If you need to provide OCR (Optical Character Recognition) functionality into your application one option that you have is to use Microsoft Office Document Imaging library (MODI) that comes with the Office (2003, 2007) package. In addition, if you need to create Word 2007 documents from scanned files, you can use the OpenXML SDK to to create them. In this article I will show you how to create a simple application to perform OCR on images and generate document files.
Pre-requisites
First, you need to install the MODI library and the OpenXML SDK.
The Microsoft Office Document Imagining 12 library is available in the Office 2007 package. You need to run the setup and make sure that component is installed, and if not, select to install it on your computer.

The Open XML SDK 1.0 is available for download at the Microsoft Download Center:
Creating a Windows Forms Application
Resolving Dependencies
The next step is to create a Windows Forms application. We'll call this DemoOCR.
In order to use the MODI library we need to add a reference to Microsoft Office Document Imaging 12 Type Library.
The MODI library contains an ActiveX control for viewing the scanned documents. We'll host such a control on the form, and to do that we need to add the control to the toolbox.
- Go to the toolbox and select the General category.
- Right click with the mouse and from the context menu select Choose Items.

- In the Choose Toolbox Items dialog open the COM Components tabs.
- Select the Microsoft Office Document Imaging Viewer Control 12.0 and press OK.

After this, the MODI Image Viewer control will be available in the toolbox and you can drag it to the form. This is how the form will look.

To save the parsed images as Word 2007 documents we'll use the Open XML SDK. For that we need to add two more references:
- DocumentFormat.OpenXML assembly
- WindowsBase assembly (version 3.0.0.0).

Parsing Images
Parsing an image (scanned document) is quite easy. It requires to:
- create a new MODI.Document object
- create the document from an image file (TIFF, BMP, JPEG, GIF)
- perform OCR on the document
You can get a notification of the OCR progress, by registering on the OnOCRPRogress event. You can see all these in the following sample. Here, md is a member of the form class.
private void btnParse_Click(object sender, EventArgs e)
{
md = new MODI.Document();
md.OnOCRProgress +=
new MODI._IDocumentEvents_OnOCRProgressEventHandler(OnOCRProgress);
progressParsing.Value = 0;
md.Create(textImagePath.Text);
md.OCR(
MiLANGUAGES.miLANG_ENGLISH,
ckOrientImage.Checked,
ckStraightenImage.Checked);
axMiDocViewer.Document = md;
progressParsing.Value = 0;
}
private void OnOCRProgress(int progress, ref bool cancel)
{
progressParsing.Value = progress;
}
progressParsing is a ProgressBar control that shows the total progress of the OCR process.
Using the MODI Viewer Control
One single line of code is necessary for displaying the document in the image viewer control. It is already shown in the code above.
axMiDocViewer.Document = md;
This is how the controls displays a screenshot of the MSDN page for MODI 2003 Object Model.
Saving to a Word 2007 Document
The following function creates a simple Word 2007 document using the Open XML Format SDK 1.0.
private void SaveAsDocx(string text, string filename)
{
char[] delimiters = { '\r', '\n' };
string[] lines = text.Split(delimiters);
StringBuilder sb = new StringBuilder();
sb.Append(@"<?xml version='1.0' encoding='UTF-8' standalone='yes'?>");
sb.Append(@"<w:document xmlns:w='http://schemas.openxmlformats.org/wordprocessingml/2006/main'>");
sb.Append("<w:body>");
foreach (string line in lines)
{
sb.Append("<w:p><w:r><w:t>");
sb.Append(line);
sb.Append("</w:t></w:r></w:p>");
}
sb.Append("</w:body></w:document>");
WordprocessingDocument wordDoc =
WordprocessingDocument.Create(
filename,
WordprocessingDocumentType.Document);
MainDocumentPart docPart = wordDoc.MainDocumentPart;
string docText = sb.ToString();
docPart = wordDoc.AddMainDocumentPart();
Stream partStream = docPart.GetStream();
UTF8Encoding encoder = new UTF8Encoding();
Byte[] buffer = encoder.GetBytes(docText);
partStream.Write(buffer, 0, buffer.Length);
wordDoc.Close();
}
In order to transform the entire MODI document into a Word document we need to iterate on all the document pages, concatenate the text and then send it to the function above.
StringBuilder sb = new StringBuilder();
foreach (MODI.Image image in md.Images)
{
sb.Append(image.Layout.Text);
}
SaveAsDocx(sb.ToString(), txtDocumentPath.Text);
It is possible to save a selection the user made in the image viewer. To make the selection possible in the control, the ActionState must be set to 5 - Select. The selection is available in the TextSelection property of the control.
SaveAsDocx(axMiDocViewer.TextSelection.Text, txtDocumentPath.Text);
After saving this selection, the Word 2007 document looks like this:

This demo application shown is available in the archive attached to this article.
References
For more information see the following references
- Using the Microsoft Office Document Imaging 2003 Object Model
- Chapter 22: Office Open XML Essentials
- MiDocView Object
- Open XML Format SDK 1.0 at MSDN
| Next > |
|---|





