sharparena.com

  • Increase font size
  • Default font size
  • Decrease font size

From Bitmaps to Word Documents with MODI and Open XML SDK

E-mail Print PDF
User Rating: / 2
PoorBest 

If you need to provide OCR (Optical Character Recognition) functionality into your application one option that you have is to use Microsoft Office Document Imaging library (MODI) that comes with the Office (2003, 2007) package. In addition, if you need to create Word 2007 documents from scanned files, you can use the OpenXML SDK to to create them. In this article I will show you how to create a simple application to perform OCR on images and generate document files.

Pre-requisites

First, you need to install the MODI library and the OpenXML SDK.

The Microsoft Office Document Imagining 12 library is available in the Office 2007 package. You need to run the setup and make sure that component is installed, and if not, select to install it on your computer.

The Open XML SDK 1.0 is available for download at the Microsoft Download Center:

Creating a Windows Forms Application

Resolving Dependencies

The next step is to create a Windows Forms application. We'll call this DemoOCR.

In order to use the MODI library we need to add a reference to Microsoft Office Document Imaging 12 Type Library.

The MODI library contains an ActiveX control for viewing the scanned documents. We'll host such a control on the form, and to do that we need to add the control to the toolbox.

  1. Go to the toolbox and select the General category.
  2. Right click with the mouse and from the context menu select Choose Items.
  3. In the Choose Toolbox Items dialog open the COM Components tabs.
  4. Select the Microsoft Office Document Imaging Viewer Control 12.0 and press OK.

After this, the MODI Image Viewer control will be available in the toolbox and you can drag it to the form. This is how the form will look.

To save the parsed images as Word 2007 documents we'll use the Open XML SDK. For that we need to add two more references:

  1. DocumentFormat.OpenXML assembly
     
  2. WindowsBase assembly (version 3.0.0.0).
With that set, all the references are resolved and we can go further.

Parsing Images

Parsing an image (scanned document) is quite easy. It requires to:

  • create a new MODI.Document object
  • create the document from an image file (TIFF, BMP, JPEG, GIF)
  • perform OCR on the document

You can get a notification of the OCR progress, by registering on the OnOCRPRogress event. You can see all these in the following sample. Here, md is a member of the form class.

      private void btnParse_Click(object sender, EventArgs e)
      {
         md = new MODI.Document();

         md.OnOCRProgress +=
           new MODI._IDocumentEvents_OnOCRProgressEventHandler(OnOCRProgress);
         progressParsing.Value = 0;

         md.Create(textImagePath.Text);

         md.OCR(
            MiLANGUAGES.miLANG_ENGLISH,
            ckOrientImage.Checked,
            ckStraightenImage.Checked);

         axMiDocViewer.Document = md;
         progressParsing.Value = 0;
      }

      private void OnOCRProgress(int progress, ref bool cancel)
      {
         progressParsing.Value = progress;
      }

progressParsing is a ProgressBar control that shows the total progress of the OCR process.

Using the MODI Viewer Control

One single line of code is necessary for displaying the document in the image viewer control. It is already shown in the code above.

axMiDocViewer.Document = md;

This is how the controls displays a screenshot of the MSDN page for MODI 2003 Object Model.

Saving to a Word 2007 Document

The following function creates a simple Word 2007 document using the Open XML Format SDK 1.0.

      private void SaveAsDocx(string text, string filename)
      {
         char[] delimiters = { '\r', '\n' };
         string[] lines = text.Split(delimiters);

         StringBuilder sb = new StringBuilder();
         sb.Append(@"<?xml version='1.0' encoding='UTF-8' standalone='yes'?>");
         sb.Append(@"<w:document xmlns:w='http://schemas.openxmlformats.org/wordprocessingml/2006/main'>");
         sb.Append("<w:body>");

         foreach (string line in lines)
         {
            sb.Append("<w:p><w:r><w:t>");
            sb.Append(line);
            sb.Append("</w:t></w:r></w:p>");
         }

         sb.Append("</w:body></w:document>");

         WordprocessingDocument wordDoc =
            WordprocessingDocument.Create(
               filename,
               WordprocessingDocumentType.Document);

         MainDocumentPart docPart = wordDoc.MainDocumentPart;

         string docText = sb.ToString();
         docPart = wordDoc.AddMainDocumentPart();

         Stream partStream = docPart.GetStream();
         UTF8Encoding encoder = new UTF8Encoding();
         Byte[] buffer = encoder.GetBytes(docText);
         partStream.Write(buffer, 0, buffer.Length);
         wordDoc.Close();
      }

In order to transform the entire MODI document into a Word document we need to iterate on all the document pages, concatenate the text and then send it to the function above.

         StringBuilder sb = new StringBuilder();
         foreach (MODI.Image image in md.Images)
         {
            sb.Append(image.Layout.Text);
         }

         SaveAsDocx(sb.ToString(), txtDocumentPath.Text);

It is possible to save a selection the user made in the image viewer. To make the selection possible in the control, the ActionState must be set to 5 - Select. The selection is available in the TextSelection property of the control.

         SaveAsDocx(axMiDocViewer.TextSelection.Text, txtDocumentPath.Text);

After saving this selection, the Word 2007 document looks like this:

This demo application shown is available in the archive attached to this article.

References

For more information see the following references

Attachments:
Download this file (DemoOCR.zip)DemoOCR.zip[OCR demo application in VS 2008]11 Kb30 Downloads
Last Updated on Monday, 18 May 2009 11:40