OCR Module

The Optical Character Recognition (OCR) module allows you to index text content from files such as scanned documents stored in image or PDF files.

Supported document formats

.tiff, .tiff-fx, .pcx, .dcx, .bmp, .jpeg, .png, .max, .gif, .pbm, and .pdf

Supported languages
  • 120 non-Asian languages
  • 21 languages with spell checking dictionaries for improved detection (Brazilian, Catalan, Czech, Danish, Dutch, English, Esperanto, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Slovenian, Spanish, Swedish, Turkish)

It is recommended to separate the source files requiring OCR conversion from those that do not. This ensures the highest level of performance and conversion quality. You can do this by moving files in different file system folders or through the use of source filters (see Adding and Modifying Source Filters).

Note: The OCR module is optional. You need to purchase it to receive the license allowing you to install and activate the module.

Deployment overview

  1. Download and install the OCR module (see Installing the OCR Module).

  2. Let CES know where to find the OCR converter (see Adding an OCR Open Converter).

  3. Create a new document type set, link the OCR open converter to appropriate document types, assign the document type set to sources containing documents to be converted with the OCR module, and rebuild the sources (see Associating the OCR Open Converter to Document Types).