Product DocsMenu

OCR Module

The Optical Character Recognition (OCR) module allows you to index text content from files such as scanned documents stored in image or PDF files.

Supported document formats

.tiff, .tiff-fx, .pcx, .dcx, .bmp, .jpeg, .png, .max, .gif, .pbm, and .pdf

Supported languages

  • 120 non-Asian languages

  • 21 languages with spell checking dictionaries for improved detection (Brazilian, Catalan, Czech, Danish, Dutch, English, Esperanto, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Slovenian, Spanish, Swedish, Turkish)

It is recommended to separate the source files requiring OCR conversion from those that do not. This ensures the highest level of performance and conversion quality. You can do this by moving files in different file system folders or through the use of source filters (see Adding or Modifying Source Filters).

Notes:

  • The OCR module is optional. You need to purchase it to receive the license allowing you to install and activate the module.

  • CES 7.0.5989+ (October 2013) The OCR module can run as a 64-bit process.

  • By default, to minimize useless processing of a large number of small files that do not contain text, such as icons, the OCR module does not process small files.

    Depending on your CES version:

  • The default OCR module configuration is a compromise between conversion completeness and performances and can prevent indexing some content in a document or even entire documents. Consider adapting the configuration to your needs (see Tuning the OCR Module).

  • OCR licensing (CPUs, cores, and threads)

    OCR licensing works on a CPU basis. Each licensed OCR module allows you to pass OCR through one CPU, with a number of threads equal to the number of cores in that CPU. This is due to the third party component at the core of the module and implies that some further measures need to be taken for you to maximize your OCR use.

    • Consider the number of CPUs on your server. Do you want to use all CPUs for OCR? Purchase one OCR module per CPU you want to use.

    • Provide Coveo with the total number of cores so that they can prepare the licensing accordingly. OCR licensing only includes one thread by default.

    You can set the number of OCR threads for the local or remote converters (see Configuring a Remote Converter).

    Note that if you look at the processes in the Task Manager while OCR is running, even with a single licensed OCR, you may occasionally see more than one OCR process. That is due to the fact that a process is spawned for each individual page of a multi-page document.

    CES 7.0.6684+ (May 2014) You can see the number of concurrent OCR converters authorize by your CES license by reviewing the Audio Video Restrictions parameter value from the License page (see What Information Is Displayed in the License Page?).

Deployment overview

  1. Download and install the OCR module on your Coveo server (see Installing the Optical Character Recognition Module).

  2. On the Coveo server, run the VB script contained in the following CES7Admin-OCRNumberOfThreads.zip file to enable a number of OCR threads that matches the available cores.

  3. Let CES know where to find the OCR converter (see Adding an OCR Open Converter).

  4. Create a new document type set (see Creating a Document Type Set).

  5. Link the OCR open converter to appropriate document types (see Associating the OCR Open Converter to Document Types).

  6. Assign the document type set to sources containing documents to be converted with the OCR module (see Modifying the Document Type Set Used by a Source).

    Tip: When you have a large number of documents to process using OCR, it is good practice to:

    1. Starting indexing a source containing only a small representative sample of documents.

    2. Validate that all documents and all their desired content has been indexed as expected.

    3. If not make adjustments (see Tuning the OCR Module).

    4. Once satisfied, index all documents.

  7. Rebuild the sources.

People who viewed this topic also viewed