Tuning the OCR Module
Default behavior you should know about
-
PDF documents first go through the PDF converter. When the PDF converter finds a minimum amount of textual content in a PDF, the document is not sent to the OCR converter. Respecting this threshold prevents wasting time converting PDF documents containing text rather than images, but if some of your PDF documents contain both text and images, the module can miss content to be extracted by OCR. You can adjust the minimum amount of text using the mintextskippdfocr parameter that has a default value of 256 characters.
-
By default, the OCR module only converts up to the first 200 pages of each document to prevent excessive conversion times. When you want all pages of all documents to be converted, you can increase the maxpage parameter value to ensure your longest PDF will be fully converted.
-
By default, the OCR module only converts documents that are at least 10 KB in size, assuming these small documents (such as icons, logos, small images) do not contain images worth converting by OCR, and therefore save conversion resources. You can change this limit using the minsize parameter.
-
When a document goes through the OCR converter, its converted output is put into the CES cache, so if you attempt to convert the same document again, to save conversion resources, the content of the cache is taken rather than converting the document again, even if you changed script parameters. When you try different conversion parameter values, you may think that the parameter changes had no effect, while they might if the document was actually converted.
When you convert again a set of documents, you can temporarily deactivate using the CES cache content for already indexed documents. In the OCRConverter7.js script file, as shown below, comment (add // in front of) lines 12 to 17 inclusively as well as line 43.
// if (docCache.BytesCount > 0) { // // The OCR has already been done, copy the data in the OutputDocument. // docCache.SaveToFile(outputDocument); // CustomConversion.OutputDocument.LoadFromFile(outputDocument); // DocumentInfo.IsValid = true; // } else { ... // }
To tune the OCR module conversion
-
Ensure you keep a intact copy of the original [CES_Path]\OCR Module\OCRConversion7.js script.
-
Using a text editor:
-
Open your working copy of the OCRConverter7.js script.
-
For parameters for which you want to use a value other than the default one, in function image2Text.Init() on line 21, add parameter-value pairs to the first semicolon separated string argument.
image2Text.Init("pdfloadflag=3; spellcheck=1; timeout=1200; language=en,fr,de,es; cleanuptmp=1; MaximumPixelsX=32000; MaximumPixelsY=32000", tempDirectory, parentProcess);
Refer to the following table, for available parameters.
Parameter Description Default value Maximum value cleanuptmp Whether to delete OCR module generated temporary files. Consider setting to 0 for troubleshooting purposes to be able to inspect temporary file content. 1
(true - use 0 for false)enginetimeout OCR engine conversion timeout (milliseconds) for each call to the OCR API, such as when converting each page of a document, in which case the document is only indexed by reference (see What Is the Difference between Indexing by Reference and Indexing by Content?).
Example: A log message when a enginetimeout occurs can look like:
Indexed by reference (Converter specific failure (The C:\CES7\Script\OCRConversionDev7D.js custom conversion script generated the following error: class CGLCOM::ActiveScript::Exception: CESCustomConverter.OCR.Debug.7.0 - Exception: class APICallFailedException: Error in OCR function kRecPreprocessImg With the following error code: API_TIMEOUT_ERR (line: 27, column: 8).))
180000 -1
(no timeout)language Code of languages that the OCR module should recognize in the documents. Supported languages: ca, cs, da, nl, en, fi, fr, de, el, hu, it, no, pl, pt, sl, es, svd, ru, tr en,fr,de,es license The OCR license file location. [OCR_Install_Path]\license.txt MaximumPixelsX Required parameter specifying the maximum image width (in pixel) above which a document is not converted.
Example: Related error message:
Indexed by reference (Converter specific failure (The C:\CES7\Script\OCRConversionDev7D.js custom conversion script generated the following error: class CGLCOM::ActiveScript::Exception: CESCustomConverter.OCR.Debug.7.0 - Exception: class APICallFailedException: Error in OCR function kRecLoadImg With the following error code: IMG_SIZE_ERR. If converting big images, consider using the MaximumPixelsX and MaximumPixelsY parameters with values above the default value of 8400 in the Open Converter Script. (line: 27, column: 8).)) *
8400 32000 MaximumPixelsY Required parameter specifying the maximum image height (in pixel) above which a document is not converted. 8400 32000 maxpage Maximum number of pages converted for each document. 200 2147483647 maxretry Maximum number of OCR engine initialization retries. 2 2147483647 minsize Minimum document size (in bytes) below which documents are not converted by OCR. 10240 2147483647 mintextskippdfocr For PDF documents only, the minimum number of characters found in the HTML generated by the PDF converter above which a PDF document is not processed by the OCR converter. 256 2147483647 ocrbinpath Path of the OCR library binary files. [OCR_Install_Path] ocrbin64path Path of the OCR library 64-bit version binary files. [OCR_Install_Path] pdfloadflag Available PDF flag values:
1 - disable using PDF character codes.
2 - disable using PDF tags.
3 - process the PDF just as an image (without any PDF info).
Flags can be OR-ed to combine the value of this setting.
The recognition accuracy is usually worse if the value is 1 or 3.
2 3 spellcheck Whether to use dictionaries available in several languages to improve
the language detection.
0
(false - use 1 for true)timeout Whole document conversion timeout (seconds). When one occurs the document is only indexed by reference (see What Is the Difference between Indexing by Reference and Indexing by Content?).
Example: A log message when a timeout occurs can look like:
Indexed by reference (Converter specific failure (The C:\CES7\Script\OCRConversionDev7D.js custom conversion script generated the following error: class CGLCOM::ActiveScript::Exception: CESCustomConverter.OCR.Debug.7.0 - OCRS timeout. (line: 27, column: 8).))
600 2147483647 -
Save the file.
-
-
Consider enabling the OCR logs to get more feedback on the conversion process:
In the C:\tmp folder, add a dummy empty file with the name OCR_TRACE_FLAG.flag.
Additional OCR information will be logged in the C:\tmp\ocr_tracefile.log file.
What's Next?
Ensure your modified OCRConverter7.js script is the one used by your OCR open converter (see Adding an OCR Open Converter).