Product DocsMenu

Text Analytics Pipeline Configuration

Text analytics operations are defined in a pipeline consisting of one or more stages. The pipeline stages are defined in an XML configuration file that starts with a global configuration section (see Text Analytics Global Configuration Parameters). Each XML configuration file can define one or more text analytics runs or jobs (see Runs Versus Jobs).

Text analytics pipelines are registered in the Coveo Job scheduler (CJS) service using TAnGO (see Managing Text Analytics Pipeline Configurations). The CJS service manages the launch of a pipeline once or at specified regular intervals.

About Runs

A run is a text analytics pipeline that sequentially applies a set of stages on a set of documents.

Example: Typically, a set of documents is fetched from the Coveo unified index, text analytics metadata is extracted, and the metadata is injected back in the index in the form of tag fields.

The pipeline is composed of stages of the following types in the following order:

Fetcher

A run always starts with a fetcher plugin that retrieves documents to be processed by the pipeline (see Predefined Text Analytics Fetcher). There can be only one fetcher plugin in a pipeline.

Filters

The pipeline can contain one or more filter plugins used to exclude specific type of content from the fetched documents, before they are processed further (see Predefined Text Analytics Filters).

Extractors

At the center of the text analytics process, one or more extractor plugins create and attach metadata to processed documents (see Predefined Text Analytics Extractors).

Normalizers

One or more normalizer plugins clean-up the metadata created by the extractors (see Predefined Text Analytics Normalizers).

Outputter

At the end of the pipeline, an outputter plugin saves the text analytics results somewhere. There can be only one outputter plugin in a pipeline (see Predefined Text Analytics Outputters).

The pipeline structure for a run is shown in the following XML configuration file sample.

<?xml version="1.0" encoding="utf-8"?>
<TextAnalyticsService>
  <!-- Global configuration parameters -->
  <Configuration>
    ...
  </Configuration>
  <!-- Definition of the run -->
  <!-- Set the name of your run -->
  <Run Name="MainRun">
    <!-- Plugin used to fetch the documents to process -->
    <Fetcher>
      ...
    </Fetcher>
    <!-- Extract stuff -->
    <Extractors>
      <!-- First extractor -->
      <Extractor>...</Extractor>
			...
      <!-- Nth extractor -->
      <Extractor>...</Extractor>      
    </Extractors>
    <!-- Normalize metadata names and values -->
    <Normalizers>
      <!-- First normalizer -->
      <Normalizer>...</Normalizer>
			...
      <!-- Nth normalizer -->
      <Normalizer>...</Normalizer> 
	</Normalizers>
    <!-- Plugin used to output the result of the text analytics run -->
    <Outputter>
			...
    </Outputter>
  </Run>
</TextAnalyticsService>  

About Jobs

A job is a one stage pipeline that you can use when you need to execute general tasks that should not be executed on each individual document.

Example: You can use a job when you want to use CES tagging queries, copy files, programmatically change a configuration in CES, perform a maintenance task, etc.

What's Next?

Review the procedure to create, run, and fine-tune text analytics pipelines (see Managing Text Analytics Pipeline Configurations).

People who viewed this topic also viewed