Product DocsMenu

Predefined Text Analytics Extractors

The extractor stage is at the center of the text analytics process. Extractor plugins identify and extract specific types of information from the content of the processed documents.

Whitelister

The Whitelister plugin reads a flat text file containing a list of expressions, one per line. Each expression is searched in each processed document. When there is a match, the term is added as a metadata to the document. The listed expressions must match word boundaries. Only exact matches are extracted. Case sensitivity can be set.

Example: The following Whitelister plugin populates the ToyotaVehicles metadata with occurrences of expressions found in the ToyotaVehicleList.txt file.

<Extractor>
  <Impl>Coveo.TextAnalytics.Implementations.Whitelister, Coveo.TextAnalytics.Implementations</Impl>
  <Configuration>
   <MetadataName>ToyotaVehicles</MetadataName>
   <FilePath>D:\TextAnalytics\Config\whitelists\ToyotaVehicleList.txt</FilePath>
   <CaseSensitive>False</CaseSensitive>
  </Configuration>
</Extractor>  

Content of the ToyotaVehicleList.txt file:

Yaris Hatchback
Yaris
Corolla
Matrix
Prius c
Prius
Prius Plug-in
Prius v
Camry
Camry Hybrid
Venza
Avalon
Sienna
RAV4
Highlander
Highlander Hybrid
FJ Cruiser
4Runner
Sequoia
Tacoma
Tundra  

CESQueryMetadataExtractor

The CESQueryMetadataExtractor plugin adds the specified value to the specified metadata when the document matches the specified query.

Example: With the following CESQueryMetadataExtractor extractor definition, Potato will be extracted to the metadata Vegetables when the document matches the CES query potato AND (grow OR cook OR vegetable OR food).

<Extractor>
  <Impl>Coveo.TextAnalytics.Implementations.CESQueryMetadataExtractor, Coveo.TextAnalytics.Implementations</Impl>
  <Configuration>
   <Query>potato AND (grow OR cook OR vegetable OR food)</Query>
   <MetadataName>Vegetables</MetadataName>
   <MetadataValue>Potato</MetadataValue>
  </Configuration>
</Extractor>  

Note: The CESQueryMetadataExtractor plugin generates one query to CES for each document and can therefore be slow when there are many documents.

MetadataAdderExtractor

The MetadataAdderExtractor plugin simply adds the specified value to the specified metadata for all documents.

Example: With the following MetadataAdderExtractor extractor definition, Potato will be added to the metadata Vegetables for all documents.

<Extractor>
  <Impl>Coveo.TextAnalytics.Implementations.MetadataAdderExtractor, Coveo.TextAnalytics.Implementations</Impl>
  <Configuration>
   <MetadataName>Vegetables</MetadataName>
   <MetadataValue>Potato</MetadataValue>
  </Configuration>
</Extractor>  

RegexMetadataExtractor

This RegexMetadataExtractor plugin finds documents matching one specified regular expression and sets the value of the specified metadata name using one of following exclusive methods: 

  • By default, writes all found matches when neither the <MetadataValue> or <MetadataReplacement> parameters is defined.

  • Optionally, when one or more <MetadataValue> parameter is defined, writes the specified value(s).

  • Optionally, when one <MetadataReplacement> parameter is defined, writes the corresponding regular expression replacement value.

This extractor is generic and powerful, but can require significant CPU resources when specifying complex regular expressions.

Use the following parameters to define the extractor:

<Regex>

This required parameter specifies the regular expression (regex) that must be matched.

<MetadataName>

This required parameter specifies the name of the metadata to which a value is set when the content of a document matches the regular expression.

<MetadataValue>

One or more occurrences of this optional parameter specify one or more values to save in the metadata when the regular expression finds a match.

<MetadataReplacement>

This optional parameter specifies a regex replacement string used to create the metadata value from the content of regex matches in the document.

<CaseSensitive>

This optional parameter can be set to True to force the regular expression to be case sensitive. The default is False.

Examples: A document to analyze contains only:  "I want to eat a potato".

With this content, for this document, the following extractor sets the Vegetables metadata to potato.

<Extractor>
  <Impl>Coveo.TextAnalytics.Implementations.RegexMetadataExtractor, Coveo.TextAnalytics.Implementations</Impl>
  <Configuration>
   <MetadataName>Vegetables</MetadataName>
   <Regex>(potato|tuberosum)</Regex>
  </Configuration>
</Extractor>  

With this content, for this document, the following extractor adds potato and tuberosum to the Vegetables metadata.

<Extractor>
  <Impl>Coveo.TextAnalytics.Implementations.RegexMetadataExtractor, Coveo.TextAnalytics.Implementations</Impl>
  <Configuration>
   <MetadataName>Vegetables</MetadataName>
   <Regex>(potato|tuberosum)</Regex>
   <MetadataValue>potato</MetadataValue>
   <MetadataValue>tuberosum</MetadataValue>
  </Configuration>
</Extractor>  

SalienceMetadataExtractor

The SalienceMetadataExtractor entity discovery plugin can extract one or more types of information (Themes, Named Entities, Sentiment).

Example: The following extractor processes up to 4096 characters of each document, extracts themes, named entities, and multi-level sentiment analysis.

<Extractor>
  <Impl>Coveo.TextAnalytics.Implementations.Salience.SalienceMetadataExtractor, Coveo.TextAnalytics.Implementations</Impl>
  <Configuration>
    <MaxDocLenToProcess>4096</MaxDocLenToProcess>
    <!-- Location of Salience files -->
    <SalienceLicensePath>E:\TXTAN\Lexalytics\license.v5</SalienceLicensePath>
    <SalienceDirectory>D:\TXTAN\Lexalytics\data</SalienceDirectory>
    <SalienceUserDirectory>D:\TXTAN\Lexalytics\user</SalienceUserDirectory>
    <!-- Types of information to extract -->
    <ExtractThemes>True</ExtractThemes>
    <ExtractUserDefinedEntities>False</ExtractUserDefinedEntities>
    <ExtractNamedEntities>True</ExtractNamedEntities>
    <!-- Sentiment analysis configuration -->
    <ExtractSentiment>True</ExtractSentiment>
    <PositiveThreshold>0.15</PositiveThreshold>
    <NegativeThreshold>-0.15</NegativeThreshold>
    <SentimentLevels>True</SentimentLevels>
    <SentimentLevelMultiple>2</SentimentLevelMultiple>
  </Configuration>
</Extractor>  

Available parameters are:

<MaxDocLenToProcess>

The required MaxDocLenToProcess parameter defines the maximum number of characters from the content to send to the entity discovery plugin for each document. The recommended value is 4096. Assuming that on average English words contain 5 characters, 4096 characters correspond to about 800 words. A larger value can significantly increase the time required to process the run for many large documents.

Note: With a MaxDocLenToProcess set to 4096, smaller documents like emails are most likely fully processed. However, for longer documents containing thousands of word such as knowledge base articles, only the first 800 or so words will be processed.

<SalienceLicensePath>

This required parameter specifies the path and name of the Salience license file.

<SalienceDirectory>

This required parameter specifies the folder where Salience data files are located.

<SalienceUserDirectory>

This required parameter specifies the folder where Salience user data files are located.

<ExtractThemes>

Set this optional parameter to True to activate the extraction of themes, in which case you must also specify the <ThemeMetaName> global configuration parameter (see <ThemeMetaName>).

<ExtractNamedEntities>

Set this optional parameter to True to activate the extraction of predefined named entities. You cannot specify to the entity discovery plugin which named entities to extract. When you are only interested in a subset of named entities, you can use a normalizer stage to remove unwanted named entity metadata (see MetadataFilter).

<ExtractSentiment>

Set this optional parameter to True to activate the sentiment analysis, in which case you must also specify the <SentimentMetaName> global configuration parameter (see <SentimentMetaName>).

<PositiveThreshold>
<NegativeThreshold>

These parameters are required only when sentiment analysis is activated. The values are the level of sentiment score required to return a Positive or Negative sentiment. For an in between score, the returned sentiment is Neutral. Recommended values are respectively 0.15 and -0.15.

<SentimentLevels>

Set this optional parameter to True to instruct the plugin to return multiple sentiment levels: Very Positive, Positive, Neutral, Negative, and Very Negative.

<SentimentLevelMultiple>

When multiple sentiment level is activated, this parameter specifies the multiplier value applied to <PositiveThreshold> and <NegativeThreshold> values to determine the threshold score to respectively return Very Positive and Very Negative sentiment.

Example: With the <PositiveThreshold> , <NegativeThreshold>, and <SentimentLevelMultiple> parameters respectively set to 0.15, -0.15, and 2, a document with a sentiment score greater than 0.30 returns a Very Positive sentiment. Similarly, a document with a sentiment score smaller than -0.30 returns a Very Negative sentiment.

<VeryPositiveLabel>
<PositiveLabel>
<NeutralLabel>
<NegativeLabel>
<VeryNegativeLabel>

You can use these optional parameters to change the default label for the sentiment levels.

Example:

<Extractor>
  <Impl>Coveo.TextAnalytics.Implementations.Salience.SalienceMetadataExtractor, Coveo.TextAnalytics.Implementations</Impl>
  <Configuration>
     ...
    <VeryPositiveLabel>Great!</VeryPositiveLabel>
    <PositiveLabel>Good!</PositiveLabel>
    <NeutralLabel>OK</NeutralLabel>
    <NegativeLabel>Not so good...</NegativeLabel>
    <VeryNegativeLabel>It's horrible!</VeryNegativeLabel>
     ...  
  </Configuration>
</Extractor>   
People who viewed this topic also viewed