Product DocsMenu

Predefined Text Analytics Filters

Filter plugins are responsible for removing specific type of content from each fetched document to prevent to process them in the rest of the pipeline. They are used between the fetcher stage and extractor stage.

LongNonSentenceLinesFilter

The LongNonSentenceLinesFilter plugin determines whether a BLOB of text is structured like a sentence by checking the presence of punctuation, spaces, etc. When it is not the case, such as for a folder path or a line of programming language code, the BLOB of text is removed from the content sent to the rest of the pipeline. This filter is useful to prevent sending non-phrasal content to the entity discovery plugin that could consume significant CPU resources on the BLOB of text and extract no information. This filter does not have parameters.

<Filter>
  <Impl>Coveo.TextAnalytics.Implementations.LongNonSentenceLinesFilter, Coveo.TextAnalytics.Implementations</Impl>
  <Configuration>
  </Configuration>
</Filter>  

NonSentenceFilter

The NonSentenceFilter plugin analyzes each line of text and determines if it is a valid English sentence based on word lookup such as stop words (a, the, of...) and statistical analysis (number of symbols, numbers, etc.). When the line is found to not be a sentence, the line is removed from the content sent to the rest of the pipeline. This filter is useful to prevent sending non-phrasal content to the entity discovery plugin that could consume significant CPU resources on the BLOB of text and extract no information. This filter does not have parameters.

<Filter>
  <Impl>Coveo.TextAnalytics.Implementations.NonSentenceFilter, Coveo.TextAnalytics.Implementations</Impl>
  <Configuration>
  </Configuration>
 </Filter>  

RegexLineFilter

The RegexLineFilter plugin evaluates each line of processed documents against one or more regular expressions (regex) defined by the <Regex> parameter. If any one of the regexes matches the line, the line is removed from the content sent to the rest of the pipeline. This filter is generic and powerful, but can require significant CPU resources when specifying complex regular expressions.

Example: The following filter removes lines starting with the # symbol.

<Filter>
  <Impl>Coveo.TextAnalytics.Implementations.RegexLineFilter, Coveo.TextAnalytics.Implementations</Impl>
  <Configuration>
    <Regex>#.*$</Regex>
  </Configuration>
</Filter>  

EmailHeaderFilter

The EmailHeaderFilter plugin removes email header lines starting with From:, Sent:, Subject:, To:, and Importance:. In general in the index, the email body retrieve by the fetcher does not include header lines. This filter is useful in rare cases where indexed email documents include the email header lines and you want to process these email messages without the header content. Note that the subject is generally set as the document title so it is still processed even if the Subject: line is removed. This filter does not have parameters.

<Filter>
  <Impl>Coveo.TextAnalytics.Implementations.EmailHeaderFilter, Coveo.TextAnalytics.Implementations</Impl>
  <Configuration>
  </Configuration>
</Filter>  

What's Next?

Evaluate available extractors (see Predefined Text Analytics Extractors).

People who viewed this topic also viewed