Entity Discovery Plugin
Themes
Themes are noun phrases extracted from full text based on linguistic analysis. A noun phrase is a sequence of one or more terms that can be replaced by a noun or pronoun. The entity discovery plugin typically extracts 2-word themes.
Example: The expression community account management is a sequence of terms that can be replaced by it, and could therefore be extracted as a theme.
Theme extraction works best on documents containing regular text made of complete sentences and not very well on documents containing text chunks such as log files.
The relevance of a theme extracted for a single document is magnified when appearing in a correlation ranked facet, allowing end-users to easily pull subject related documents from dispersed sources.
Named Entities
Named entities are unique text elements that can be classified in predefined categories. The entity discovery plugin can extract named entities for a fair number of categories (company names, product names, people, job titles, places, dates...). When found in a document, named entity values are saved as metadata named after the corresponding category and attached to the document.
Example: With named entity extraction enabled, when processing the following sentence:
This article comments Paul Baker's favorite restaurants in Boston.
The following metadata and values are created for this document:
-
People = Paul Baker
-
Place = Boston
With named entity extraction enabled, the entity discovery plugin always generates metadata for all categories that it supports and finds. In the text analytics pipeline, you can use a filter plugin to only pass metadata from named entity categories of interest to the outputter stage (see SalienceMetadataExtractor and MetadataFilter).
Facets using named entities allow end-users to easily find documents referring to specific name entities.
Sentiment Analysis
The entity discovery plugin can use the optional Sentiment Analysis feature to calculate an overall sentiment score for a document by finding predefined expressions associated with positive or negative sentiment and summing their respective sentiment scores. Sentiment analysis produces best results with full sentence content that is likely to include judgments, opinions, or moods such as posts from customer review communities.
Example: With sentiment analysis enabled, when processing the following sentence:
This is the worst and most painful software installation I ever went through. However, once installed, its features are nice.
An overall Negative sentiment is returned since the terms worst and most painful carry a significantly negative score compared to the word nice that carries a positive score. Other terms are Neutral as they do not carry sentiment.
By default, sentiment analysis returns either Positive, Neutral, or Negative sentiment. You can however configure multiple-levels and customize the sentiment level names (see SalienceMetadataExtractor).