What Is an Excerpt?
Tip: In .NET search interfaces featuring the Preferences link, you can configure the excerpt to include one to four lines of text (see Modifying .NET Search Interface Preferences).
Note: No excerpt appears in the search result for a copy protected document (such as a PDF) to prevent showing its content in a context where users can make a copy. When the excerpt is missing or empty for specific documents, your Coveo administrator can verify if the documents are identified as copy protected in the index (see Reviewing Document Details from the Index Browser).
Example: In Adobe Acrobat, in the Password Security Settings dialog box, when you clear the Enable copying of text, images, and other content check box, the document becomes copy protected, and no excerpt appears in search results for this document.
Why Keywords Do Not Appear in Some Excerpts?
By default, Coveo search interfaces only return search results containing all searched keywords, so you can legitimately expect search results excerpts to include your search keywords. However, in some cases like the ones listed below, one or more of your keywords may not appear in the search result excerpts.
-
Keywords occur towards the end in a long document
The tail of large documents is excluded when it comes to build the excerpt (see How Are Excepts Generated?), so when one or all searched keywords occur only in the excluded document section, segments gathered for the excerpt may not include all searched keywords.
-
Results are injected
Various Coveo features (such as Top Results or Coveo Machine Learning Automatic Relevance Tuning) or search interface customization by developers can inject results that may not include one or all the searched keywords, explaining why some or none of the keywords appear in the excerpt.
-
A thesaurus expands results
Thesaurus entries defined by your index administrator may replace one or more of your keywords (see Adding Thesaurus Entries From the Administration Tool), so only the replacement synonym appears in the excerpt.
-
Your query includes many keywords
When a query contains many keywords (such as a long sentence), but the keyword occurrences are scattered in the document, it may be impossible to assemble a few segments (within 200 characters) that include all the keywords.
How Are Excepts Generated?
At indexing time, the cleaned text of each item's content is recorded in the index. At query time, relevant segments that include the keywords are extracted from the recorded cleaned text to build the excerpt.
Note: The compressed cleaned text recorded for each item is limited in size (about 32 KB) to optimize the index size and query performances. For large documents (such as PDF files with several hundred pages), only the content at the beginning of the document can therefore appear in excerpts.
The algorithm used to generate the excerpts is very complex. The goal is to extract the most relevant segments around keywords and fit the result in 2 or 3 lines (typically 200 characters).
To help you understand how the excerpt is assembled, here are indications of some of the criteria on which the algorithm is based:
-
Create a contextual segment around each highlighted keyword:
-
Ideally a full sentence
-
Segment centered on highlighted keywords
-
Grow small sentences with content from adjacent sentences
-
-
Evaluate each segment ranking score based on:
-
Keyword proximity and completeness
-
Number of non stop words keywords within the segment
-
Grammatical quality
-
Segment position in the document (better at beginning)
-
Average word length
-
Sentence length
-
-
Keep the best segments:
-
Skip segments not bringing new keywords to maximize completeness
-
Merge overlapping segments
-
If not enough characters:
-
Grow best segments first
-
Merge small nearby sentences
-
-