Chinese Japanese Korean Thai Language Improved Relevance
The Coveo Platform 7 has always supported Chinese, Japanese, Korean, and Thai (CJKT), but now offers improved relevance for these languages.
These languages do not use spacing characters to separate words. Previously, Coveo Enterprise Search (CES) was indexing each character separately as if it was a word and used pairs of such characters to perform retrieval. This indexing method allows end users to find content, but was not optimal for example for search results precision and ranking.
Note: This topic is also available in the following languages:
- Simplified Chinese
ความเกี่ยวข้องที่ได้รับการปรับปรุงในภาษาจีน ญี่ปุ่น เกาหลี ไทย
향상된 중국어 일본어 한국어 태국어의 검색능력
With the new indexing method, Coveo Enterprise Search (CES) uses proven language aware word tokenizers to identify and separate expressions in individual groups of inseparable characters referred to hereafter as CJKT words. Each CJKT word is then indexed as normal words. The meaning of CJKT words is thus preserved and ranking is done on words rather than on individual characters, allowing for improved relevance.
Example: You can enter Chinese, Japanese, Korean, or Thai keywords in the search box to get relevant documents in search results and see highlighted CJKT keyword occurrences in the search results title and excerpt such as in the following Chinese example.
CES 7.0.6547+ (March 2014) Improved relevance for Chinese, Japanese, and Korean.
CES 7.0.6424+ (February 2014) Improved relevance for Thai.
CES 7.0.6547+ (March 2014) New index created use the new improved relevance CJKT indexing method. When you upgrade CES from a version prior to CES 7.0.6547 to a CES 7.0.6547+ version, an existing index will by default continue to use the original CJKT indexing method. If you want to switch to the new CJKT indexing method, contact Coveo Support for assistance.
In the examples presented below, a group of similar uppercase letters (ex.: TTT) represents a CJKT word, while a group of different lowercase letters (ex.: abc) represents a non-CJKT word or term.
Example: The simplified Chinese expression for Coveo supports many languages is decomposed as follows:
|Represented by:||abc TTTUUU|
abc stands for coveo
TTT stands for 支持 (support)
UUU stands for 多国语言 (multilingual)
Automatic detection of CJKT content based on each language specific Unicode character sets and encodings.
As are words for other languages, CJKT words are indexed to identify in what documents they appear, and where they appear in each document.
At query time, a CJKT expression is split into CJKT words and the search results present all documents containing all CJKT words. In the search interface, the searched CJKT words are highlighted in search results title and excerpt.
End-users can search for CJKT words mixed with non-CJKT words or terms.
Note: Returned documents are ranked using the same process and criteria as for other languages (see Administration Tool - Ranking Menu).
|Prefixes and operators||
In the search box of Coveo search interfaces, end-users can use search prefixes and operators with CJKT expressions (see Search Prefixes and Operators). The Boolean operators must be spelled in English (AND, OR, NEAR, NOT).
Example: End-users can use the OR operator between a word and a CJKT expression:
Examples: End-users can use the NEAR operator between a word and a CJKT expression:
Note: The NEAR operator supports matching a word or a phrase, but not a subexpression.
Examples: End-users can use the NOT or minus operator that will expand to an exact phrase match when preceding a CJKT expression:
Examples: While stemming does not apply to CJKT (see About Stemming), end-users can still use the exact match plus (+) or number sign (#) operators in front of a CJKT expression to expand the expression as an exact phrase. The operator will be stripped.
In the search box of Coveo search interfaces, end-users can search for a specific CJKT phrase. The phrase semantic will be preserved.
Example: End-users can use double quotes to delimit an expression to exactly match (see Searching a Phrase):
Non-word characters generate an exact phrase of surrounding characters.
Example: The presence of the dash (-) forces a conversion to an exact phrase match:
A Coveo administrator can enter CJKT expressions in thesaurus entries to expand queries (see Adding Thesaurus Entries From the Administration Tool).
Note: CJKT thesaurus entries are applied on CJKT words so a CJKT expression and its CJKT words are considered equivalent.
Example: Entering TTTUUU or TTT UUU in a thesaurus entry has the same effect.
Thai expressions can be used in field queries (see What Are Field Queries and Free Text Queries?). Matches within fields are more precise, because they are converted to exact phrase matches.
|Stop words||A Coveo administrator can include CJKT words as stop words (see Configuring the Index to Ignore Stop Words in Queries).|
|Did you mean||The Word Corrector Lexicon (WCL) supports CJKT words so mistyped CJKT expression can lead to Did You Mean suggestions at query time (see How Are Misspelled Words Handled?).|
Note: The use of wildcards is not supported for Chinese, Japanese, Korean, and Thai.