Product DocsMenu

Chinese Japanese Korean Thai Language Improved Relevance

The Coveo Platform 7 has always supported Chinese, Japanese, Korean, and Thai (CJKT), but now offers improved relevance for these languages.

These languages do not use spacing characters to separate words. Previously, Coveo Enterprise Search (CES) was indexing each character separately as if it was a word and used pairs of such characters to perform retrieval. This indexing method allows end users to find content, but was not optimal for example for search results precision and ranking.

With the new indexing method, Coveo Enterprise Search (CES) uses proven language aware word tokenizers to identify and separate expressions in individual groups of inseparable characters referred to hereafter as CJKT words. Each CJKT word is then indexed as normal words. The meaning of CJKT words is thus preserved and ranking is done on words rather than on individual characters, allowing for improved relevance.

Example: You can enter Chinese, Japanese, Korean, or Thai keywords in the search box to get relevant documents in search results and see highlighted CJKT keyword occurrences in the search results title and excerpt such as in the following Chinese example.

Notes:

  • CES 7.0.6547+ (March 2014) Improved relevance for Chinese, Japanese, and Korean.

  • CES 7.0.6424+ (February 2014) Improved relevance for Thai.

  • CES 7.0.6547+ (March 2014) New index created use the new improved relevance CJKT indexing method. When you upgrade CES from a version prior to CES 7.0.6547 to a CES 7.0.6547+ version, an existing index will by default continue to use the original CJKT indexing method. If you want to switch to the new CJKT indexing method, contact Coveo Support for assistance.

In the examples presented below, a group of similar uppercase letters (ex.: TTT) represents a CJKT word, while a group of different lowercase letters (ex.: abc) represents a non-CJKT word or term.

Example: The simplified Chinese expression for Coveo supports many languages is decomposed as follows:

Original expression:

coveo支持多国语言

Represented by: abc TTTUUU
where:

abc stands for coveo

TTT stands for 支持 (support)

UUU stands for 多国语言 (multilingual)

Supported feature Description
Indexing

Automatic detection of CJKT content based on each language specific Unicode character sets and encodings.

As are words for other languages, CJKT words are indexed to identify in what documents they appear, and where they appear in each document.

Search

At query time, a CJKT expression is split into CJKT words and the search results present all documents containing all CJKT words. In the search interface, the searched CJKT words are highlighted in search results title and excerpt.

End-users can search for CJKT words mixed with non-CJKT words or terms.

Example:

Typed query: abcTTTUUUdef
Transformed query: (abc TTT UUU def)

Note: Returned documents are ranked using the same process and criteria as for other languages (see Administration Tool - Ranking Menu).

Prefixes and operators

In the search box of Coveo search interfaces, end-users can use search prefixes and operators with CJKT expressions (see Search Prefixes and Operators). The Boolean operators must be spelled in English (AND, OR, NEAR, NOT).

Example: End-users can use the OR operator between a word and a CJKT expression:

Typed query: abc OR TTTUUU
Transformed query: (abc OR (TTT UUU))

Examples: End-users can use the NEAR operator between a word and a CJKT expression:

Typed query: abc NEAR TTTUUU
Transformed query: (abc NEAR "TTT UUU")
Typed query: r-cTTTUUU NEAR def
Transformed query: ("r c TTT UUU" NEAR def)

Note: The NEAR operator supports matching a word or a phrase, but not a subexpression.

Examples: End-users can use the NOT or minus operator that will expand to an exact phrase match when preceding a CJKT expression:

Typed query: NOT TTTUUU
Transformed query: NOT "TTT UUU"
Typed query: -TTTUUU
Transformed query: -"TTT UUU"

Examples: While stemming does not apply to CJKT (see About Stemming), end-users can still use the exact match plus (+) or number sign (#) operators in front of a CJKT expression to expand the expression as an exact phrase. The operator will be stripped.

Typed query: +TTTUUU
Transformed query: "TTT UUU"
Typed query: #TTTUUU
Transformed query: "TTT UUU"
Phrase search

In the search box of Coveo search interfaces, end-users can search for a specific CJKT phrase. The phrase semantic will be preserved.

Example: End-users can use double quotes to delimit an expression to exactly match (see Searching a Phrase):

Typed query: abc"TTTUUUdef"
Transformed query: abc "TTT UUU def"

Non-word characters generate an exact phrase of surrounding characters.

Example: The presence of the dash (-) forces a conversion to an exact phrase match:

Typed query: TTTUUU-VVV
Transformed query: "TTT UUU VVV"
Thesaurus

Coveo administrator can enter CJKT expressions in thesaurus entries to expand queries (see Adding Thesaurus Entries From the Administration Tool).

Note: CJKT thesaurus entries are applied on CJKT words so a CJKT expression and its CJKT words are considered equivalent.

Example: Entering TTTUUU or TTT UUU in a thesaurus entry has the same effect.

Field queries

Thai expressions can be used in field queries (see What Are Field Queries and Free Text Queries?). Matches within fields are more precise, because they are converted to exact phrase matches.

Examples:

Typed query: @field=abcTTTUUU
Transformed query: @field="abc TTT UUU"
Typed query: @field=(abc,TTTUUU)
Transformed query: @field=(abc, "TTT UUU")
Typed query: @field=(abc,"TTTUUU")
Transformed query: @field=(abc, "TTT UUU")

 

Stop words Coveo administrator can include CJKT words as stop words (see Configuring the Index to Ignore Stop Words in Queries).
Did you mean The Word Corrector Lexicon (WCL) supports CJKT words so mistyped CJKT expression can lead to Did You Mean suggestions at query time (see How Are Misspelled Words Handled?).

Note: The use of wildcards is not supported for Chinese, Japanese, Korean, and Thai.

People who viewed this topic also viewed