Configuring and Indexing a Web Pages Source
To configure and index a Web Pages source
-
On the Coveo server, access the Administration Tool (see Opening the Administration Tool).
-
Select Index > Sources and Collections.
-
In the Collections section:
-
Select an existing collection in which you want to add the new source.
OR
-
Click Add to create a new collection (see Adding a Collection).
-
-
In the Sources section, click Add.
The Add Source page that appears is organized in three sections.
-
In the General Settings section of the Add Source page:
-
Enter the appropriate value for the following required parameters:
-
Name
-
Enter a descriptive name of your choice for the connector source.
Example: My Organization Website
-
Source Type
-
The connector used by this source. In this case, select Web Pages.
-
Addresses
-
The root URL for the website content that you want to index.
Example: http://www.myorganization.com/
-
You can also specify multiple URLs when they share the same configuration. This is useful when you want to index only specific sections of a website. Each URL must be on a separate line in the box.
Note: It is recommended to create independent sources for independent websites.
-
Refresh Schedule
-
Time interval at which the index is automatically refreshed to keep the index content up-to-date. By default, the Every day option instructs CES to refresh the source everyday at 12 AM. Choose the refresh rate appropriate to the rate at which the website content is updated.
Important: For a Web Pages source, the full refresh does not immediately catch deleted pages, but will remove a page from the index if the page returns a 404 error three times in a row. Otherwise, a rebuild eliminates deleted web pages from the index.
Note: You can create new or modify existing source refresh schedules (see Creating or Modifying a Source Schedule).
-
-
Review the value for the following parameters that often do not need to be modified:
-
Rating
-
Change this value only when you want to globally change the rating associated with all items in this source relative to the rating to other sources (see Understanding Search Results Ranking).
-
Document Types
-
If you have defined custom document type sets, select the most appropriate one for this source (see What Are Document Type Sets?).
-
Active Languages
-
If you have defined a custom language set for this source, select it.
-
Fields
-
If you have defined custom field sets, select the most appropriate one for this source (see What Are Field Sets?).
-
-
-
In the Specific Connector Parameters & Options section of the Add Source page, review if you need to change the parameter default values:
-
User Agent
-
Determines the name used by the Web Pages connector to identify itself when downloading pages. Leave empty to use the default value (CoveoEnterpriseSearch) configured for all Web Pages sources in the Web Connector page (Configuration > Connectors > Web Crawler).
User Agent Identifier
Determines the identifier used by the Web Pages connector to identify itself when downloading pages.
Some websites use the user agent string ID to detect if the visitor is a specific browser or search engine crawler. The HTTP user agent id string field allows websites to check and detect browser and versions. This information can be used to output different HTML and content.
Example: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Safari/532.5
Leave empty to use the default value (Mozilla/4.0 (compatible; MSIE 5.0; Windows 95)) configured for all Web Pages sources in the Web Connector page (Configuration > Connectors > Web Crawler).
-
Kerberos Cross Domain
-
Specifies a semicolon separated list of Service Principal Names for cross domain authentication with Kerberos.
In the Option section:
-
Index the document's metadata
-
When selected, CES indexes all the document metadata, even metadata that are not associated with a field. The orphan metadata are added to the body of the document so that they can be searched.
When cleared (default), only the values of system and custom fields that have the Free Text Queries attribute selected will be searchable without using a field query (see Adding a Field to Search On and What Are Field Queries and Free Text Queries?).
Example: A document has two metadata:
-
LastEditedBy containing the value Hector Smith
-
Department containing the value RH
In CES, the custom field CorpDepartment is bound to the metadata Department and its Free Text Queries attribute is selected.
When the Index the document's metadata option is cleared, searching for RH returns the document because a field is indexing this value. Searching for hector does not return the document because no field is indexing this value.
When the Index the document's metadata option is selected, searching for hector also returns the document because CES indexed orphan metadata.
-
-
Document's addresses are case-sensitive
-
Select only when the addresses of website documents are case-sensitive. This option is cleared by default.
-
Generate a cached HTML version of indexed documents
-
Leave this check box selected (recommended). When indexing, CES creates HTML versions of indexed documents. In the search interfaces, users can then more rapidly review the content by clicking the Quick View link rather than opening the original web page. Consider clearing this check box only when you do not want to use Quick View links or save resources when building the source. This option is selected by default.
-
Open results with cached version
-
Leave this check box cleared (recommended) so that in the search interfaces, the main search result link opens the original web page. Consider selecting this check box only when you do not want users to be able to open the original web page but only see the HTML version of the document as a Quick View. In this case, you must also select Generate a cached HTML version of indexed documents. This option is cleared by default.
-
Reuse HTTP Connection
-
When crawling a website secured with Kerberos authentication, select this check box to keep the Kerberos connection alive between HTTP GET requests. This prevents repeating the Kerberos authentication for each request and can significantly improve the crawling performance.
-
Skip addresses with parameters (domain.com?parameters)
-
Select this check box to prevent CES from indexing pages whose addresses contain a query part that can return similar content, and therefore prevent indexing page duplicates and save disk space. Clear this check box when same addresses with different parameters return different content. This option is selected by default.
-
-
In the Security section of the Add Source page, when authentication is needed to crawl the website, enter the appropriate value for the following parameters:
-
In the Authentication section, select one of the following options:
-
Crawl anonymously
Select when the full content of the website is available to everybody.
-
Crawl using the service identity
Select when the website is secured and the user identity of the CES service has full access to the website (see About the CES Service Logon Account).
-
Crawl using this identity
Select when the website is secured and you want to use a specific user identity to crawl the website content (see Adding a User Identity).
Note: You can set up a Kerberos authentication to impersonate a user by creating and selecting a user identity for that user. The crawler threads will be impersonated with that user. The user must be from the same domain as the crawled web server. Consider selecting the Reuse HTTP Connection option.
-
-
In the Security Provider drop-down list, when you select to not crawl anonymously, select the security provider that can authenticate the user identity specified in the Authentication section.
-
Click Save to save the source configuration and start indexing this source.
-
-
When the website you are indexing uses Kerberos authentication and you assigned a Kerberos user identity to the source:
-
In the navigation panel on the left, select Advanced.
-
CES 7.0.6424+ (February 2014) On the right, in the Crawling section, select the Enable Kerberos authentication option. NTLM or Basic authentication is used when the option is cleared.
Note: Consider clearing the Enable Kerberos authentication option to prevent getting error messages similar to the following:
An error occurred while warming up search page [URL]: class CGLNetwork::NetworkAccessDenied: The login information of server (SERVER NAME) is invalid.
-
-
Click Start to build your source.
-
Validate that the source building process is executed without errors:
-
In the navigation panel on the left, click Status, and then validate that the indexing proceeds without errors.
OR
-
Open the CES Console to monitor the source building activities (see Using the CES Console).
-
What's Next?
Source-level permissions are not indexed for Web Pages sources. However, when web page files are stored on the same network as the Coveo Master server, you can associate file server permissions to them (see Modifying Source Security Permissions).
CES also supports form-based authentication to access certain secure web pages (see Indexing Secure Web Pages Using Forms).