Product DocsMenu

Configuring and Indexing a Web Pages Source

A source defines a set of connector parameters specifying where and how to crawl a website.

To configure and index a Web Pages source

  1. On the Coveo server, access the Administration Tool (see Opening the Administration Tool).

  2. Select Index > Sources and Collections.

  3. In the Collections section, select the collection to which you want to add a Web Pages source, or click Add to create a new collection (see Adding a Collection).

  4. In the Sources section, click Add.

  5. In the Add Source page:

    1. Enter the appropriate value for the following required parameters:

      Name
      A descriptive name of your choice for the Web Pages source.
      Example: Coveo Website
      Source Type
      The connector used by this source. In this case, select Web Pages.
      Addresses
      The root URL for the website content that you wish to index.
      Example: http://www.coveo.com/
      You can also specify multiple URLs when they share the same configuration. This is useful when you want to index only specific sections of a website. Each URL must be on a separate line in the text box.
      Note: It is recommended to create independent sources for independent websites.
      Refresh Schedule
      Time interval at which the index is automatically refreshed to keep the index content up-to-date. By default, the Every day option instructs CES to refresh the source everyday at 12 AM. Choose the refresh rate appropriate to the rate at which the website content is updated.
      Authentication
      Select one of the following options:
      • Crawl anonymously

        Select when the full content of the website is available to everybody.

      • Crawl using the service identity

        Select when the website is secured and the user identity of the CES service has full access to the website (see About the CES Service Logon Account).

      • Crawl using this identity

        Select when the website is secured and you want to use a specific user identity to crawl the website content (see Adding a User Identity).

    2. Consider modifying the default value for the following parameters:

      Rating
      Change this value only when you want to globally change the rating associated with all items in this source relative to the rating of other sources (see Understanding Search Results Ranking).
      Document Types
      If you have defined custom document type sets, select the most appropriate one for this source (see What Are Document Type Sets?).
      Active Languages
      If you have defined a custom language set for this source, select it.
      Fields
      If you have defined custom field sets, select the most appropriate one for this source (see What Are Field Sets?).
      User Agent

      Determines the name used by the Web Pages connector to identify itself when downloading pages. Leave empty to use the default value (CoveoEnterpriseSearch) configured for all Web Pages sources in the Web Connector page (Configuration > Connectors > Web Connector).

      User Agent Identifier

      Determines the identifier used by the Web Pages connector to identify itself when downloading pages. Some websites use the user agent string ID to detect if the visitor is a specific browser or search engine crawler. The http user agent id string field allows websites to check and detect browser and versions. This information can be used to output different html and content.

      Example: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Safari/532.5

      Leave empty to use the default value (Mozilla/4.0 (compatible; MSIE 5.0; Windows 95)) configured for all Web Pages sources in the Web Connector page (Configuration > Connectors > Web Connector).

    3. Review that the appropriate check boxes are selected:

      Index the document's metadata
      When selected, CES indexes all the document metadata, even metadata that are not associated with a field. The orphan metadata are added to the body of the document so that they can be searched. This option is cleared by default.
      Document's addresses are case-sensitive
      Select only when the addresses of website documents are case-sensitive. This option is cleared by default.
      Generate a cached HTML version of indexed documents
      Leave this check box selected (recommended). When indexing, CES creates HTML versions of indexed documents. In the search interfaces, users can then more rapidly review the content by clicking the Quick View link rather than opening the original web page. Consider clearing this check box only when you do not want to use Quick View links or save resources when building the source. This option is selected by default.
      Open results with cached version
      Leave this check box cleared (recommended) so that in the search interfaces, the main search result link opens the original web page. Consider selecting this check box only when you do not want users to be able to open the original web page but only see the HTML version of the document as a Quick View. In this case, you must also select Generate a cached HTML version of indexed documents. This option is cleared by default.
      Skip addresses with parameters (domain.com?parameters)

      Select this check box to prevent CES from indexing pages whose addresses contain a query part that can return similar content, and therefore save disk space and prevent indexing page duplicates. Clear this check box when same addresses with different parameters return different content. This option is selected by default.

    4. Click Save and Start to save the source configuration and start the indexing of the new source.

  6. Validate that the source building process is executed without errors:

    • In the navigation panel on the left, click Status, and then validate that the indexing proceeds without errors.

      OR

    • Open the CES Console to monitor the source building activities (see Using the CES Console).

What's Next

Source-level permissions are not indexed for Web Pages sources. However, when web page files are stored on the same network as the Coveo Master server, you can associate file server permissions to them (see Modifying Source Security Permissions).

CES also supports form based authentication to access certain secure web pages (see Indexing Secure Web Pages Using Forms).