Product DocsMenu

Coveo Platform 7.0 >
Administrator Help > Connectors > Sitemap Connector > Configuring and Indexing a Sitemap Source

Configuring and Indexing a Sitemap Source

A source defines a set of configuration parameters for one or more Sitemap files listing the content of your site.

To configure and index a source with the Sitemap connector

  1. On the Coveo server, access the Administration Tool (see Opening the Administration Tool).

  2. Select Index > Sources and Collections.

  3. In the Collections section:

    1. Select an existing collection in which you want to add the new source.

      OR

    2. Click Add to create a new collection (see Adding a Collection).

  4. In the Sources section, click Add.

  5. In the General Settings section of the Add Source page:

    1. Enter the appropriate value for the following required parameters:

      Name

      Enter a descriptive name of your choice for the connector source.

      Example: My Organization Website Sitemap

      Source Type

      The connector used by this source. In this case, select Sitemap.

      Addresses

      Enter the URLs to one or more Sitemap files or Sitemap index files in either the http:// or https:// form.

      Notes:

      • By default, Sitemap files and Sitemap index files that do not respect the following validations based on the Sitemap protocol are ignored during the indexing process (see Sitemap protocol):

        • An uncompressed Sitemap file must be no larger than 10 MB (even if the file is compressed with GZIP).

        • A Sitemap file cannot contain more than 50,000 URLs.

        • All referenced URLs must be less than 2,048 characters.

        • All referenced URLs must be relative to the Sitemap that references them and in the same domain. The location of a Sitemap file determines the set of URLs that can be included in that Sitemap.

          Example: A Sitemap file located at http://myorgwebsite.com/tech/sitemap.xml can include any URLs starting with http://myorgwebsite.com/tech/ but cannot include URLs starting with http://myorgwebsite/catalog/.

      • When you do not want your Sitemap files and Sitemap index files to be validated, add the ParseSitemapInStrictMode hidden parameter with the false value (see Modifying Hidden Sitemap Source Parameters). In this case, the above validations are not performed. Consequently, all web pages are indexed if their reference URL is valid and absolute.

      • When you want to retrieve the content of listed web pages from a XML Sitemap, enter the direct Sitemap URL instead of the Sitemap website address. Otherwise, the source could interpret the web page as a Sitemap file in HTML and crawl the discovered links.

        Example: You enter the following URL: http://myorgwebsite.com/sitemap.xml instead of http://myorgwebsite.com/.

      • The Sitemap connector can retrieve all links contained in a web page. The Sitemap crawler does not expand all discovered links, but only crawls the web page as a Sitemap file in HTML.

        You can also select only a specific part of a web page to be indexed by adding the HtmlXPathSelectorExpression hidden parameter. The parameter value must be an XPath expression that selects one or more nodes of a web page containing the URLs to crawl (see Modifying Hidden Sitemap Source Parameters). By default, the connector indexes all listed web pages from an HTML Sitemap.

        Example: You want only to index a specific portion (only the web pages linked inside the cbc-sitemap div container) of the CBC Sitemap web page, so you add the parameter with the following value: //div[@id='cbc-sitemap'].

        • Any XPath selecting node can be used to set the website portion to include (see XPath syntax).

        • You should also set the ParseSitemapInStrictMode hidden parameter to false since an HTML web page does not follow the Sitemap protocol (see Sitemap Protocol).

      Examples:

      • http://myorgwebsite.com/sitemap.xml (Public website Sitemap)

      • http://myorgwebsite.com/sitemap.xml.gz (Public website Sitemap compressed with GZIP)

      • http://myorgwebsite.com/sitemap (Web page containing links such as a site map)

      You can enter more than one Sitemap file or Sitemaps index file address on separate lines, but you must ensure that all source parameters apply to all Sitemap files. Otherwise, create other sources.

      Refresh Schedule

      Time interval at which the index is automatically refreshed to keep the index content up-to-date. By default, the Every day option instructs CES to refresh the source everyday at 12 AM (see Creating or Modifying a Source Schedule) .

    2. Review the value for the following parameters that often do not need to be modified:

      Rating

      Change this value only when you want to globally change the rating associated with all items in this source relative to the rating to other sources (see Understanding Search Results Ranking).

      Example: If this source was for an important Sitemap, you may want to set this parameter to High, so that in the search interface, results from this source appear earlier in the search result list compared to those from other sources.

      Document Types

      If you defined custom document type sets, ensure to select the most appropriate for this source (see What Are Document Type Sets?).

      Active Languages

      If you defined custom language sets, ensure to select the most appropriate for this source (see Adding and Configuring a Language Set).

      Fields

      Select the field set that you created earlier (see Sitemap Connector Deployment Overview).

  6. In the Specific Connector Parameters & Options section of the Add Source page:

    1. Review if you need to change the default values for the following parameters:

      Number of Refresh Threads

      Determines the number of refresh threads that allow the connector to crawl web pages in parallel. The default value is 2 threads.

      Notes:

      • CES 7.0.8047+ (December 2015) The connector supports multiple threads (2+) for websites that use a form-based authentication.

      • CES 7.0.7914 (October 2015) You must set the value to 1.

      • Increasing this value may improve source refresh speed but puts more load on the website server.

      Mapping File

      The path to the mapping file. Leave the default value to use the default mapping file that comes with the connector (Coveo.CES.CustomCrawlers.Sitemap.MappingFile.xml). If you create a custom mapping file, enter the full path to your custom mapping file. Contact Coveo Support for assistance if you need to customize the mapping file.

      User-Agent HTTP header

      Determines the identifier used by the Sitemap connector to identify itself when downloading web pages. The default value is Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36.

    2. In the Parameters section, click Add Parameter to be able to change the default value of hidden parameters (see Modifying Hidden Sitemap Source Parameters).

      Notes:

      • CES 7.0.8225+ (March 2016) (For XML sitemaps only) When the last modification dates are not in a standard format (ex: YYYY-MM-DDThh:mm:ss.sTZD), thus triggering an error in the CES Console (SITEMAP_INVALID_FORMAT_ERROR with an Invalid date), add the DateFormat hidden parameter to specify the Sitemap file date custom format. The format must use the MSDN format specifiers (see Custom Date and Time Format Strings and DateFormat).

        Example: yyyy;MM;ddTHH:mm:sszzz

      • CES 7.0.7814+ (August 2015) When you use basic authentication and you get an HTML 404 error in the CES Console, you can add the ForceBasicAuthorizationHeader hidden parameter and set it to true (see ForceBasicAuthorizationHeader).

    3. In the Option section, review the default value of the following check boxes:

      Index subfolders

      This option, a generic connector parameter, is not taken into account and has no effect for the Sitemap connector.

      Index the document's metadata

      When selected, CES indexes all the document metadata, even metadata that are not associated with a field. The orphan metadata are added to the body of the document so that they can be searched using free text queries.

      When cleared (default), only the values of system and custom fields that have the Free Text Queries attribute selected will be searchable without using a field query (see Adding a Field to Search On and What Are Field Queries and Free Text Queries?).

      Example: A document has two metadata:

      • LastEditedBy containing the value Hector Smith

      • Department containing the value RH

      In CES, the custom field CorpDepartment is bound to the metadata Department and its Free Text Queries attribute is selected.

      When the Index the document's metadata option is cleared, searching for RH returns the document because a field is indexing this value. Searching for hector does not return the document because no field is indexing this value.

      When the Index the document's metadata option is selected, searching for hector also returns the document because CES indexed orphan metadata.

      Generate a cached HTML version of indexed documents

      When you select this check box (recommended), at indexing time, CES creates HTML versions of indexed documents. In the search interfaces, users can then more rapidly review the content by clicking the Quick View link rather than opening the original document with the original application. Consider clearing this check box only if you do not want to use Quick View links or to save resources when building the source.

      Open results with cached version

      Leave this check box cleared (recommended) so that in the search interfaces, the main search result link opens the original document with the original application. Consider selecting this check box only when you do not want users to be able to open the original document but only see the HTML version of the document as a Quick View. In this case, you must also select Generate a cached HTML version of indexed documents.

  7. When you have an authentication on your website, in the Security section of the Add Source page:

    1. In the Authentication drop-down list, select the Sitemap user identity that you created for this source (see Sitemap Connector Deployment Overview). Otherwise, select (none).

      Note: By specifying a User Identity, the connector can authenticate using the following supported authentication schemes:

      Some setups can be problematic, but most of the setups should be supported. You can use the ManualCookies hidden parameter when your website does not use one of these authentication schemes (see Modifying Hidden Sitemap Source Parameters).

    2. Click Save and Start to save the source configuration and build the source.

  8. Manually set the security on the source, by changing the default Permissions option to set the permissions globally on the source:

    Note: You get the following error message in the CES Console when the Index security permissions option is selected by default:

    Permissions indexing is not provided by the Sitemap connector. You must manually configure the permissions on the source.

    1. In the navigation panel on the left, select Permissions.

    2. In the Permissions page:

      1. Select the Specifies the security permissions to index option.

      2. Optionally, in the Allowed Users list, add or remove users or groups to precisely specify who has access to the content from this source.

        By default, the Active Directory everyone \S-1-1-0\ group specifies that any Active Directory user can see all the content from this source.

      3. Optionally, in the Denied Users list, add users or groups to specify who has not access to the content from this source.

      4. Click Apply Changes.

  9. On the toolbar, click Start/Rebuild to start indexing your source.

  10. Validate that the source building process is executed without errors:

    • In the navigation panel on the left, click Status, and then validate that the indexing proceeds without errors.

      OR

    • Open the CES Console to monitor the source building activities (see Using the CES Console).

What's Next?

Set an incremental refresh schedule for your source (see Scheduling a Source Incremental Refresh).

People who viewed this topic also viewed