Modifying Hidden Sitemap Source Parameters
The following list describes the advanced hidden parameters available with Sitemap sources. The parameter type (integer, string, etc.) appears between parentheses following the parameter name.
-
IndexHtmlMetadata (Boolean) CES 7.0.8541+ (September 2016)
Whether metadata tags found in HTML files should be indexed. The content attribute of meta tags is indexed when the tag is keyed with one of the following attributes: name, property, itemprop, or http-equiv. The default value is false since the parameter has an impact on indexing performance.
Example: In the tag <meta property="og:title" content="The Article Title"/>, The Article Title is indexed.
Semicolon separated list of additional HTTP headers added to the connector requests in the following format: key1=\value1;key2=\value2.
-
DateFormat (String) CES 7.0.8225+ (March 2016)
(For XML sitemaps only) When the last modification dates are not in a standard format (ex: YYYY-MM-DDThh:mm:ss.sTZD), thus triggering an error in the CES Console (SITEMAP_INVALID_FORMAT_ERROR with an Invalid date), specify the Sitemap file date custom format. The format must use the MSDN format specifiers (see Custom Date and Time Format Strings).
Example: yyyy;MM;ddTHH:mm:sszzz
-
FormAuthConfigurationPath (String) CES 7.0.7914+ (October 2015)
-
The path to the form-based authentication XML configuration file.
Note: The UseCookies hidden parameter must be set to true. If not, you get the following warning during your sitemap source rebuild:
Ensure the username and password are valid and that you are supplying all the values submitted by the form at "{0}". The request to authenticate did not support HTTP cookies but the authentication form set the following cookies: {0}. This may have caused the form authentication to fail. Consider turning on HTTP cookies support.
-
ForceBasicAuthorizationHeader (Integer) CES 7.0.7814+ (August 2015)
-
Whether to force basic authentication header in the web request without waiting the server challenge. The default value is false. Set it to true when your server does not challenge the caller for authentication for example or when you get an HTTP 404 error (often occurs on non-IIS servers) in the CES Console that looks like the following:
Exception during item expansion: https://myorgwebsite.com/basicauth/user/password. -> The remote server returned an error: (404) Not Found.
-
Whether each Sitemap file should be parsed in strict mode or not. When the Sitemap file does not follow the protocol specification (see Protocol Standard Validations), the parsing throws an exception. The default value is false. Set to true when you want to want to index your Sitemap files with protocol standard validations.
Note: CES 7.0.7914– (October 2015) The default value was true.
-
ReadTimeout (Integer)
-
The timeout duration in seconds when the connector reads web page content from a stream (i.e., downloading a Sitemap/web page content). The default value is 300 seconds.
-
Timeout (Integer)
-
The number of seconds to wait before the request (i.e., server responding to a request) times out. The default value is 100 seconds.
-
AllowAutoRedirect (Boolean)
-
Whether the request should automatically follow redirection responses from the web resource or not. The default value is true.
-
NumberOfRetries (Boolean)
-
The number of retries allowed when a failed web request is recoverable. Only the following HTTP errors will be retried: 408, 500 and 503. The default value is 3 retries.
-
UseCookies (Boolean)
-
Whether cookies must be enabled to crawl. The default value is false. Set the value to true when you want a cookie container to be initialized and reused for each web request for the crawling.
-
A collection of manual cookies to inject with each HTTP web request in the following format:
MyCookieName=MyCookieValue;Domain=coveo.com;Expires=Wdy, DD Mon YYYY HH:MM:SS GMT;Path=/;Domain=mydomain.com;Secure;HttpOnly
where you need to enter your information at the specified places.
When you need to define more than one cookie, separate each cookie definition with the ;; separator. The default value is null. Use this parameter when your website does not use one of the four supported authentication schemes and thus needs a specific cookie to be used for crawling (see Supported Authentication Schemes).
Example: MyFirstCookie=MyFirstValue;Domain=www.coveo.com;;MySecondCookie=MySecondValue;Domain=www.example.com
Notes:
-
The only mandatory attributes are the cookie name, its value and the domain (where the cookie belongs to). All attributes must be separated using a semicolon (;) character.
-
The supported optional attributes are:
-
Expires: the expiration date in RFC 1123 format (Wdy, DD Mon YYYY HH:MM:SS GMT);
-
Path: the subfolder path where the cookie belongs to (relative to the root domain);
-
Secure: means to keep cookie communication limited to encrypted transmission;
-
HttpOnly: directs browsers to not expose cookies through channels other than HTTP (and HTTPS) requests.
-
The Secure and HttpOnly attributes do not have associated values. The presence of their attribute names indicates that their behaviors are enabled.
-
-
HtmlXPathSelectorExpression (String) CES 7.0.7711+ (June 2015)
-
The XPath expression used to select one or more nodes in an HTML document containing the URLs to crawl. By default, the connector indexes all listed web pages from an HTML Sitemap.
Example: You only want to index a specific portion (only the web pages linked inside the cbc-sitemap div container) of the CBC Sitemap web page so you add the parameter with the following value: //div[@id='cbc-sitemap'].
Notes:
-
The ParseSitemapInStrictMode hidden parameter should be set to false since an HTML web page does not follow the Sitemap protocol (see Sitemap Protocol).
-
Any XPath selecting nodes can be used to set the website portion to index (see XPath Syntax).
-
-
ScrapingConfiguration (String) CES 7.0.8541+ (September 2016)
-
The JSON web scraping configuration that allows you to specify CSS or XPATH selectors to:
-
Filter pages.
-
Exclude page sections.
Example: Exclude page sections such as the header, the footer, or a side panel that are similar in all pages and are considered noise in the index.
-
Scrap metadata from pages.
Example: Extract to a metadata a blog post publication date string that is only available in a specific
div
in the page (not in ameta
tag). You can then map the metadata to a field that can be used in the blog search result template so search user can easily identify when the blog was published.
This option adds useful flexibility to the Sitemap connector. The JSON configuration syntax is the same as the one used in Coveo Cloud V2 Web or Sitemap sources (see Web Scraping Configuration).
When you add the hidden parameter, simply paste the appropriate valid JSON configuration in the parameter value box.
Note: The Sitemap connector does not support the
sub-item
web scraping feature, allowing to split a crawled page in more than one index item. If you do include such configuration, it will simply be ignored. -
-
EnableJavaScript (Boolean) CES 7.0.8541+ (September 2016)
-
Whether the JavaScript should be evaluated and rendered before the indexation. The default value is false. This option is useful when you want to index the dynamically rendered content of crawled pages. Be aware however that activating this option has a significant impact on the crawling performance.
-
IndexHtmlMetadata (Boolean) CES 7.0.8541+ (September 2016)
-
Whether the metadata tags found in HTML files should be scrapped and passed to the index. The feature extracts the
content
attribute value for allmeta
HTML elements with aname
,property
,itemprop
, orhttp-equiv
attribute as well as thetitle
HTML element value. The default value isfalse
.Notes:
-
Enabling this option may significant reduce the crawling performance as the crawler must scrap each page.
-
The CES converter by default also more efficiently extracts
meta
HTML elements with aname
attribute. Consider enabling this option only when you want to extractmeta
HTML elements with aproperty
,itemprop
, orhttp-equiv
attribute.
-
To modify hidden Sitemap source parameters
-
Refer to Adding an Explicit Connector Parameter to add one or more Sitemap source parameters.
-
For a new Sitemap source, access the Add Source page of the Administration Tool to modify the value of the newly added advanced parameter:
-
Select Index > Sources and Collections.
-
Under Collections, select the collection in which you want to add the source.
-
Under Sources, click Add.
-
In the Add Source page, edit the newly added advanced parameter value.
-
-
For an existing Sitemap source, access the Source: ... General page of the Administration Tool to modify the value of the newly added advanced parameter:
-
Select Index > Sources and Collections.
-
Under Collections, select the collection containing the source you want to modify.
-
Under Sources, click the existing Sitemap source in which you want to modify the newly added advanced parameter.
-
In the Source: ... General page, edit the newly added advanced parameter value.
-
-
Rebuild your Sitemap source to apply the changes to the parameters (see Applying an Action to a Collection or a Source).