Configuring and Indexing an Amazon S3 Source
To configure and index an Amazon S3 source
-
On the Coveo server, access the Administration Tool (see Opening the Administration Tool).
-
Select Index > Sources and Collections.
-
In the Collections section:
-
Select an existing collection in which you want to add the new source.
OR
-
Click Add to create a new collection (see Adding a Collection).
-
-
In the Sources section, click Add.
The Add Source page that appears is organized in three sections.
-
In the General Settings section of the Add Source page:
-
Enter the appropriate value for the following required parameters:
-
Name
-
A descriptive name of your choice for the connector source.
Example: Amazon S3 Site
-
Source Type
-
The connector used by this source. In this case, select Amazon S3.
-
Addresses
-
The address of the Amazon S3 bucket site in one of the following types:
-
Virtual-host style
Examples:
-
http://[bucket].s3.amazonaws.com/
-
http://[bucket].s3-[aws-region].amazonaws.com/
where you replace [bucket] by your actual bucket name and [aws-region] with your region-specific endpoint.
-
-
Path style
Examples:
-
http://s3.amazonaws.com/[bucket]
-
http://s3-[aws-region].amazonaws.com/[bucket]
where you replace [bucket] by your actual bucket name and [aws-region] with your region-specific endpoint.
-
You can enter more than one bucket address on separate lines, but you must ensure that all source parameters apply to all Amazon S3 buckets. Otherwise, create other sources for other buckets.
Notes:
-
The starting address must specify one bucket with its region. URLs that do not specify any region are using the US Standard (us-east-1) region endpoint.
-
When the URL point to a folder inside a bucket, only keys starting with that prefix will be crawled.
-
You can index more than one bucket.
-
-
Fields
-
If you defined an Amazon S3 field set, select it (see Amazon S3 Connector Deployment Overview and What Are Field Sets?).
-
Refresh Schedule
-
Time interval at which the index is automatically refreshed to keep the index content up-to-date. By default, the Every day option instructs CES to refresh the source everyday at 12 AM.
Note: You can create a new or modify an existing source refresh schedule (see Creating or Modifying a Source Schedule).
-
-
Review the value for the following parameters that often do not need to be modified:
-
Rating
-
Change this value only when you want to globally change the ranking associated with all items in this source relative to the rating of other sources (see Understanding Search Results Ranking).
Example: If this source is for a legacy PLM, you may want to set this parameter to Low, so that in the search interface, results from this source appear later in the list compared to those from other sources.
-
Document Types
-
If you created a custom document type set for this source, select it (see Creating a Document Type Set). Otherwise, leave Default.
-
Active Languages
-
If you defined custom active language sets, ensure to select the most appropriate for this source (see Adding and Configuring a Language Set).
-
-
-
In the Specific Connector Parameters & Options section of the Add Source page:
-
When the Amazon S3 content is private, enter the appropriate value for the following parameters. Otherwise (for public data set) leave them empty:
Notes:
-
The Access Key and Secret Key are accessible in the IAM console (see Understanding and Getting Your Security Credentials).
-
CES 7.0.7914+ (October 2015) The Access Key and Secret Key parameters are optional.
-
Access Key
-
The ID of the IAM account access key used to request data from the Amazon S3 servers.
Example: AKIAIOSFODNN74152KOP
-
Secret Key
-
The IAM account secret access key used to request data from the Amazon S3 servers.
Example: wJalrXUtnFEMI/K7MDENG/bPxRfiCYifc51AYQQf
-
-
In the Mapping File box, leave the default mapping file name (Coveo.CES.CustomCrawlers.AmazonS3.MappingFile.xml) unless you created a custom mapping file, in which case, enter the full path of your valid mapping file.
-
Click Add Parameter when you want to show and change the value of hidden source parameters (see Modifying Hidden Amazon S3 Source Parameters).
-
In the Option section, the state of check boxes generally does not need to be changed:
-
Index Subfolders
-
Check to index all subfolders below the specified starting addresses.
-
Index the document's metadata
-
When selected, CES indexes all the document metadata, even metadata that are not associated with a field. The orphan metadata are added to the body of the document so that they can be searched using free text queries.
When cleared (default), only the values of system and custom fields that have the Free Text Queries attribute selected will be searchable without using a field query (see Adding a Field to Search On and What Are Field Queries and Free Text Queries?).
Example: A document has two metadata:
-
LastEditedBy containing the value Hector Smith
-
Department containing the value RH
In CES, the custom field CorpDepartment is bound to the metadata Department and its Free Text Queries attribute is selected.
When the Index the document's metadata option is cleared, searching for RH returns the document because a field is indexing this value. Searching for hector does not return the document because no field is indexing this value.
When the Index the document's metadata option is selected, searching for hector also returns the document because CES indexed orphan metadata.
-
-
Generate a cached HTML version of indexed documents
-
When you select this check box (recommended), at indexing time CES creates HTML versions of indexed documents and saves them in the unified index. In the search interfaces, users can then more rapidly review the content by clicking the Quick View link to open the HTML version of the item rather than opening the original document with the original application.
Consider clearing this check box only if you do not want to use Quick View links or to save resources when building the source.
-
Open results with cached version
-
Leave this check box cleared (recommended) so that in the search interfaces, the main search result link opens the original document with the original application. Consider selecting this check box only when you do not want users to be able to open the original document but only see the HTML version of the document as a Quick View. When this option is selected, you must also select the Generate a cached HTML version of indexed documents check box.
-
-
Click Save to save the source configuration.
-
-
Because Amazon S3 security model is not yet supported, the Amazon S3 connector does not index permissions and you must change the default Permissions option to set the permissions globally on the source:
Note: You get the following error message in the CES Console when the Index security permissions option is selected by default:
Permissions indexing is not provided by AmazonS3Crawler.
-
In the navigation panel on the left, select Permissions.
-
In the Permissions page:
-
Select the Specifies the security permissions to index option.
-
In the Allowed Users list, add or remove users or groups to precisely specify who has access to the content from this source.
By default, the Active Directory everyone group specifies that any Active Directory user can see all the content from this source.
-
Optionally, in the Denied Users list, add users or groups to specify who has not access to the content from this source.
-
Click Apply Changes.
-
-
-
On the toolbar, click Start/Rebuild to start indexing your source.
-
Validate that the source building process is executed without errors:
-
In the navigation panel on the left, click Status, and then validate that the indexing proceeds without errors.
OR
-
Open the CES Console to monitor the source building activities (see Using the CES Console).
-