Index the website using a web crawler

This section covers the configuration of DXP search to build a basic search of a public DXP Content Management site.

Crawling a DXP Content Management website is fundamentally the same as crawling any other website - as far as the web crawler is concerned, it is just crawling a set of html pages that are linked together using standard hyperlinks.

Before you start

Ensure you’ve followed the instructions to create a search package that will bundle up your search.

There are a few basic things you should make a note of:

  1. What is the URL of your website? You will need to tell the web crawler where to start crawling your site, and the home page is where you normally will start.

  2. Are there sections of your site that you should skip?

  3. Do you want to index just your html pages, or also include PDFs and Microsoft Office files that are part of your website?

The following steps need to be configured within the DXP search capability.

Create a web data source to index your site

In this step we wil create the web data source that will be responsible for crawling and indexing your website.

  1. From the root page of the search administration dashboard locate your search package, or from the search package management locate the two tabs for managing your data sources and results pages.

  2. Make sure you have the data sources tab selected and click the Create new data source button.

  3. From the popup choose the create option. The attach option allows you to connect an existing data source to your search package.

  4. The data source creating wizard will load. Enter the following settings into the steps:

    • Data source type: web

    • Data source name: <ENTER-A-NAME>

    • What website do you want to crawl?: Enter your website URL here. Ensure the URL here includes the desired protocol. (for most sites this will be https://)

    • What do you want to exclude from your crawl?: You will see a list of presets values here which you can modify, but the defaults should work for most websites. A URL that contains any of these excludes within the full URL will be skipped. If you have parts of your site that you want to skip then you can add the items here. e.g. you might add admin/ to the list of paths to exclude all pages that exist inside admin folders on your website.

      For DXP Content Management sites it is a good idea to add ?a= to the patterns, and also consider removing the calendar entry as this will skip any URL containing the word calendar. If you have any calendars that can be browsed infinitely into the past and future you should leave this setting in place (and exclude the calendar using your robots.txt) and use a sitemap.xml to expose your calendar entries to search engines. This will prevent web crawlers getting stuck in a crawler trap attempting to crawl your calendar.
    • Which non-HTML file types do you want to crawl as well?: This lists additional file types that will be included in your index. You can normally leave this as the default value.

  5. Review your settings then advance to the finish step. The finish step allows you to start an update of your site immediately. Select this, then click the finish button. The data source management screen will automatically load. After a few seconds you should see a status message indicating that your search is updating.

This completes the basic setup of the search crawler.

It might take some time to create an index of your website. If your website is large it may take many hours to complete.

Once the search update completes the status displayed on the data source management screen will change to a green update complete message.

In order to test the search we will need to use a results page.

Return to the search package management screen and select the results pages tab in the components section.

Select the results page you created when you created your search package.

To test your search enter a keyword into the search box and submit the search. You should see some search results from the index we just built.

Ensuring your binary documents have correct titles and metadata

A standard web crawl of your DXP Content Management site will not pick up metadata that is associated with your binary files (like PDFs and Word documents).

This is because the metadata is stored within the CMS database and the search only has access to the content that is stored within the file.

Additional steps need to be followed to ensure that the search indexer has access to this metadata.

Crawling an unpublished DXP Content Management website

If your website hasn’t yet been published (and you need to be a logged-in user to see the site) then you can crawl it with the following settings which you can add temporarily to your configuration.

Additional configuration can be added to your web data source to crawl your DXP Content Management site as a specified user.

Before you start

If you haven’t already created your data source follow the steps above to create it.

In additional you should set up a user account in DXP Content Management for your crawler. The account should have read-only access to the website. Note down the username and password of this user as you will need it to configure the web crawler.

Tutorial: Crawl an unpublished DXP Content Management site

From the data source management screen for your website data source:

  1. Select edit data source configuration from the settings panel.

  2. Use the add new button to add the following configuration settings. The group ID needs to be a unique integer that is used to group the four related settings together.

    Parameter key Group ID URL parameter key Value

    crawler.form_interaction.pre_crawl.*.url

    1

    https://<DXP-CMS-SERVER>/home

    crawler.form_interaction.pre_crawl.*.form_number

    1

    1

    crawler.form_interaction.pre_crawl.*.cleartext.*

    1

    sq_username

    <DXP-CMS-USER-NAME>

    crawler.form_interaction.pre_crawl.*.encrypted.*

    1

    password

    <PASSWORD>

  3. Click the save all button to save your configuration.

  4. Return to the data source management screen and start a new update by selecting update this data source from the update panel.

When crawling with authentication keep in mind that you will index whatever the user can see - this includes any in-page personalization as well as non-live or safe edit content that the user currently has access to and binary documents with DXP Content Management-managed URLs.