Search results is missing a particular URL

This article shows how to investigate the scenario where a specific URL that is expected to be in the search results cannot be found.

Details

There are many reasons why a URL might be missing from search results including:

  • If the URL is rejected as it fails to match any include patterns.

  • If the URL is rejected as it matches an exclude pattern.

  • If the URL is rejected due to match against robots.txt or <meta> robots directives or because it was linked with a rel="nofollow" attribute.

  • If the URL is rejected due to match against file type/mime type rules.

  • If the URL is rejected due to exceeding the configured maximum filesize.

  • If the URL is killed (in a kill_exact.cfg, kill_partial.cfg or query-kill.cfg) or detected as a duplicate of another page.

  • If a crawler trap rule is triggered (because there are too many files within the same folder or too many repeated folders in the directory structure).

  • If a canonical URL is detected.

  • If a URL contains a # character (everything from the # onwards is discarded as this signifies an in-page anchor).

  • If the URL redirects to another URL (and if this is rejected).

  • The URL may have timed out or returned an error when it was accessed.

  • If an error occurred while filtering the document.

  • If an error occurred on the web server and a misconfiguration of the server means that the error message is returned with a HTTP 200 status code. This can result in the URL being identified as a duplicate.

  • The update may have timed out before attempting to fetch the URL.

  • The URL may be an orphan (unlinked) page.

  • If the SimpleRevisitPolicy is enabled then the crawler may not have attempted to fetch the URL if it was linked from a page that rarely changes.

  • The URL doesn’t match any configured scoping applied to the results page, or for the query being run.

  • The licence may not be sufficient to index all of the documents.

  • You have a curator rule defined that removes the URL for certain queries.

Tutorial: Investigate a missing page

Step 1: access the URL directly

Before you look into the Funnelback configuration start by accessing the URL directly.

e.g. for a missing web page open it in your web browser. When opening the URL, have the browser developer tools running so you can inspect the HTTP headers returned by the web server.

  • Check the response headers to see if any redirects are returned. Look out for 3xx HTTP status codes as well as location headers. If the web server is telling Funnelback to redirect it’s possible that the URL won’t be fetched if the redirected URL matches an exclude rule. If you find a redirect try searching the index for the redirected URL.

  • View the source code for the HTML and check to see if a canonical URL is defined in page metadata. Look out for a <link> tag with rel="canonical" set as a property. The value of this tag will be used as the URL for the document when Funnelback indexes it regardless of what URL was used to access the page. If you find a canonical link tag then try searching the index for the canonical URL.

  • View the source code for the HTML and check to see if it has any robots meta tags. Look out for a <meta name="robots"> tag. If the value is set to anything including noindex then the page will be excluded from the search index.

  • Check the robots.txt for the site to see if the URL of the page matches any of the disallowed pages.

Step 2: check to see if the URL exists in the search index

Try the /collection-info/v1/collections/<DATA-SOURCE-ID>/url API call which should report on if the URL is in the index.

If you’ve confirmed the URL is in the index first go back to the search where you can’t find the page and access the JSON results (edit your URL and update search.html to search.json). Once the JSON finishes loading search the source for the URL you can’t find. If you find in inside the additionalParameters.remove_urls element then this means a curator rule has been triggered that is configured to remove the URL.

If it’s not curator and the URL is in the index, but you can’t see it in the search results first try searching for the URL using the u and v metadata classes - it could be that the item ranks really badly.

The u metadata class holds the hostname component of the URL and the v metadata class holds the path component of the URL. To search for http://example.com/example/file.html you could run the query: u:example.com v:example/file.html.

If you find it when searching for the URL directly then you might need to figure out why it’s ranking badly. SEO auditor can assist you with this investigation.

If you can’t find the URL using the u and v metadata classes check that the results page you are searching isn’t scoped (e.g. via gscopes) or that hook scripts are not modifying your query.

Step 3: check the index and gather logs

If it doesn’t show up as being in the index then the next thing to check is if the URL appears in the Step-Index.log for the data source that gathered the URL. It’s possible that the indexer detected it as a DUPLICATE or BINARY and removed it from the index or that a canonical URL means that it was indexed with a different URL. If the URL was 'excluded due to pattern' check that the URL doesn’t contain the Funnelback installation path in the URL. There is a default setting that will exclude any URLs containing the install path. This can be disabled using the -check_url_exclusion=false indexer option.

If it’s missing from the Step-Index.log then check the gather logs. For a web data source start with the url_errors.log then look at the craw.log.X files. You may need to grep the log files for the URL using a command similar to:

$ grep '<URL>' $SEARCH_HOME/data/<DATA-SOURCE-ID>/<VIEW>/*.log

You may see some log messages similar to the following in the crawl.log.X files:

Rejected: <URL>

This usually means that the URL was rejected due to a match against robots.txt

Unacceptable: <URL>

This usually means that the URL was rejected due to a match against an exclude pattern or not matching any of the include patterns.

Unwanted type: [type] <URL>

This means that the URL was rejected due to a match against an unwanted mime type.

E <URL> [Exceeds max_download_size: <SIZE>] <TIMESTAMP>

This means that the document was skipped because it was larger than the configured maximum file size (default is 10MB). The crawler.max_download_size setting can be used to increase this.

The url_errors.log file reports on specific errors that occurred (such as HTTP errors) when accessing URLs. The log lines follow a format similar to the line below and are fairly self-explanatory:

E <URL> [ERROR TYPE]

Check the filter log (crawler.central.log) for any errors for the URL for web data sources, or usually the gather log for a non-web data source). If an error occurs while filtering a URL it may result in the URL not being included in the index.

Check to see if the URL was killed. Check to see if there is a kill_exact.cfg or kill_partial.cfg for the data source and if so, if any of the patterns listed match the missing URL. Also check for query-kill.cfg - if this is defined you can look at the Step-KillDocumentsByQuery._kill-url-list.log log file to see if the missing URL is listed in here. If you find any of these match the missing URL then it has been killed from the index.

Step 4: access the URL directly using the DEBUG API

Funnelback provides an API call that can be used to assist in debugging http requests made by the crawler. This tool is particularly useful for debugging form-based authentication, but it is also a very useful tool for debugging other missing URLs.