Search results is missing a particular URL

Background

This article shows how to investigate the scenario where a specific URL the is expected to be in the search results cannot be found.

Details

There are many reasons why a URL might be missing from search results including:

  • If the URL is rejected as it fails to match any include patterns.

  • If the URL is rejected as it matches an exclude pattern.

  • If the URL is rejected due to match against robots.txt or <meta> robots directives or because it was linked with a rel="nofollow" attribute.

  • If the URL is rejected due to match against file type/mime type rules.

  • If the URL is rejected due to exceeding the configured maximum filesize.

  • If the URL is killed (in a kill_exact.cfg or kill_partial.cfg) or detected as a duplicate of another page.

  • If a crawler trap rule is triggered (because there are too many files within the same folder or too many repeated folders in the directory structure).

  • If a canonical URL is detected.

  • If a URL contains a # character (everything from the # onwards is discarded as this signifies an in-page anchor).

  • If the URL redirects to another URL (and if this is rejected).

  • The URL may have timed out or returned an error when it was accessed.

  • If an error occurred while filtering the document.

  • If an error occurred on the web server and a misconfiguration of the server means that the error message is returned with a HTTP 200 status code. This can result in the URL being identified as a duplicate.

  • The update may have timed out before attempting to fetch the URL.

  • The URL may be an orphan (unlinked) page.

  • If the SimpleRevisitPolicy is enabled then the crawler may not have attempted to fetch the URL if it was linked from a page that rarely changes.

  • The licence may not be sufficient to index all of the documents.

Tutorial: Investigate a missing page

Step 1: access the URL directly

Before you look into the Funnelback configuration start by access the URL directly.

e.g. for a missing web page open it in your web browser. When opening the URL, have the browser developer tools running so you can inspect the HTTP headers returned by the web server.

  • Check the response headers to see if any redirects are returned. Look out for 3xx HTTP status codes as well as location headers. If the web server is telling Funnelback to redirect it’s possible that the URL won’t be fetched if the redirected URL matches an exclude rule. If you find a redirect try searching the index for the redirected URL.

  • View the source code for the HTML and check to see if a canonical URL is defined in page metadata. Look out for a <link> tag with rel="canonical" set as a property. The value of this tag will be used as the URL for the document when Funnelback indexes it regardless of what URL was used to access the page. If you find a canonical link tag then try searching the index for the canonical URL.

  • View the source code for the HTML and check to see if it has any robots meta tags. Look out for a <meta name="robots"> tag. If the value is set to anything including noindex then the page will be excluded from the search index.

  • Check the robots.txt for the site to see if the URL of the page matches any of the disallowed pages.

Step 2: check to see if the URL exists in the search index

Try the /collection-info/v1/collections/<COLLECTION-ID>/url API call which should report on if the URL is in the index.

If the URL is in the index but you can’t see it in the search results first try searching for the URL using the u and v metadata classes - it could be that the item ranks really badly.

The u metadata class holds the hostname component of the URL and the v metadata class holds the path component of the URL. To search for http://example.com/example/file.html you could run the query: u:example.com v:example/file.html.

If you find it when searching for the URL directly then you might need to figure out why it’s ranking badly. SEO auditor can assist you with this investigation.

If you can’t find the URL using the u and v metadata classes check that the collection/profile you are searching isn’t scoped (e.g. via gscopes) or that hook scripts are not modifying your query.

Step 3: check the index and gather logs

If it doesn’t show up as being in the index then the next thing to check is if the URL appears in the Step-Index.log for the collection that gathered the URL. It’s possible that the indexer detected it as a DUPLICATE or BINARY and removed it from the index or that a canonical URL means that it was indexed with a different URL. If the URL was 'excluded due to pattern' check that the URL doesn’t contain the Funnelback installation path in the URL. There is a default setting that will exclude any URLs containing the install path. This can be disabled using the -check_url_exclusion=false indexer option.

If it’s missing from the Step-Index.log then check the gather logs. For a web collection start with the url_errors.log then look at the craw.log.X files. You may need to grep the log files for the URL using a command similar to:

$ grep '<URL>' $SEARCH_HOME/data/<COLLECTION-ID>/<VIEW>/*.log

You may see some log messages similar to the following in the crawl.log.X files:

Rejected: <URL>

This usually means that the URL was rejected due to a match against robots.txt

Unacceptable: <URL>

This usually means that the URL was rejected due to a match against an exclude pattern or not matching any of the include patterns.

Unwanted type: [type] <URL>

This means that the URL was rejected due to a match against an unwanted mime type.

E <URL> [Exceeds max_download_size: <SIZE>] <TIMESTAMP>

This means that the document was skipped because it was larger than the configured maximum file size (default is 10MB). The crawler.max_download_size settng can be used to increase this.

The url_errors.log file reports on specific errors that occurred (such as HTTP errors) when accessing URLs. The log lines follow a format similar to the line below and are fairly self-explanatory:

E <URL> [ERROR TYPE]

Check the filter log for any errors for the URL (this is either the crawler.central.log or crawler.inline-filter.log for web collections, or usually the gather log for a non-web collection). If an error occurs while filtering a URL it may result in the URL not being included in the index.

Step 4: access the URL directly using the DEBUG API

Funnelback provides an API call that can be used to assist in debugging http requests made by the crawler. This tool is particularly useful for debugging form-based authentication but it is also a very useful tool for debugging other missing URLs.