Search results is missing a particular URL
This article shows how to investigate the scenario where a specific URL that is expected to be in the search results cannot be found.
There are many reasons why a URL might be missing from search results including:
If the URL is rejected as it fails to match any include patterns.
If the URL is rejected as it matches an exclude pattern.
If the URL is rejected due to match against
<meta>robots directives or because it was linked with a
If the URL is rejected due to match against file type/mime type rules.
If the URL is rejected due to exceeding the configured maximum filesize.
If the URL is killed (in a
kill_partial.cfg) or detected as a duplicate of another page.
If a crawler trap rule is triggered (because there are too many files within the same folder or too many repeated folders in the directory structure).
If a canonical URL is detected.
If a URL contains a # character (everything from the # onwards is discarded as this signifies an in-page anchor).
If the URL redirects to another URL (and if this is rejected).
The URL may have timed out or returned an error when it was accessed.
If an error occurred while filtering the document.
If an error occurred on the web server and a misconfiguration of the server means that the error message is returned with a
HTTP 200status code. This can result in the URL being identified as a duplicate.
The update may have timed out before attempting to fetch the URL.
The URL may be an orphan (unlinked) page.
SimpleRevisitPolicyis enabled then the crawler may not have attempted to fetch the URL if it was linked from a page that rarely changes.
The URL doesn’t match any configured scoping applied to the results page, or for the query being run.
The licence may not be sufficient to index all of the documents.
Tutorial: Investigate a missing page
Step 1: access the URL directly
Before you look into the Funnelback configuration start by accessing the URL directly.
e.g. for a missing web page open it in your web browser. When opening the URL, have the browser developer tools running so you can inspect the HTTP headers returned by the web server.
Check the response headers to see if any redirects are returned. Look out for
3xxHTTP status codes as well as
locationheaders. If the web server is telling Funnelback to redirect it’s possible that the URL won’t be fetched if the redirected URL matches an exclude rule. If you find a redirect try searching the index for the redirected URL.
View the source code for the HTML and check to see if a canonical URL is defined in page metadata. Look out for a
rel="canonical"set as a property. The value of this tag will be used as the URL for the document when Funnelback indexes it regardless of what URL was used to access the page. If you find a canonical link tag then try searching the index for the canonical URL.
View the source code for the HTML and check to see if it has any robots meta tags. Look out for a
<meta name="robots">tag. If the value is set to anything including
noindexthen the page will be excluded from the search index.
robots.txtfor the site to see if the URL of the page matches any of the disallowed pages.
Step 2: check to see if the URL exists in the search index
/collection-info/v1/collections/<DATA-SOURCE-ID>/url API call which should report on if the URL is in the index.
If the URL is in the index but you can’t see it in the search results first try searching for the URL using the
v metadata classes - it could be that the item ranks really badly.
If you find it when searching for the URL directly then you might need to figure out why it’s ranking badly. SEO auditor can assist you with this investigation.
If you can’t find the URL using the
v metadata classes check that the results page you are searching isn’t scoped (e.g. via gscopes) or that hook scripts are not modifying your query.
Step 3: check the index and gather logs
If it doesn’t show up as being in the index then the next thing to check is if the URL appears in the
Step-Index.log for the data source that gathered the URL. It’s possible that the indexer detected it as a
BINARY and removed it from the index or that a canonical URL means that it was indexed with a different URL. If the URL was 'excluded due to pattern' check that the URL doesn’t contain the Funnelback installation path in the URL. There is a default setting that will exclude any URLs containing the install path. This can be disabled using the
-check_url_exclusion=false indexer option.
If it’s missing from the
Step-Index.log then check the gather logs. For a web data source start with the
url_errors.log then look at the
craw.log.X files. You may need to grep the log files for the URL using a command similar to:
$ grep '<URL>' $SEARCH_HOME/data/<DATA-SOURCE-ID>/<VIEW>/*.log
You may see some log messages similar to the following in the
This usually means that the URL was rejected due to a match against
This usually means that the URL was rejected due to a match against an exclude pattern or not matching any of the include patterns.
Unwanted type: [type] <URL>
This means that the URL was rejected due to a match against an unwanted mime type.
E <URL> [Exceeds max_download_size: <SIZE>] <TIMESTAMP>
This means that the document was skipped because it was larger than the configured maximum file size (default is 10MB). The
crawler.max_download_size setting can be used to increase this.
url_errors.log file reports on specific errors that occurred (such as HTTP errors) when accessing URLs. The log lines follow a format similar to the line below and are fairly self-explanatory:
E <URL> [ERROR TYPE]
Check the filter log (
crawler.central.log) for any errors for the URL for web data sources, or usually the gather log for a non-web data source). If an error occurs while filtering a URL it may result in the URL not being included in the index.
Step 4: access the URL directly using the DEBUG API
Funnelback provides an API call that can be used to assist in debugging http requests made by the crawler. This tool is particularly useful for debugging form-based authentication, but it is also a very useful tool for debugging other missing URLs.