Debugging filtering

This article provides advice for debugging issues associated with filtering. This applies to both built-in filters, filters provided by enabling a specific plugin, and legacy custom filters.

There are a number of things to check if filtering appears to not be working as expected:

  • Start by checking the filter logs - this will vary slightly for different data source types, but typically it’s the gather.log for web data sources and the gatherer log file for other data source types. Inspect the file for errors and search within the log for the specific URL.

    If there is an error in the log it probably means that there is an error within the filter code. The solution to this is to fix the code (and this obviously depends on the error) Note: it’s not uncommon to see errors generated in the filter logs for accessibility auditor and also for Tika conversion (from converting binary files to text).

  • Use the filter debug API to see how a filter chain modifies a specific document. This API call (POST /filter/v1/debug/collections/{collection}/doc) is found in the debug section of the administration API. This API filters a specific URL using a specified data source configuration and enables you to look at the filtered text that is generated.

  • Other causes of filtering issues could be:

    • The filter does not run on the specific URL because it fails the programmed test for the filter (e.g. wrong mime type)

    • the filter is missing from filter chain

    • the filter is included in the filter chain but the filters are not specified in the correct order. e.g. a filter that analyses document text won’t work if it runs before a document is converted from binary to text.

If a filter error occurs it may result in the document being skipped and ultimately be missing from the index.