Debugging filtering

This article provides advice for debugging issues associated with filtering. This applies to both built-in filters, filters provided by enabling a specific plugin, and legacy custom filters.

There are a number of things to check if filtering appears to not be working as expected:

  • Start by checking the filter logs - this will vary from collection to collection but typically it’s the gather.log for web collections and the gather’s’log for other collection types. Inspect the file for errors and search within the log for the specific URL.

    If there is an error in the log it probably means that there is an error within the filter code. The solution to this is to fix the code (and this obviously depends on the error) Note: it’s not uncommon to see errors generated in the filter logs for accessibility auditor and also for Tika conversion (from converting binary files to text).

  • Other causes of filtering issues could be:

    • The filter does not run on the specific URL because it fails the programmed test for the filter (e.g. wrong mime type)

    • the filter is missing from filter chain

    • the filter is included in the filter chain but the filters are not specified in the correct order. e.g. a filter that analyses document text won’t work if it runs before a document is converted from binary to text.

If a filter error occurs it may result in the document being skipped and ultimately be missing from the index.