crawler.non_html

Background

This option is a comma-separated list of file extensions to download. The file types are for non-html files i.e. binary file types like .pdf, .doc etc. These files will not be parsed i.e. the crawler will not attempt to extract hyperlinks from them.

If crawler.inline_filtering_enabled is set to "true" then these files will be filtered. If you don’t want this to happen for a specific type of file you can add its MIME type to the filter.ignore.mimeTypes setting.

Setting the key

Set this configuration key in the search package or data source configuration.

Use the configuration key editor to add or edit the crawler.non_html key, and set the value. This can be set to any valid List<String> value.

Default value

crawler.non_html=doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,xlsm

Examples

Only download PDF files.

crawler.non_html=pdf