textify.cfg
textify.cfg
defines which non-Java based filters will be used to extract plain text from binary documents.
This configuration can only be modified by a system administrator. |
Configuring external filter programs
External filtering is not recommended for performance reasons as a new process is started for each document that is filtered using an external filter. Where possible use Tika to filter binary documents to text. |
The external filters are external programs that convert binary documents into plain text. textify.cfg
contains lines specifying filters in the following form:
textify.cfg
extension=command
where:
- extension
-
is the file extension of the target file, for example
.doc
or.pdf
. - command
-
is the external command to run.
Command parameters
In the value for the filter’s command, the following tokens will be replaced by the information on the current file being filtered:
Token | Replaced by… |
---|---|
|
The path to the binary file (input). |
|
The path to the plain text file (output). |
|
The executable program taken from |
|
The environment variable |
If TEXTIFY_INPUT
or TEXTIFY_OUTPUT
is not specified on the command line, then these files will be used as the command’s standard input and standard output respectively.
Textify files
There are usually three textify.cfg
files that will be consulted to determine which filters should be used on particular files. These are (in order of precedence):
-
collection specific
textify.cfg
($SEARCH_HOME/conf/COLLECTION_NAME/textify.cfg
) -
system wide
textify.cfg
($SEARCH_HOME/conf/textify.cfg
) -
default
textify.cfg
($SEARCH_HOME/conf/textify.cfg.default
)
The textify.cfg.default
should not be changed as these are overwritten during an upgrade.
Example
Here is the PDF filter from the standard textify.cfg
:
textify.cfg
.pdf=executable{perl} $SEARCH_HOME/bin/filter/pdf2html.pl TEXTIFY_INPUT
See also
-
executables.cfg
configuration option