PADRE is the Funnelback "engine." It indexes the documents from a collection and queries those indexes. PADRE is, in fact, several programs. The main ones being:
- is the indexing checker program. It is a utility used for validating indexes.
- is the program for showing stored document information, including metadata.
- is the program for adjusting flag bits - See Document Flags.
- is the indexing program. It reads the text files gathered during an update and creates the search indexes.
- is the program used for query auto completion.
- is the program for displaying the contents of the results file.
- is the search program. It parses the CGI parameters and executes the user's query, returning XML results.
Controlling indexable content
By default, PADRE will index all of the content within each document given to it. Finer control is provided through several mechanisms to exclude certain content from being indexed. These are:
- Control the content that is gathered:
For more information on this, see the relevant section in the documentation for your particular collection type. (e.g. Web crawler exclude patterns)
- Exclusion of sections within pages:
By surrounding sections of pages with special HTML comments, those sections will be ignored by the indexer. These tags can be automatically included for you based upon a regular expression. This is particularly useful in excluding common navigation elements, headers and footers. See Noindex expression. Alternatively you can insert them yourself into your documents. Note that these tags will only apply within the body of the HTML document. For example:
... This section is indexed ... <!--noindex--> ... This section is not indexed ... <!--endnoindex--> ... This section is indexed ...
Note that the 'noindex' tags used in previous versions of funnelback (beginnoindex, *start_indexing* and *stop_indexing*) will still operate correctly.
- Exclusion of whole pages:
By inserting a special meta tag into a page, the indexer can be instructed not to index it. The meta tag is:
<meta name="robots" content="noindex">
You can tell how old an index is by looking at the XML results for a search (search.xml) and looking at the collectionUpdated element:
<collectionUpdated>Tue Jul 31 11:24:47 2012</collectionUpdated>
The same information is also available in the index_time file in:
- PADRE binaries usage
- Custom Summaries
- Query logs
- Query Independent Evidence
- Document Flags
- Optimising your Web Site for Search
- Funnelback Ranking Algorithms