P@noptic 2.0.0 release notes
Released: 7 February 2001
Prior to v6.0 Funnelback was known as P@noptic Search |
Hardware/Architecture
-
Declining hardware prices and larger collections suggest that the preferred minimum configuration for a new P@NOPTIC box should be 512 MB. Default memory parameters are now set to assume this capacity. P@NOPTIC will continue to work on smaller machines (for smaller collections) but the memory parameters will need to be reduced.
-
A new filesystem structure has been adopted which makes use of a second disk if there is one to increase system capacity and provide rudimentary backup and recovery capabilities.
-
If the system is equipped with two identical disks, system installation will create a 2-disk system and make copies of initial
/
,/home
,/usr
and/var
partitions on the second drive. -
We are working toward supporting hardware RAID but explicit support is not yet present in the v2.0 release.
Admin GUI
-
An advanced scheduling interface is supplied as well as the simple 7-day one-per-day one. It allows scheduling by day-of-month and by hour-of-day. For example, it is possible to update a collection on the hour between 0900 and 1700 on working days, or at midnight on the 13th and 27th of each month. More than one update can be programmed simultaneously but this is only advised for small collections. *Local and Meta collection can now be created and edited via the admin GUI.
-
Meta collections are automatically updated when one of their components is.
-
Different colors are now associated with each type of collection and these are used in the main collection list display and in the scheduling interface.
-
The main collection list is now sorted by recency of update. The stalest collection appears first. Collections are marked with an asterisk if they appear in the schedule of automatic updates.
-
The admin GUI now correctly handles non-HTML file formats for the new FunnelBack crawler.
-
Collection updates timeout after a default period of 48 hours and what has been collected so far is indexed.
-
The update confirmation screen is now tailored to the type of collection and to the crawl options in effect. It provides a lot more detailed information about what specific actions will be taken as part of the update.
-
Email messages to the search administrator now have informational subject lines and contain a lot more detail about the results of crawling and indexing.
Crawler (FunnelBack)
-
The open source crawler
pwget
is no longer included in the distribution. -
The new crawler is called FunnelBack - which is both a cross between two very potent Australian spiders and a functional description.
-
FunnelBack now incorporates a very large number of internal improvements to improve its robustness and to handle a vast range of web-page pathologies with equanimity.
-
FunnelBack conforms to the various Web RFCs (published specifications which define Web standards). Where possible, it attempts to do the best it can with web pages and servers which don’t conform to these standards.
-
Funnelback currently uses memory resident data structures but can handle more than a million URLs on a 512MB machine.
-
Funnelback prevents overloading web servers by imposing a minimum delay between the completion of one request to a server and issue of the next. This delay is configurable on a per-collection basis.
-
Funnelback attempts to process documents which are classified as text/html or text/plain by the publishing web server. However, because many web servers erroneously classify documents, FunnelBack avoids parsing documents which are more than a few megabytes long or which contain more than a few ASCII control characters.
-
Crawling is now subject to a configurable time limit, set to 48 hours by default. If the intranet to be crawled contains a page generator which creates millions of unique but uninformative pages, it is better to stop after a reasonable time. After the time limit expires, the crawler exits and the harvest thus far is indexed and made ready for searching. Note that, assuming good load balance and 20 parallel threads, FunnelBack is capable of retrieving a million pages in 48 hours.
Filter framework
-
We have experienced problems with the open source filter for extracting text from PDF files. Sometimes no text is extracted; sometimes the filter takes a grossly excessive amount of time.
-
As an interim measure, the filter framework now applies a time limit to each filter operation.
-
Hopefully, a new robust extractor for PDF can be found in time for the next release.
Indexer/query processor (PADRE)
-
Added image search capability via Alt Text and image filenames. (metadata class i:)
-
Added a metadata class (k:) to represent anchor text within a document.
-
Improved ranking of homepages.
-
Improved support for subset/subsite search.
-
Added the
config.pl
script for configuring and building PADRE. -
Various internal cleanups and bug fixes.
-
PADRE now produces an overall indexing summary in the last 30 lines of its output.
-
Better support for different document formats (eg. XML).
-
A test for binary documents is applied during indexing. Documents which are determined to be binary are ignored.
-
URLs are now extracted from the BASE HREF in the HTTP header if it exists otherwise from the document filename.
.pan.txt
suffixes inserted by the Filter Framework are not counted as part of the URL.
Query interface
-
The query interface is more customizable. HTML headers and footers may be specified on a per-collection basis as may certain colors used in result pages.
-
Depending upon customization, the result page layout now permits more results on the first screen of output.
-
Small improvements to wording and presentation of results.
-
The advanced search option is no longer available. (It provided very little additional functionality and the search individual site capability was extremely difficult to use for more than 100 or so servers.)
-
In a forthcoming release we plan to replace advanced search with a graphical interface to P@NOPTIC’s metadata search capability.