P@noptic 5.5.0 release notes

Crawler

  • Incremental crawling: If your web servers provide Content-Length information in response to HTTP HEAD requests, the time and network traffic required to update a crawled collection can be dramatically reduced. The administration interface now supports both full and incremental updates and allows n-incremental, 1-full scheduling patterns.

  • Increased crawler efficiency: More efficient memory use, less frequent checkpointing and more streamlined internal operations allow more parallelism and faster crawling.

  • Duplicate detection within the crawler: This can dramatically reduce the number of documents which are downloaded only to be subsequently deleted.

  • Improved crawler handling of problem web sites: Leads to even less missed content.

Database/XML indexing

  • Support for Ingres and Sybase databases.

  • Enhanced efficiency in indexing large databases.

  • More faithful "metadata" summaries. Database fields or XML documents containing markup (e.g. HTML bold tags) are protected during indexing and can be faithfully rendered in titles and metadata summary elements.

  • More faithful "metadata" summaries. Database fields or XML documents containing markup (e.g. HTML bold tags) are protected during indexing and can be faithfully rendered in titles and metadata summary elements.

Indexer/query processor

  • Ability to supply a customised stop word list via the -STOP query processor option. Note, however that stopword elimination is not particularly important in Panoptic.

  • Live URLs in web collection search results now always correspond to the URL as supplied by the original webserver.

  • Preliminary support for UTF-8 character encoding, but only for European languages at this stage. Please see query processor option -utf8, and indexer option -utf8input.

  • The length of the original document is now reported, rather than length after text extraction and storage on disk.

  • Substantially improved handling of dates in XML documents.

  • Better support for large collections of documents.

  • Query term highlighting in metadata summaries, except for URLs.

  • By default, uncanonicalised queries are logged rather than queries in which the words have been reordered.

  • Allow sites to avoid indexing repeated navigational elements by inserting HTML comments. If one of the strings *stop_indexing*, beginnoindex or noindex appears at the beginning of an HTML comment (with or without preceding white space), indexing will be suppressed until the next HTML comment beginning with either of *start_indexing* or endnoindex. Note that anchor text of outgoing links will still be propagated to the link targets for indexing.

Panoptic 5.5.5

Upgrading to Panoptic 5.5.5

  • Administrators running Panoptic on Windows may wish to modify their collection update tasks in the Windows Task Scheduler to use update-win.pl instead of update.pl. This will enable the main collection update log to be created at SEARCH_HOME\log\update-<collection>.log during scheduled updates.

    Also note the upgrade issues included in earlier releases if upgrading from a version of Panoptic earlier than 5.5.4.

Panoptic 5.5.5 changes

Bug fixes

  • Fixed an intermittent crash in the Excel filter (xlhtml).

  • Fixed a bug in the processing of anchortext in cases where the link target contains whitespace.

  • Fixed an overflow in the expiry date of SSL certificates.

  • Fixed the permissions on the generated SSL certificates.

  • Enable the Panoptic Web administration virtual host to be included from Apache’s conf.d directory.

  • Added quotes around the certificate paths in the Apache configuration.

  • Fixed quoting and escaping issues in limit.pl (Windows)

  • Mandatory exclusion operators, -, in metadata search form elements were being misinterpreted due to recent HTML encoding changes in PADRE.

  • Fixed a bug PADRE’s parsing of the index.bldinfo file.

  • Fixed a bug in scanning documents containing <title>hello</title> …​</title>.

  • Fixed a bug in search.cgi that prevented it from being run under suexec.

  • Fixed various warnings and errors in schedule.cgi.

  • Fixed a potential buffer overflow when reading the licence key.

  • Added a missing argument to the call to setup-search-location.pl.

  • Default result titles in PADRE’s XML are now enclosed in CDATA sections.

Enhancements

  • Improved extraction of metadata from Word files.

  • New script, update-win.pl, to unable update logging under Windows.

  • New feature One Shot which sends the user directly to the URL of the first search result when the oneshot CGI parameter is specified.

  • Upgraded the JRE to 1.5.0.04.

  • Removed the redundant xxxtextify text from filtered documents.

  • Upgraded the pdftotext filter on Solaris to version 3.00.

  • New form substitution tags, resifcollection and resifnotcollection to enable collection specific results presentation for meta-collections.

    • Update logs now include a warning if the status email fails to send.

    • The PDF text filter now outputs the subject metadata as description metadata and subject metadata.

    • The crawler will now follow hyperlinks that are not enclosed in quotes.

    • Upgraded Apache for Win32 to 2.0.54.

Panoptic 5.5.4

Upgrading to Panoptic 5.5.4

  • No issues when upgrading from 5.5.1 or later. Refer to the version 5.5.1 upgrade issues for information about upgrading from earlier versions.

Panoptic 5.5.4 changes

Bug fixes

  • The external metadata system was skipping default pages.

  • The resifnot{} tag was not working across multiple lines.

  • Fixed an anchortext matching problem on Windows.

  • Don’t override DC.Title metadata with empty HTML <title> elements.

  • Fixed an error in the <s:Truncate> help page.

New features

  • Added support for radio buttons in search interfaces.

  • xml-splitter.pl can now accept regular expression input instead of just a literal Xpath.

  • New official Debian and SuSE versions of Panoptic.

  • Better support for ISO8601 date scanning.

  • Added the new "Powered by Panoptic" logo to the default forms.

  • Added support for handling anchortext pointing to HTML redirects.

Panoptic 5.5.3 changes

Upgrading to Panoptic 5.5.3

  • No issues when upgrading from 5.5.1 or later. Refer to the version 5.5.1 upgrade issues for information about upgrading from earlier versions.

Panoptic 5.5.3 changes

Bug fixes

  • Fixed a problem related to indexing broken symlinks and long filenames.

  • Update status emails include log files from the live view when they are performed in the live view.

  • Click.cgi redirects and exits if referring URL is undefined.

  • Fixed uninitialised variable warning in Utils.pl.

  • Fixed a bug in CMWeb collections that caused the 'filter' option to be reset to 'false' when edited via the Web admin interface.

  • An empty scope parameter before query parameter no longer removes the query parameter.

  • Fixed the indexer crash when phantom documents are used in conjunction with the MDSF file.

  • Fixed the multiple match-point overflow error messages.

  • Numerous fixes to refine.cgi.

  • Reinstated the "next page" link.

  • XML files are correctly displayed in the cached view.

  • Fixed crawler execution problem that occurred intermittently when the update was started by cron.

  • Fixed a bug in whitespace queries.

  • Fixed a bug in '0' value queries.

  • The Solaris installer now creates the SSL certificates before attempting to configure Apache.

  • The Solaris installer no longer adds the "LoadModule Suexec" directive (in case it’s already compiled in).

Enhancements

  • Better use of CSS in the default forms files.

  • Collection size information is written to size.log.

  • Click.cgi (click-through logger) is now installed in the Web area.

  • Add support for logging result rank in click.cgi.

  • The inline thesaurus matches on whole words only.

  • Installation continues if Apache fails to restart.

  • Added new form tag operator: <s:Compare>

  • Added new form tier bar customisation tags: <s:TierBarFeaturedPages>, <s:TierBarFullyMatching> and <s:TierBarPartiallyMatching>

  • Added support for the showform CGI parameter that forces display of the initial search form by not executing the query processor.

  • Added support for the fp_tiers CGI parameter to enable/disable featured pages tiers.

  • Added support for 'separator' and 'label' attributes for <s:PrevNext>.

  • Fixed a bug in processing xxx_orplus queries.

  • The Apache configuration is now written to a temporary file if the SEARCH_SERVICE environment variable has been set to BUREAU in the existing Apache config.

  • Added map tags around the <s:PrevNext> tags for better accessibility.

  • Added support for collection option index to enable/disable indexing.

  • Added calendar to standard crawler exclude patterns.

  • Enable meta collections to be updated for the purpose of query log management.

  • Added new version of PDF introductory guide.

  • Added support for featifnot for enhanced display of featured pages.

  • Better support for dealing with date operators in refine.cgi.

  • Added robots noindex metadata to search forms.

  • Fixed display of featured pages when a description is not provided.

  • Inline thesaurus: Display all suggestions and sort lexicographically.

  • Integrated the new SecureCGI.pm library to enhance protection against cross-site scripting attacks.

  • Implemented a minimum term frequency of 2 for words to be added to the spelling dictionary.

  • The document format select list in the advanced form is now of type scoped and to remove result tiers that to not conform to the selected format.

  • Metadata query parameters can be combined with the scoped and (*_sand), not (*_not), and and (*_and) modifiers.

  • Multiple scope CGI parameters are joined to enable scopes to be selectable via check boxes.

  • The scope parameter is now displayed on a separate line on both forms.

  • The scope parameter is recorded as a hidden parameter in the advanced form to enable multiple scoped queries.

  • The current scope is displayed on the advanced form.

  • Added a work-around for a RedHat EL3 bug in the spelling dictionary builder.

  • Updated version of cpio.exe for use with the installer.

  • All calls to Perl scripts are prefixed with the full path to the Perl interpreter to avoid the Windows file association bug.

  • Encode special characters in Windows filenames.

  • PerlIS is no longer used by default with IIS (Perl.exe is far more stable).

  • Check that Windows user has administrator privileges.

  • Check that Windows user is using Perl 5.8.

Panoptic 5.5.1

Upgrading to Panoptic 5.5.1

  • To enhance the customisability of the results pages, an HTML <br> element was taken out of the search wrapper (search.cgi) and put into the default interface form files. The results page formatting of existing collections that use the Featured Pages mechanism would benefit from having this <br> element inserted into the form files immediately after the featured pages link (see the new simple.form.dist file as a reference).

  • Use of the 'search-apache' group has been replaced with group 'search'. This change is made automatically during the upgrade to 5.5.x and doesn’t require any action from the administrator.

  • The query processor now outputs result document sizes as bytes instead of kilobytes. The document sizes also now represent the pre-filtered size. The default search wrapper, search.cgi, converts these sizes back to kilobytes but custom wrappers will need to handle this.

  • Titles and metadata summary fields in the PADRE XML output are now protected by CDATA sections. The search wrapper (search.cgi) has taken over the role of stripping the CDATA markup and encoding the encapsulated text into HTML.

    This doesn’t require any changes for users of the standard search.cgi wrapper. Administrator’s using a custom search wrapper will need to make these changes to their own wrapper or enable backwards compatibility mode by specifying -nocdata as an indexer and query processor option.

    For a full explanation of this change see: http://www.panopticsearch.com/AdminHelp/summaries.html

Panoptic 5.5.1 changes

Bug Fixes

  • Featured pages are now displayed when there are zero results.

  • Correct detection of cron running on Debian for status.cgi.

  • Set the spelling_main_dictionary parameter to 'english' on Red Hat 9 installations.

  • Fixed the bug resulting from having an accented character as the last character in a summary.

  • PADRE should respond to a -v before failing due to licence check.

  • PADRE looks for the licence key in C:\Panoptic\search on Windows when SEARCH_HOME hasn’t been set.

  • The tmp_qresults file is unlinked after the query cache is loaded.

  • Stemming no longer deletes single character query terms.

  • Fixed crawler halting problem on some systems.

  • Fixed index spell parsing problems (due to accented characters).

  • Index-spell.pl now sends its STDERR to a log file.

Enhancements

  • Incremental crawling.

  • Exact text representation of metadata summaries fields.

  • Increased crawler efficiency: More efficient memory use, less frequent check-pointing and more streamlined internal operations allow more parallelism and faster crawling.

  • Improved crawler handling of problem web sites (including Lotus Domino based sites).

  • Enable parsing of standard date formats in the XML scanner.

  • More efficient database extraction.

  • Support for Lotus Domino, Ingres and Sybase databases.

  • Ability to supply a customised stop word list via the -STOP query processor option.

  • Live URLs in web collection search results now always correspond to the URL as supplied by the original web server.

  • Preliminary support for UTF-8 character encoding, but only for European languages at this stage. Please see query processor option -utf8, and indexer option -utf8input.

  • The length of the original document is now reported, rather than length after text extraction and storage on disk.

  • Better support for large collections of documents.

  • Query term highlighting in metadata summaries.

  • Support for crawling annodexed media files.

  • Support for integration with the iPlanet Web server.

  • PADRE scans up to 220 characters into each doc for <html> instead of only 120 to determine document type.

  • Improve PADRE’s ability to determine the fully qualified hostname for licence checking purposes.

  • Set the appropriate threading library based on kernel version in crawl.pl.

  • Move the DB2XML log file to the log directory under the collection root (so that it can be viewed by the admin interface) and rename it to database.log.

  • Make the max_heap_size argument apply to the DB2XML call.

  • Parameterised the Perl locale in crawl.pl.

  • All textify.conf files now refer to SEARCH_HOME as $SEARCH_HOME on Solaris and Windows.

  • Removed the query stats, query report and status scripts from the search area.

  • Now use whoami for root user detection.

  • Limit the number of Blat retries on Windows to 3.

  • Replace the 'search-apache' group with 'search'.

  • Check for the existence of files before chgrpping (the chgrp command is no longer silent with the -f option).

  • Add the aspell, pstotext and wvWare, catdoc, xlhtml, ppthtml source packages to the Solaris version.

  • Allow the Web user and group to be different.

  • Added a new line to the htpasswd file when not using Apache (for iPlanet Integration).

  • Improved the featured pages formatting.

  • Strip CDATA sections from input XML.

  • Protect summary titles and metadata fields with CDATA sections in PADRE’s output XML.

  • PADRE returns file sizes in bytes. These are then converted to kilobytes by search.cgi.

  • No longer display collection listing in search.cgi if environment variable SEARCH_SERVICE is set to BUREAU.

  • Upgrade Apache to 2.0.50 on Win32.

  • Upgrade to JRE 1.4.2_05 on Win32, Red Hat and Debian.