These guidelines will help you build a site that is highly searchable. A searchable site means enhanced search experience in Funnelback (and any other search product), plus greater visibility in global search engines such as Yahoo, Google or Bing. This translates to efficiency gains for employees and easier information availability for customers and stakeholders.
The following list gives suggestions on improving your site's searchability:
- Avoid excessive reliance on dynamically generated web pages
- Spiders work by following links. With dynamically generated content, they can potentially miss important pages or clutter up indexes with rubbish. When you do generate pages dynamically, give each page a single, short, human-readable URL.
- Avoid excessive use of <frameset>s and <frame>s
- Funnelback indexes the frame and its component pages separately. When a particular search result is returned it may appear without the context which would have been provided by the frame.
- Split large documents into smaller documents
- If some of your documents are very long, consider publishing them as separate chapters or sections. Imagine that your organisation has an administrative procedures manual (APM) which is 3,000 pages long and a HR employee enters the search query "long service leave". A PDF file of the whole APM wouldn't be a good answer to the query, even though it contains the best answer, because the HR employee would then need to search through the very large document for what they actually wanted. A far better answer would be a single HTML file containing "Section 13.4.5: Long Service Provisions".
- Exclude unsuitable material
- Configure your collections (or use ROBOTS.TXT files) to prevent the crawler from accessing material which isn't suitable for searching. You may wish to exclude mirror sites and directories of non-textual data. Excess material increases disk space usage, and slows down crawling, indexing and query processing. Focusing the material indexed may also improve the quality of results.
- Prevent individual pages that do not contain useful information from being indexed
- This may include pages that are useful in a browsing context, but are less likely to be appropriate as search results. Examples include A-Z listing pages, mid- and low-level index pages, etc. Use of the <meta name="robots" content="follow,noindex"/> robots metadata directives would be appropriate here.
- Excluding portions of a page (such as navigation content)
- This might include navigational elements, headers, footers, etc. (See Controlling indexable content in PADRE for details). The query-biased result summaries on some sites can suffer in quality because the summaries include sentences extracted from the site navigation text instead of the main document content. A solution for this problem is to add directives into the Web pages to indicate that certain sections should not be indexed. Where these pages cannot be modified at the source, the use of a NoIndexFilterInjector is recommended. Note that anchor text is indexed as part of the target document at all times to ensure that ranking quality is not affected.
- Supply correct date metadata
- This could be achieved by ensuring that page-level date metadata is published in a supported format, or by ensuring that your webserver is configured to send the correct document modified dates in the HTTP headers.
- Create concise page titles
- title tags are often used as search result titles, and aid in providing a strong information scent. Titles should aim to be unambiguous, and provide users with a clear indication of the result's content, purpose and context.
- Create descriptive link text
- Link text is defined as the words that form the text of the hyperlink when creating links in your HTML. Avoid using link text like More... or Click here...
- Create good quality metadata:
- Funnelback can be configured to index metadata, and use metadata for display purposes. Eg. A metadata abstract can be presented instead of the auto-generated snippet. Good metadata can also be used to provide faceted navigation. Bad metadata is worse than having no metadata at all.
- Ensure your webserver serves appropriate status codes
- During crawling, URLs that are requested that return a status code of 200:OK will be regarded as valid pages, even if the page itself contains a 'Broken / Not Found' message. Your webserver should ensure that broken URLs return a 404:Not Found status code.
- The Web Robots Pages (robotstxt.org)
- Dublin Core Metadata Schema (Dublin Core Metadata Initiative)
- Twitter Cards markup Tag Reference
- OpenGraph Object Types