crawler.additional_link_extraction_pattern

Background

This option defines an additional regular expression that will be used to extract URLs from content. Its default is a pattern that recognizes any absolute URL. More specifically a URL that matches the following conditions:

  1. starts with the protocol header (i.e., http(s) or ftp(s)) or

  2. starts with a subdomain name, which is composed by alphanumerical character(s) (e.g., "www.", "docs.",…​etc.) and

  3. contains the domain name without the domain type(e.g., "google" from "google.com"), which is composed by single or multiple alphanumerical character(s), in addition to some designated special characters (i.e., -_+) and

  4. ends with a domain type, which is composed by 2 or 3 alphabetical characters (e.g., ".edu",".com", ".au", …​etc) or

  5. Starts with IP address (e.g., "192.10.0.1") or 7.starts with "localhost" or 8.ends with the port number with a colon (e.g.,":8443") or

  6. followed by a path segment (e.g., "/test/am/i")

examples:

<loc>http://www.abc.net.au/mobile</loc>
{link : doc1.csiro.au}
<a href="http://www.abc.net.au">ABC</a>
<img src="http://www.abc.net.au/logo.png"alt="ABC Logo"/>
...
Please refer to the unit tests of this key for more  acceptable URL examples.

Setting the key

Set this configuration key in the search package or data source configuration.

Use the configuration key editor to add or edit the crawler.additional_link_extraction_pattern key, and set the value. This can be set to any valid String value.

Default value

crawler.additional_link_extraction_pattern=(((http(s)?|ftp(s)?)://[a-zA-Z0-9][-a-zA-Z0-9_+]{0,}\.|
                                            [a-zA-Z0-9][-a-zA-Z0-9_+]{0,}\.[a-zA-Z0-9][-a-zA-Z0-9_+]{0,}\.[a-zA-Z]{2,4} |
                                            ((http(s)?|ftp(s)?)://)localhost |
                                            ((http(s)?|ftp(s)?)://)?(((25[0-5]|(2[0-4]|1\d|[1-9]|)\d)\.?\b){4}))
                                            (:(0|[1-9][0-9]{0,3}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5]))?
                                            ((/([\?-a-zA-Z0-9&@:%.\_+~#=&]{0,}))+)?

If no value is defined, then the above default is used.

Notes:

  1. This default value will only work on the absolute URL links. It will not work for the relative URL links.

  2. Its activation is controlled by the boolean value stored in crawler.use_additional_link_extraction. If that boolean value is false, this regular expression will NOT be used. Otherwise, this regular expression will be used , in addition to crawler.link_extraction_regular_expression.

  3. It does not recognize the unicode and the non-English alphabet.

Examples

crawler.additional_link_extraction_pattern="(\s)(href|src)(\s)*=(\s)*('|")?\s*(.*?)(>|\4|(\s\w+\=))>?([^<]+)?(</a)?"