crawler.additional_link_extraction_pattern
Background
This option defines an additional regular expression that will be used to extract URLs from content. Its default is a pattern that recognizes any absolute URL. More specifically a URL that matches the following conditions:
-
starts with the protocol header (i.e., http(s) or ftp(s)) or
-
starts with a subdomain name, which is composed by alphanumerical character(s) (e.g., "www.", "docs.",…etc.) and
-
contains the domain name without the domain type(e.g., "google" from "google.com"), which is composed by single or multiple alphanumerical character(s), in addition to some designated special characters (i.e., -_+) and
-
ends with a domain type, which is composed by 2 or 3 alphabetical characters (e.g., ".edu",".com", ".au", …etc.) or
-
Starts with IP address (e.g., "192.10.0.1") or 7.starts with "localhost" or 8.ends with the port number with a colon (e.g.,":8443") or
-
followed by a path segment (e.g., "/test/am/i")
examples:
<loc>http://www.abc.net.au/mobile</loc> {link : doc1.csiro.au} <a href="http://www.abc.net.au">ABC</a> <img src="http://www.abc.net.au/logo.png"alt="ABC Logo"/> ... Please refer to the unit tests of this key for more acceptable URL examples.
Setting the key
Set this configuration key in the search package or data source configuration.
Use the configuration key editor to add or edit the crawler.additional_link_extraction_pattern
key, and set the value. This can be set to any valid String
value.
Default value
crawler.additional_link_extraction_pattern=(((http(s)?|ftp(s)?)://[a-zA-Z0-9][-a-zA-Z0-9_+]{0,}\.|
[a-zA-Z0-9][-a-zA-Z0-9_+]{0,}\.[a-zA-Z0-9][-a-zA-Z0-9_+]{0,}\.[a-zA-Z]{2,4} |
((http(s)?|ftp(s)?)://)localhost |
((http(s)?|ftp(s)?)://)?(((25[0-5]|(2[0-4]|1\d|[1-9]|)\d)\.?\b){4}))
(:(0|[1-9][0-9]{0,3}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5]))?
((/([\?-a-zA-Z0-9&@:%.\_+~#=&]{0,}))+)?
If no value is defined, then the above default is used.
Notes:
-
This default value will only work on the absolute URL links. It will not work for the relative URL links.
-
Its activation is controlled by the boolean value stored in
crawler.use_additional_link_extraction
. If that boolean value is false, this regular expression will NOT be used. Otherwise, this regular expression will be used , in addition tocrawler.link_extraction_regular_expression
. -
It does not recognize the unicode and the non-English alphabet.