crawler.link_extraction_regular_expression
Background
This option defines the regular expression that will be used to extract URLs from HTML links like the following:
<link rel="alternate"href="http://www.abc.net.au/mobile"/>
<a href="http://www.abc.net.au">ABC</a>
<img src="http://www.abc.net.au/logo.png"alt="ABC Logo"/>
Setting the key
Set this configuration key in the search package or data source configuration.
Use the configuration key editor to add or edit the crawler.link_extraction_regular_expression
key, and set the value. This can be set to any valid String
value.
Examples
crawler.link_extraction_group=5
crawler.link_extraction_regular_expression=\s(href|src)(\s)*=(\s)*(\'|\")?\\s*(.*?)(>|\"|\'|(\s\w+\=))
Extracted groups:
-
(href|src)
: handlelink
,a
orimg
HTML tags. -
(\s)
: optional spaces -
(\s)
: optional spaces -
(\'|\")
:- quotes to begin the URL -
(.*?)
: the URL (non-greedy pattern) -
(>|\"|\'|(\s\w+\=))
: end the URL