crawler.link_extraction_regular_expression

Background

This option defines the regular expression that will be used to extract URLs from HTML links like the following:

<link rel="alternate"href="http://www.abc.net.au/mobile"/>
<a href="http://www.abc.net.au">ABC</a>
<img src="http://www.abc.net.au/logo.png"alt="ABC Logo"/>

Setting the key

Set this configuration key in the search package or data source configuration.

Use the configuration key editor to add or edit the crawler.link_extraction_regular_expression key, and set the value. This can be set to any valid String value.

Default value

crawler.link_extraction_regular_expression=\s(href|src)(\s)*=(\s)*('|")?\s*(.*?)(>|\4|(\s\w+\=))>?([^<]+)?(</a)?

If no value is defined, then the above default is used.

Examples

crawler.link_extraction_group=5
crawler.link_extraction_regular_expression=\s(href|src)(\s)*=(\s)*(\'|\")?\\s*(.*?)(>|\"|\'|(\s\w+\=))

Extracted groups:

(href|src): handle link, a or img HTML tags.
(\s): optional spaces
(\s): optional spaces
(\'|\"):- quotes to begin the URL
(.*?): the URL (non-greedy pattern)
(>|\"|\'|(\s\w+\=)): end the URL

Help Center

Menu

crawler.link_extraction_regular_expression

Background

Setting the key

Default value

Examples

See also