Crawling rules configuration

Modified on Tue, 19 Mar 2024 at 02:57 PM

  1. Common Functions for All Rules.
  2. How to Set up Crawling Rules.
  3. Combination of Conditions and Settings.

Crawling rules specify which type of URLs to include or exclude from crawling.

1. Common Functions for All Rules

  • To configure crawling rules, tick the ‘Use crawling rules‘ checkbox.

If you need to crawl the website without created rules, but you don’t want to delete them, untick this checkbox.

  • ‘Follow the links on filtered URLs‘ → this setting is similar to the tag. Check it to follow all links located on the pages that meet the rules.

Note that filtered URLs will not be added to the results table but will be included in the ‘Skipped URLs‘ report. 

  • Add a rule → adds new rules in the settings window. It is also possible to add a rule using the Ctrl+N hotkey. The number of rules is not limited.
  • Filter logic→ to define how the rules will work, it is necessary to choose filter logic:
    • AND → combines two or more rules: it returns ‘true‘ only if all of setrules return ‘true‘.   
    • OR →  returns ‘true‘ if at least one of set rules returns ‘true‘.  
  • Clear rules → delete all created rules.

If it’s necessary to delete one particular rule, click the cross icon located on the top right corner of its line.

2. How to Set up Crawling Rules

How to set up crawling rules

The rule line contains two dropdown lists: for choosing action and for choosing the rule condition.

Possible rules actions:

  • Include → the crawler will add URLs matching the conditions, into the table.

  • Exclude → URLmatching the conditions, will be included to the ‘Skipped URLs‘ table.

  • Possible rules conditions:

    • Contains → restrict crawling by the text content in a URL, for example, of the specified category .

    • Exactly matching → for searching or excluding a particular URL from the report.

    • Matching RegExp → allows to include/exclude URLs using regular expressions. For instance, to get a URL of a certain depth. 

    • Begins with → include/exclude URLs, which starts with a set value.

    • Ends with → include/exclude URLs, which ends with a set value.

  • Length → for crawling restriction by the number of symbols in a URL. It is possible to set math equity character (=) and inequality characters (<, >, ≤, ≥, NULL).

3. Combination of Conditions and Settings

3.1. Netpeak Spider allows combining described conditions between each other and with other settings. For instance, to crawl URLs containing the word ‘spider‘ and with two click depth from the initial URL, from the subdomain  Set up max crawling depth: 2 on the ‘Restrictions‘ tab:

3.2. Go to the ‘Rules‘ tab and configure the following conditions:

  • Include URLs started with ‘‘.

  • Include URLs that contain ‘spider‘.

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select atleast one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article