How to Crawl Slow Websites

Modified on Mon, 09 Oct 2023 at 07:40 PM

  1. Selecting crawling mode and crawling area.
  2. Crawling speed settings.
  3. Automatic crawling pause and resuming.

1. Selecting crawling mode and crawling area

If it is not necessary to crawl the entire website, you can restrict crawling area, so the program won’t expose a website to a longstanding load. There are several ways to do it:

  • Limit crawling to one category → enter a URL of necessary category in the ‘Initial URL‘ field and enable the ‘Crawl only in directory‘ option. It can be found under ‘Settings → General‘. Keep in mind that to use this mode, the category should have an appropriate URL structure when URLs of the category and its pages begin with the same path. For example: and website/category/first-item. 

  • Limit crawling by using rules → this feature will help you to focus only on pages that match certain rules. These might be pages whose URLs contain particular words. 

2. Crawling speed settings 

To adjust crawling speed considering low performance of a crawled website, use the settings under ‘Settings → General‘:

  • Decrease the number of threads → set up not more than 5 threads in the corresponding field. It will reduce a number of concurrent parallel connections and decrease the load on a website.

  • Set up a delay between requests → adjust a delay between requests that are sent by the crawler to a server in the corresponding field. Delay is applied to each thread, so if the website is sensitive to high load, use a delay combined with a minimum number of threads.

  • Increase response timeout → by default Netpeak Spider waits 30,000 milliseconds for a page response and moves on to the next one unless it receives a response within this time. If you know in advance that page response speed is low, you can increase response timeout.

3. Automatic crawling pause and resuming

If you encounter the ‘429 Too many Requests‘ status code during crawling, we recommend doing the following steps:

  1. Go to ‘Settings → Advanced‘ and tick options in the ‘Pause crawling automatically‘ section:

  • When website returns the ‘429 Too Many Requests‘ status code.
  • When the response timeout is exceeded.

  1. Decrease the number of threads.

  2. Change settings according to the recommendations in the first paragraph of this article.

  3. Save settings.

  4. Continue crawling if the error appeared at the beginning; restart crawling; recrawl certain pages with incorrect codes.  

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select atleast one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article