How to Crawl Slow Websites

Modified on Mon, 15 Jul, 2024 at 5:34 PM

Selecting crawling mode and crawling area.
Crawling speed settings.
Automatic crawling pause and resuming.

1. Selecting crawling mode and crawling area

If it is not necessary to crawl the entire website, you can restrict crawling area, so the program won’t expose a website to a longstanding load. There are several ways to do it:

Limit crawling to one category → enter a URL of necessary category in the ‘Initial URL‘ field and enable the ‘Crawl only in directory‘ option. It can be found under ‘Settings → General‘. Keep in mind that to use this mode, the category should have an appropriate URL structure when URLs of the category and its pages begin with the same path. For example: website.com/category and website/category/first-item.

Limit crawling by using rules → this feature will help you to focus only on pages that match certain rules. These might be pages whose URLs contain particular words.

2. Crawling speed settings

To adjust crawling speed considering low performance of a crawled website, use the settings under ‘Settings → General‘:

Decrease the number of threads → set up not more than 5 threads in the corresponding field. It will reduce a number of concurrent parallel connections and decrease the load on a website.

Set up a delay between requests → adjust a delay between requests that are sent by the crawler to a server in the corresponding field. Delay is applied to each thread, so if the website is sensitive to high load, use a delay combined with a minimum number of threads.

Increase response timeout → by default Netpeak Spider waits 30,000 milliseconds for a page response and moves on to the next one unless it receives a response within this time. If you know in advance that page response speed is low, you can increase response timeout.

3. Automatic crawling pause and resuming

If you encounter the ‘429 Too many Requests‘ status code during crawling, we recommend doing the following steps:

Go to ‘Settings → Advanced‘ and tick options in the ‘Pause crawling automatically‘ section:

When website returns the ‘429 Too Many Requests‘ status code.
When the response timeout is exceeded.

Decrease the number of threads.
Change settings according to the recommendations in the first paragraph of this article.
Save settings.
Continue crawling if the error appeared at the beginning; restart crawling; recrawl certain pages with incorrect codes.