Custom HTTP Headers

Modified on Tue, 9 Jul, 2024 at 4:18 PM

In Netpeak Spider 3.6 we’ve implemented a support of custom HTTP headers for a flexible program configuration. It allows you to solve more advanced tasks like checking SEO issues on websites that use a web form authentication.

1. What this feature is for

1.1. Crawling websites requiring authentication

Custom HTTP headers will allow you to crawl or scrape data from websites which content is available only for authorized users.

1.2. Avoiding crawling protection

Owing to custom HTTP headers, a web server will consider requests sent by Netpeak Spider not as automatic ones but as sent by a user.

1.3. Getting dynamic versions of pages

This feature will be necessary when you have to crawl a website that sends different source code depending on parameters in HTTP headers: a device, client, region, language, or screen resolution.

2. How to configure HTTP headers

To configure custom HTTP headers, go to ‘Settings’ → ‘HTTP headers’.

2.1. Such fields as ‘User-Agent’, ‘Accept’, ‘Accept-Encoding’ can’t be changed.. In case of creating another header with a similar name, the crawler will ignore it to avoid errors during the crawling.

2.2. The ‘Add header’ button will add a new row with the fields ‘Name’, ‘Value’ and ‘Delete’ button. You can type your own name and value of the header. The number of headers that you can add in settings is not restricted.

2.3. The ‘Clear all’ button removes all added headers except the first three onesand ‘Reset settings to default’ button clears all the added headers returning the standard list of headers.

2.4. You can save an added set of headers as a template with a corresponding button.

3. Use cases

3.1. Checking changes on a website with ‘If-Modified-Since’ header

1. Add a new header in the ‘HTTP headers’ settings with the following value – If-Modified-Since: , :: GMT.

Example: If-Modified-Since: Wed, 1 Jan 2020 07:28:00 GMT

2. Set a user agent in the ‘User agent’ settings that will be used in the request headers sent to a web server of a crawled website.

3. Enter an initial URL and hit the ‘Start’ button.

What is it for?

If pages return 200 status code during crawling it means that they were changed in the date range specified in the ‘If-Modified-Since’ header. Otherwise, a page should return 304 status code.

3.2. Crawl locale-adaptive content

You can set any language and region value from ‘Accept-Language’, ‘Cookie’, ‘Referer’ headers in any header with a unique name to analyze locale-adaptive content.