Netpeak Spider 2.1.1.1: Global Changes in the Crawling Algorithm

Modified on Tue, 5 Dec, 2023 at 11:32 AM

The main purpose of this update is to show you the idea behind the design of Netpeak Spider – a program built on the crawling algorithms similar to those used by search engines. We will talk about the solutions to some major problems that programs/services may encounter, mostly about the ones that deal with SEO audit automation.

1. Goals

One of our long-standing goals is to replicate – as closely as possible – the ways search engine robots crawl websites. To achieve this we need to:

correctly process HTTP response headers
consider the tags in the <head> element
display reports in a convenient way
lay the basis for the algorithm that will correctly calculate the internal PageRank

2. Difficulties

Here are some problems we encountered on the way to our goal:

search engines do not disclose the methods they use to crawl your website → we cannot be sure what tags they find, how they process them, or whether they pass the link juice, etc. Our program, by contrast, is supposed to provide you with this information.
customizing the crawl due to the big number of settings → in crawling settings you can switch on/off any parameter as well as indexation instructions. However, if you switched off a parameter that determines indexation, or generally is a factor that allows search engines to access your page, the program wouldn’t show any issues you might have on the page.

3. Solutions

To address these problems, we changed the crawling algorithm from its core and added some new features that will help you understand the way Netpeak Spider works:

3.1. New Status Codes

We have changed the way we manage status codes, introducing new unique marks for the most common cases that we now return with the symbol &:

Disallowed → indicates that the URL is blocked in robots.txt
Canonicalized → the URL contains a canonical tag to another URL (note that if the page has a canonical tag pointing to itself, no such indication will appear)
Refresh Redirected → the URL has a refresh tag (either in an HTTP response header or a meta refresh in the <head> element) that is pointing to another URL (will not be displayed if a refresh tag is pointing to itself)
Noindex / Nofollow → the URL contains instructions that prevent robots from indexing and/or following the link (the instructions themselves are located in the HTTP response headers or in the <head> element)

As a result, from now on, if a page blocked by robots.txt returns a 200 (OK) status code, you will receive a ‘200 OK & Disallowed’ response:

3.2. User Agent

Pay attention to the user agent you choose in the settings, since now it is used in all robot directives:

robots.txt file → the highest level: if you block a URL with robots.txt from being indexed by a specific bot (like Googlebot) and turn on the option in the settings menu to account for robots.txt, Netpeak Spider won’t even attempt to access this URL. Be attentive! It is a rather common mistake among SEO specialists: they like to include some important tags like <meta name="robots" content="noindex, follow"> on a page and then disallow the page in robots.txt, thus blocking the tag from search engine robots.
X-Robots-Tag → mid-level: X-Robots-Tag is almost identical to meta robots tag with the only difference that it can be found in the HTTP response headers, which is a faster way to give search engine robots instructions as to indexing your page.
Meta Robots → the lowest level: a standard tag found in the <head> element of the document.

Try the preset user agent templates for search engine robots or enter your own data in the ‘Custom Settings’.

3.3. Advanced Settings

In the ‘Advanced’ tab of the crawling settings you can now factor:

Robots.txt Instructions → if a page is blocked by robots.txt and this option is turned on, you will not see this page in the report since Netpeak Spider doesn’t add such pages to the pending URLs. The only exception is by crawling a list of URLs – this mode implies that the final report will show all pages with an ‘& Disallowed’ mark.
Canonical Instructions → if a page has a canonical tag pointing to another URL, the results table will show both the source URL and the target URL.
Refresh → likewise, if a page has this tag pointing to another URL, the results table will show both pages.
X-Robots-Tag Instructions → if a page is blocked by a noindex tag, the page will not be shown in the report. If there is a nofollow instruction, the program will not view any links from the page. The only exception is, again, crawling the list of URLs, where the status code will have an ‘& Noindex / Nofollow’ mark.
Meta Robots Instructions → identical to X-Robots-Tag instructions.

3.4. Canonical Chains

In the previous versions of Netpeak Spider, we have shown how we detect redirect chains. Now the time has come to cover one more important issue → canonical chains. This has turned out to be rather complicated, so we decided to design an additional separate table to include this data.

A canonical tag pointing to a page, which itself points to another page, confuses search engines a great deal, that’s why – by default – the program considers a two-link chain already an issue (this can be easily changed in the settings).

3.5. Other Improvements to the Algorithm

canonical and meta refresh links, as well as links with redirects, will be added to the pending URLs under any settings
outgoing links from the pages with canonical and refresh are now filtered according to the rules of search engine robots → if you turn on the setting that accounts for these instructions, there will be only one outgoing link from this page mentioned in the canonical or refresh tag
the mechanics of excluding links has been changed entirely → at present the following priority of indexation and links is on hand: the rules in crawling settings → robots.txt instructions → Canonical → Refresh → X-Robots-Tag → Meta Robots. This priority denotes that if the page has 2 tags: canonical (pointing to another URL) and meta robots, only canonical will be accounted for because of its higher priority.
the mechanics behind receiving information from robots.txt has been optimized

4. Additional New Features and Improvements

4.1. Redirect Chains or Canonical Chains Contain Pages Blocked by Robots.txt

To solve this problem we introduced new issues. In the report they appear as:

Canonical Chain Blocked by Robots.txt
Redirect Chain Blocked by Robots.txt

These issues will allow to detect problems with indexation, which is difficult to carry out manually.

4.2. Base Tag

We have added a separate ‘Base Tag’ parameter and are now able to detect a new issue ‘Bad URL Base Tag Format’ – this tag often causes a lot of trouble for SEO specialists. For instance, relative links cannot be used in a <base> tag: following such links in a browser will work as expected, search engine robots, however, will uncover dozens of duplicates and ‘trash’, nonexistent pages. Simply put, it is extremely difficult to find this issue manually.

4.3. Title, Description, Keywords Length

For quite some time, Netpeak Spider has been crawling pages for the issues connected with title and description tags that exceed or fall behind the optimum tag length. Yet only now we have added separate parameters that will show the current length.

4.4. Order of Parameters

We have tweaked the order of parameters in the crawling settings (‘Parameters’ tab) as well as in the results table. Hope you’ll enjoy it!

In a nutshell

We are getting ready for our next update which will include the enhanced technology to calculate internal PageRank with maximum accuracy. For this reason, we have changed the crawling algorithm entirely, introduced new status codes and issues, which will show in a most convenient way how the program and search engine robots crawl your website. Stay tuned for further updates!

Digging This Update? Let's Discuss Netpeak Spider Perks in Person