How to Crawl a Website Using Screaming Frog and a Headless Browser

What is a headless browser?headless browser

A headless browser is a web browser without a graphical user interface. Headless browsers provide automated control of a web page in an environment similar to popular web browsers, but are executed via a command line interface or using network communication. They are particularly useful for testing web pages as they are able to render and understand HTML the same way a browser would, including styling elements such as page layout, colour, font selection and execution of JavaScript and AJAX which are usually not available when using other testing methods. Google stated in 2009 that using a headless browser could help their search engine index content from websites that use AJAX.

This data is often used to test web pages en mass for quality control or to extract data. The headless browser is significant because it understands web pages like a browser would – with the caveat that browsers all (annoyingly) behave slightly differently. Headless browsers, for example, should be able to parse JavaScript. They can click on links and even cope with downloads.  This is a classic example of software providing data to another piece of software without a GUI being necessary.

Almost all headless browsers can be controlled via an API or a console, and developers have used them to automate their testing processes, take screenshots of what’s rendered inside them, and retrieve the content of Web pages and then supplying it to other more complex software.

Types of Headless Browser Available

Most successful headless browser projects nowadays are based on a real-life browser, with the most widespread of them being PhantomJS, based on Chrome’s and Safari’s WebKit engine.

Similarly there’s trifleJS originally built on Internet Explorer’s former Trident engine, now ported to Google’s V8, SlimerJS on Firefox’s Gecko engine, Awesomium based on Chromium, and HtmlUnit built on Mozilla’s Rhino engine.

As for headless browsers built on custom browser engines, there’s Twill, but the project is inactive for 8 years now. This seems to be a problem with all similar projects, all reaching a wall when, due to a lack of resources and time, and because of the huge amount of work and maintenance a browser engine needs, the project is eventually abandoned.

Because of this reason we always recommend choosing a headless browser built on a known browser engine, since there’s a high chance to have a lot of people contributing to it, and lots of in-depth documentation and tutorials at hand at any time.

P.S. You should check Asad Dhamani’s list of headless browsers and their derived technologies.

Malicious Use

Of course, this technology has proven to be highly efficient to hackers as well, its automation features allowing them to create and launch complex, yet fully controllable attacks on various websites and Web services.

Headless browsers have been known to be used in DDOS attacks, brute-force attacks, and also for falsely increasing ad revenue by faking page loads and user interactions.

 

Sources:

  • Source: Wikipedia
  • Source: Andrew Girdwood
  • Source: http://codebyte.in/seo-of-one-page-applications-built-with-javascript-frameworks/