Visit us at it-sa 2024!

GetyourfreeticketGetyourfreeticket

Open laptop with diagrams

What is Web Scraping?

The term web scraping describes the automated copying of content from a website. In addition to web scraping that is legal and welcome, such as what search engines do to index websites, there are also harmful and abusive methods of web scraping. Attackers employ this technology, for instance, to completely copy content from a website and publish it on another site. For companies, this being carried out has consequences detrimental to their business.

How Web Scraping Works

01

A definition of web scraping

Web scraping, also known as screen scraping, generally refers to the process of extracting, copying, saving and reusing third-party content on the internet. In addition to manual scraping, where content is copied by hand, a number of tools for the automated copying of websites have also become established. A positive use case of web scraping is the indexing of websites by Google or other search engines. In most cases, this indexing is welcome, since it is the only way for users to find the company pages they are looking for on the internet. Malicious screen scraping intended to unlawfully misappropriate intellectual property, on the other hand, violates copyright law and is therefore illegal.

02

How does web scraping work?

A variety of technologies and tools are employed in web scraping:

Manual scraping

It is a fact that both content and parts of the source code of websites are sometimes copied by hand. Criminals on the internet resort to this method particularly when bots and other scraping programs are blocked by the robots.txt file.

Software-Tools

Web scraping tools such as Scraper API, ScrapeSimple and Octoparse enable the creation of web scrapers even by those with little to no programming knowledge. Developers also use these tools as the basis for developing their own scraping solutions.

Text pattern matching

Automated matching and extraction of information from web pages can also be accomplished by using commands in programming languages such as Perl or Python.

HTTP manipulation

Content can be copied from static or dynamic websites using HTTP requests.

Data mining

Data mining can also be used for web scraping. To do this, web developers rely on an analysis of the templates and scripts in which website content is embedded. They identify the content they are looking for and use a “wrapper” to display it on their own site.

HTML parser

HTML parsers familiar from browsers are used in web scraping to extract and parse the content sought after.

Copying microformats

Microformats are a frequently used component of websites. They may contain metadata or semantic annotations. Extracting this data enables conclusions to be drawn about the location of special data snippets.

03

Use and areas of application

Web scraping is employed in many different areas. It is always used for data extraction, often for completely legitimate purposes, but misuse is also common practice.

Search engine web crawlers

Indexing websites is the basis for how search engines like Google and Bing work. Only by using web crawlers, which analyze and index URLs, is it possible to sort and present search results.

Web crawlers are “bots”, which are automated programs that perform defined and repetitive tasks.

Replacement for web services

Screen scrapers can be used as a replacement for web services. This is of particular interest to companies that want to provide their customers with specific analytical data on a website. However, using a web service for this involves high costs. For this reason, screen scrapers, which extract the data, are the more cost-effective option.

Remixing

Remixing or mashup combines content from different web services. The result is a new service. Remixing is often done via interfaces, but if no such APIs are available, the screen scraping technique is also used here.

 

 

 

Misuse

The misuse of web scraping or web harvesting can have a variety of objectives:

  • Price grabbing: One special form of web scraping is called price grabbing. Here, retailers use bots to extract a competitor’s product prices in order to intentionally undercut them and acquire customers. Thanks to the great transparency of prices on the internet, customers quickly end up going to the next-cheapest retailer, which results in greater price pressure.

  • Content/product grabbing: Instead of prices or price structures, in content grabbing bots target the content of the website. Attackers copy intricately designed product pages in online shops true to the original and use the expensively created content for their own e-commerce portals. Online marketplaces, job exchanges and classified ads are also popular targets of content grabbing.

  • Increased loading times: Web scraping wastes valuable server capacities: Large numbers of bots constantly update product pages while searching for new pricing information. This results in slower loading times for human users—especially during peak periods. If it takes too long to load the desired web content, customers quickly turn to the competition.

  • Phishing: Cyber criminals use web scraping to grab email addresses published on the internet and use them for phishing. In addition, criminals can recreate the original site for phishing activities by making a deceptively realistic copy of the original site.

04

How can companies block web scraping?

There are a few measures that prevent a website from being subject to scraping:

  • Bot management: Using bot management solutions, companies are able to fine-tune which bots are allowed to access information on the website and which to treat as malware.

  • robots.txt: Using the robots.txt file, site operators can specify which areas of the domain may be crawled and exclude specific bots from the outset.

  • Captcha prompts: Captcha prompts can also be integrated into websites to provide protection against bot requests.

  • Appropriate integration of telephone numbers and e-mail addresses: Site operators protect contact data from scraping by putting the information behind a contact form. The data can also be integrated via CSS.

  • Firewall: Strict firewall rules for web servers also protect against unwanted scraping attacks.

Google search engine on a cell phone

05

Scraping as spam

In many cases, websites with scraped content that do not cite the source are in violation of copyright law. In addition, search engines like Google classify them as spam. For websites with original content, these spam sites also pose a risk, because when in doubt, search engines could consider the legitimate website as duplicate content and penalize it accordingly. This results in a considerably lower SEO ranking. In order to proactively combat web scraping, companies and webmasters use special Google Alerts, for example, which provide information about suspicious content on the internet.

06

Legal framework: is screen scraping legal?

Many forms of web scraping are covered by law. This applies, for example, to online portals that compare the prices of different retailers. A ruling to this effect by the German Federal Court of Justice in 2014 clarifies this: As long as no technical protection to prevent screen scraping is overcome, there is no anti-competitive obstacle.

However, web scraping becomes a problem when it infringes copyright law. Anyone who integrates copyrightable material into a website without citing the source is therefore acting unlawfully.

In addition, if web scraping is misused, for phishing for example, the scraping itself may not be unlawful, but the activities that follow may be.

Clock

07

Web Scraping: Things you need to know

Web scraping is an integral part of the modern internet. Many popular services such as search engines or price comparison portals would not be possible without the automated retrieval of information from websites. Abuse of it, however, also poses serious risks for companies, such as unscrupulous competitors extracting expensively produced content from an online shop and copying it to their own. The load from traffic caused by bots acting autonomously is also not negligible. Currently, bots generate about half of website traffic. That is why effective bot management is a crucial factor in protecting a company website from scraping attacks.

Find out more