Live Proxies

How to Do Python in Web Scraping: Practical Tutorial 2025

Learn how to do web scraping in Python with this practical 2025 guide. Discover top libraries, anti-scraping techniques, and legal considerations.

How to Do Python in Web Scraping
Live Proxies

Live Proxies Editorial Team

Content Manager

How To

11 February 2025

Data is the fuel of the internet in 2025. Whether it’s AI models training on vast datasets or businesses analyzing trends, web scraping has become an essential tool for accessing structured information from websites. Websites, apps, and widgets are shiny UI wrappers displaying massive data sets for users. Data has been the new gold since the early tech startup boom in 2008, and the value of data has only increased with the rise of AI and LLMs.

These products need the best and latest data to return good results to users. Companies have embraced web scraping to get this data. From articles and forum posts to tutorials and videos, if it has data value, it can be scraped.

A crucial skill for any developer in 2025 is web scraping. How to do it well and ethically is important for engineers in established companies and those in startups. There are many languages to use for web scraping, but one that stands out is Python. In this article we will examine how to write web scraping scripts in Python, the best libraries for web scraping, how to overcome anti-scraping measures, and the legal implications of web scraping.

What is Web Scraping in Python?

Before we can start writing scripts, we need to define what web scraping is and why we would want to use it. So much information is available to us on web pages. Rich data exists in many forms, from product listing pages to stock reports. But how can we access this data?

If you ask any data scientist or analyst what the most difficult part of their job is, it is not analyzing the data. The most difficult part is getting and cleaning the data. Web scraping allows developers to do both.

Web scraping is simply the extraction of data from a website. This could be as simple as retrieving the title text from an article, or as complicated as paging through shareholder documents to grab financial information for specific quarters.

Web Scraping in Python

There are endless applications for web scraping including:

  • Product Data Retrieval: Scraping product pages to get data including images, descriptions, and reviews
  • Anti-Counterfeiting: Scraping the web to look for counterfeit goods or competitors using your products
  • Lead Generation: Scraping contact forms and websites to create leads for your goods or services
  • Content Review: Scrape articles, videos, and audio to review that sponsored content meets advertisers' message and market
  • Customer Sentiment: Scrape blog posts and social media posts to understand reactions to new products, services, or company news
  • Trade Platform Data: Scrape companies' websites to get alternative data for financial platforms and institutions

The applications are endless, but the best tools are well-known. In the coming sections, we will describe why Python works well for web scraping.

Why Use Python for Web Scraping?

You may be wondering why you would choose Python over other languages when web scraping. Python is a forgiving coding language with many web scraping libraries and extensive community support.

Diving deeper into the reasons:

  • Dynamically Typed: Python is a dynamically typed language. Unlike statically typed languages like Java, Python does not enforce variable typing which makes development faster.
  • Many Packages: Python has a lot of web scraping packages. Your goal with building a web scraper is to implement a solution not build every tool. Python offers a wide variety of packages to speed up your development.
  • Community Support: Python has extensive community support. Because Python is seen as a great web scraping language, the community has created many video and written tutorials and documentation for common issues.

You may say this all sounds great, but why not choose another programming language like JavaScript or R? Here’s a breakdown of Python, JavaScript and R that should clear up why Python is the best choice:

Feature Python JavaScript R
Ease of Use ✅ Easy-to-read, beginner-friendly ❌ Requires more setup ❌ Not beginner-friendly
Scraping Speed ✅ Fast with Scrapy ✅ Fast with Puppeteer ❌ Slower
Best For ✅ General web scraping ✅ JavaScript-heavy scraping ❌ Statistical data scraping

Whether you are just starting to scrape or looking to move to a more stable language, Python is an excellent choice.

How to Set Up Your Python Environment for Web Scraping

Every developer knows the key to quick, good development is a great environment. Having the right version of your language to support all features, being able to manage external libraries, and having an editor to improve your code is crucial.

Picking out the right tools can be hard. There are many new package managers, IDEs, and extensions trying to disrupt existing products. Let's discuss tools you should use in 2025 to improve your web scraping.

Step 1: Download Language Binary

Your language binary is the code that makes up your desired coding language. In our case, we need to get a version of Python. We can download a copy of Python using this link which should take us to this page:

python

Step 2: Download an Interactive Developer Environment (IDE)

Once you have downloaded your copy of Python, it is time to download an IDE. An IDE or interactive developer environment is software that makes writing and managing your code easier.

IDEs offer features like file diagrams, syntax highlighting, local servers and terminals, and many plugins for easier development. Popular IDEs include VSCode, PyCharm, and Cursor. We will use VSCode.

To download VSCode, navigate to the download page using this link. You should see the following page:

IDE

Step 3: Package Managers

Popular coding languages like Java, Python, and JavaScript have extensive third-party libraries for code. This code is open-source and proprietary and offers prebuilt solutions for common coding needs.

We will make extensive use of the libraries while web scraping. To do so we will need a package manager. This is software that helps retrieve, update, and delete software from our setup.

We will be using Pip which is included in the download of Python we had earlier.

Step 4: Python Libraries and Packages

We have our language binary: Python, a place to code in: VSCode, and software to manage our libraries: Pip. Now we need to download some libraries specifically for web scraping.

There are many libraries available to us. With Pip they are all a single command line prompt away from us. We will need to use the Beautiful Soup library while web scraping. Let's download it.

To download Beautiful Soup, we will use Pip. Open your VSCode integrated terminal and enter the command shown below:

Beautiful Soup

Once you hit enter, Pip will call out to a repository and download the latest version of Beautiful Soup to your current directory. Now, your environment is fully set up and ready to start scraping.

Which Tools Are Best for Web Scraping in Python?

One of the reasons Python is such a great language for web scraping is its extensive set of libraries. Whether scraping a small static site, crawling sprawling web ecosystems, or working with heavily JavaScript-rendered pages, there is a library for you.

Three of the most popular libraries for web scraping in Python are Beautiful Soup, Scrapy, and Selenium. We will go through each in more detail, but below is a table that outlines some of the strengths and limitations of each library:

Feature Beautiful Soup Scrapy Selenium
Use Case Simple web scraping Large-scale scraping Interacting with dynamic content
Ease of Use ✅ Beginner-friendly ❌ Steeper learning curve ✅ Easy for automation but can be heavy for just scraping
Speed ❌ Slower ✅ Fast and efficient ❌ Slower
Async Support ❌ No native async support ✅ Built-in asynchronous support ❌ No native async support
Dynamic Content 🟠 Limited 🟠 Limited ✅ Fully supports dynamic content
Performance 🟠 Medium, for small to medium projects ✅ High, ideal for large-scale crawlers ❌ Low, not ideal for large-scale scraping
Headless Browsing ❌ No ❌ No ✅ Yes, supports headless browser modes (e.g., Chrome, Firefox)

Library Option 1: Beautiful Soup

Beautiful Soup is popular for two reasons: it is simple to use and lightweight. Because you are interacting with static HTML documents, you do not have to worry about client-side rendered content or crawling between web pages.

Instead, you can solely focus on how you interact with the DOM. This library has limitations as you do not have features like page-to-page crawling or proxy services.

Library Option 2: Scrapy

Scrapy is Beautiful Soup+. It has a steeper learning curve due to its more full-featured nature. Scrapy also offers the ability to parse static pages like Beautiful Soup does but also allows for implementing proxies and page-to-page crawling.

Scrapy does not support scraping JavaScript-rendered websites. It can also be hard to pick up because it is more of a framework than just a library like Beautiful Soup.

Library Option 3: Selenium

If you need your web scraper to act like a human, then Selenium is the library for you. Selenium excels at appearing human through its browser automation-based logic.

With Selenium, you can mimic page clicks and scrolls and defeat CAPTCHAs and other anti-crawling precautions.

The downsides to Selenium have to do with its setup. To use Selenium you have to download the library, a web driver, and a browser specifically for web automation. Selenium also has issues with scraping as it will send along headers to alert websites that it is a headless, automated script.

How to Scrape Websites Using Python

Scraping websites with Python comes down to a few common steps. There are some wrinkles introduced by custom services, authentication, different rendering cycles, and anti-scraping tools but the majority of web scraping is:

  1. Researching and understanding your target website.
  2. Fetching the webpages and content from the website.
  3. Parsing the data from the returned payloads.
  4. Cleaning and storing the data for later use.

Hopefully, your target is consistent in their page structure, rendering type, and HTML class and ID naming. Websites are living documents and updates to existing themes or new sections added later can explode the complexity of your web scraping.

steps of web scraping

In the coming sections, we will look at how to handle each step of web scraping with practical code examples to get you started immediately.

Understanding Website Structure

Many technologies and services in tech leverage real world terms to explain code based solutions. You have bookmarks to save webpages that you frequently use. You browse an online store by looking over product pages. You store information on webpages similar to how you store information on pages in a book.

A website can be understood as a sort of digital book. It is a collection of webpages that contain content related to the book's theme. There is a sitemap, or index, that documents all of the different sections of the website.

To better understand the technology behind webpages, we need to understand HTML documents. HTML, which stands for hypertext markup language, is a hybrid coding / markdown language that allows developers to group content with extra information to display rich experiences to users.

Each webpage is an HTML document which consists of head and body section:

doctype

(src: https://stuyhsdesign.wordpress.com/basic-html/structure-html-document/)

Each of these sections are contained within HTML tags, which are the atomic elements of a webpage. The HTML tags all have specific behaviors and meanings that the browser understands.

The most common tags for content within the page are <p>, <div>, <img>, <a>, and <hX> tags or paragraph, div, image, link, and header tags. These tags display text, images, and links and give structure to the page.

To take our understanding of webpages to the next level, let's use the browser tools to inspect a popular news website: NPR.

NPR

(src: https://www.npr.org/2025/01/30/nx-s1-5279214/economy-gdp-consumer-spending-tariffs)

This image shows what a reader may view when they navigate to the article page. However, there is so much more data available to us if we inspect the page source code. This is the HTML document that structures the text, images, and links that give the page structure and make it interactive.

All modern browsers give us the ability to view the source code of a page. To do this we can left click on any part of the page and select the "inspect" option in the menu.

In this case I want to look at the source code that displays the article header. Left click on the article heading text and click inspect:

NPR_inspect

Here we see the HTML for the header:

// NPR Article Title Header HTML tag <h1>The U.S. economy is still doing well as Americans continue to spend</h1>

This HTML tag is an <h1> tag which is the highest heading value a page can have. The <h1> tag is important for web crawlers as it

There is a picture below the header. If we want to collect the link that this image is served from we can left click it and inspect it:

<img src="https://npr.brightspotcdn.com/dims3/default/strip/false/crop/4500x3000+0+0/resize/1100/quality/85/format/jpeg/?url=http%3A%2F%2Fnpr-brightspot.s3.amazonaws.com%2Fcc%2F86%2Ff878cad743a19887593fd36a6089%2Fgettyimages-2186442665.jpg" 
     class="img" 
     alt="Consumer spending kept the U.S. economy humming in October, November and December. The economy grew at an annual rate of 2.3% during the quarter." 
     data-template="https://npr.brightspotcdn.com/dims3/default/strip/false/crop/4500x3000+0+0/resize/{width}/quality/{quality}/format/{format}/?url=http%3A%2F%2Fnpr-brightspot.s3.amazonaws.com%2Fcc%2F86%2Ff878cad743a19887593fd36a6089%2Fgettyimages-2186442665.jpg" 
     data-format="jpeg">

Now we are looking at a <img> tag. These are attribute-rich tags because they do more than just contain text. They have a src attribute that states the URL for retrieving the image. They may also have an alt attribute that tells screen readers what to communicate to people who have visual disabilities.

All of this information may be useful to our webscraper, and there is a lot more information in the HTML document. In the coming sections we will continue to explore HTML documents and look at how to scrape this data programmatically to get information for our apps.

Fetching Web Page Content

We understand how to target elements in a web page. Now, it is time to figure out how to retrieve the webpage itself. Modern webpages respond to HTTP requests (hypertext protocol). These requests consist of a few different elements:

Fetching Web Page Content

For each request, we will have headers that describe who or what is making the request, a verb that describes what kind of action will be performed, and an optional body that may include information for performing the action.

Most of our calls while web scraping will be GET calls as we want to read data from a website. We will not usually have a body while webscraping, so we can ignore that for now. The last part of the HTTP request is the headers section.

This is an important section to review and modify as websites will review this data when determining if you are scraping their site. Some of the most common headers associated with web scraping are:

  1. User-Agent: Describes your browser and hardware
  2. Accept-Language: Describes your locale and language
  3. Accept-Encoding: Describes how to compress data
  4. Accept: Describe the types of files you expect in the response
  5. Referer: Describes the last page visited by the user

It is important to set these values in your requests as the headers can tell your target website that the activity is a scraper and not human. Another important item to include in requests is cookies.

Cookies are strings passed between a client and server that give context about who wants data and their permissions. With libraries like Python’s requests, we can manage cookies using the Session object.

Let’s look at a quick Python requests call that sets headers and cookies in a GET call:

// Python code example

import requests

# Define the URL for the GitHub API endpoint
url = "https://api.github.com/user"

# Set up headers and cookies
headers = {
    "User-Agent": "YourAppName",  # GitHub requires a User-Agent header
    "Authorization": "Bearer YOUR_ACCESS_TOKEN"  # Optional, if using OAuth token
}

# Set a cookie value to send in our request
cookies = {
    "github_cookie": "ABC_123_XYZ" 
}

# Make the GET request
response = requests.get(url, headers=headers, cookies=cookies)

# Check if the request was successful
if response.status_code == 200:
    print("Success:", response.json())  # Print the JSON response
else:
    print(f"Failed with status code: {response.status_code}")

Here we are making a GET call to get the user profile on Github. We set our URL, create a headers map and then a cookies map. The request library makes setting headers and cookies simple as we are able to pass these directly as arguments to our GET call. We should receive back a successful response object as long as we are requesting a real user.

The final item we need to understand with HTTP requests is response codes. These codes give us information about the result of our request. The 3 categories we are interested in are 200 (successful calls), 400 (client errors) and 500 (server errors).

Each status code corresponds to a different result including: missing authentication, bad requests, server timeout, and rate limits exceeded. Many of the errors you will get have to do with 400 statuses: issues with your requests. Here are the common 400 errors and some tips to fix the errors:

Feature Python JavaScript R
Ease of Use ✅ Easy-to-read, beginner-friendly ❌ Requires more setup ❌ Not beginner-friendly
Scraping Speed ✅ Fast with Scrapy ✅ Fast with Puppeteer ❌ Slower
Best For ✅ General web scraping ✅ JavaScript-heavy scraping ❌ Statistical data scraping

Parsing HTML with Beautiful Soup

Once we make a page request, we should have an HTML document. This is the source code the browser uses to display a webpage. We can imagine this as a printed document. Our job now is to highlight and copy the important information.

Beautiful Soup is a Python library that excels at navigating HTML documents. We can target data like HTML DOM elements, CSS selectors, IDs, and data attributes. For dynamic selectors that have some structure, we can use regular expressions to create pattern-searching code.

Let's look at some code for retrieving the data of an NPR article using Python and the Beautiful Soup and requests libraries:

// Python code for parsing call to NPR article using BS4

import requests
from bs4 import BeautifulSoup

# URL of the NPR article
url = "https://www.npr.org/2025/01/30/nx-s1-5279214/economy-gdp-consumer-spending-tariffs" 

# Send a request to get the page content
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# 1. Retrieve the title
title = soup.find("h1", {"class": "storytitle"}).get_text(strip=True)

# 2. Retrieve the author
author_tag = soup.find("span", {"class": "byline"})
author = author_tag.get_text(strip=True) if author_tag else "Author not available"

# 3. Retrieve the date
date_tag = soup.find("time")
date = date_tag.get_text(strip=True) if date_tag else "Date not available"

# 4. Retrieve images and their metadata
images = []
image_tags = soup.find_all("img")
for img in image_tags:
    img_url = img.get("src")
    img_alt = img.get("alt", "No alt text available")
    img_title = img.get("title", "No title available")
    images.append({
        "url": img_url,
        "alt": img_alt,
        "title": img_title
    })

# 5. Retrieve the article text
article_paragraphs = soup.find_all("p", {"class": "storytext"})
article_text = " ".join([p.get_text(strip=True) for p in article_paragraphs])

# Output the data
print("Title:", title)
print("Author:", author)
print("Date:", date)
print("Images:")
for image in images:
    print(f"  - {image['title']} (alt: {image['alt']}) -> {image['url']}")
print("\nArticle Text:")
print(article_text)

In this code example, we have already inspected the web pages to get the correct CSS selectors for each of our target items. We are concerned with getting: the title, the author, the date, a list of all the images and their metadata, and finally, the text body.

This body text can be useful as we try to understand the document. Two important visualization actions we can take are sentiment analysis and word frequency graphing.

Before we start working on visualization we can clean our data with code like this:

// Python for cleaning body text

import re

# Clean up extra spaces, newlines, or unwanted characters
cleaned_text = " ".join(article_text.split())

# Remove any remaining unwanted characters (e.g., special characters)
cleaned_text = re.sub(r"[^A-Za-z0-9\s.,;!?-]", "", cleaned_text)

print(cleaned_text)

Visualization 1: Sentiment Analysis

Have you ever wondered about the tones or emotions of a text? This is what sentiment analysis focuses on: determining the tone and intention of a text. Python offers support for sentiment analysis through the textblob package.

To perform analysis on our cleaned code we would have the following code:

// Python code for sentiment analysis

from textblob import TextBlob

# Perform sentiment analysis
blob = TextBlob(cleaned_text)
sentiment = blob.sentiment

print("Sentiment:", sentiment)

Visualization 2: Word Frequency Graphing

Another common text visualization method is to look at word frequency. Are there certain figures or places that are mentioned a lot throughout the body of the text? Do certain adjectives seem to pop up when describing, and event or person?

To figure this out we can write a short script like the following to work with our cleaned body text from the article:

// Python code for word frequency graphing

from collections import Counter
import re

# Split the text into words and count their occurrences
words = re.findall(r'\b\w+\b', cleaned_text.lower())
word_counts = Counter(words)

# Print the most common words
print(word_counts.most_common(10))  # Top 10 most common words

Handling JavaScript-Rendered Content

Not all websites will return a complete HTML doc when you make a request. As websites have become more advanced, they have implemented client-side code, or HTML created dynamically using JavaScript.

These pages load in segments and change with user interactions instead of loading at once. Tools like Beautiful Soup, Scrapy, and Requests will not help us get all the data we need. Instead, we have to turn to tools like Selenium and Playwright.

We will look at Selenium, a coding tool for automating web browser interactions. It was initially created as an automation tool for testing but has proven useful for scraping dynamic content from web pages.

With Selenium you can target webpages and program the browser to mimic user actions such as button clicks and page scrolls. There are some ways JavaScript-rendered content may appear:

  1. Page Template: A page template is sent from the server which then calls back to the server for the data to populate it (lazy loading)
  2. Infinite Scroll: Infinite scroll requires the user to navigate to a specific part of a page before calling for more content
  3. Clicks and Links: A button or link must be clicked before the page calls out to populate a section

For all of these use cases, we need to simulate an action after the DOM is loaded. This is why we need automation frameworks like Selenium. Let's look at configuring Selenium and Python to perform web scraping.

Step 1: Install Python in your environment

We will need Python on our computer to build our scraper. Please reference the previous setup on setting up your environment if you do not already have Python installed.

Step 2: Install Selenium

The code we need to integrate Selenium into our script comes as a Python package. Run the following command in your workspace to download the Selenium package:

// python command to install Selenium

pip install selenium

Step 3: Download Web Driver

Python with Selenium can't directly access your browser. To perform automation we will need software to control the browser. This is what a web driver does.

There are different web drivers for different browsers. We will be working with Google Chrome, but you could also get web drivers for Firefox or Microsoft Edge.

To download ChromeDriver, navigate to this page and download a version of ChromeDriver that is compatible with your current browser:

ChromeDriver

Once you have downloaded ChromeDriver, place the executable where your computer can find it in: "/usr/local/bin."

Step 4: Test Your Web Driver Installation

You have downloaded ChromeDriver, but it is always good to test that it works. Sometimes the hardest part of software development is environment setup.

The following code is a diagnostic script to ensure that Python, Selenium, and ChromeDriver are ready to use:

// Python code to verify ChromeDriver installation

<pre><code class="language-python">
from selenium import webdriver

Create a new instance of Chrome WebDriver

driver = webdriver.Chrome(executable_path='/usr/local/bin’)
driver.get('http://www.google.com/')

Wait for a few seconds to let the browser load

driver.implicitly_wait(5)

print(driver.title) # Print the title of the page to verify everything is working driver.quit()

Step 5: Write Your Selenium Script

Our environment is ready to use, let's write a simple script to test out Selenium. Remember, unlike Beautiful Soup where we are parsing a static document, Selenium is interaction-based.

For our example, we will open Google and navigate to content through "user actions" instead of retrieving a single web page and parsing it. The code for our Selenium script will look like this:

// Python code example for interacting with Google

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

# Initialize the WebDriver (adjust path if necessary)
driver = webdriver.Chrome(executable_path='/usr/local/bin')

# Open a webpage
driver.get('https://www.google.com')

# Find the search box and input a search query
search_box = driver.find_element(By.NAME, 'q')
search_box.send_keys('Selenium WebDriver tutorial')
search_box.send_keys(Keys.RETURN)  # Press Enter

# Wait for results to load
driver.implicitly_wait(3)

# Print the first result’s title
first_result = driver.find_element(By.XPATH, '(//h3)[1]')
print(first_result.text)

# Close the browser
driver.quit()

Cleaning and Storing Data

Writing web scraping scripts is a tedious task. Even with good attention to detail, you may have hard-to-use data. This is where cleaning comes in.

Once we have crawled and scraped our target web pages, we will have collections of data stored in memory on our computers. To make this usable we need to clean the data and store it somewhere. What does cleaning look like?

There are many steps to cleaning data, but we will look at the top 5 most common steps and how to correct them.

Cleaning 1: Removing Duplicate Entries

This step is straightforward: we want to remove duplicate records from our data sets. Sometimes the duplicates come from repeated data on the webpage we target, other times they come from issues with our scripts.

Below is a code example showing how to use Pandas to remove duplicate records from data stored in memory:

// Python Code

import pandas as pd

# Sample dataset with missing values
data = {
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', None],
    'Age': [25, None, 35, 40]
}

df = pd.DataFrame(data)

# Fill missing values in 'Name' with 'Unknown' and in 'Age' with the mean value
df['Name'].fillna('Unknown', inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)

print(df)

// New Data Output

  ID     Name     Age
0   1    Alice    25.0
1   2    Bob      33.3
2   3    Charlie  35.0
3   4    Unknown  40.0

Cleaning 2: Handling Missing Values

Some of our data may have incomplete records due to missing data on the webpage, or our scripts missing elements while scraping. This incomplete data can cause issues for our data analysts who don't account for missing values in their reports or dashboards.

With Pandas, we have the option to identify and insert in place a given value for missing items in our datasets:

// Python Code

# Sample dataset with missing values
data = {
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', None],
    'Age': [25, None, 35, 40]
}

df = pd.DataFrame(data)

# Fill missing values in 'Name' with 'Unknown' and in 'Age' with the mean value
df['Name'].fillna('Unknown', inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)

print(df)

// Cleaned Data

 ID     Name   Age
0   1    Alice  25.0
1   2      Bob  33.3
2   3  Charlie  35.0
3   4   Unknown  40.0

Cleaning 3: Standardizing Date and Time Formats

Even when we have all the data from our web scraper, there can be inconsistency in the format. This is especially prevalent with dates and times.

Instead of forcing analysts or developers to build around these inconsistencies, we can normalize the formats of these date and time values:

// Python Code

# Sample dataset with dates in different formats
data = {
    'Event': ['A', 'B', 'C'],
    'Date': ['2023/10/01', '01-10-2023', '2023-10-03']
}

df = pd.DataFrame(data)

# Convert 'Date' column to a standard datetime format
df['Date'] = pd.to_datetime(df['Date'], format='%Y/%m/%d', errors='coerce')

print(df)

// Cleaned Data

  Event       Date
0     A 2023-10-01
1     B 2023-10-01
2     C 2023-10-03

Cleaning 4: Removing HTML Tags and Special Characters

Webpages can use different formats, characters, and elements to customize data for viewing. This may result in HTML tags or special character artifacts wrapping our data points.

These artifacts cause issues when accessing the data as they are not in the string or number format we expect. We can use Pandas with regular expressions to clean up this data:

// Python Code

import re
import pandas as pd

# Sample dataset with HTML tags
data = {
    'Product': ['<p>Apple</p>', '<div>Banana</div>', '<span>Carrot</span>']
}

df = pd.DataFrame(data)

# Remove HTML tags using regular expressions
df['Product'] = df['Product'].apply(lambda x: re.sub('<.*?>', '', x))

print(df)

// Cleaned Data

   Product
0    Apple
1   Banana
2   Carrot

Cleaning 5: Normalizing Text

The last case of cleaning we will look at is normalizing text. The process of normalization comes into play when we have many different ways of explaining the same data.

Say we are storing data for the given US state of a record. If our records are for Washington state we could have: WA, WASH, Washington, Wa, etc.

These all represent the same state data value but are not easy to work with because they differ in their literal value. We can normalize and thus clean this data to be easier to work with:

// Python Code

# Sample dataset with inconsistent text formatting
import pandas as pd

data = {
    'Product': [' Apple ', '   BANANA', 'carrot   ']
}

# Create a df structure using our data 
df = pd.DataFrame(data)

# Normalize the text by lowering case and removing extra spaces
df['Product'] = df['Product'].str.lower().str.strip().str.replace(r'\s+', ' ', regex=True)

print(df)

// Clean Data

  Product
0    apple
1   banana
2   carrot

Once you have cleaned and formatted your data, you may want to output your data in a format that other programs can use. The two most common formats for storing pandas data are CSV and JSON.

Both formats have their benefits. CSV data resembles a table with comma-delimited values making up each row or record. This format is useful if you access the data with table structure in mind.

However, if you work with data using an OOP language like JavaScript, JSON may be a better choice. Though JavaScript Object Notation has JavaScript in its name, it is widely supported by many scripting languages for reading and writing.

How to Avoid Web Scraping Pitfalls

As you begin web scraping you will find many prevention settings or developer designs that make scraping difficult. Some of the most common pitfalls include:

  1. CATPCHAs
  2. IP bans
  3. Rate Limits

These tools and precautions are meant to prevent abuse by web scrapers and malicious actors. As with any counter measure, there are ways to continue to scrape even when these tools are in place.

Pitfall 1: CAPTCHAs

CAPTCHA, which stands for Completely Automated Public Turing test to tell Computers and Humans Apart) is a web tool meant to prevent web scrapers from accessing certain webpages. They are usually visual tests or puzzles that a human should be able to solve, but a machine would not.

Captcha

Pitfall 2: IP Bans

IP Bans are a technique to block specific or entire blocks of IP addresses from accessing a server. This could be for a specific IP that has been flagged for malicious behavior. Or it could be for an entire block of IP addresses for a given country.

You can avoid being IP banned by using proxies to spread your web traffic across many IP addresses. By making requests from many IPs you are less likely to get flagged for spamming a server.

The Python requests library pairs well with scrapy-proxies to allow easy implementation of proxy servers.

Pitfall 3: Rate Limits

Websites use rate limits to prevent abuse of their services. They set a limit for acceptable requests made from one source over a given period of time. This could be 10 requests per second or 1000 requests per minute.

Usually they identify requests by the IP address of a requestor or an API key. These both serve as fingerprints to ensure that a user does not use all of a server's resources.

Once you know a website's rate limit, you can use the Python sleep method with your requests to perform a backoff. This is a preset time period where you don't make requests to ensure you stay within a service's rate limit.

Ethical and Legal Aspects of Web Scraping

When you're webscraping, a lot of your movements may seem to be by the honor code. Sure, websites may rate limit you or try to block you from accessing certain data. However, if you really try to access data you will probably be successful.

Just because you can does not mean you should. Just as you would not break into an empty store to steal goods, you should not web scrape in areas that are explicitly forbidden.

Let's look at some best ethical practices and cases that resulted from web scrapers breaking the rules

1. Respect robots.txt files and Terms of Service

Before you begin scraping, always check a website's robots.txt file. This file is the starting point to determine which pages, if any, you are allowed to scrape.

Along with checking the robots.txt file, look at the website's terms of service. This will outline, in writing, expectations on data use and ingestion by third parties.

One high visibility case of respecting ToS came in 2017 between HiQ and LinkedIn. In this case HiQ was scraping publicly available data and LinkedIn challenged them with a cease and desist letter.

While the court ruled in HiQ's favor due to the public nature of the data, the fact that this was a contentious case and ruling highlight the importance of reading a website's policies.

2. Avoid Overloading a Target's Servers

Even if a website does not set explicit rate limits, set your own so that you do not block others from using the site. When you send too many requests to a website you are performing a DDOS (distributed denial of service) attack which is illegal and harmful to the target company and their users.

An early 2000's case of server overloading occurred between eBay and Bidder's Edge. The courts ruled in eBay's favor that Bidder's Edge web scraping of product listing was causing harm to both eBay and their users by overloading their servers and degrading the quality of the online auction house.

3. Do not Scrape Private or Restricted Data

Do not scrape data behind paywalls or logins. Especially for paid services, this type of behavior may be not only against ToS but also illegal. This activity is akin to stealing data that you do not have authority to move offsite.

A 2009 verdict in favor of Facebook backed this up. Facebook won in a lawsuit against Power Ventures. Power Ventures was scraping Facebook profiles against the wishes of Facebook.

4. Provide Attribution for Data and Respect Copyright

Our final ethical practice is to provide attribution for data and respect copyright. If you use a website's data for publication or commercial reasons, give credit to the source you scraped from.

If data is protected by copyright, do not use the data against the wishes of the creator.

A 2020 court ruling found Thyrfty potentially liable for copyright infringement for scraping Sears' product catalog without permission.

As you have seen, there are some grey areas when it comes to ethical and legal constraints for webscraping. When you’re in doubt, reference this table to get a direction on if what you’re doing is allowed:

Scraping Type Legal? Notes
Scraping publicly available data ✅ Yes Allowed, but check the robots.txt
Scraping behind logins ❌ No Violates Terms of Service
Scraping with permission ✅ Yes Always the safest option

Common Mistakes Beginners Make

There are many mistakes beginners make when they begin web scraping. Though web scraping is easy to get into, mastering it takes time and many learning lessons. There are some mistakes almost all beginners make.

1. Ignoring robots.txt

The first mistake is ignoring robots.txt. This file describes which, if any, pages on a website are free for web scraping. While there is not usually an immediate reaction to illicit scraping, a webmaster may review their network logs and decide to ban your IP from accessing the website. Always check the robots.txt file and respect the rules outlined in them.

2. Ignoring Rate Limits with overly Aggressive Scraping

Another common mistake is overly aggressive web scraping. Just because you can send tens of thousands of requests a minute to a site does not mean you should. You do not want to DDOS a site to get stock information. This is illegal.

You also do not want to get throttled or have your IP blocked for exceeding rate limits. Pay attention to the described rate limits and avoid making more requests than you need to for any given task.

3. Poor or No Data Validation

A final mistake beginners make is poor or no data validation. The whole point of web scraping is to retrieve good data. Developers are human and are prone to inconsistent page layouts and bugs. Do not let these inconsistencies poison your data set.

Make sure you look at and test multiple web pages for each part of your script to ensure that the script is resilient against page differences. You do not want to spend a lot of time in Pandas cleaning up dirty data sets.

Advanced Tips for Efficient Web Scraping

As your web scraping operations get more advanced, you will need to employ new tools and more efficient methods for scraping. Many different methods can increase the efficiency of your scraping including: multi-threading, asynchronous scraping, and distributed scraping.

1. Multi-threading Scraping

With multi-threaded scraping, we make use of our CPU cores to run multiple scraping threads at once. Instead of starting and stopping one web scraping task sequentially, we can run multiple tasks at once.

We can see an example of that in action in the following code which makes use of Python's threading and requests libraries:

// Python code for Multi-threading

# Import dependencies
import threading
import requests

# create method to retrieve a web page given a url
def scrape_url(url):
    response = requests.get(url)
    print(f"Scraped: {url} - Status: {response.status_code}")

# list of urls to target
urls = ["https://example.com", "https://example.org", "https://example.net"]
threads = []

# Creating and starting threads
for url in urls:
    thread = threading.Thread(target=scrape_url, args=(url,))
    thread.start()
    threads.append(thread)

# Waiting for all threads to finish
for thread in threads:
    thread.join()

2. Asynchronous Scraping

Synchronous actions happen one after another. Asynchronous actions can occur at any time, regardless of other actions. With asynchronous scraping we make many requests at once for pages or assets, and process them as they come rather than being stuck with a specific order.

This scraping method is more efficient as a long running request or task will not block other tasks from running and finishing concurrently. To implement this in Python we will make use of the asyncio library to perform non-blocking network requests:

// Python code for Async Scraping

import asyncio
import aiohttp

# Create method that will accept a session object and a url to get content from page
async def scrape_url(session, url):
    async with session.get(url) as response:
        print(f"Scraped: {url} - Status: {response.status}")

# Method to create a list of tasks for scraping the list of URLs
async def main():
    async with aiohttp.ClientSession() as session:
        tasks = []
        urls = ["https://example.com", "https://example.org", "https://example.net"]
        for url in urls:
            tasks.append(scrape_url(session, url))
        await asyncio.gather(*tasks)

# Running the asynchronous scraping
asyncio.run(main())

3. Distributed Scraping

The final advanced scraping method we will look at is distributed scraping. Instead of spreading requests and processes across CPU cores or using asynchronous actions, we will split our requests across machines.

This method works best if you have multiple machines or access to a cloud service. Distributed scraping shines as it has great scalability and performance. You will need to leverage a distributed system like Apache Kafka or Celery in order to use this process.

We can see an example of distributed scraping with Celery below:

// Python code for Distributed Scraping

from celery import Celery
import requests

app = Celery('scraper', broker='pyamqp://guest@localhost//')

@app.task
def scrape_url(url):
    response = requests.get(url)
    return f"Scraped: {url} - Status: {response.status_code}"

# Example to run the task in a distributed system
result = scrape_url.apply_async(args=["https://example.com"])
print(result.get())

How to Choose the Right Websites to Scrape

Some websites are just better than others when it comes to web scraping. When you find one of these websites you will be blown away by how easy it is to pull good, complete datasets from the website. Some signs of a good website to scrape include:

  • Rich and complete data sets
  • Frequently updated data
  • Data in table format, or consistent and easy-to-navigate DOM elements
  • DOM elements with unique and consistent identifiers
  • Robots.txt files that do not disallow scraping
  • Sitemap.xml files that clearly display all available web pages

The more consistent a website is in how it displays its data, the fewer edge cases you have to account for with your web scraper. Blog websites with a single blog template are a great example of this consistency. The header below is for Medium, which offers simple, easy to scrape templated articles:

header

We know there are heading HTML tags for the titles, body tags for the text, and consistent div tags with identities for the author and date. Another easy page to scrape is a page that has consistent tables and graphs.

Websites to Scrape

We have consistent tables to extract all the information we need on a target. Along with the ease of coding is the ease of access to the web pages. We will touch more on robots.txt and sitemap.xml files, but having the right values in each greatly increases the speed and accuracy of web scraper development.

What Makes a Website Ideal for Scraping?

Some websites are easier to scrape than others. When you find a website that is easy to scrape, the task of scripting will seem like nothing. What makes a website ideal for web scraping?

The first thing to look for is static content. If the server returns you a complete webpage without JavaScript-rendered content, you are in a good position. Static content allows you to make a single request and have all the content you need to scrape.

The second sign of a good website is a clear and consistent structure. If you are scraping an e-commerce website and the pages fall into two clear categories of collection and product pages, where each page has the same sections, that is a good website.

A final sign of a good website to scrape is a lack of anti-scraping technology. If there are no Captchas, rate limiting, authorization headers, or IP whitelisting then it is easier to scrape without having your scraper blocked.

How to Analyze a Website Before Scraping

We have already looked at making basic requests for HTML, more complicated requests for JavaScript-rendered content and ethical considerations for webpages. Now, let's talk about 2 important items to analyze before web scraping.

The first item to analyze before web scraping is the robots.txt file. This is a simple text file that is stored at the root of the target website. It describes what agent headers will be rejected (a popular one now is OpenAI) as well as which paths, or groups of paths are restricted from web scraping. The robots.txt file does not prevent you from scraping, it is more instructions of how you should behave on a website. Let's take a look at a popular website: ESPN.

To get to ESPN's robots.txt file we can enter the following link in your browser:

https:///www.espn.com/robots.txt

When you enter this in your browser you will be returned this text file (I have abridged it, but left in important items for our review):

# robots.txt for www.espn.com
</code></pre>

<pre><code>
User-agent: claritybot
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: *
Disallow: */admin/
Disallow: */boxscore?
Disallow: */calendar/
Disallow: */cat/
Disallow: */conversation?
Disallow: */conversation/
Disallow: */conversations?
Disallow: */conversations/
Disallow: */databaseresults/
Disallow: */date/
Disallow: */deportes/
Disallow: */flash/
Disallow: */login?
Disallow: */login

The first section has a list of common User-Agent headers that ESPN is watching for. If a request is made with any of these User-Agent values then it is fully prohibited from accessing any paths. For all other User-Agents (User-Agent: *) there are rules for specific paths and sets of pages that we are not allowed to scrape. There is a lot going on which we need to pay attention to and respect.

Once we have rules for what we are not allowed to scrape, it would be helpful to know a list of all the paths that exist for scraping. This is where our second item comes in: the sitemap.

A sitemap is usually an XML file that displays all the possible content on a website. These files can be massive.

To get ESPN's sitemap we can go to the following URL: https://www.espn.com/sitemap.xml

We will be returned the following sitemap which is a sort of table of context for the different sections of the website.

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://www.espn.com/googlenewssitemap</loc>
<loc>https://www.espn.com/sitemap/watch-espn-videos/</loc>
<loc>https://www.espn.com/sitemap/videos</loc>
<loc>https://www.espn.com/where-to-watch.xml</loc>
</sitemap>
</sitemapindex>

This is a high level sitemap, but we can see that there are many subsections on the website. Navigating to any of these links will give individual items displaying the content information that Google looks for when displaying ESPN data.

When analyzing websites, you must use both the robots.txt file and the sitemap. One tells you what you can and can't access while the other tells you what exists on the website.

How to Scale Web Scraping Projects

There will be projects where the size of your target and the complexity of the pages you are scraping will outpace the tools we have discussed before. You will need to scale both your resources and the tools that you are using.

A solution to heavier computer needs is to add more machines. While this can help, spreading tasks to more machines does not make your web scraping more effective. Instead, you should scale horizontally (more machines) and vertically (better methods for scraping efficiently). This is where tools like Scrapy Clusters come in.

With Scrapy Clusters you create distributed crawling machines that perform a specific action for a webpage instead of performing all actions. For instance, you could have a crawling machine that retrieves the web page and breaks it up into text section 1, text section 2, images, etc.

We will look deeper into what this distributed crawling machine architecture looks like next.

Distributed Web Scraping with Python

As mentioned in the previous section, distributed crawling machines can help make your operations more efficient. It does this by distributing scraping tasks across many machines or processes to handle high amounts of data efficiently.

Distributed Web Scraping with Python

A good companion library for Scrapy is Scrapy Clusters which helps with the distribution and orchestration of tasks among multiple scrapers. Scrapy Clusters uses a data store, like Redis, to share data and tasks between multiple web scraper processes.

This allows multiple Scrapy spider instances to work together without replicating scraping efforts or making duplicate requests. In order to run a successful cluster we need:

  1. Redis: a data store to manage the data during the process of web scraping for tasks and target URLs
  2. Scrapy instances: individual Scrapy spiders that complete given tasks
  3. Scrapy middleware: will handle overhead tasks like headers, retried requests, etc.

Now that we understand what Scrapy Clusters are, let's look at some code examples. We will need two files to run our clusters. The first is for our Scrapy instances:

// Python code to create Scrapy spider

import scrapy
from scrapy_redis.spiders import RedisSpider

class MySpider(RedisSpider):
    name = 'my_spider'
    redis_key = 'my_spider:start_urls'  # The key in Redis to read the starting URLs

    def parse(self, response):
        # Extract data from the page (this is just an example)
        title = response.xpath('//title/text()').get()
        yield {'title': title}

        # Follow links to other pages and add them to the scraping queue
        next_page = response.xpath('//a[@class="next"]/@href').get()
        if next_page:
            yield response.follow(next_page, self.parse)

	This code section sets up an instance of a Scrapy spider with a method to parse request. Next we will have a configuration file for managing Scrapy Clusters and Redis

// Python code to manage Redis and Scrapy Clusters Settings

# Enable Scrapy-Redis
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'

# Redis connection settings
REDIS_URL = 'redis://localhost:6379'

# Scrapy Cluster related settings
SCHEDULER_PERSIST = True  # Keep the Redis queue persistent
CONCURRENT_REQUESTS = 16  # Customize based on your needs
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 0.5  # Customize based on site rules

# Enable logging to monitor distributed crawling status
LOG_LEVEL = 'INFO'

Storing and Managing Large Data Sets

Once you have retrieved data, you need a way to store the data. Here comes another design decision for you to make: what kind of database to use for your records. The two most common paradigms for databases are SQL and No SQL.

Both database types have their pros and cons. Here are some considerations when choosing:

Feature SQL (Relational Database) NoSQL (Non-Relational Database)
Data Structure Structured, organized in tables with rows and columns (schema-based) Flexible, stores data in JSON, key-value, document, or graph formats (schema-less)
Products MySQL, PostgreSQL, SQLite, MS SQL Server MongoDB, Cassandra, CouchDB, Redis, Elasticsearch
Schema Fixed schema, must define tables and columns before storing data Schema-less, no need to predefine data structure, adaptable to changing data
Flexibility Less flexible; changing schema can be complex and requires migration Highly flexible; can store semi-structured or unstructured data (e.g., JSON, XML)
Data Integrity Strong consistency and ACID compliance (Atomicity, Consistency, Isolation, Durability) Eventual consistency (may trade off some ACID properties for scalability)
Query Language SQL (Structured Query Language), complex queries supported for joins, aggregates Query languages vary (e.g., MongoDB uses BSON/JSON-based queries), less complex for joins but supports more flexible searches
Use for Web Scraping Ideal for structured data, data with predefined relationships (e.g., product catalogs, user data) Ideal for unstructured or semi-structured data, data with changing schema (e.g., social media posts, log data, documents)

You won't have issues with choices, as both types of databases have many paid and open-source options. SQL databases are seen as more rigid as their table structure requires planning. NoSQL databases are more flexible with document, not record-based data storage.

SQL databases can be easier to work with as they are ACID-based and consistency is built in. NoSQL records require more safeguards as the data is not as rigidly defined.

If you have specific data you are scraping, SQL databases make more sense because of their consistency and reliability. With dynamic webpages, or webpages that are inconsistent, NoSQL databases are a great way to store unstructured data now for structured use later.

Some good use cases for SQL web scraping are: product pages, stock reports, and user profiles. NoSQL web scraping applications include news articles, social media posts, and logs.

Once you have decided if you are using SQL or NoSQL databases, it is time to decide if cloud or local storage is better for your web scraper.

Cloud storage has become big business since the early 2010s and many applications and companies now run on the cloud. Amazon Web Services and Microsoft Azure

Much of the difference between cloud versus local storage is who manages the hardware and software of the servers, and where the servers are physically located. Here are some more considerations:

Feature Cloud Storage Local Storage
Scalability Highly scalable, can handle massive amounts of data with minimal effort, auto-scaling Limited by the physical hardware; scaling requires purchasing more storage devices or upgrading infrastructure
Cost Pay-as-you-go, typically based on data volume, storage class, and retrieval frequency Upfront cost for hardware and ongoing maintenance cost; no variable cost based on usage
Accessibility Accessible globally via the internet Only accessible locally or within a local network
Durability Highly durable with data replication across regions Lower durability, risk of data loss in case of hardware failure
Performance Dependent on network speed; latency may be an issue when accessing or retrieving large datasets Typically faster for local data access; no dependency on network speed, suitable for high-performance computing
Data Security Highly secure, with built-in encryption (at rest and in transit), and integrated access management (IAM) systems Security is your responsibility (physical security, data encryption); limited by local security measures
Use Cases Ideal for large-scale, distributed, and scalable web scraping projects where data must be accessed from different locations, or when the dataset grows rapidly Ideal for small to medium-sized scraping projects, where data access and speed are the primary concerns and scale is manageable

Cloud storage is more scalable, secure, and data redundant. However, the costs can increase quickly based on your provider, and you have to rely on network conditions for speed of accessing your data.

Local storage is great if you need low latency reads as it is not network-dependent. You are also directly in control for costs, though overhead can be an issue.

How to Monitor and Maintain Web Scraping Scripts

One major downside of web scraping is that the internet is always changing. Websites will update their themes or webpage structure. Some software libraries will introduce breaking changes that cause older code not to work. There are constant changes that can bring your scraping to a halt.

Manually checking on the effectiveness of your scripts is not realistic. To maintain oversight and control of your operations, you must monitor your scripts. Luckily, there are ways to automate this monitoring using Python.

Automating Monitoring with Python

A good rule for programming is if you can automate a task for a relatively small amount of effort, do it. If you are running long-running scripts or web scraping scripts that run daily, weekly, or monthly, you will want to add automated monitoring.

Automated monitoring will allow you to start the script on your machine or a hosted server and forget about it unless you get a notification. Hopefully, after a few iterations, you will not get any notifications, but in the meantime, Python has great solutions for automated monitoring.

One popular tool for automated monitoring is the APScheduler (Advanced Python Scheduler) package. This package allows you to schedule Python code similar to how you would with CRON jobs.

With this library, you can section off your code into jobs that can be run and monitored individually. This allows you to have better visibility into what jobs failed and why instead of seeing a single script failure for all the pages of a website or sections of a page.

// Python code for APScheduler

from apscheduler.schedulers.background import BackgroundScheduler
import time
import logging

# Setting up logging
logging.basicConfig(level=logging.INFO)

# Method to run and log web scraping job
def web_scraping_job():
    logging.info("Web scraping task started")
    try:
	my_web_scraping_method_for_target_a()
        logging.info("Web scraping task completed")
    except Exception as e:
        logging.error(f"Error occurred: {e}")

# Method to print off the status of the scraper
def monitor_job():
    logging.info("Monitoring: Checking if the scraper is running.")

# Setting up the scheduler
scheduler = BackgroundScheduler()
scheduler.add_job(web_scraping_job, 'interval', seconds=15, id='scraping_job')
scheduler.add_job(monitor_job, 'interval', seconds=5, id='monitor_job')

# Start the scheduler
scheduler.start()

try:
    while True:
        time.sleep(1)  # Keep the main thread alive to let the scheduler run
except (KeyboardInterrupt, SystemExit):
    scheduler.shutdown()

In this code snapshot we can see our setup for 2 jobs: a web scraping job and a monitoring job. Once we have defined what the jobs do, we can schedule them to run at intervals.

If we error out on either of the jobs we will catch the error and throw an alert. This is a simple script, but APScheduler can get more complex as you add more web scraping jobs, or more sophisticated monitoring methods.

Adapting Scripts to Website Changes

Websites are not static documents. They change over time. This could be in the form of a new website theme, new website styling, new page components, or a fully new service.

Any of the listed changes will break our scripts, or make them less effective. We need strategies to know when these changes occur and to fix our scripts to ensure we get the data we need.

Strategy 1: Break up Scripts into Modules

It is common during early development to create single, monolithic web scraping scripts. When websites change, these scripts will break and will not give clear direction for what you need to fix.

To troubleshoot and fix your scripts faster, break up your scripts into many focused methods. Maybe you have one method for scraping author bios and another to target images and their metadata.

What is important is to have singularly focused methods. When a site changes and these methods break, you know exactly what method and area of the website to troubleshoot.

Strategy 2: Add Robust Error Handling and Logging

When something breaks you want to know the where and what of the error. This is where robust error handling and logging come in. Add try-catch blocks to your modular methods to pinpoint what is breaking.

If you have certain errors you expect, create specific error messages for these. Add logging for each record you go through so you know what page or section failed. The more information you output and the better you structure it, the easier it will be to fix your scripts.

Strategy 3: Use the Right tools

Some tools are better for scraping dynamic or constantly changing websites than others. If you have dynamic websites, use tools like Scrapy or Selenium.

Both tools excel at working with variable page structure and dynamic content. They allow for automatic retries and delays when dealing with anti-bot measures.

Leveraging APIs as an Alternative to Scripting

When available, APIs (application program interfaces) are a great alternative to web scraping scripts. Companies of all sizes create APIs that allow users to access data about their services and users.

Companies like Github have APIs that allow you to get user and repo information with a few requests. Twitter has APIs that enable you to pull other users’ tweets or even manage your account all from a program.

APIs are great for scripting services because they expose endpoints with specific resources. Web scraping can make more information available to you, but requires a lot more effort than setting up an account and writing some HTTP requests.

APIs have drawbacks as they can require payments for use and are likely to change. The frequency of change may be similar to how often a web page’s structure changes. APIs and web scraping scripts share that pain point.

How to Integrate APIs in Python

While APIs have their own drawbacks compared to web scraping, when they work, they make gathering data easy. We will step through a quick example of integrating with Github's API.

We want to use the Github API to request the profile information for a specific user. Our code will look like this:

// Python Code Below

import requests

# Define the GitHub API URL
username = 'octocat'  # Replace with any GitHub username
url = f'https://api.github.com/users/{username}'

# Send a GET request to the GitHub API to retrieve the profile data
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the response as JSON
    profile_data = response.json()
    
    # Extract and display relevant information
    print(f"GitHub Username: {profile_data['login']}")
    print(f"Name: {profile_data['name']}")
    print(f"Company: {profile_data['company']}")
    print(f"Location: {profile_data['location']}")
    print(f"Public Repositories: {profile_data['public_repos']}")
    print(f"Followers: {profile_data['followers']}")
    print(f"Following: {profile_data['following']}")
    print(f"Bio: {profile_data['bio']}")
else:
    print(f"Failed to retrieve data for {username}. Status code: {response.status_code}")

We load the requests library and define a username and a URL to make a request. We can then request and store the output in the response variable.

From this response variable, we have the user's login, company, location, public repos, and many other values. With just a few lines of code, we were able to make a call to Github and get back a large, formatted data object.

This API does not require authentication, but if we need to add auth, the requests library has us covered. Check out the following example to see how easy it is to create a tuple with your username and password and add it to your request.

If you want to add rate limits you can either take a naive approach of adding a for loop with a set number of iterations per period, or you can add another package like rate limit. It just depends on how in-depth your solution needs to be.

How to Use Proxy Servers for Web Scraping

Depending on the amount of requests you are making and the services you are targeting, you may need to use proxy servers while web scraping. A proxy is simply something that sits between two items.

For web scraping, a proxy server sits between the requesting application and the target service to mask the request's origin. Large-scale web scrapers may require thousands of requests a minute. This type of traffic will be flagged and blocked by a webmaster if all the traffic comes from a single IP.

You could throttle a single IP’s traffic to stay compliant or use proxy servers to spread the requests and maintain the speed of your web scraper. To use proxy servers, you must understand what types are available.

The three most common types of proxy servers are data center proxy servers, residential proxy servers, and rotating proxies. Let's look closer at the proxy types:

Proxy Type Description IP Source Speed Reliability Cost Use Case
Data Center Proxy Proxies from data centers with no physical location. Data centers (non-ISP owned) ✅ High 🟠 Medium ✅ Low to Medium Ideal for tasks that require speed, such as web scraping or large-scale data gathering
Residential Proxies Proxies that use real consumer IPs assigned by ISPs. Real homes and ISPs 🟠 Medium ✅ High ❌ High Best for accessing geo-restricted content or for more evasive scraping
Rotating Proxies Proxies that automatically rotate IP addresses at set intervals. Mixed (data centers or residential) 🟠 Varies ✅ High ❌ Medium to High Perfect for large-scale web scraping where IP bans are a concern

1. Data Center Proxies

The best use case for data center proxies is high-volume jobs where anonymity is unimportant. Tasks like scraping large amounts of non-sensitive data or interacting with non-botted websites.

A good target for data center proxies is the product pages of public-facing stores.

2. Residential Proxies

These proxies are best suited when anonymity is your top priority but are the most expensive. They use real ISP-given IP addresses for making requests to your target. This helps with anti-bot measures.

Residential proxies are great for social media and geo-targeted web scraping where you must match a certain profile to access data.

3. Rotating Proxies

These proxies are great for long-running, high-volume web scraping tasks. They rotate between requesting IPs at certain intervals which helps prevent rate limit errors and IP bans.

They are great where persistent IP bans are common such as search engine and social media scraping.

Setting up Proxies with Python

We have talked about what proxy servers are, but now let’s look at how to implement them. One of the best ways to utilize proxies is to perform a proxy rotation. In this method, we have a collection of proxy IPs and change our IP based on the number of requests made for performing actions.

Let us look at a quick example of how to set proxies in your HTTP requests using the requests Python library:

// Python code for rotating proxies

import requests
import random

# List of proxy servers (this could be fetched dynamically from your provider)
proxies_list = [
    'http://username:password@proxy1_ip:port',
    'http://username:password@proxy2_ip:port',
    'http://username:password@proxy3_ip:port'
]

# Randomly choose a proxy from the list
proxy = {'http': random.choice(proxies_list), 'https': random.choice(proxies_list)}

url = 'https://httpbin.org/ip'

response = requests.get(url, proxies=proxy)

if response.status_code == 200:
    print("Response from proxy:", response.json())
else:
    print("Failed to retrieve content:", response.status_code)

In this code, we create a list of proxy addresses. Next, we set a random proxy value to use from our list. We make the request and then check the outcome. If we have a list of targets, we can rotate through our list of proxies to keep them fresh and avoid IP bans.

Best Proxy Providers for Web Scraping

Now that we have looked at how to integrate proxy servers into requests, let us review what proxy providers you may want to use in your web scraper.

1. Live Proxies

Live Proxies is a very popular service due to their real-time managed IP-pool which ensures long-running processes have fresh IPs to avoid bans or dropped calls. They specialize in high-quality residential IPs which make them the perfect choice for dealing with strict anti-bot targets.

2. Smart Proxy

Smart Proxy boasts over 40 million global IP addresses which allows you to bypass any geo-targeting IP preventions. They integrate well with Python scraping tools like Scrapy and the requests library. Smart Proxy focuses its products on mid to large-tier scraping projects.

3. Oxylabs

Oxylabs offers over 100 million residential and datacenter IPs, ideal for large-scale scraping. Their proxies integrate seamlessly with Python tools like Scrapy and requests, ensuring high success rates for bypassing geo-restrictions and anti-bot defenses. Oxylabs caters to enterprise-level data extraction needs.

Frequently Asked Questions About Web Scraping in Python

We’ve looked at web scraping, how to perform it with Python, and many important details to optimize your scraping operations. There are a few questions you may consistently have while scraping. We will focus on those questions and answers in the following sections.

Is web scraping legal?

If you follow your target site’s rules, there is nothing illegal about web scraping. Each site should have a robots.txt file which should be your first step in determining the rules for scraping.

As mentioned earlier, this file tells a bot which URLs are public for scraping. This information sets the boundaries for what is legal. If it doesn’t allow any URLs to be scraped then you cannot scrape the site.

Along with reviewing the robots.txt file, you must pay attention to the site’s copyright. Just because you can scrape content doesn’t mean that you can use it in any way you desire. The copyright policy will outline what you can legally do with your scraped data.

What are the best Python libraries for web scraping?

The great thing with software is that developers are constantly building new tools to solve or improve on existing problems. While there may be new Python libraries for web scraping released later in 2025, these are some of the best libraries right now.

1. Beautiful Soup

Simple and adaptable library which makes it a great tool for beginning web scrapers. You can see the download and documents page here.

2. Scrapy

Extensible library allowing you to tie in other libraries and tools to turbocharge your web scraper. The official download page can be found here.

3. Selenium

This is a web automation framework that allows users to interact with target web pages using JavaScript like interactions. The official website can be found here.

How long does it take to learn web scraping with Python?

The amount of time it takes to learn how to scrape with Python is dependent on how well you understand web pages. If you already know what HTML and CSS are, understand network requests, and can code in Python, you can learn to web scrape in a matter of hours. If you are missing this prerequisite knowledge, learning to web scrape can take a week or more. It all depends on what you know.

The best way to learn is to understand the fundamental components of a web page: HTML and CSS. Once you are comfortable with those, learn how basic client-server interactions work for a webpage. Finally, learn Python to the point where you can work with variables and object instantiation.

At this point, you have enough coding and web page knowledge to learn web scraping applications in a matter of hours.

Can I scrape any website using Python?

Python is a great start for scraping any website. However, there are limitations to using Python for web scraping. To be able to scrape any website you may need to deploy more than just Python.

Server-side rendered content is the best application for pure Python web scraping. If you have client-side rendering you will need to use a tool like Selenium to mimic user interactions. This is not the only tool you may use, but is one of the most commonly paired tools.

Along with client-side rendered content, you must pay attention to authentication and authorization, rate limiting, and legal limitations. For more information on these, please review other frequently asked questions.

How can I avoid being blocked while scraping?

Not every website wants to be scraped. While they might not explicitly state in their robots.txt file that they don’t allow scraping, certain websites will employ tactics to discourage scraping. These tactics rely on knowing that requests are coming from a single source and limiting access to this source. There are many strategies to get around these limits. Let’s take a quick look at the top 3 methods.

1. User Agents

A user-agent is a string sent in your request to a service that identifies what kind of device you are using. Certain user agents are identified as bots. To avoid being identified you can spoof what your user-agent value is

2. Proxies

Websites will look for certain identifiers like your IP address to determine if you are making too many requests. You can use proxies to make your requests look like they originate from many IP addresses.

3. Rate Limiting

Websites will limit the number of requests a single source can make over a timespan (20 requests per minute). You can back off your requests to remain in their accepted range and avoid getting blocked.

What are some common errors in web scraping and how to fix them?

You will encounter errors while trying to get your web scraping script working. Some are more common than others but are not too difficult to account for. Let’s review the 3 most common.

1. HTTP Errors

At some point you will be greeted with 400 or 500 errors. These errors happen when you try to access a webpage with either a client error (you messed up) or a server error (the target is having issues).

To account for these HTTP errors you can add try-catch blocks to your web scraping scripts to ensure your scraper doesn’t crash on a single bad HTTP request

2. Incorrect Selectors

You don’t want your script to break if you make a mistake with your selectors or the page is inconsistent in structure. If you do not account for getting incorrect data from selectors you may have a broken script or bad data.

In order to correct for this add fallback selectors to account for variations in page content and structure.

3. JavaScript Rendered Content

There are times where you may not get any data back even after you have checked you r scraping script multiple times. This may be due to JavaScript rendered content which does not include data when you make a request for a webpage.

To fix this you can use a library like Selenium that works better with JavaScript-rendered content.

How do I handle CAPTCHA while web scraping?

CAPTCHAs were created to defeat web scrapers, so how are we supposed to bypass them? As with every hurdle in software, more software is the answer.

If you want to handle CAPTCHAs while web scraping you can use a CAPTCHA-solving service such as 2Captcha. These services can be slow and cost money. Do research before you settle on one product.

What are the alternatives to web scraping?

Web scraping is not the only way to gather data for your app or service. There are three other common methods for collecting data: APIs, open data portals, and paid data services. APIs (application program interfaces) are code layers that allow two programs to talk to each other. In your case, it will allow your app to directly access a database. These are helpful as they allow direct access to the data without writing scripts.

Drawbacks of APIs include the need for authorization, poor documentation, breaking changes, and high prices. All of the limits from APIs are dictated by the owners of the API. If they don’t document their service well it is hard to use. If they introduce breaking changes in a version update, then your app may break.

Another alternative to web scraping is using open data portals. These are usually public resources that combine public and reusable data into an easy-to-use interface. The benefit of open data portals is that they are usually free. The main downside is that they offer data as a user interface or an API which can make using them difficult.

The final popular alternative to web scraping is paid data services. These are similar to APIs in that they offer a code interface to retrieve data but differ in quality and support. Paid data services cost money, which is their main downside. However, because they cost money, they usually offer more support and are less likely to introduce breaking changes.

Related Articles

How To Scrape Amazon

How To Scrape Amazon: Product Data & Reviews (2025)

Learn how to scrape Amazon for product data and reviews in 2025. Explore the best Python tools, avoid anti-scraping measures, and gather valuable insights.

How To

11 February 2025

How to Do Web Scraping in Java

How to Do Web Scraping in Java: A Complete Tutorial

Web Scraping in Java: Learn how to extract data from websites using Java. This guide covers setup, libraries like JSoup & Selenium, and best practices.

How To

10 February 2025

How to Use BrowserScan to Detect Browser Fingerprints

How to Use BrowserScan to Detect Browser Fingerprints

Discover the risks of multi-accounting: identical browser fingerprints can link accounts, risking bans. Use BrowserScan to protect your online strategy.

How To

30 October 2024