Live Proxies

How To Scrape Amazon: Product Data & Reviews (2025)

Learn how to scrape Amazon for product data and reviews in 2025. Explore the best Python tools, avoid anti-scraping measures, and gather valuable insights.

How To Scrape Amazon
Live Proxies

Live Proxies Editorial Team

Content Manager

How To

11 February 2025

Scraping e-commerce websites like Amazon and retrieving product data and reviews has become an invaluable business strategy. With over 300 million active customers and 12 million products listed on the platform, Amazon offers a massive repository of data for market research, competitor analysis, and customer sentiment insights. Studies show that 90% of consumers read online reviews before making a purchase, and 72% of customers say positive reviews increase their trust in a brand. Through web scraping, companies can monitor pricing trends, consumer preferences, and other factors that will help optimize their strategies. This article is about how to scrape Amazon, clean the data, and apply it in a best-practice ethical and technical way.

Is It Legal to Scrape Amazon Data?

Scraping Amazon data comes with some risks. While grabbing publicly available data may not always be illegal, making too many requests at once and putting pressure on Amazon’s servers can definitely cause legal issues. Amazon’s terms of service usually forbid scraping and breaking those rules could lead to account bans or even legal action. Plus, scraping might be seen as unauthorized access to their systems, which could land you in hot water under certain laws. To play it safe, using Amazon’s official APIs is the best way to get product data while staying on the right side of the rules.

Understanding Amazon’s Anti-Scraping Measures

Amazon uses strong anti-scraping mechanisms to protect its data, which makes it difficult for automated bots to scrape information. These include IP rate limiting, session tracking, and request fingerprinting, all of which are designed to detect and block unauthorized scraping attempts. To identify these challenges, one needs specialized tools and techniques, such as browser developer tools and extensions like Wappalyzer. Below, we discuss the main anti-scraping methods Amazon uses and how to detect them.

Common Anti-Scraping Techniques Used by Amazon

  1. IP-Based Rate Limiting: Amazon monitors the frequency of requests from individual IP addresses. Exceeding certain thresholds can result in temporary or permanent IP bans.
  2. Session Management for Data: Amazon uses session-based tracking to associate specific data, like product reviews, with a user’s session. This makes it challenging to scrape reviews without maintaining valid session tokens.
  3. Request Fingerprinting: Amazon identifies patterns in request parameters. Even if you rotate IPs, sending identical request payloads can trigger blocking mechanisms.

Tools to Detect Anti-Scraping Challenges

  1. Detecting anti-scraping measures can be pretty straightforward or quite tricky, depending on the website. One tool that can help is the Wappalyzer extension, which shows you the technologies a site is using, including popular anti-bot services. But Amazon is a bit different it doesn’t rely on third-party services.
  2. Another way to inspect the website is through browser developer tools. Perform tests like blocking JavaScript files to check for browser fingerprinting and dynamic content loading mechanisms. These steps help you identify anti-scraping techniques effectively.

How to Set Up Your Environment for Scraping Amazon

Before you start scraping Amazon, it's essential to set up the right environment and tools. The following steps will help you get everything in place to begin your web scraping project efficiently.

  • Python: The programming language for web scraping.
  • Pip: Python’s package manager.
  • BeautifulSoup: A library for parsing HTML.
  • Requests: For sending HTTP requests.

Configuring Your Development Environment

  1. Install Python from python.org.
  2. Create a virtual environment using python -m venv env
  3. Install necessary libraries: pip install requests beautifulsoup4

Set up your IDE for an organized workflow, ensuring efficient script management.

How to Extract Product Data from Amazon

Indeed, scraping product data off Amazon would involve identifying key HTML elements and parsing it into a human-readable format. This is done through inspecting page structure and possible details like the name of the product, its price, and ratings using browser developer tools. Once you identify the elements, you can extract using Python and BeautifulSoup as requests mimic real user behavior, which bypasses bots. Here's how you do it:. Use browser developer tools to inspect the HTML structure of Amazon’s product pages. Target elements such as:

  • Product Name: //span[@id="productTitle"]/text()

product name

  • Price: (//span[@class="a-offscreen"]/text())[1]

product price

  • Ratings: //span[@id="acrPopover"]/@title

rating

Parsing HTML for Product Data

Here’s a sample code snippet to extract product details:

from bs4 import BeautifulSoup
import requests

# URL of the Amazon product page
url = "https://www.amazon.com/Samsung-SM-S711B-Factory-Unlocked-International/dp/B0CLHH6DQN"

# Headers to mimic a real browser and avoid bot detection
headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'priority': 'u=0, i',
    'sec-ch-ua': '"Not A(Brand";v="8", "Chromium";v="132", "Google Chrome";v="132"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Linux"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36',
}

# Send a GET request to the Amazon page
response = requests.get(url, headers=headers)

# Check if the response is valid (status code 200 and sufficient content length)
if response.status_code != 200 or len(response.text) < 10000:
    raise requests.exceptions.RequestException("Failed to fetch the page or encountered a captcha.")

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Extract product details
product_details = {
    "title": soup.find("span", id="productTitle").get_text(strip=True),  # Product title
    "ratings": soup.find("span", id="acrPopover")["title"],  # Average rating
    "price": soup.find("span", class_="a-offscreen").get_text(strip=True),  # Product price
}

# Print the extracted product details
print(product_details)

Parsing HTML for Product Data

How to Scrape Amazon Reviews

Extracting Amazon reviews requires navigating through multiple pages, which requires passing valid session information with each request. Since Amazon has updated its pagination system, you must include cookies and session tokens to access review pages. Below is a step-by-step guide on how to scrape reviews from an Amazon product using Python and BeautifulSoup

Navigating Pagination for Reviews

Amazon reviews are usually spread across multiple pages, so you’ll need to automate the navigation through them using libraries like Requests. Recently, Amazon made some changes to how reviews are paginated. Now, you need to pass along a user session with each request. This means that for every new page of reviews you want to access, you’ll need to include valid session details like cookies and session tokens to get the data successfully.

You can control the page navigation of Amazon reviews by modifying the **pageNumber **parameter in the URL. This parameter determines which page of reviews is displayed. For example, in the URL https://www.amazon.com/product-reviews/B0CLHH6DQN/ref=cm_cr_arp_d_paging_btm_next_2?pageNumber=3,

The pageNumber=3 indicates that you're viewing the third page of reviews. To access the next set of reviews, simply increase the pageNumber value (e.g., pageNumber=4 for the fourth page).

Here's an example of extracting reviews from the initial page:

Note: need to pass valid cookies else it will return sign in page response

Navigating Pagination for Reviews

Navigating Pagination for Reviews

Navigating Pagination for Reviews

Navigating Pagination for Reviews

Navigating Pagination for Reviews

from bs4 import BeautifulSoup
import requests

# Change this to navigate to a different page
page_number = 1

# URL for the Amazon product reviews page
url = f"https://www.amazon.com/product-reviews/B0CLHH6DQN/ref=cm_cr_arp_d_paging_btm_next_2?pageNumber={page_number}"

# Logged-in session cookies (replace with actual cookies if needed)
cookies = {
    # Example: 'session-id': '1234567890',
}

# Custom headers to mimic a real browser and avoid bot detection
headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',  # Do Not Track preference
    'pragma': 'no-cache',
    'priority': 'u=0, i',
    'sec-ch-ua': '"Not A(Brand";v="8", "Chromium";v="132", "Google Chrome";v="132"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Linux"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36',
}

# Send a GET request to fetch the reviews page
response = requests.get(url, headers=headers, cookies=cookies)

# Check if the response is valid (status code 200 and sufficient content length)
if response.status_code != 200 or len(response.text) < 10000:
    raise requests.exceptions.RequestException("Failed to fetch the page or encountered a captcha. Status code: {}, Content length: {}".format(response.status_code, len(response.text)))

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Extract review elements based on the new selector for 'li' elements
review_elements = soup.find_all("li", class_="review aok-relative")
reviews = []

# Loop through the review elements and extract relevant data
for review_element in review_elements:
    reviews.append({
        "reviewer_name": review_element.find("span", class_="a-profile-name").get_text(strip=True) if review_element.find("span", class_="a-profile-name") else "N/A",  # Reviewer's name
        "review_title": [span.get_text(strip=True) for span in review_element.select("a[data-hook='review-title'] span")][2] if len(review_element.select("a[data-hook='review-title'] span")) > 2 else "N/A",  # Review title (3rd span in the selector)
        "review_date": review_element.find("span", {"data-hook": "review-date"}).get_text(strip=True) if review_element.find("span", {"data-hook": "review-date"}) else "N/A",  # Review date
        "review_text": review_element.find("span", {"data-hook": "review-body"}).find("span").get_text(strip=True) if review_element.find("span", {"data-hook": "review-body"}) else "N/A",  # Review text
        "review_rating": review_element.find("a", {"data-hook": "review-title"}).find("i").find("span").get_text(strip=True) if review_element.find("a", {"data-hook": "review-title"}) else "N/A",  # Review rating
    })
    # Print the list of reviews
print(reviews)

the list of reviews

Post-Processing Data

Clean and transform the scraped data:

# Import datetime module for date manipulation
from datetime import datetime

def clean_review_data(data):
    """
    Cleans up the review date and review rating for a list of reviews.

    Args:
        data (list): A list of dictionaries, where each dictionary contains review data with keys:
                     - "review_date": Raw date string (e.g., "Reviewed in the United States on January 1, 2022").
                     - "review_rating": Raw rating string (e.g., "4.5 out of 5 stars").
    
    Returns:
        list: A list of dictionaries with cleaned review data, where:
              - "review_date" is reformatted to "YYYY-MM-DD".
              - "review_rating" is converted to a float.
    """
    for review in data:
        # Clean and reformat the review date:
        # Remove the prefix "Reviewed in the United States on " and convert it to "YYYY-MM-DD" format.
        review["review_date"] = datetime.strptime(
            review["review_date"].replace("Reviewed in the United States on ", ""), "%B %d, %Y"
        ).strftime("%Y-%m-%d")
        
        # Extract and convert the review rating to a float:
        # Take the numeric portion (e.g., "4.5") from the rating string and convert it to a float.
        review["review_rating"] = float(review["review_rating"].split(" ")[0])
    
    # Return the cleaned review data
    return data

# Apply the cleaning function to the reviews list
cleaned_reviews = clean_review_data(reviews)

# Print the cleaned reviews to verify the output
print(cleaned_reviews)

cleaned reviews

How to Handle Amazon’s Anti-Scraping Measures

Bypassing Amazon's anti-scraping measures is a challenge but can be overcome by a variety of methods. Rotating proxy servers and handling CAPTCHA challenges are the primary methods that should be employed for successful data extraction. The next steps describe ways to avoid getting IP blocks and handling CAPTCHA challenges when scraping Amazon.

Implement rotating proxies to prevent IP blocks:

proxies = {
    "http": "http://proxy_ip:proxy_port",
    "https": "http://proxy_ip:proxy_port"
}
response = requests.get(url, headers=headers, proxies=proxies)

Why Live Proxies is a Better Choice

Live Proxies stands out as an excellent provider for rotating proxies due to several key advantages:

  • Transparency: Live Proxies is committed to addressing concerns openly, ensuring you're always informed about the services and features.
  • Private Proxies: Unlike shared proxy pools, Live Proxies offers private proxies, meaning your IPs are unique to you and not shared with others.
  • Reliability: Clients who need long-term solutions appreciate the consistent proxy quality and the fast resolution of any concerns.
  • Excellent Customer Support: Live Proxies offers 24/7 customer support, ensuring quick responses and client satisfaction.
  • Highest Quality IPs: With stable and high-quality IPs, Live Proxies ensures that your requests remain unblocked on Amazon and other websites.
  • Custom B2B Network: Tailored solutions for large-scale enterprise requirements ensure that Live Proxies meets your specific needs, whether you're handling small tasks or large scraping projects.

Amazon primarily uses image CAPTCHAs, which display text in a fuzzy manner. To handle these CAPTCHAs:

  1. Retry Requests: When encountering a CAPTCHA, retry the request after verifying the response status code and content. Note that Amazon’s responses may have a status code of 200 but still contain a CAPTCHA challenge.
  2. Use CAPTCHA-Solving Services: If solving CAPTCHAs is necessary, rely on third-party services like 2Captcha or Capsolver that provide automated CAPTCHA-solving solutions for a cost.

Best Practices for Scraping Amazon in 2025

Scraping Amazon data is essential for businesses in 2025 to understand market trends, customer behaviors, and competitor actions. However, ensuring the accuracy of the data and maintaining script reliability are critical to ensuring valuable insights. Below are the best practices for Amazon scraping, supported by statistics and real-world sources.

Ensuring Data Accuracy

To make sure your data is accurate, it's a good idea to add some validation checks. For example, you can confirm that fields like price are of the right type like ensuring it’s an integer. You can also set up rules to validate other fields to catch any inconsistencies.

Read more: Best Practices for Data Verification and Validation

Maintaining Script Reliability

To keep your script running smoothly, you could create a custom parser just for Amazon and set up unit tests to check that everything’s working as expected. It’s important to regularly test this parser, especially since Amazon updates its site often and pages can have different structures. Having a dedicated parser will make things easier. Plus, using version control like Git and automating your workflow with CI/CD tools can help ensure your code stays reliable and up to date.

Read more: Scraping Reliability (7 Tips for Nocoders)

Alternatives to Scraping: Using Amazon’s APIs

While scraping Amazon’s website can provide valuable data, it comes with legal and ethical risks. In contrast, using Amazon's official APIs offers a more compliant, structured, and stable way to access data. Below is a comparison between scraping and using APIs, highlighting key differences in legality, reliability, cost, and scalability.

Aspect Scraping API
Legality Risky; may violate terms of service Fully legal and compliant
Data Access Access all visible page data Limited to structured API data
Reliability Breaks with site changes Stable and well-maintained
Cost Usually free but resource-heavy Often paid with usage limits
Ease of Use Requires parsing and bot evasion Simplified, structured responses
Scalability Needs proxies and infrastructure Limited by rate quotas
Ethics Questionable without consent Fully ethical and approved

How to Get Started with Amazon’s APIs

  • Create an Amazon Developer Account: Sign up on the Amazon Developer Portal.
  • Request API Access: Apply for access to the required API, such as the Product Advertising API or Amazon Business API.
  • Get Your API Token: After approval, you'll receive an API token for authentication.
  • Follow the Guide: Use the Product Search API v1 Reference to get started with making API requests.

How to Use Amazon Data for Business Insights

Amazon data can be of great help to businesses in tracking competitor activities, monitoring product performance, and leveraging historical trends. Below are some key ways to use Amazon data effectively for business growth.

  • Competitive Analysis: Track competitors' pricing and reviews to understand their strategies and adjust yours accordingly.
  • Monitoring Product Performance: Analyze trends in sales ranks and customer feedback to assess product demand and customer satisfaction.
  • Leveraging Historical Data: Use past pricing and review patterns to predict future trends and make informed decisions.

Analyzing Amazon Reviews Using Pandas

Extracting and analyzing Amazon reviews can provide valuable insights into customer satisfaction, product performance, and common concerns. Using Pandas, we can efficiently process the data, visualize rating distributions, and track trends over time. The analysis includes:

  • Rating Distribution: A count plot shows the frequency of each rating, highlighting customer sentiment.
  • Average Rating Over Time: A line chart illustrates how ratings fluctuate monthly, indicating any shifts in user satisfaction.
  • Review Text Analysis: A word cloud reveals frequently mentioned words, helping identify common themes in user feedback.

By leveraging Pandas, Seaborn, and Matplotlib, we can transform raw reviews into actionable business insights.

Installing Required Libraries

To perform this analysis, install the necessary Python libraries using the following command:

pip install pandas matplotlib seaborn wordcloud

Complete code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Sample reviews data
reviews = [
    {'reviewer_name': 'M. C. Gianetta', 'review_title': 'Exactly as described', 'review_date': 'Reviewed in the United States on November 18, 2024', 'review_text': "Ordering a phone online can sometimes be challenging. Occasionally things don't meet expectations. That's not the case here. Brand new in a sealed box. A good phone at a good price, delivered on a timely basis.", 'review_rating': '5.0 out of 5 stars'},
    {'reviewer_name': 'Bryan k Boone', 'review_title': 'So far so good', 'review_date': 'Reviewed in the United States on September 8, 2024', 'review_text': "Haven't had it a month yet but I'm pleased. Great pictures. Battery lasts much longer than my galaxy 9s. It arrived in great condition,  no problem with anything. It's a excellent price. One thing I dislike but isn't a huge deal, the charger cord is short.", 'review_rating': '5.0 out of 5 stars'},
    {'reviewer_name': 'Annie', 'review_title': 'Samsung S23 Fe', 'review_date': 'Reviewed in the United States on October 7, 2024', 'review_text': 'I love my new phone, works really well. Really excellent  upgrade from an A72. It does get hot when charging, but it cools down when placed in front of a fan or a/c. Very easy to transfer data to new phone. This was definitely a brand new phone. I will recommend this phone to anyone. Would have given 5 stars if I had gotten a charger block as promised. It only came with the USB cable.', 'review_rating': '4.0 out of 5 stars'},
    {'reviewer_name': 'Christine J. Elliston', 'review_title': 'It works well just as expected.', 'review_date': 'Reviewed in the United States on January 17, 2025', 'review_text': "I bought it for my husband's birthday and he just loves it.", 'review_rating': '5.0 out of 5 stars'},
    {'reviewer_name': 'Chris', 'review_title': 'Its an international phone from India', 'review_date': 'Reviewed in the United States on January 2, 2025', 'review_text': 'After using this phone and everything is set to india. And nothing works. And cant have apps no voicemail. Called samsung they said its an international phone no where in the description says international phone or that its from india. Misleading and terrible phone.', 'review_rating': '1.0 out of 5 stars'},
    {'reviewer_name': 'E&C', 'review_title': "Couldn't be happier", 'review_date': 'Reviewed in the United States on October 22, 2024', 'review_text': "I've had this phone for a while now and it's perfect. The color resolution is amazing. Camara is top notch. The battery last all day without needed extra charge. The cost is was well blow the regular. Speed is super fast. I love this phone and would buy again", 'review_rating': '5.0 out of 5 stars'},
    {'reviewer_name': 'Marcelo', 'review_title': 'Funciona sin problemas en Argentina', 'review_date': 'Reviewed in the United States on December 20, 2024', 'review_text': 'Funciona sin problemas en Argentina. Sería bueno que incluyan el cargador como parte del combo de compra. Solo trae el cable.', 'review_rating': '5.0 out of 5 stars'},
    {'reviewer_name': 'Amazon Customer', 'review_title': 'Good quality', 'review_date': 'Reviewed in the United States on September 12, 2024', 'review_text': 'Nice case looks good feel ok. Glad I got it.', 'review_rating': '4.0 out of 5 stars'},
    {'reviewer_name': 'Tajera', 'review_title': 'Has expected', 'review_date': 'Reviewed in the United States on November 6, 2024', 'review_text': 'This wasn't damaged at all and it exactly like I read and saw', 'review_rating': '4.0 out of 5 stars'},
    {'reviewer_name': 'Placeholder', 'review_title': 'GB AND HOW LONG WITHOUT CHARGING', 'review_date': 'Reviewed in the United States on August 20, 2024', 'review_text': "You can get around  8 hours without charging, leaving about 28% left. The phone does run hot but someone said it's because of the type processor used on international phones. The western phone has snapdragon? This phone has exynos (I believe). It has all that the western phone has. You need to check to see if Samsung pay is on the international phone , someone said it wasn't. Otherwise a good phone price less than the western phone. I gave it a 4 because it runs hot (sometimes).", 'review_rating': '4.0 out of 5 stars'}
]

# Convert the reviews data into a Pandas DataFrame
df = pd.DataFrame(reviews)

# Clean the 'review_rating' column to extract numeric ratings
df['review_rating'] = df['review_rating'].str.extract(r'(\d+\.\d+)').astype(float)

# Visualizations
plt.figure(figsize=(12, 6))

# 1. Distribution of Ratings
plt.subplot(1, 2, 1)
sns.countplot(x='review_rating', data=df, palette='viridis')
plt.title('Distribution of Review Ratings')
plt.xlabel('Rating')
plt.ylabel('Number of Reviews')

# 2. Average Rating Over Time
# Extract the review date and convert to datetime
df['review_date'] = pd.to_datetime(df['review_date'].str.extract(r'on (.*)')[0])
df.set_index('review_date', inplace=True)

plt.subplot(1, 2, 2)
df.resample('M')['review_rating'].mean().plot(kind='line', marker='o', color='orange')
plt.title('Average Rating Over Time')
plt.xlabel('Review Date')
plt.ylabel('Average Rating')

plt.tight_layout()
plt.show()

# 3. Word Cloud for Review Text (Optional)
from wordcloud import WordCloud

# Combine all review text into a single string
text = " ".join(review for review in df['review_text'])

# Generate a word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

# Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Review Text')
plt.show()

Analyzing Amazon Reviews

world cloud

Conclusion

Scraping Amazon takes a mix of technical know-how and a strong commitment to legal and ethical standards. By using tools like BeautifulSoup, Selenium, and proxies, you can responsibly collect valuable product and review data while staying on the right side of the rules. It's all about balancing the right tools with respect for Amazon's policies.

Frequently Asked Questions About Scraping Amazon

Is scraping Amazon against its Terms of Service?

Yes, it violates Amazon’s ToS. However, scraping public data may be permissible under specific legal contexts.

Can I use proxies to scrape Amazon?

Yes, proxies help bypass IP blocks. Use residential or rotating proxies for best results.

How do I scrape Amazon data without being blocked?

Use strategies like user-agent rotation, request delays, and proxies to avoid detection.

What’s the difference between scraping and APIs for Amazon data?

Scraping provides flexibility but risks legal issues, while APIs offer structured, legal access to data.

Related Articles

How to Do Python in Web Scraping

How to Do Python in Web Scraping: Practical Tutorial 2025

Learn how to do web scraping in Python with this practical 2025 guide. Discover top libraries, anti-scraping techniques, and legal considerations.

How To

11 February 2025

How to Do Web Scraping in Java

How to Do Web Scraping in Java: A Complete Tutorial

Web Scraping in Java: Learn how to extract data from websites using Java. This guide covers setup, libraries like JSoup & Selenium, and best practices.

How To

10 February 2025

How to Use BrowserScan to Detect Browser Fingerprints

How to Use BrowserScan to Detect Browser Fingerprints

Discover the risks of multi-accounting: identical browser fingerprints can link accounts, risking bans. Use BrowserScan to protect your online strategy.

How To

30 October 2024