Scraping e-commerce websites like Amazon and retrieving product data and reviews has become an invaluable business strategy. With over 300 million active customers and 12 million products listed on the platform, Amazon offers a massive repository of data for market research, competitor analysis, and customer sentiment insights. Studies show that 90% of consumers read online reviews before making a purchase, and 72% of customers say positive reviews increase their trust in a brand. Through web scraping, companies can monitor pricing trends, consumer preferences, and other factors that will help optimize their strategies. This article is about how to scrape Amazon, clean the data, and apply it in a best-practice ethical and technical way.
Is It Legal to Scrape Amazon Data?
Scraping Amazon data comes with some risks. While grabbing publicly available data may not always be illegal, making too many requests at once and putting pressure on Amazon’s servers can definitely cause legal issues. Amazon’s terms of service usually forbid scraping and breaking those rules could lead to account bans or even legal action. Plus, scraping might be seen as unauthorized access to their systems, which could land you in hot water under certain laws. To play it safe, using Amazon’s official APIs is the best way to get product data while staying on the right side of the rules.
Understanding Amazon’s Anti-Scraping Measures
Amazon uses strong anti-scraping mechanisms to protect its data, which makes it difficult for automated bots to scrape information. These include IP rate limiting, session tracking, and request fingerprinting, all of which are designed to detect and block unauthorized scraping attempts. To identify these challenges, one needs specialized tools and techniques, such as browser developer tools and extensions like Wappalyzer. Below, we discuss the main anti-scraping methods Amazon uses and how to detect them.
Common Anti-Scraping Techniques Used by Amazon
- IP-Based Rate Limiting: Amazon monitors the frequency of requests from individual IP addresses. Exceeding certain thresholds can result in temporary or permanent IP bans.
- Session Management for Data: Amazon uses session-based tracking to associate specific data, like product reviews, with a user’s session. This makes it challenging to scrape reviews without maintaining valid session tokens.
- Request Fingerprinting: Amazon identifies patterns in request parameters. Even if you rotate IPs, sending identical request payloads can trigger blocking mechanisms.
Tools to Detect Anti-Scraping Challenges
- Detecting anti-scraping measures can be pretty straightforward or quite tricky, depending on the website. One tool that can help is the Wappalyzer extension, which shows you the technologies a site is using, including popular anti-bot services. But Amazon is a bit different it doesn’t rely on third-party services.
- Another way to inspect the website is through browser developer tools. Perform tests like blocking JavaScript files to check for browser fingerprinting and dynamic content loading mechanisms. These steps help you identify anti-scraping techniques effectively.
How to Set Up Your Environment for Scraping Amazon
Before you start scraping Amazon, it's essential to set up the right environment and tools. The following steps will help you get everything in place to begin your web scraping project efficiently.
- Python: The programming language for web scraping.
- Pip: Python’s package manager.
- BeautifulSoup: A library for parsing HTML.
- Requests: For sending HTTP requests.
Configuring Your Development Environment
- Install Python from python.org.
- Create a virtual environment using python -m venv env
- Install necessary libraries: pip install requests beautifulsoup4
Set up your IDE for an organized workflow, ensuring efficient script management.
How to Extract Product Data from Amazon
Indeed, scraping product data off Amazon would involve identifying key HTML elements and parsing it into a human-readable format. This is done through inspecting page structure and possible details like the name of the product, its price, and ratings using browser developer tools. Once you identify the elements, you can extract using Python and BeautifulSoup as requests mimic real user behavior, which bypasses bots. Here's how you do it:. Use browser developer tools to inspect the HTML structure of Amazon’s product pages. Target elements such as:
- Product Name: //span[@id="productTitle"]/text()
- Price: (//span[@class="a-offscreen"]/text())[1]
- Ratings: //span[@id="acrPopover"]/@title
Parsing HTML for Product Data
Here’s a sample code snippet to extract product details:
from bs4 import BeautifulSoup
import requests
# URL of the Amazon product page
url = "https://www.amazon.com/Samsung-SM-S711B-Factory-Unlocked-International/dp/B0CLHH6DQN"
# Headers to mimic a real browser and avoid bot detection
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-IN,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'priority': 'u=0, i',
'sec-ch-ua': '"Not A(Brand";v="8", "Chromium";v="132", "Google Chrome";v="132"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36',
}
# Send a GET request to the Amazon page
response = requests.get(url, headers=headers)
# Check if the response is valid (status code 200 and sufficient content length)
if response.status_code != 200 or len(response.text) < 10000:
raise requests.exceptions.RequestException("Failed to fetch the page or encountered a captcha.")
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Extract product details
product_details = {
"title": soup.find("span", id="productTitle").get_text(strip=True), # Product title
"ratings": soup.find("span", id="acrPopover")["title"], # Average rating
"price": soup.find("span", class_="a-offscreen").get_text(strip=True), # Product price
}
# Print the extracted product details
print(product_details)
How to Scrape Amazon Reviews
Extracting Amazon reviews requires navigating through multiple pages, which requires passing valid session information with each request. Since Amazon has updated its pagination system, you must include cookies and session tokens to access review pages. Below is a step-by-step guide on how to scrape reviews from an Amazon product using Python and BeautifulSoup
Navigating Pagination for Reviews
Amazon reviews are usually spread across multiple pages, so you’ll need to automate the navigation through them using libraries like Requests. Recently, Amazon made some changes to how reviews are paginated. Now, you need to pass along a user session with each request. This means that for every new page of reviews you want to access, you’ll need to include valid session details like cookies and session tokens to get the data successfully.
You can control the page navigation of Amazon reviews by modifying the **pageNumber **parameter in the URL. This parameter determines which page of reviews is displayed. For example, in the URL https://www.amazon.com/product-reviews/B0CLHH6DQN/ref=cm_cr_arp_d_paging_btm_next_2?pageNumber=3,
The pageNumber=3 indicates that you're viewing the third page of reviews. To access the next set of reviews, simply increase the pageNumber value (e.g., pageNumber=4 for the fourth page).
Here's an example of extracting reviews from the initial page:
Note: need to pass valid cookies else it will return sign in page response
from bs4 import BeautifulSoup
import requests
# Change this to navigate to a different page
page_number = 1
# URL for the Amazon product reviews page
url = f"https://www.amazon.com/product-reviews/B0CLHH6DQN/ref=cm_cr_arp_d_paging_btm_next_2?pageNumber={page_number}"
# Logged-in session cookies (replace with actual cookies if needed)
cookies = {
# Example: 'session-id': '1234567890',
}
# Custom headers to mimic a real browser and avoid bot detection
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-IN,en;q=0.9',
'cache-control': 'no-cache',
'dnt': '1', # Do Not Track preference
'pragma': 'no-cache',
'priority': 'u=0, i',
'sec-ch-ua': '"Not A(Brand";v="8", "Chromium";v="132", "Google Chrome";v="132"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36',
}
# Send a GET request to fetch the reviews page
response = requests.get(url, headers=headers, cookies=cookies)
# Check if the response is valid (status code 200 and sufficient content length)
if response.status_code != 200 or len(response.text) < 10000:
raise requests.exceptions.RequestException("Failed to fetch the page or encountered a captcha. Status code: {}, Content length: {}".format(response.status_code, len(response.text)))
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Extract review elements based on the new selector for 'li' elements
review_elements = soup.find_all("li", class_="review aok-relative")
reviews = []
# Loop through the review elements and extract relevant data
for review_element in review_elements:
reviews.append({
"reviewer_name": review_element.find("span", class_="a-profile-name").get_text(strip=True) if review_element.find("span", class_="a-profile-name") else "N/A", # Reviewer's name
"review_title": [span.get_text(strip=True) for span in review_element.select("a[data-hook='review-title'] span")][2] if len(review_element.select("a[data-hook='review-title'] span")) > 2 else "N/A", # Review title (3rd span in the selector)
"review_date": review_element.find("span", {"data-hook": "review-date"}).get_text(strip=True) if review_element.find("span", {"data-hook": "review-date"}) else "N/A", # Review date
"review_text": review_element.find("span", {"data-hook": "review-body"}).find("span").get_text(strip=True) if review_element.find("span", {"data-hook": "review-body"}) else "N/A", # Review text
"review_rating": review_element.find("a", {"data-hook": "review-title"}).find("i").find("span").get_text(strip=True) if review_element.find("a", {"data-hook": "review-title"}) else "N/A", # Review rating
})
# Print the list of reviews
print(reviews)
Post-Processing Data
Clean and transform the scraped data:
# Import datetime module for date manipulation
from datetime import datetime
def clean_review_data(data):
"""
Cleans up the review date and review rating for a list of reviews.
Args:
data (list): A list of dictionaries, where each dictionary contains review data with keys:
- "review_date": Raw date string (e.g., "Reviewed in the United States on January 1, 2022").
- "review_rating": Raw rating string (e.g., "4.5 out of 5 stars").
Returns:
list: A list of dictionaries with cleaned review data, where:
- "review_date" is reformatted to "YYYY-MM-DD".
- "review_rating" is converted to a float.
"""
for review in data:
# Clean and reformat the review date:
# Remove the prefix "Reviewed in the United States on " and convert it to "YYYY-MM-DD" format.
review["review_date"] = datetime.strptime(
review["review_date"].replace("Reviewed in the United States on ", ""), "%B %d, %Y"
).strftime("%Y-%m-%d")
# Extract and convert the review rating to a float:
# Take the numeric portion (e.g., "4.5") from the rating string and convert it to a float.
review["review_rating"] = float(review["review_rating"].split(" ")[0])
# Return the cleaned review data
return data
# Apply the cleaning function to the reviews list
cleaned_reviews = clean_review_data(reviews)
# Print the cleaned reviews to verify the output
print(cleaned_reviews)
How to Handle Amazon’s Anti-Scraping Measures
Bypassing Amazon's anti-scraping measures is a challenge but can be overcome by a variety of methods. Rotating proxy servers and handling CAPTCHA challenges are the primary methods that should be employed for successful data extraction. The next steps describe ways to avoid getting IP blocks and handling CAPTCHA challenges when scraping Amazon.
Implement rotating proxies to prevent IP blocks:
proxies = {
"http": "http://proxy_ip:proxy_port",
"https": "http://proxy_ip:proxy_port"
}
response = requests.get(url, headers=headers, proxies=proxies)
Why Live Proxies is a Better Choice
Live Proxies stands out as an excellent provider for rotating proxies due to several key advantages:
- Transparency: Live Proxies is committed to addressing concerns openly, ensuring you're always informed about the services and features.
- Private Proxies: Unlike shared proxy pools, Live Proxies offers private proxies, meaning your IPs are unique to you and not shared with others.
- Reliability: Clients who need long-term solutions appreciate the consistent proxy quality and the fast resolution of any concerns.
- Excellent Customer Support: Live Proxies offers 24/7 customer support, ensuring quick responses and client satisfaction.
- Highest Quality IPs: With stable and high-quality IPs, Live Proxies ensures that your requests remain unblocked on Amazon and other websites.
- Custom B2B Network: Tailored solutions for large-scale enterprise requirements ensure that Live Proxies meets your specific needs, whether you're handling small tasks or large scraping projects.
Amazon primarily uses image CAPTCHAs, which display text in a fuzzy manner. To handle these CAPTCHAs:
- Retry Requests: When encountering a CAPTCHA, retry the request after verifying the response status code and content. Note that Amazon’s responses may have a status code of 200 but still contain a CAPTCHA challenge.
- Use CAPTCHA-Solving Services: If solving CAPTCHAs is necessary, rely on third-party services like 2Captcha or Capsolver that provide automated CAPTCHA-solving solutions for a cost.
Best Practices for Scraping Amazon in 2025
Scraping Amazon data is essential for businesses in 2025 to understand market trends, customer behaviors, and competitor actions. However, ensuring the accuracy of the data and maintaining script reliability are critical to ensuring valuable insights. Below are the best practices for Amazon scraping, supported by statistics and real-world sources.
Ensuring Data Accuracy
To make sure your data is accurate, it's a good idea to add some validation checks. For example, you can confirm that fields like price are of the right type like ensuring it’s an integer. You can also set up rules to validate other fields to catch any inconsistencies.
Read more: Best Practices for Data Verification and Validation
Maintaining Script Reliability
To keep your script running smoothly, you could create a custom parser just for Amazon and set up unit tests to check that everything’s working as expected. It’s important to regularly test this parser, especially since Amazon updates its site often and pages can have different structures. Having a dedicated parser will make things easier. Plus, using version control like Git and automating your workflow with CI/CD tools can help ensure your code stays reliable and up to date.
Read more: Scraping Reliability (7 Tips for Nocoders)
Alternatives to Scraping: Using Amazon’s APIs
While scraping Amazon’s website can provide valuable data, it comes with legal and ethical risks. In contrast, using Amazon's official APIs offers a more compliant, structured, and stable way to access data. Below is a comparison between scraping and using APIs, highlighting key differences in legality, reliability, cost, and scalability.
Aspect | Scraping | API |
---|---|---|
Legality | Risky; may violate terms of service | Fully legal and compliant |
Data Access | Access all visible page data | Limited to structured API data |
Reliability | Breaks with site changes | Stable and well-maintained |
Cost | Usually free but resource-heavy | Often paid with usage limits |
Ease of Use | Requires parsing and bot evasion | Simplified, structured responses |
Scalability | Needs proxies and infrastructure | Limited by rate quotas |
Ethics | Questionable without consent | Fully ethical and approved |
How to Get Started with Amazon’s APIs
- Create an Amazon Developer Account: Sign up on the Amazon Developer Portal.
- Request API Access: Apply for access to the required API, such as the Product Advertising API or Amazon Business API.
- Get Your API Token: After approval, you'll receive an API token for authentication.
- Follow the Guide: Use the Product Search API v1 Reference to get started with making API requests.
How to Use Amazon Data for Business Insights
Amazon data can be of great help to businesses in tracking competitor activities, monitoring product performance, and leveraging historical trends. Below are some key ways to use Amazon data effectively for business growth.
- Competitive Analysis: Track competitors' pricing and reviews to understand their strategies and adjust yours accordingly.
- Monitoring Product Performance: Analyze trends in sales ranks and customer feedback to assess product demand and customer satisfaction.
- Leveraging Historical Data: Use past pricing and review patterns to predict future trends and make informed decisions.
Analyzing Amazon Reviews Using Pandas
Extracting and analyzing Amazon reviews can provide valuable insights into customer satisfaction, product performance, and common concerns. Using Pandas, we can efficiently process the data, visualize rating distributions, and track trends over time. The analysis includes:
- Rating Distribution: A count plot shows the frequency of each rating, highlighting customer sentiment.
- Average Rating Over Time: A line chart illustrates how ratings fluctuate monthly, indicating any shifts in user satisfaction.
- Review Text Analysis: A word cloud reveals frequently mentioned words, helping identify common themes in user feedback.
By leveraging Pandas, Seaborn, and Matplotlib, we can transform raw reviews into actionable business insights.
Installing Required Libraries
To perform this analysis, install the necessary Python libraries using the following command:
pip install pandas matplotlib seaborn wordcloud
Complete code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Sample reviews data
reviews = [
{'reviewer_name': 'M. C. Gianetta', 'review_title': 'Exactly as described', 'review_date': 'Reviewed in the United States on November 18, 2024', 'review_text': "Ordering a phone online can sometimes be challenging. Occasionally things don't meet expectations. That's not the case here. Brand new in a sealed box. A good phone at a good price, delivered on a timely basis.", 'review_rating': '5.0 out of 5 stars'},
{'reviewer_name': 'Bryan k Boone', 'review_title': 'So far so good', 'review_date': 'Reviewed in the United States on September 8, 2024', 'review_text': "Haven't had it a month yet but I'm pleased. Great pictures. Battery lasts much longer than my galaxy 9s. It arrived in great condition, no problem with anything. It's a excellent price. One thing I dislike but isn't a huge deal, the charger cord is short.", 'review_rating': '5.0 out of 5 stars'},
{'reviewer_name': 'Annie', 'review_title': 'Samsung S23 Fe', 'review_date': 'Reviewed in the United States on October 7, 2024', 'review_text': 'I love my new phone, works really well. Really excellent upgrade from an A72. It does get hot when charging, but it cools down when placed in front of a fan or a/c. Very easy to transfer data to new phone. This was definitely a brand new phone. I will recommend this phone to anyone. Would have given 5 stars if I had gotten a charger block as promised. It only came with the USB cable.', 'review_rating': '4.0 out of 5 stars'},
{'reviewer_name': 'Christine J. Elliston', 'review_title': 'It works well just as expected.', 'review_date': 'Reviewed in the United States on January 17, 2025', 'review_text': "I bought it for my husband's birthday and he just loves it.", 'review_rating': '5.0 out of 5 stars'},
{'reviewer_name': 'Chris', 'review_title': 'Its an international phone from India', 'review_date': 'Reviewed in the United States on January 2, 2025', 'review_text': 'After using this phone and everything is set to india. And nothing works. And cant have apps no voicemail. Called samsung they said its an international phone no where in the description says international phone or that its from india. Misleading and terrible phone.', 'review_rating': '1.0 out of 5 stars'},
{'reviewer_name': 'E&C', 'review_title': "Couldn't be happier", 'review_date': 'Reviewed in the United States on October 22, 2024', 'review_text': "I've had this phone for a while now and it's perfect. The color resolution is amazing. Camara is top notch. The battery last all day without needed extra charge. The cost is was well blow the regular. Speed is super fast. I love this phone and would buy again", 'review_rating': '5.0 out of 5 stars'},
{'reviewer_name': 'Marcelo', 'review_title': 'Funciona sin problemas en Argentina', 'review_date': 'Reviewed in the United States on December 20, 2024', 'review_text': 'Funciona sin problemas en Argentina. Sería bueno que incluyan el cargador como parte del combo de compra. Solo trae el cable.', 'review_rating': '5.0 out of 5 stars'},
{'reviewer_name': 'Amazon Customer', 'review_title': 'Good quality', 'review_date': 'Reviewed in the United States on September 12, 2024', 'review_text': 'Nice case looks good feel ok. Glad I got it.', 'review_rating': '4.0 out of 5 stars'},
{'reviewer_name': 'Tajera', 'review_title': 'Has expected', 'review_date': 'Reviewed in the United States on November 6, 2024', 'review_text': 'This wasn't damaged at all and it exactly like I read and saw', 'review_rating': '4.0 out of 5 stars'},
{'reviewer_name': 'Placeholder', 'review_title': 'GB AND HOW LONG WITHOUT CHARGING', 'review_date': 'Reviewed in the United States on August 20, 2024', 'review_text': "You can get around 8 hours without charging, leaving about 28% left. The phone does run hot but someone said it's because of the type processor used on international phones. The western phone has snapdragon? This phone has exynos (I believe). It has all that the western phone has. You need to check to see if Samsung pay is on the international phone , someone said it wasn't. Otherwise a good phone price less than the western phone. I gave it a 4 because it runs hot (sometimes).", 'review_rating': '4.0 out of 5 stars'}
]
# Convert the reviews data into a Pandas DataFrame
df = pd.DataFrame(reviews)
# Clean the 'review_rating' column to extract numeric ratings
df['review_rating'] = df['review_rating'].str.extract(r'(\d+\.\d+)').astype(float)
# Visualizations
plt.figure(figsize=(12, 6))
# 1. Distribution of Ratings
plt.subplot(1, 2, 1)
sns.countplot(x='review_rating', data=df, palette='viridis')
plt.title('Distribution of Review Ratings')
plt.xlabel('Rating')
plt.ylabel('Number of Reviews')
# 2. Average Rating Over Time
# Extract the review date and convert to datetime
df['review_date'] = pd.to_datetime(df['review_date'].str.extract(r'on (.*)')[0])
df.set_index('review_date', inplace=True)
plt.subplot(1, 2, 2)
df.resample('M')['review_rating'].mean().plot(kind='line', marker='o', color='orange')
plt.title('Average Rating Over Time')
plt.xlabel('Review Date')
plt.ylabel('Average Rating')
plt.tight_layout()
plt.show()
# 3. Word Cloud for Review Text (Optional)
from wordcloud import WordCloud
# Combine all review text into a single string
text = " ".join(review for review in df['review_text'])
# Generate a word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
# Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Review Text')
plt.show()
Conclusion
Scraping Amazon takes a mix of technical know-how and a strong commitment to legal and ethical standards. By using tools like BeautifulSoup, Selenium, and proxies, you can responsibly collect valuable product and review data while staying on the right side of the rules. It's all about balancing the right tools with respect for Amazon's policies.
Frequently Asked Questions About Scraping Amazon
Is scraping Amazon against its Terms of Service?
Yes, it violates Amazon’s ToS. However, scraping public data may be permissible under specific legal contexts.
Can I use proxies to scrape Amazon?
Yes, proxies help bypass IP blocks. Use residential or rotating proxies for best results.
How do I scrape Amazon data without being blocked?
Use strategies like user-agent rotation, request delays, and proxies to avoid detection.
What’s the difference between scraping and APIs for Amazon data?
Scraping provides flexibility but risks legal issues, while APIs offer structured, legal access to data.