Live Proxies

How to Scrape Walmart: Data, Prices, Products, and Reviews (2025)

Learn how to scrape Walmart in 2025, extract titles, prices, products and reviews via JSON, handle pagination, stay compliant and avoid blocks with proxies.

How to scrape Walmart
Live Proxies

Live Proxies Editorial Team

Content Manager

How To

31 October 2025

For businesses, extracting data from eCommerce giants like Walmart is invaluable, yet challenging due to Walmart's advanced anti-bot defenses. This guide walks you through a complete approach to effectively scrape Walmart data at scale in 2025 while also addressing common obstacles. We will begin with a quick method for immediate results, then proceed to a more robust and scalable solution.

How to scrape Walmart price & title in 10 lines?

You can scrape basic product data from Walmart with a minimal script. Walmart's website is built on Next.js, a framework that conveniently embeds page data within a JSON object inside a <script> tag. By targeting this JSON, you can bypass fragile HTML parsing and extract structured data directly.

The Python snippet below demonstrates this technique:

import json, curl_cffi

# Fetch the search results page, impersonating a Chrome browser
response = curl_cffi.get(
    "https://www.walmart.com/search?q=running+shoes", impersonate="chrome"
)

# Find the start and end of the __NEXT_DATA__ JSON blob
start = response.text.find('<script id="__NEXT_DATA__"')
start = response.text.find(">", start) + 1
end = response.text.find("</script>", start)

# Parse the JSON and iterate through products
data = json.loads(response.text[start:end])
for stack in data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"]:
    for item in stack.get("items", []):
        if item.get("__typename") == "Product":
            print(f"{item.get('name', 'No name')} - ${item.get('price', 0)}")
            

In this script, we use curl_cffi to mimic a real Chrome browser and then extract the JSON content from the __NEXT_DATA__ tag. We filter for items where __typename is "Product" to exclude ads and other non-product content. This method is far more reliable than parsing HTML classes, which change frequently.

Here’s the console output:

Console output:

Is it legal to scrape Walmart?

Scraping publicly available data from Walmart is generally considered legal, but it's a legal gray area that requires a careful approach. The key is to stick to data that is visible to any user, respect the website's rules, and avoid causing any disruption to their service.

Best practices for ethical scraping

To guide your scraping activities ethically and responsibly, follow these best practices. Understanding web scraping fundamentals is the first step to ensuring your data collection efforts remain within acceptable boundaries.

  • Do focus exclusively on publicly visible data, such as product listings, prices, and reviews. This kind of general data retrieval is common for market analysis.
  • Don’t attempt to access any information behind a login, such as user accounts or personal data. This is a clear boundary violation.
  • Don’t ignore Walmart’s robots.txt file. While not legally binding, it outlines the website owner's wishes and is a key part of responsible bot etiquette.
  • Don’t overwhelm the site with high-frequency requests. An aggressive scraper can mimic a denial-of-service (DoS) attack, which can have legal consequences.
  • Do be mindful of the data you collect. Anonymize it where possible and immediately discard any personal information you might have inadvertently scraped from public reviews.

Note – this information is for educational purposes only and does not constitute legal advice. Web scraping laws are complex and constantly evolving. Always consult with a legal professional to ensure your project is compliant.

With the legal considerations in mind, let's explore the specific data points you can collect from Walmart.

What Walmart data to scrape?

Walmart is a goldmine of eCommerce data. If you know where to look, you can extract clean, structured data to fuel your market research, price intelligence projects, or AI models. Here’s a checklist of the key data points you should be targeting:

  • Product details. Essential identifiers like Title, Brand, SKU, Item ID, and GTIN/UPC. The category path (e.g., "Home > Appliances > Coffee Makers") provides crucial context.
  • Pricing information. This includes the current price, "was" price for items on sale, calculated savings, and promotional badges like "Flash Deal". This data is essential for tracking dynamic pricing strategies.
  • Stock & fulfillment. Availability status ("In stock", "Out of stock"), product condition ("New", "Refurbished"), and fulfillment options like delivery or in-store pickup.
  • Product variants. Many products come in different sizes, colors, or pack quantities. Each variant often has its own unique product ID, price, and availability.
  • Marketplace sellers & variant pricing. Distinguish between items "Sold by Walmart" and third-party sellers. This is key for Walmart variants price by seller analysis, helping you track how different sellers price the same product variants (e.g., size, color)
  • Ratings & reviews. The aggregate star rating and total review count provide a quick quality signal, while individual reviews offer deeper insights.
  • Media. URLs for high-resolution product images and videos are valuable for catalogs and visual analysis.
  • Specifications. Detailed attributes like dimensions, weight, material, and other technical features.

To illustrate, here's a simplified sample of the JSON structure for a single product, showing how neatly this data is organized:

{
    "itemId": "5162907971",
    "productId": "1RIZUQ0XICHV",
    "sku": "5162907971",
    "name": "Mainstays Black 12-Cup Drip Coffee Maker",
    "brand": "Mainstays",
    "categoryPathName": "Home Page/Home/Appliances/Kitchen Appliances/Coffee Shop/Coffee Makers",
    "price": {
      "price": 15.88,
      "priceString": "$15.88"
    },
    "wasPrice": {
      "price": 239.99,
      "priceString": "$239.99"
    },
    "availabilityStatus": {
      "display": "In stock",
      "value": "IN_STOCK"
    },
    "condition": {
      "text": "New"
    },
    "fulfillmentType": "DELIVERY",
    "sellerName": "Walmart.com",
    "averageOverallRating": 4.5,
    "numberOfReviews": 4527,
    "imageInfo": {
      "allImages": [
        {
          "url": "https://i5.walmartimages.com/seo/Mainstays-Black-12-Cup-Drip-Coffee-Maker_a5765781.jpeg"
        }
      ]
    },
    "specifications": [
      {"name": "Material", "value": "Plastic;Glass"},
      {"name": "Weight", "value": "3.3 lb"}
    ]
  }

Note – this is a simplified example. The full JSON from the website is much richer.

With a clear map of the data you need, the next step is building a scraper. Let's dive in and build a robust solution using the top language for web scraping, Python.

How to scrape Walmart with Python?

While a 10-line script is great for a quick test, a production-level scraper needs to be resilient, structured, and scalable. Let’s build a robust Walmart scraper using Python, a common starting point for many Python web scraping projects.

Setting up your environment

First, let's get your project set up. Using a virtual environment is a best practice for managing your project's dependencies.

Create and activate a virtual environment:

# Create the environment
python -m venv .venv

# Activate on Windows CMD
.venv\Scripts\activate.bat

# Activate on macOS/Linux
source .venv/bin/activate

Install the necessary libraries:

pip install beautifulsoup4 curl-cffi lxml

Here’s a quick rundown of our toolkit:

  • curl-cffi. Makes HTTP requests that impersonate a real browser's TLS fingerprint. This makes your scraper much harder for anti-bot systems to detect compared to standard libraries.
  • BeautifulSoup4. A powerful and developer-friendly library for parsing HTML. We'll use it to easily locate the hidden data we need.
  • lxml. A high-performance parser that BeautifulSoup4 uses under the hood to process HTML at lightning speed.

Locating and parsing Walmart's __NEXT_DATA __ JSON

Start by navigating to the Walmart website and searching for a product, like "running shoes". The page will look something like this:

Locating and parsing Walmart

As mentioned, the most reliable way to get data from Walmart is by targeting the __NEXT_DATA__ JSON blob embedded in the page's HTML. To find this, right-click anywhere on the search results page and select "Inspect" to open your browser's developer tools.

Locating and parsing Walmart inspect

In the developer tools, navigate to the "Network" tab, make sure "Doc" is selected, and refresh the page. Click on the first request in the list, which corresponds to the page itself.

Locating and parsing Walmart inspect running

In the "Response" tab for that request, you will see the full HTML of the page. Here, you can find the <script id="__NEXT_DATA__"> tag containing all the product data in a clean JSON format. This is what our script will target. Effective data parsing is key to turning this raw HTML into structured, usable data.

Effective data parsing

Building the Walmart search scraper

Let’s build the Walmart scraper step by step.

Step 1 – making the HTTP request

We start by defining a function that sends a GET request to a Walmart search URL. The impersonate="chrome" argument in curl_cffi is crucial for mimicking a real browser. We also include checks for failed requests or CAPTCHA pages.

import json
import curl_cffi.requests as requests
from bs4 import BeautifulSoup

def extract_walmart_products(search_url):
    try:
        # Make a request with Chrome browser impersonation
        response = requests.get(search_url, impersonate="chrome", timeout=30)
        
        response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)
        
        page_source = response.text
        
        # Check for Walmart's "Robot or human" CAPTCHA page
        if "Robot or human" in page_source:
            raise Exception("Walmart CAPTCHA page detected - request has been blocked")
        
        # ... next steps will go here ...

Step 2 – locating and parsing the hidden JSON data

Once we have the HTML content, we use BeautifulSoup to find the <script id="__NEXT_DATA__"> tag. We then extract its content and parse it as JSON.

    # ... continued from Step 1 ...
        
        soup = BeautifulSoup(page_source, 'lxml')
        script_tag = soup.find('script', {'id': '__NEXT_DATA__'})
        
        if not script_tag:
            raise Exception("__NEXT_DATA__ script not found in page HTML")
            
        data = json.loads(script_tag.get_text())

Step 3 – navigating the JSON and extracting product information

The product data is nested deep within the JSON object. We need to traverse this structure to reach the list of products. The path is typically props -> pageProps -> initialData -> searchResult -> itemStacks.

Each "itemStack" contains a mix of content, including products, ads, and recommendations. We must iterate through these and filter for actual products, which are identified by __typename == "Product".

  # ... continued from Step 2 ...

        # Navigate through the nested JSON structure to find product data
        search_result = data['props']['pageProps']['initialData']['searchResult']
        product_stacks = search_result['itemStacks']
        
        products = []
        for stack in product_stacks:
            for item in stack.get('items', []):
                # Filter out ads and recommendations, keeping only products
                if item.get("__typename") == "Product":
                    products.append(item)

        if not products:
            raise Exception("No products found in the extracted data")
            
        return products
    
    except Exception as e:
        print(f"Extraction failed: {e}")
        return []

Step 4 – putting it all together

Finally, we create a main function to run the scraper and save the results to a JSON file.

# (Place the extract_walmart_products function from above here)

def main():
    url = "<https://www.walmart.com/search?q=running+shoes>"
    print(f"Scraping products from: {url}")
    
    products = extract_walmart_products(url)
    
    if products:
        with open("walmart_products.json", 'w', encoding='utf-8') as f:
            json.dump(products, f, indent=2, ensure_ascii=False)
        print(f"✓ Extracted {len(products)} products to walmart_products.json")
    else:
        print("No products were extracted.")

if __name__ == "__main__":
    main()
    

Below is the complete script:

import json
import curl_cffi.requests as requests
from bs4 import BeautifulSoup


def extract_walmart_products(query: str):
   """
   Scrapes product data from a Walmart search results page.
   """
   search_url = f"https://www.walmart.com/search?q={query}"
   print(f"Scraping products from: {search_url}")

   try:
       # Make a request with Chrome browser impersonation to avoid blocking
       response = requests.get(search_url, impersonate="chrome", timeout=30)
       response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

       page_source = response.text

       # Check for Walmart's "Robot or human" CAPTCHA page
       if "Robot or human" in page_source:
           raise Exception("Request blocked by Walmart CAPTCHA page.")

       soup = BeautifulSoup(page_source, "lxml")
       script_tag = soup.find("script", {"id": "__NEXT_DATA__"})

       if not script_tag:
           raise Exception("__NEXT_DATA__ script not found in page HTML.")

       data = json.loads(script_tag.get_text())

       # Navigate through the nested JSON structure to find product data
       search_result = data["props"]["pageProps"]["initialData"]["searchResult"]
       product_stacks = search_result["itemStacks"]

       products = []
       for stack in product_stacks:
           for item in stack.get("items", []):
               # Filter out ads and recommendations, keeping only products
               if item.get("__typename") == "Product":
                   products.append(item)

       if not products:
           raise Exception("No products found in the extracted data.")

       return products

   except Exception as e:
       print(f"An error occurred: {e}")
       return []


def main():
   search_query = "running shoes"
   products = extract_walmart_products(search_query)

   if products:
       filename = "walmart_products.json"
       with open(filename, "w", encoding="utf-8") as f:
           json.dump(products, f, indent=2, ensure_ascii=False)
       print(f"✓ Extracted {len(products)} products to {filename}")
   else:
       print("No products were extracted.")


if __name__ == "__main__":
   main()

Running the script produces a clean JSON file with detailed product information. Here’s a truncated sample of the output:

[
  {
    "__typename": "Product",
    "name": "Nike Air Zoom Pegasus 39 Men's Running Shoes",
    "id": "2L6X6K6VWQZ5",
    "priceInfo": {
      "currentPrice": {
        "price": 89.99,
        "priceString": "$89.99"
      }
    },
    "imageInfo": {
      "thumbnailUrl": "https://i5.walmartimages.com/asr/...",
      "size": "LARGE"
    },
    "averageRating": 4.5,
    "numberOfReviews": 1247,
    "brand": "Nike",
    "canonicalUrl": "/ip/Nike-Air-Zoom-Pegasus-39/12345",
    "availabilityStatusV2": {
      "status": "IN_STOCK"
    }
    // ... truncated (each product contains 40+ more fields)
  },
  {
    "__typename": "Product",
    "name": "Adidas Ultraboost 22 Running Shoes",
    "id": "3M7Y7L7XRZA6",
    "priceInfo": {
      "currentPrice": {
        "price": 119.99,
        "priceString": "$119.99"
      }
    }
    // ... truncated for brevity
  }
]

While this is a great start, most scraping tasks require collecting data from all pages, not just the first one. You need to master two key features of any eCommerce site: search filters and pagination. Let's scale up our scraper to handle both.

Applying search filters via URL parameters

Walmart uses URL parameters to manage filtering and sorting. You can discover these by applying filters on the website and observing the changes in the URL, or by inspecting network requests in your browser's developer tools.

Applying search filters via URL parameters

Here are a few common examples:

  • Sort by best seller – &sort=best_seller

  • Sort by price (low to high) – &sort=price_low

  • Set a price range – &min_price=10&max_price=50

By appending these to your base search URL, you can tell your scraper exactly what to target. For instance, to find the best-selling running shoes over $10, your URL would look like this:

https://www.walmart.com/search?q=running+shoes&min_price=10&sort=best_seller

Navigating multiple pages (pagination)

To collect all products from a search or category, you'll need to loop through every page of the results by manipulating Walmart category pagination parameters. Walmart uses the page parameter for this (e.g., &page=1, &page=2).

Navigating multiple pages

We can create a controller function that iteratively scrapes each page until no more products are found. The complete code below includes this pagination logic and a polite delay between requests.

Here's how to modify our script to handle pagination:

  1. Detect the last page. We'll add a check in extract_walmart_products to see if the page contains the "no search results" message. If it does, we return an empty list, signaling the end of the pagination loop.

  2. Create a pagination loop. A new function, scrape_all_pages, will manage the loop, incrementing the page number and collecting products from each page.

  3. Add a polite delay. It's crucial to add a time.sleep() delay between requests to avoid overwhelming Walmart's servers.

Here is the complete, updated code with pagination logic:

import json
import time
import curl_cffi.requests as requests
from bs4 import BeautifulSoup


def extract_walmart_products(search_url):
    # (The function from the previous section, with one addition)
    try:
        response = requests.get(search_url, impersonate="chrome", timeout=30)
        page_source = response.text

        # ADDITION: Check for the "no results" page to stop pagination
        if "There were no search results" in page_source:
            return []

        # ... (rest of the function remains the same) ...
        # (It should return an empty list [] on any failure)
    except Exception as e:
        print(f"Extraction failed on {search_url}: {e}")
        return []


def scrape_all_pages(base_search_url):
    all_products = []
    page = 1

    while True:
        # Construct URL for the current page
        paginated_url = f"{base_search_url}&page={page}"
        print(f"Scraping page {page}: {paginated_url}")

        products_on_page = extract_walmart_products(paginated_url)

        # Stop if no products are found on the page
        if not products_on_page:
            print(f"No products found on page {page}. Stopping pagination.")
            break

        all_products.extend(products_on_page)
        print(f"Found {len(products_on_page)} products on page {page}.")

        page += 1
        time.sleep(2)  # Respectful delay between requests

    return all_products


def main():
    base_url = "https://www.walmart.com/search?q=running+shoes"
    print(f"Starting pagination scrape from: {base_url}")

    all_products = scrape_all_pages(base_url)

    if all_products:
        with open("walmart_all_products.json", "w", encoding="utf-8") as f:
            json.dump(all_products, f, indent=2, ensure_ascii=False)
        print(
            f"\n✓ Extracted {len(all_products)} total products to walmart_all_products.json"
        )
    else:
        print("No products were extracted from any pages.")


if __name__ == "__main__":
    main()

Running this script will now paginate through all available pages for the search query:

Script paginate

This will save a single file containing hundreds of products.

Single file containing hundreds of products

With a large dataset scraped, the next step is to decide how to store it effectively.

How to structure/store data you scrape from Walmart?

Saving raw JSON is a good start, but larger projects benefit from a more organized storage solution. The best choice depends on your needs.

  • JSON. Excellent for storing nested, unstructured data exactly as scraped. It’s human-readable and supported by most programming languages. Best for initial data dumps and small projects.
  • CSV. A flat, tabular format ideal for analysis in spreadsheets (Excel, Google Sheets) or for import into data tools like Pandas. You’ll need to flatten nested JSON, selecting only the fields needed for each column.
  • Databases. The most robust option for large-scale, ongoing projects. Databases allow efficient querying, indexing, and management of vast datasets. SQLite is a lightweight, file-based option for getting started, while PostgreSQL is a powerful choice for production systems.

Beyond search results, individual product pages contain even more detailed information worth capturing.

Scraping individual product pages

While search pages provide a summary, individual product pages are where you find complete details like full specifications, long descriptions, shipping information, and more.

Scraping individual product pages

The scraping technique is the same – we're still targeting the __NEXT_DATA__ JSON blob – but the path to the data is slightly different. On a product page, you'll find the data located at this path: data['props']['pageProps']['initialData']['data']['product'] Here is a dedicated function to scrape a single product page. It's nearly identical to our search page scraper, just updated with the new JSON path.

import json
import curl_cffi.requests as requests
from bs4 import BeautifulSoup

def scrape_single_product(product_url: str):
    """
    Scrapes detailed data from a single Walmart product page.
    """
    print(f"Scraping product details from: {product_url}")
    try:
        # Make a request with Chrome browser impersonation
        response = requests.get(product_url, impersonate="chrome110", timeout=30)
        response.raise_for_status()

        page_source = response.text

        # Check for bot detection
        if "Robot or human" in page_source:
            raise Exception("Request blocked by Walmart CAPTCHA page.")

        soup = BeautifulSoup(page_source, "lxml")
        script_tag = soup.find("script", {"id": "__NEXT_DATA__"})
        if not script_tag:
            raise Exception("__NEXT_DATA__ script not found in page HTML.")

        data = json.loads(script_tag.get_text())
        
        # Navigate to product data using the updated path for product pages
        product_data = data["props"]["pageProps"]["initialData"]["data"]["product"]
        
        if not product_data:
            raise Exception("No product data found in the JSON structure.")
            
        return product_data

    except Exception as e:
        print(f"An error occurred: {e}")
        return None

def main():
    # Example URL for a specific product
    product_url = "<https://www.walmart.com/ip/Avia-Men-s-5000-Performance-Walking-Sneakers/2556648281>"
    
    product_details = scrape_single_product(product_url)
    
    if product_details:
        filename = "walmart_product_details.json"
        with open(filename, "w", encoding="utf-8") as f:
            json.dump(product_details, f, indent=2, ensure_ascii=False)
        print(f"✓ Extracted detailed product data to {filename}")
    else:
        print("Failed to extract product data.")

if __name__ == "__main__":
    main()

Running the script successfully produces a walmart_product_details.json file. Here’s a glimpse of the data you'll get, truncated for readability:

{
  "name": "Avia Men's 5000 Athletic Performance Running Shoes, Wide Width Available",
  "id": "4HR1PX7BZS5F",
  "primaryProductId": "2556648281",
  "shortDescription": "Avia Men's athletic performance running shoes...",
  "longDescription": "Designed for active lifestyles, these Avia men's running shoes...",
  "primaryOffer": {
    "offerPrice": 29.97,
    "wasPrice": {
      "price": 39.97,
      "priceString": "$39.97"
    }
  },
  "imageInfo": {
    "thumbnailUrl": "https://i5.walmartimages.com/asr/...",
    "allImages": [
      {
        "url": "https://i5.walmartimages.com/seo/...",
        "size": "LARGE"
      }
    ]
  },
  "averageRating": 4.2,
  "numberOfReviews": 1847,
  "specifications": [
    {
      "name": "Brand",
      "value": "Avia"
    },
    {
      "name": "Material",
      "value": "Synthetic"
    }
  ],
  "reviews": {
    "customerReviews": [
      {
        "rating": 5,
        "reviewText": "Great shoes for the price...",
        "reviewer": "Verified Purchase"
      }
    ]
  }
}

This script will save a highly detailed JSON file for the specified product. But there's another valuable dataset on these pages worth scraping: customer reviews.

Scrape Walmart reviews with Python

Customer reviews provide invaluable insights into product quality, common issues, and user sentiment. While a product page shows a few top reviews, the complete dataset is often located on a separate, paginated reviews page.

Scrape Walmart reviews with Python

The URL structure for reviews typically looks like this:

https://www.walmart.com/reviews/product/{PRODUCT_ID}?page={PAGE_NUMBER}

Once again, all review data is embedded in the __NEXT_DATA__ script tag. The JSON path to the reviews is: data['props']['pageProps']['initialData']['data']['reviews']['customerReviews']

You can make your scraper incredibly efficient by using URL parameters to fetch only the reviews you need. This is much faster than scraping everything and filtering the data later.

Here are the most useful filter parameters:

  • Verified purchases onlyadd vp=true to the URL.

  • Filter by rating – use ratings=5 for 5-star reviews, ratings=1 for 1-star reviews, and so on.

  • Sort by most recent – add sort=submission-desc to get the newest reviews first.

You can also combine these filters. To get the most recent, 5-star reviews from verified purchasers, the URL would be:

.../reviews/product/2556648281?ratings=5&vp=true&sort=submission-desc

The following script paginates through all review pages for a given product ID and collects every review. It checks for the “Next Page” button to determine when to stop.

import json
import curl_cffi.requests as requests
import time
from bs4 import BeautifulSoup


def extract_reviews_from_page(reviews_url):
    """Extracts reviews from a single Walmart reviews page."""
    try:
        response = requests.get(
            reviews_url,
            impersonate="chrome",
            timeout=30,
            headers={"referer": "https://www.walmart.com/"},
        )
        if response.status_code != 200 or "Robot or human" in response.text:
            return [], False

        page_source = response.text
        # Check if the "Next Page" button exists to determine if there are more pages
        has_next_page = 'data-testid="NextPage"' in page_source

        # Use BeautifulSoup to extract JSON data
        soup = BeautifulSoup(page_source, "html.parser")
        script_tag = soup.find("script", {"id": "__NEXT_DATA__"})

        if not script_tag:
            return [], False

        data = json.loads(script_tag.get_text())
        reviews = data["props"]["pageProps"]["initialData"]["data"]["reviews"].get(
            "customerReviews", []
        )
        return reviews, has_next_page

    except Exception:
        return [], False


def extract_all_reviews(product_id):
    """Extracts all reviews for a product by paginating through all pages."""
    all_reviews = []
    page = 1

    while True:
        reviews_url = (
            f"https://www.walmart.com/reviews/product/{product_id}?page={page}"
        )
        print(f"Scraping reviews page {page}...")

        reviews, has_next = extract_reviews_from_page(reviews_url)
        if not reviews:
            break

        all_reviews.extend(reviews)
        print(f"Found {len(reviews)} reviews on page {page}.")

        if not has_next:
            break

        page += 1
        time.sleep(1)  # Be polite

    return all_reviews


def main():
    product_id = "2556648281"  # Example product ID
    all_reviews = extract_all_reviews(product_id)

    if all_reviews:
        with open("walmart_reviews.json", "w", encoding="utf-8") as f:
            json.dump(all_reviews, f, indent=2, ensure_ascii=False)
        print(f"\n✓ Extracted {len(all_reviews)} total reviews to walmart_reviews.json")


if __name__ == "__main__":
    main()

Here’s a sample of the clean, structured data you get for each review:

[
  {
    "rating": 5,
    "title": "Love these shoes!",
    "text": "I love everything about these shoes. The look, the comfort, the price, the value. They are inexpensive, but they don't look cheap...",
    "reviewer": "Ja***",
    "submissionTime": "2025-07-13",
    "helpfulVotes": 0,
    "isVerifiedPurchase": true,
    "productName": "Avia Men's 5000 Athletic Performance Running Shoes"
  },
  {
    "rating": 5,
    "text": "These are great, well-made shoes that don't cost an arm and a leg! I'd rather buy these than Nike's!",
    "reviewer": "Kel***",
    "submissionTime": "2025-07-22",
    "helpfulVotes": 0,
    "isVerifiedPurchase": true,
    "productName": "Avia Men's 5000 Athletic Performance Running Shoes"
  }
]

This powerful script allows you to scrape thousands of reviews, providing a rich dataset for sentiment analysis and market research. However, as you scale up your efforts to scrape Walmart reviews with Python or any data from Walmart, you will inevitably run into Walmart's defenses.

Common challenges in Walmart scraping

The scripts we've built are great, but when you try to extract data at scale, you'll inevitably hit a wall. Understanding Walmart’s defenses is the first step to building a scraper that can overcome them.

Challenge 1 – IP rate limiting and blocks

Sending too many requests from a single IP address is the fastest way to get blocked. eCommerce sites like Walmart enforce strict rate limits to protect their stability. Exceed these limits, and you'll find your IP has been banned, a frustrating dead end for any serious scraping project. Relying on a single IP is a non-starter for scalable data collection.

Challenge 2 – sophisticated bot detection

Walmart uses advanced anti-bot solutions (e.g. HUMAN Security, formerly PerimeterX) that analyze far more than just your IP. These systems combine fingerprinting and behavioral analysis, monitor your request patterns, and look for any signs of automation. If your script gets flagged, you'll be served a "Press & Hold" CAPTCHA page, which will stop your scraper cold.

Sophisticated bot detection

Challenge 3 – geo-restricted data

Walmart displays different prices, product availability, and shipping options based on your geographic location. If your scraper's IP is in New York, you'll only see data relevant to New York shoppers. This is a major problem if you need accurate data for other regions, like tracking stock in California or comparing prices in Texas. Without controlling your location, your data will be inconsistent and unreliable.

How to scrape Walmart without blocks?

The key to a successful, large-scale web scraping project is to make your requests indistinguishable from those of real users. This requires a smarter strategy, with a high-quality proxy network at its core.

The role of high-quality proxies

A proxy server acts as an intermediary, masking your true IP address. For scraping, this is essential for rotating your IP and simulating requests from different users and locations. However, not all proxies are created equal. Basic datacenter proxies are easily detected and blocked. The solution is to use a residential proxy network. A residential proxy uses IP addresses from real internet service providers (ISPs), making your requests appear as legitimate residential traffic.

Every failed request costs you time, bandwidth, and compute resources. Live Proxies offers a premium network of ethically sourced residential proxies that are perfect for bypassing Walmart's defenses. Here's why they are effective:

  • Massive IP pool. With millions of IPs across 55+ countries, you can avoid IP bans and collect accurate, location-specific data by using targeted proxies.
  • Private IP allocation. Many providers share IPs among customers, leading to "burnt" IPs that are already flagged. Live Proxies can allocate dedicated IP pools for eCommerce targets, ensuring a clean reputation. Instead of one large pool shared by everyone, your IPs are unique and only shared with you.
  • Dynamic rotation & sticky sessions. You can rotate your IP with every request to maximize anonymity or use "sticky" sessions to maintain the same IP for up to 60 minutes for multi-step tasks.

Integrating proxies into your Python script

Integrating Live Proxies into your script is simple. You just need to add the proxies argument to your request call.

Step 1 – configure your proxy credentials

# Format: "username:password@endpoint:port"
PROXY = "YOUR_PROXY_USERNAME:YOUR_PROXY_PASSWORD@YOUR_PROXY_ENDPOINT:YOUR_PROXY_PORT"

proxies = {
    'http': f'http://{PROXY}',
    'https': f'http://{PROXY}',
}

Step 2 – add the proxies parameter to your request

response = curl_cffi.get(
    product_url, 
    impersonate="chrome", 
    timeout=30, 
    proxies=proxies
)

Here's the complete script:

import json
import curl_cffi
from bs4 import BeautifulSoup

# Configure your Live Proxies credentials
PROXY = "YOUR_PROXY_USERNAME:YOUR_PROXY_PASSWORD@YOUR_PROXY_ENDPOINT:YOUR_PROXY_PORT"

proxies = {
    "http": f"http://{PROXY}",
    "https": f"http://{PROXY}",
}


def extract_single_product(product_url):
    try:
        response = curl_cffi.get(
            product_url, impersonate="chrome", timeout=30, proxies=proxies
        )

        page_source = response.text

        soup = BeautifulSoup(page_source, "html.parser")
        script_tag = soup.find("script", {"id": "__NEXT_DATA__"})

        if not script_tag:
            raise Exception("__NEXT_DATA__ script not found")

        data = json.loads(script_tag.get_text())
        product_data = data["props"]["pageProps"]["initialData"]["data"]["product"]

        if not product_data:
            raise Exception("No product data found")

        return product_data

    except (json.JSONDecodeError, KeyError) as e:
        print(f"Data extraction failed: {e}")
        return {}
    except Exception as e:
        print(f"Single product extraction failed: {e}")
        return {}


def display_location_info(product_data):
    """Display location information from product data"""
    try:
        fulfillment_options = product_data.get("fulfillmentOptions", [])
        location_info = product_data.get("location", {})

        if fulfillment_options:
            location_text = fulfillment_options[0].get("locationText", "N/A")
            print(f"\n📍 Fulfillment location: {location_text}")

        if location_info:
            store_ids = location_info.get("storeIds", [])
            if store_ids:
                print(f"🏪 Store ID: {store_ids[0]}")
    except:
        pass


def main():
    product_url = "https://www.walmart.com/ip/Avia-Men-s-02-Air-Sneaker/10212008376"
    product_details = extract_single_product(product_url)

    if product_details:
        display_location_info(product_details)
        with open("walmart_product_details.json", "w", encoding="utf-8") as f:
            json.dump(product_details, f, indent=2, ensure_ascii=False)
        print("✓ Product data saved to walmart_product_details.json")
    else:
        print("❌ No product data extracted")


if __name__ == "__main__":
    main()

Here’s the console output scraping location-specific data, such as fulfillment location and store ID:

Console output scraping

For large-scale data collection, combining proxies with other advanced techniques will make your scraper more resilient by combining residential proxies, fingerprint alignment, pacing, and targeted retries:

  • Retry logic with exponential backoff. Network errors and temporary blocks are inevitable. Instead of letting your scraper fail, implement a retry mechanism. Using a library like Tenacity, you can automatically retry failed requests with an increasing delay, which prevents hammering the server.
  • Asynchronous requests. Using Python's asyncio library with a client like aiohttp allows you to send hundreds of requests concurrently. Instead of waiting for each request, your scraper can manage many open connections at once, drastically increasing your data throughput.
  • Distributed task queues. For scraping millions of pages, task queues like Celery, RQ, or Prefect are essential. They allow you to distribute scraping jobs across multiple independent workers, making it easy to scale your operation and process a massive number of URLs in parallel.
  • Robust monitoring and logging. To maintain a healthy scraper, you must monitor its performance in real-time. For every request, log key metrics to a dashboard (using tools like Prometheus, Grafana, or Datadog) like the HTTP status code, latency (response time), parse success, and whether a block signal like a CAPTCHA was detected. This data is essential for quickly diagnosing problems and understanding your scraper's success rate.
  • Headless browsers. Although Walmart is a modern, JavaScript-heavy website, our method is effective because it conveniently pre-loads its data into the initial HTML. For websites that lack this shortcut and require JavaScript to run in the browser for content to load, a headless browser is essential. These tools render the complete webpage in the background, executing all JavaScript to simulate a real user session. Popular options for headless browsers include Playwright, Selenium, Puppeteer, and other third-party frameworks like SeleniumBase (a powerful framework built on top of Selenium).

Use cases for Walmart scraped data

Once you have a reliable stream of Walmart data, you can unlock powerful business insights. Here are a few popular use cases for this data:

  • Competitive intelligence. Monitor competitor pricing, promotions, and product assortments in real time. Automatically track new product launches and adjust your own strategy accordingly.
  • Dynamic pricing optimization. Implement automated pricing algorithms that respond to market changes, competitor prices, and inventory levels to maximize revenue and profit margins.
  • Market research. Analyze customer reviews and ratings at scale to identify product gaps, emerging trends, and consumer pain points. This data can inform new product development and feature improvements.
  • Brand reputation management. Track your brand's performance by monitoring product ratings, review sentiment, and price positioning relative to your competitors.
  • Product sourcing. Identify high-demand products with poor reviews or frequent stockouts, presenting opportunities for private label sellers to enter the market with an improved offering.

Conclusion

Successfully scraping Walmart for data, prices, products, and reviews requires a sophisticated and scalable approach, despite the challenges involved. While basic scripts offer a quick overview, a more robust solution demands targeting hidden JSON, accurately managing pagination, and, crucially, employing a high-quality residential proxy network such as Live Proxies. This strategy enables you to overcome blocks and establish an effective data pipeline.

Ready to build a scraper that won't get blocked?

Explore our pricing plans or sign up for a free trial to test the power of our residential proxy network today!

FAQs

Why is Walmart ZIP code price scraping so complex?

Walmart's pricing varies geographically due to several factors:

  • Local inventory and promotions. A specific store might offer clearance sales or promotions that aren't available elsewhere to manage excess inventory. For instance, a sale in Omaha, Nebraska, may not be replicated in Miami, Florida.
  • Shipping and logistics. The cost of fulfilling and shipping an item can differ based on the delivery location (e.g., rural address versus a major city warehouse), which impacts the final price.
  • Regional pricing strategies. Prices are sometimes adjusted to compete with local brick-and-mortar competitors within a particular market.

Therefore, it is crucial to record the ZIP code used when scraping any price data. Without this location context, the data can be incomplete and misleading.

How often should I re-crawl?

The ideal crawl frequency depends entirely on the product's volatility. A one-size-fits-all approach is inefficient.

  • High-volatility products. For popular items like new electronics or best-selling apparel, where prices and stock change often, a daily crawl is a good baseline.
  • Low-volatility products. For "long-tail" items or staples that are rarely discounted (e.g., a specific brand of socks), a weekly crawl is usually sufficient and saves resources.
  • Sales events. During major promotions like Black Friday or special flash sales, prices and inventory can fluctuate by the minute. Hourly (or even more frequent) crawling is necessary to capture accurate data during these short windows.

Can I only extract reviews?

Yes, and it's more efficient to do so. Instead of loading the entire product page, you can point your scraper directly at Walmart's dedicated reviews URL (e.g., walmart.com/reviews/product/{PRODUCT_ID()}). However, to get an unbiased sample, avoid just scraping the first few pages. These pages are often sorted by "Most relevant", which can skew your sentiment analysis. A better strategy is to sample reviews from across the entire dataset, for example, scrape pages 1-3, a few pages from the middle, and the last few pages to get a mix of recent, old, positive, and negative feedback.

Why do I get HTML one day and JSON the next?

If you get structured JSON data one day and different HTML the next, you are likely encountering A/B testing or a new site deployment. Large eCommerce platforms constantly test new layouts on different segments of users and roll out updates.

A robust scraper must be resilient to these changes. The best practice is to build dual parsers:

  1. A primary parser that targets the efficient JSON object.
  2. A fallback parser that can extract data from the raw HTML structure if the JSON is not found.