Live Proxies

How to Scrape Bing Search Results, Ads, News, and Maps With Python in 2026

Scrape Bing in 2026 with Python: organic results, AI answers, ads, news, maps, images, and recipes using curl_cffi and SeleniumBase with clean schemas.

How to Scrape Bing Search Results, Ads, News, and Maps with Python
Live Proxies

Live Proxies Editorial Team

Content Manager

How To

2 February 2026

Bing is often overlooked, but it's a key search provider used by ChatGPT Search and captures a B2B audience, especially on corporate devices, that Google under-serves. If you try to scrape it with a standard script at scale, you will often run into TLS fingerprinting and behavior based blocks. In this guide, we'll walk through building scrapers for Bing Search, AI Answers, Maps, News, Ads, Images, and Recipes so that you can get the data reliably.

Can We Scrape Bing Search Results in 10 Lines

If you only need the organic search results (the "blue links"), you don't need a heavy browser. Bing's basic results are Server-Side Rendered (SSR), which means the data is present in the initial HTML response.

However, standard Python requests often trigger bot detection quickly, especially at scale, primarily due to TLS fingerprinting and other behavioral signals that Bing monitors. The solution is to use curl_cffi, a library that mimics the TLS handshake of a real Chrome browser.

When to use this:

  • You only need the first page of organic results
  • You need high speed and low cost (no headless browser overhead)

Here's the code:

import json
from curl_cffi import requests # pip install curl_cffi
from bs4 import BeautifulSoup # pip install beautifulsoup4

def scrape_bing_static(query):
    # impersonate="chrome" bypasses TLS fingerprinting
    headers = {"Accept-Language": "en-US,en;q=0.9", "Referer": "https://www.bing.com/"}
    response = requests.get(
        "https://www.bing.com/search",
        params={"q": query},
        impersonate="chrome",
        headers=headers,
    )

    if "b_results" not in response.text:
        return {"error": "No results (Blocked, Captcha, or Consent Wall)"}

    soup = BeautifulSoup(response.content, "html.parser")
    results = []

    for r in soup.select("li.b_algo"):
        title = r.select_one("h2")
        link = r.select_one("a")
        if title and link:
            snippet_el = r.select_one(".b_caption p")
            results.append(
                {
                    "title": title.get_text(strip=True),
                    "link": link["href"],
                    "snippet": snippet_el.get_text(strip=True) if snippet_el else "",
                }
            )

    return results

if __name__ == "__main__":
    print(json.dumps(scrape_bing_static("agentic ai tutorial"), indent=2))

When You Need Browser Automation

The static method above has a hard limit: dynamic content.

Bing's more advanced features, like AI Answers, Maps pins, News carousels, and Recipes, load much of their content through JavaScript after the initial page renders. The exact DOM you get also depends on your query type, location, and which A/B test variant Bing serves you. A static curl_cffi request won't capture this dynamic content.

For pagination beyond the first page, session continuity matters. When Bing sees requests for first=11 or first=21 with cookies and a consistent fingerprint from earlier pages, you're far less likely to hit blocks or redirects. Browser automation handles this naturally, making it the safer choice for multi-page scraping at scale.

For these cases, we use SeleniumBase. It's a wrapper around Selenium that automatically patches the WebDriver to remove "I am a robot" flags (like navigator.webdriver).

Use this for:

  • AI Chat/Copilot answers
  • Maps & Local data
  • Pagination (Pages 2+)
  • Images & News

Before diving into the scrapers, understand which data fields drive business value for each Bing surface next.

What Bing Scraping Data Fields Matter for Business

Scraping every pixel on the page is inefficient and expensive. You need a lean schema that maps directly to business value, not just raw HTML.

Here are the specific data fields you should target for each Bing surface:

Surface Target Fields Business Use Case
Organic Search Rank, Title, Snippet, URL SEO rank tracking and competitor content gaps
Ads (PPC) Position, Advertiser, Headline, Target URL Monitoring ad spend and "Share of Voice"
News Headline, Source, Time Published, URL Brand reputation and PR alerts
Maps (Local) Name, Website, Phone, Address, Coordinates Lead gen and NAP (Name, Address, Phone) consistency
People Also Ask Question, Answer, Source Link Identifying user intent and content opportunities
AI Answers Summary, Citations, Code Snippets "Generative Engine Optimization" (GEO) tracking
Images Title, Image URL, Source Page Building ML training datasets
Recipes Name, Rating, Prep Time, Source URL Trend analysis and competitive research

Now that we've defined what to extract, let's look at how to do it legally and securely.

Is Scraping Bing Search Legal and Safe in 2026

Scraping public data is a standard industry practice, but it is important to do it responsibly. The general rule of thumb is to scrape only publicly visible data where no login is required. When you log into an account, you agree to stricter terms that usually prohibit automation, so the safest approach is to always scrape as a guest. Additionally, be mindful of privacy and avoid collecting sensitive personal information.

Further reading: How to Scrape YouTube: A Complete Guide to Videos, Comments, and Transcripts (2026) and How to Scrape Yelp Places and Reviews in 2025 (With and Without Python).

How to Scrape Bing Search Results with Python Step by Step

Organic results are the baseline for SEO tracking and competitor analysis. But if you just grab the HTML, you will find that the URLs are encrypted, the ranks reset on every page, and the content is loaded dynamically.

To get quality data, we need a stack that handles JavaScript execution and bot detection simultaneously.

How to scrape Bing search results with Python step by step

Setup Quick Start

Bing typically blocks standard automation. We need SeleniumBase with "Stealth Mode" enabled to automatically patch common browser bot flags (such as navigator.webdriver), which helps us appear more like a real user.

The tech stack:

  • seleniumbase. Browser automation with anti-detect features.
  • beautifulsoup4. To parse the HTML after it renders.
  • requests / curl_cffi. (Optional) For dependency management or static checks.
# Create and activate a virtual environment
python -m venv bing-env
source bing-env/bin/activate  # Windows: bing-env\Scripts\activate.bat

# Install dependencies
pip install seleniumbase beautifulsoup4 requests
seleniumbase install chromedriver

Parse Organic Results and Rank

We have to solve three specific engineering problems to get clean data:

  • Tracking redirects. Bing wraps URLs in bing.com/ck/a? Clicking them is slow and triggers bot detection. We will decode the base64 u parameter directly to extract the clean URL without interacting with the redirect.
  • Rank calculation. Bing resets ranking on every page (top of Page 2 is "Rank 1"). We calculate absolute rank using ((page_num - 1) * 10) + index.
  • Lazy loading. Bing loads results as you scroll. We must force a scroll to the bottom of the page to trigger all JavaScript elements before parsing.

The Code

Save this as bing_organic_scraper.py:

import json
import logging
import argparse
import base64
from urllib.parse import quote_plus, urlparse, parse_qs
from bs4 import BeautifulSoup
from seleniumbase import SB

logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(message)s", datefmt="%H:%M:%S"
)


class BingOrganicScraper:
    def _clean_url(self, url):
        """
        Decodes Bing's tracking URLs (bing.com/ck/a?u=...).
        Target URL is base64-encoded in the 'u' parameter.
        """
        if not url or "bing.com/ck/a?" not in url:
            return url
        try:
            parsed = urlparse(url)
            query_params = parse_qs(parsed.query)
            u_param = query_params.get("u", [None])[0]
            if u_param:
                if u_param.startswith("a1"):
                    u_param = u_param[2:]
                # Add padding to make base64 valid (length must be multiple of 4)
                padded = u_param + "=" * (-len(u_param) % 4)
                return base64.urlsafe_b64decode(padded).decode("utf-8")
        except Exception:
            pass
        return url

    def scrape(self, query: str, max_pages: int = 3):
        logging.info(f"Starting scrape: '{query}' ({max_pages} pages)")

        all_results = []
        unique_links = set()

        with SB(uc=True, headless=True, window_size="1920,1080") as sb:

            for page_num in range(1, max_pages + 1):

                encoded_query = quote_plus(query)
                url = f"https://www.bing.com/search?q={encoded_query}"

                # Bing pagination: first=1 (page 1), first=11 (page 2), first=21 (page 3)
                if page_num > 1:
                    offset = (page_num - 1) * 10 + 1
                    url += f"&first={offset}"

                logging.info(f"Processing page {page_num}...")

                try:
                    sb.open(url)
                    sb.wait_for_element("#b_results", timeout=15)

                    # Scroll to trigger lazy-load
                    sb.execute_script("window.scrollTo(0, 800);")
                    sb.sleep(0.5)
                    sb.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                    sb.sleep(1.5)

                    page_source = sb.get_page_source()
                    page_data = self._parse_page(page_source, page_num, unique_links)

                    if not page_data:
                        logging.warning(
                            f"No results found on page {page_num}. (Captcha or End of Results?)"
                        )

                    all_results.extend(page_data)

                    # Rate limiting between pages
                    sb.sleep(2)

                except Exception as e:
                    logging.error(f"Error on page {page_num}: {e}")
                    continue

            return all_results

    def _parse_page(self, html, page_num, unique_links):
        soup = BeautifulSoup(html, "html.parser")
        results = []

        # Note: If this selector breaks, Bing likely updated their CSS classes
        organic_items = soup.select("li.b_algo")

        for index, item in enumerate(organic_items, start=1):
            t = item.select_one("h2 a")
            if not t:
                continue

            link = self._clean_url(t.get("href"))

            # Skip duplicates across pages
            if link in unique_links:
                continue
            unique_links.add(link)

            # Multiple selectors for A/B testing variations
            s = item.select_one(".b_lineclamp2, .b_caption p, .b_algoSlug")

            rank = ((page_num - 1) * 10) + index

            results.append(
                {
                    "rank": rank,
                    "title": t.get_text(" ", strip=True),
                    "link": link,
                    "snippet": s.get_text(" ", strip=True) if s else "",
                    "page": page_num,
                }
            )

        logging.info(f"Found {len(results)} new results on page {page_num}")
        return results


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-q", "--query", type=str, default="best python frameworks")
    parser.add_argument("-p", "--pages", type=int, default=3)
    parser.add_argument("-o", "--output", type=str)
    args = parser.parse_args()

    logging.getLogger("seleniumbase").setLevel(logging.WARNING)

    scraper = BingOrganicScraper()
    data = scraper.scrape(args.query, args.pages)

    if args.output:
        fname = f"{args.output}.json"
    else:
        fname = f"organic_{args.query.replace(' ', '_')}.json"

    with open(fname, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

    logging.info(f"Saved {len(data)} results to {fname}")

Running the Scraper

Run it from your terminal. You can control the query and depth via arguments:

# Single page test
python bing_organic_scraper.py -q "agentic ai tutorial" -p 1

# Full run (3 pages)
python bing_organic_scraper.py -q "best python frameworks" -p 3 -o "frameworks"

Understanding the Output

When you run this, you will get a clean JSON object. Notice that the Rank is continuous (e.g., the top result on Page 2 is accurately labeled as Rank 11, not Rank 1).

[
  {
    "rank": 1,
    "title": "Top 10 Python Frameworks [2025 Updated]",
    "link": "https://www.geeksforgeeks.org/blogs/best-python-frameworks/",
    "snippet": "23 Jul 2025 · This article presents the 10 best Python frameworks...",
    "page": 1
  },
  {
    "rank": 11,
    "title": "The 9 Best Python Frameworks in 2025",
    "link": "https://blog.botcity.dev/2024/11/28/python-frameworks/",
    "snippet": "28 Nov 2024 · Python Framework vs. Python Library...",
    "page": 2
  }
]

How to Scrape Bing AI Answers

AI answers (Copilot summaries, developer cards, generative overviews) are high-value targets because they synthesize data from multiple sources.

How to scrape Bing AI answers

But here is the catch: They typically do not exist in the initial HTML. Bing injects them via JavaScript after the page loads, and the DOM structure changes completely depending on the user's intent (e.g., coding vs. general knowledge).

To scrape this reliably, we need a "Wait and Prioritize" strategy.

The Strategy

We are dealing with three specific challenges here:

  • Async Injection. If you parse the HTML immediately after loading, you will get nothing. We must explicitly wait for specific containers to appear in the DOM.
  • Three Different Containers. Bing doesn't use one standard box.
    • .developer_answercard_wrapper: Best for code (has syntax highlighting).
    • .b_genserp_container: Best for general knowledge (has citations).
    • #copans_container: The fallback Chat/Copilot interface (hardest to parse).
  • Priority Logic. Sometimes multiple containers are loaded. Our script will prioritize the Developer Card (most structured), then the Generative SERP, and finally Copilot as a fallback.

The CSS Selectors

Here are the specific targets we are looking for:

Selector What it holds
.developer_answercard_wrapper The "Tech Card" container (Code snippets)
.devmag_code The actual code block inside the Tech Card
.b_genserp_container The "Generative AI" container (Text summaries)
.gs_heroTextHeader__Italic The main summary headline
.gs_cit a Citation links
#copans_container The Chat/Copilot fallback container

The Code

Save this as bing_ai_scraper.py.

Note: We force the window size to 1920x1080 because Bing often hides the AI sidebar on smaller viewports.

import json
import logging
import argparse
import base64
from urllib.parse import quote_plus, urlparse, parse_qs
from bs4 import BeautifulSoup
from seleniumbase import SB

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    datefmt="%H:%M:%S",
)


class BingAIScraper:
    def _clean_url(self, url):
        """
        Decodes Bing's tracking URLs (bing.com/ck/a?u=...).
        Target URL is base64-encoded in the 'u' parameter.
        """
        if not url or "bing.com/ck/a?" not in url:
            return url
        try:
            parsed = urlparse(url)
            query_params = parse_qs(parsed.query)
            u_param = query_params.get("u", [None])[0]
            if u_param:
                if u_param.startswith("a1"):
                    u_param = u_param[2:]
                # Add padding to make base64 valid (length must be multiple of 4)
                padded = u_param + "=" * (-len(u_param) % 4)
                return base64.urlsafe_b64decode(padded).decode("utf-8")
        except Exception:
            pass
        return url

    def scrape(self, query: str):
        logging.info("Starting AI scrape for: '%s'", query)

        # Force US region where AI features are most stable
        url = f"https://www.bing.com/search?q={quote_plus(query)}&cc=US&setlang=en"

        # Desktop viewport required for AI sidebar to render
        with SB(uc=True, headless=True, window_size="1920,1080") as sb:
            sb.open(url)
            sb.wait_for_element("#b_results", timeout=10)

            # AI answers are injected dynamically via JS
            logging.info("Waiting for AI containers...")
            try:
                sb.wait_for_element(
                    "#copans_container, .b_genserp_container, .developer_answercard_wrapper",
                    timeout=5,
                )
            except Exception:
                logging.info(
                    "No AI container appeared (timeout). Proceeding to parse..."
                )

            html = sb.get_page_source()
            result = self._parse_ai_data(html)
            result["query"] = query

            return result

    def _parse_ai_data(self, html):
        soup = BeautifulSoup(html, "html.parser")

        data = {
            "ai_answer_found": False,
            "type": None,
            "summary": None,
            "content": [],
            "code_snippets": [],
            "sources": [],
        }

        # Three AI result types in priority order:
        # developer_card (tech/code), generative_serp (standard AI), copilot (chat fallback)
        rich_card = soup.select_one(".developer_answercard_wrapper")
        gen_serp = soup.select_one(".b_genserp_container")
        copilot = soup.select_one("#copans_container")

        if rich_card:
            data["ai_answer_found"] = True
            data["type"] = "developer_card"

            title = rich_card.select_one("h2")
            data["summary"] = title.get_text(" ", strip=True) if title else "AI Answer"

            data["content"] = [
                s.get_text(" ", strip=True)
                for s in rich_card.select(".devmag_cntnt_snip, .rd_sub_header")
            ]

            data["code_snippets"] = [
                c.get_text("\n", strip=True) for c in rich_card.select(".devmag_code")
            ]

            data["sources"] = [
                self._clean_url(a.get("href"))
                for a in rich_card.select(".rd_attr_items a")
            ]

        elif gen_serp:
            data["ai_answer_found"] = True
            data["type"] = "generative_serp"

            summary_header = gen_serp.select_one(".gs_heroTextHeader")
            data["summary"] = (
                summary_header.get_text(" ", strip=True) if summary_header else ""
            )

            content_area = gen_serp.select_one(".gs_text.gs_mdr")
            if content_area:
                for block in content_area.select("h3, .gs_p"):
                    t = block.get_text(" ", strip=True)
                    if len(t) > 1:
                        data["content"].append(t)

            data["sources"] = [
                self._clean_url(a.get("href"))
                for a in gen_serp.select(".gs_cit a")
                if a.get("href")
            ]

        elif copilot:
            data["ai_answer_found"] = True
            data["type"] = "copilot_fallback"
            data["content"] = [copilot.get_text("\n", strip=True)]

        return data


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-q", "--query", type=str, default="what are llm tokens")
    parser.add_argument("-o", "--output", type=str)
    args = parser.parse_args()

    logging.getLogger("seleniumbase").setLevel(logging.WARNING)

    scraper = BingAIScraper()
    data = scraper.scrape(args.query)

    if args.output:
        fname = f"{args.output}.json"
    else:
        fname = f"ai_{args.query.replace(' ', '_')}.json"

    with open(fname, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

    logging.info(f"Saved to {fname} (AI answer found: {data['ai_answer_found']})")

Running the Scraper

You can target different AI types by changing your query intent:

# Knowledge Query (Triggers Generative SERP)
python bing_ai_scraper.py -q "what are llm tokens" -o ai_tokens

# Coding Query (Triggers Developer Card)
python bing_ai_scraper.py -q "python list comprehension syntax" -o ai_python

Understanding the Output

The type field tells you exactly which container we matched.

{
  "ai_answer_found": true,
  "type": "generative_serp",
  "summary": "Tokens in Large Language Models (LLMs) are the basic units of text...",
  "content": [
    "Definition of Tokens",
    "In the context of LLMs, a token is a segment of text...",
    "Whole Words: For example, \"apple\" or \"run.\""
  ],
  "sources": [
    "https://itsfoss.com/llm-token/",
    "https://learn.microsoft.com/en-us/dotnet/ai/conceptual/understanding-tokens"
  ]
}

How to Scrape "People Also Ask" (PAA) from Bing

"People Also Ask" (PAA) boxes are critical for mapping search intent and finding content gaps.

How to scrape "People Also Ask" (PAA) from Bing

The challenge here is that PAA is an interactive element. The answers are hidden inside collapsed accordions, and new questions are lazily loaded as you scroll. If you just requests.get() the page, you will get nothing.

The Strategy

We need a three-step interaction pipeline:

  • Trigger Lazy Loading. Bing loads the first 3-4 questions initially. To get the rest, we must scroll. We will use a randomized scroll function to mimic human reading behavior, which prompts Bing to inject more questions into the DOM.
  • Force Expansion (JS Injection). Clicking accordions one by one using Selenium's .click() is slow and flaky. Instead, we will inject a snippet of JavaScript to find all accordion headers and fire their click events simultaneously. This expands all accordions quickly without the overhead of individual clicks.
  • Stateful Parsing. Once everything is expanded and visible, we parse the HTML.

The CSS Selectors

Here is the structure of a PAA card:

Selector Description
.acf-accn-itm The container for a single Q&A pair
.acf-accn-itm__hdr-label The Question text
.paa-txt The Answer text
a.paa-content__Italic The Source Link (citation)

The Code

Save this as bing_paa_scraper.py.

import json
import logging
import argparse
import base64
import time
import random
from urllib.parse import quote_plus, urlparse, parse_qs
from bs4 import BeautifulSoup
from seleniumbase import SB

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    datefmt="%H:%M:%S",
)


class BingPAAScraper:
    def _clean_url(self, url):
        """
        Decodes Bing's tracking URLs (bing.com/ck/a?u=...).
        Target URL is base64-encoded in the 'u' parameter.
        """
        if not url or "bing.com/ck/a?" not in url:
            return url
        try:
            parsed = urlparse(url)
            query_params = parse_qs(parsed.query)
            u_param = query_params.get("u", [None])[0]

            if u_param:
                if u_param.startswith("a1"):
                    u_param = u_param[2:]

                # Add padding to make base64 valid (length must be multiple of 4)
                padded = u_param + "=" * (-len(u_param) % 4)

                return base64.urlsafe_b64decode(padded).decode("utf-8")
        except Exception:
            pass
        return url

    def scrape(self, query: str):
        logging.info(f"Starting PAA scrape for: '{query}'")

        url = f"https://www.bing.com/search?q={quote_plus(query)}"

        # Use stealth mode to bypass bot detection
        with SB(uc=True, headless=True) as sb:
            try:
                sb.open(url)
                sb.wait_for_element("#b_results", timeout=15)

                self._slow_scroll_to_bottom(sb)
                self._expand_paa(sb)

                html = sb.get_page_source()
                return self._parse_paa(html)

            except Exception as e:
                logging.error(f"Error during scrape: {e}")
                return []

    def _slow_scroll_to_bottom(self, sb):
        """
        Mimics human scrolling behavior to trigger lazy-load and avoid detection.
        """
        logging.info("Scrolling page to trigger lazy-load...")

        last_height = sb.execute_script("return document.body.scrollHeight")

        while True:
            # Randomize scroll behavior to appear human
            step = random.randint(400, 600)
            sb.execute_script(f"window.scrollBy(0, {step});")
            sb.sleep(random.uniform(0.5, 1.0))

            new_height = sb.execute_script("return document.body.scrollHeight")
            scrolled_amount = sb.execute_script(
                "return window.scrollY + window.innerHeight"
            )

            if scrolled_amount >= new_height:
                sb.sleep(1.5)
                new_height = sb.execute_script("return document.body.scrollHeight")
                if scrolled_amount >= new_height:
                    break
            last_height = new_height

    def _expand_paa(self, sb):
        """
        Uses direct JS execution to click PAA accordions.
        More reliable than Selenium's .click() when elements overlap.
        """
        try:
            # Scroll to top so elements are renderable
            sb.execute_script("window.scrollTo(0, 0);")
            sb.sleep(0.5)

            if sb.is_element_visible(".acf-accn-itm"):
                sb.scroll_to(".acf-accn-itm")
                sb.sleep(0.5)

            logging.info("Injecting JS to expand accordions...")

            sb.execute_script(
                """
                let buttons = document.querySelectorAll('.acf-accn-itm__hdr');
                for(let i = 0; i < Math.min(buttons.length, 4); i++) {
                    try { buttons[i].click(); } catch(e) {}
                }
            """
            )
            sb.sleep(2.5)
        except Exception as e:
            logging.warning(f"Could not expand PAA: {e}")

    def _parse_paa(self, html):
        soup = BeautifulSoup(html, "html.parser")
        paa_results = []
        seen_questions = set()

        for item in soup.select(".acf-accn-itm"):
            q_elem = item.select_one(".acf-accn-itm__hdr-label")
            a_elem = item.select_one(".paa-txt")
            link_elem = item.select_one("a.paa-content")

            if q_elem:
                question = q_elem.get_text(" ", strip=True)
                if question in seen_questions:
                    continue

                answer_text = "Could not extract answer"
                if a_elem:
                    answer_text = a_elem.get_text("\n", strip=True)

                paa_results.append(
                    {
                        "question": question,
                        "answer": answer_text,
                        "source_link": (
                            self._clean_url(link_elem.get("href"))
                            if link_elem
                            else None
                        ),
                    }
                )
                seen_questions.add(question)

        return paa_results


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-q", "--query", type=str, default="vibe coding tools")
    parser.add_argument("-o", "--output", type=str)
    args = parser.parse_args()

    logging.getLogger("seleniumbase").setLevel(logging.WARNING)

    scraper = BingPAAScraper()
    data = scraper.scrape(args.query)

    if args.output:
        fname = f"{args.output}.json"
    else:
        fname = f"paa_{args.query.replace(' ', '_')}.json"

    with open(fname, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

    logging.info(f"Saved {len(data)} PAA items to {fname}")

Running the scraper

# Standard run
python bing_paa_scraper.py -q "vibe coding tools"

Understanding the Output

The output gives you the direct Q&A pairs. Note that the answer text is only captured if the JavaScript expansion was successful.

[
  {
    "question": "What is a vibe coding tool?",
    "answer": "Vibe coding tools are modern platforms that combine AI assistance...",
    "source_link": "https://codeconductor.ai/blog/vibe-coding-tools/"
  }
]

How to Scrape Bing Ad Results Without Mistakes

Ad results are the only way to see exactly what competitors are paying for. They reveal copy strategies, landing pages, and "Share of Voice".

How to scrape Bing ad results without mistakes

The challenge? Bing ads are fragmented. They appear at the top and bottom of the page, and the DOM structure is full of legacy code and actual typos (e.g., icondomian). If you trust the standard class names, you may miss a significant portion of the data.

The Strategy

To build a robust ad scraper, we need to handle three specific issues:

  • Dual Positions. Ads live in .b_ad.sb_top (Header) and .b_ad.sb_bottom (Footer). We must scrape both and tag them to calculate the true rank.
  • Legacy Typo. Bing's CSS is messy. Display URLs sometimes use .b_adurl and sometimes .b_addurl (double 'd'). The advertiser domain class is often .b_icondomian (misspelled). Our scraper checks for both correct and broken spellings.
  • Tracking Wrappers. Every ad link is wrapped in _a bing​.c​om/aclk _ redirect. We will decode the Base64 u parameter to get the actual landing page without clicking.

The CSS Selectors

Here is the "messy reality" of Bing's Ad DOM:

Selector Description
.b_ad.sb_top Top Ad Container
.b_ad.sb_bottom Bottom Ad Container
.b_adTopIcon_domain, .b_icondomian Advertiser Name
.b_adurl, .b_addurl Display URL
.b_vlist2col li a Sitelinks (Sub-links under the main ad)

The Code

Save this as bing_ads_scraper.py.

import json
import logging
import base64
import argparse
from urllib.parse import quote_plus, urlparse, parse_qs, unquote
from bs4 import BeautifulSoup
from seleniumbase import SB

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    datefmt="%H:%M:%S",
)


class BingAdsExtractor:
    def _clean_url(self, url):
        """
        Decodes Bing's tracking URLs to get actual destination.
        Bing wraps links in bing.com/aclk with target in base64-encoded 'u' parameter.
        """
        if not url or ("bing.com/aclk" not in url and "bing.com/ck/a" not in url):
            return url

        try:
            parsed = urlparse(url)
            query_params = parse_qs(parsed.query)
            u_param = query_params.get("u", [None])[0]

            if u_param:
                if u_param.startswith("a1"):
                    u_param = u_param[2:]

                # Add padding to make base64 string valid (length must be multiple of 4)
                u_param += "=" * (-len(u_param) % 4)

                decoded = base64.urlsafe_b64decode(u_param).decode("utf-8")
                return unquote(decoded)
        except Exception:
            pass
        return url

    def _parse_ads(self, html):
        soup = BeautifulSoup(html, "html.parser")
        ads = []
        seen = set()

        for container in soup.select("li.b_ad ul"):
            parent_classes = container.parent.get("class", [])
            position = "top" if "b_adTop" in parent_classes else "bottom"

            for rank, item in enumerate(
                container.find_all("li", recursive=False), start=1
            ):
                try:
                    title_el = item.select_one("h2 a")
                    if not title_el:
                        continue

                    title = title_el.get_text(" ", strip=True)
                    url = self._clean_url(title_el.get("href"))

                    # Skip duplicates (same ad can appear multiple times in DOM)
                    fingerprint = (title, url)
                    if fingerprint in seen:
                        continue
                    seen.add(fingerprint)

                    # Note: .b_icondomian is a Bing typo that still exists in production
                    domain_el = item.select_one(".b_adTopIcon_domain, .b_icondomian")
                    if domain_el:
                        domain = domain_el.get_text(strip=True)
                    else:
                        cite = item.select_one("cite")
                        domain = (
                            cite.get_text(strip=True).split(" ")[0]
                            if cite
                            else "Unknown"
                        )

                    display_el = item.select_one(".b_adurl cite")
                    display_url = (
                        display_el.get_text(strip=True) if display_el else domain
                    )

                    desc_el = item.select_one(".b_ad_description")
                    if desc_el:
                        # Remove "Ad" labels before extracting description text
                        for label in desc_el.select(".b_adSlug"):
                            label.decompose()
                        description = desc_el.get_text(" ", strip=True)
                    else:
                        description = ""

                    callouts = []
                    for callout in item.select(".b_secondaryText, .b_topAd"):
                        text = (
                            callout.get_text(strip=True)
                            .replace("\u00a0", " ")
                            .replace("\u00b7", "|")
                        )
                        callouts.append(text)

                    sitelinks = []
                    for link in item.select(
                        ".b_vlist2col li a, .hsl_carousel .slide a"
                    ):
                        link_title = link.get_text(strip=True)
                        link_url = self._clean_url(link.get("href"))
                        if link_title and link_url:
                            sitelinks.append({"title": link_title, "url": link_url})

                    ads.append(
                        {
                            "title": title,
                            "advertiser": domain,
                            "position": position,
                            "rank_in_block": rank,
                            "display_url": display_url,
                            "real_url": url,
                            "description": description,
                            "callouts": callouts,
                            "sitelinks": sitelinks,
                        }
                    )

                except Exception as e:
                    logging.warning(f"Error parsing ad: {e}")

        return ads

    def scrape(self, query):
        logging.info(f"Scraping ads for: {query}")
        url = f"https://www.bing.com/search?q={quote_plus(query)}"

        # Use stealth mode to bypass bot detection
        with SB(uc=True, headless=True) as sb:
            sb.open(url)
            sb.wait_for_element("#b_results", timeout=15)

            # Trigger footer ads to load
            sb.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            sb.sleep(3)

            html = sb.get_page_source()
            return self._parse_ads(html)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-q", "--query", type=str, default="marketing automation software"
    )
    parser.add_argument("-o", "--output", type=str, default="ads_output")
    args = parser.parse_args()

    logging.getLogger("seleniumbase").setLevel(logging.WARNING)

    scraper = BingAdsExtractor()
    ads = scraper.scrape(args.query)

    filename = f"{args.output}.json"
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(ads, f, indent=2, ensure_ascii=False)

    logging.info(f"Done! {len(ads)} ads saved to {filename}")

Running the Scraper

Ads only appear for commercial queries. Try this:

# High competition query
python bing_ads_scraper.py -q "marketing automation tools"

Understanding the Output

The output distinguishes between Top (Premium) and Bottom (Footer) placement, which is critical for estimating ad spend.

[
  {
    "title": "Marketing Automation Software - Best of 2025",
    "advertiser": "Wix.com",
    "position": "top",
    "rank_in_block": 1,
    "display_url": "wix.com",
    "real_url": "https://www.wix.com/blog/best-marketing-automation-software...",
    "callouts": [
      "220,000,000+ users | SEO learning hub"
    ]
  }
]

How to Scrape Bing News Results Quickly

How to scrape Bing news results quickly

News results are time-sensitive. If you are building a monitoring tool, you don't want "relevant" news from 2021; you want what happened in the last hour.

The challenge with Bing News isn't the HTML, it's the URL. Bing uses a cryptic qft parameter with internal interval codes that are not documented.

The Strategy

To get this working, we need to handle three specific mechanics:

  • The URL Hack. We reverse-engineered the qft parameter. You need to pass specific interval="X" strings to filter by time.

    • interval="4": Past Hour (Critical for breaking news)
    • interval="7": Past 24 Hours
    • Crucial Detail: You must append &form=PTFTNR__Italic to the URL, or Bing ignores your filters completely.
  • Dual Extraction. Bing frequently A/B tests its CSS class names. However, the data-* attributes on the .news-card element (like data-title, data-url) are much more stable. Our scraper prioritizes these attributes and only falls back to CSS selectors if they are missing.

  • Batch Loading. Bing typically loads approximately 8 articles initially (this may vary). To get more, we scroll to the bottom to trigger the lazy-loader.

Note: The interval parameter codes shown in this guide are reverse-engineered from Bing's network traffic and are not officially documented. These codes may change without notice.

The CSS Selectors

Selector Description
.news-card The main container for each article
data-title Primary: The headline (attribute)
data-url Primary: The direct link (attribute)
data-author Primary: The publisher/source (attribute)
a.title Fallback: Headline selector if attributes fail
span[aria-label] The relative time string (e.g., "53m", "2h")

The Code

Save this as bing_news_scraper.py.

import json
import logging
import argparse
import datetime
import time
from urllib.parse import quote_plus
from typing import List, Dict, Any

from seleniumbase import SB
from bs4 import BeautifulSoup

logging.basicConfig(level=logging.INFO, format="%(levelname)s - %(message)s")


class BingNewsScraper:
    def scrape(
        self, query: str, count: int, sort_by: str = "relevance", recency: str = "any"
    ) -> Dict[str, Any]:
        url = self._build_url(query, sort_by, recency)
        logging.info(f"Target URL: {url}")

        # Use stealth mode to avoid bot detection and CAPTCHA challenges
        with SB(uc=True, headless=True) as sb:
            sb.open(url)
            sb.wait_for_element(".news-card", timeout=10)

            self._scroll_to_load(sb, count)

            html = sb.get_page_source()

            articles = self._parse_articles(html)
            articles = articles[:count]

            results = {
                "search_parameters": {
                    "query": query,
                    "sort": sort_by,
                    "recency": recency,
                    "url_used": url,
                },
                "total_count": len(articles),
                "scraped_at": datetime.datetime.now().isoformat(),
                "articles": articles,
            }

            logging.info(f"Scraping finished. Extracted {len(articles)} articles.")
            return results

    def _build_url(self, query: str, sort_by: str, recency: str) -> str:
        base_url = "https://www.bing.com/news/search"

        # Bing's internal time interval codes (reverse-engineered from network traffic)
        time_filters = {
            "hour": 'interval="4"',
            "day": 'interval="7"',
            "week": 'interval="8"',
            "month": 'interval="9"',
            "year": 'interval="10"',
        }

        filters = []
        if recency in time_filters:
            filters.append(time_filters[recency])

        # sortbydate="1" forces chronological order (default is 0 for relevance)
        if sort_by == "date":
            filters.append('sortbydate="1"')

        url = f"{base_url}?q={quote_plus(query)}"
        if filters:
            # Multiple filters in 'qft' are separated by '+'
            url += f"&qft={quote_plus('+'.join(filters))}"

        # form=PTFTNR is required for filters to apply correctly
        url += "&form=PTFTNR"

        return url

    def _scroll_to_load(self, sb, count: int):
        loaded = 0
        retries = 0
        max_retries = 3

        while loaded < count:
            cards = sb.find_elements(".news-card")
            current = len(cards)
            logging.info(f"Loaded {current} of {count} articles...")

            # Track consecutive failed attempts to prevent infinite loops
            if current == loaded:
                retries += 1
                if retries >= max_retries:
                    logging.warning("No new articles loaded after scrolling.")
                    break
            else:
                retries = 0
                loaded = current

            sb.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            sb.sleep(2.0)

    def _parse_articles(self, html: str) -> List[Dict[str, str]]:
        """
        Two-tier extraction strategy:
        1. data-* attributes (reliable)
        2. CSS selectors (fallback for A/B tests)
        """
        soup = BeautifulSoup(html, "html.parser")
        articles = []

        for card in soup.select(".news-card"):
            try:
                title = card.get("data-title")
                url = card.get("data-url")

                # Fallback to visible elements if data attributes missing
                if not title or not url:
                    link = card.select_one("a.title")
                    if link:
                        if not title:
                            title = link.get_text(strip=True)
                        if not url:
                            url = link.get("href")

                source = card.get("data-author") or "Unknown"

                published_el = card.select_one("span[aria-label]")
                published = published_el.get_text(strip=True) if published_el else None

                snippet_el = card.select_one(".snippet")
                snippet = snippet_el.get_text(strip=True) if snippet_el else ""

                img = card.select_one("img.rms_img")
                # data-src-hq often contains higher resolution images
                image = img.get("data-src-hq") or img.get("src") if img else None

                if title and url:
                    articles.append(
                        {
                            "title": title,
                            "url": url,
                            "source": source,
                            "published": published,
                            "snippet": snippet,
                            "image": image,
                        }
                    )

            except Exception:
                pass

        return articles


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Bing News Scraper with Filters")
    parser.add_argument("-q", "--query", type=str, required=True, help="Search query")
    parser.add_argument(
        "-c", "--count", type=int, default=20, help="Number of articles to scrape"
    )
    parser.add_argument("-o", "--output", type=str, help="Custom output filename")
    parser.add_argument(
        "--sort",
        type=str,
        choices=["relevance", "date"],
        default="relevance",
        help="Sort by: 'relevance' or 'date'",
    )
    parser.add_argument(
        "--when",
        type=str,
        choices=["any", "hour", "day", "week", "month", "year"],
        default="any",
        help="Time range: hour, day, week, month, year",
    )

    args = parser.parse_args()

    logging.getLogger("seleniumbase").setLevel(logging.WARNING)

    scraper = BingNewsScraper()
    data = scraper.scrape(args.query, args.count, args.sort, args.when)

    if args.output:
        filename = f"{args.output}.json"
    else:
        safe_q = "".join([c if c.isalnum() else "_" for c in args.query]).lower()
        timestamp = int(time.time())
        filename = f"news_{safe_q}_{args.when}_{args.sort}_{timestamp}.json"

    with open(filename, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

    logging.info(f"Results saved to: {filename}")

Running the Scraper

You can now run the time-based queries:

# Get the last 10 breaking news items (Past Hour, Sorted by Date)
python bing_news_scraper.py -q "breaking news" -c 10 --when hour --sort date

# Get 20 AI articles from the last week
python bing_news_scraper.py -q "artificial intelligence" -c 20 --when week

Understanding the Output

The published field returns the raw relative time string (e.g., "20m" or "2h"). If you are storing this in a database, you should normalize these to UTC timestamps immediately after scraping.

[
  {
    "title": "Corona Remedies IPO: GMP signals strong debut",
    "url": "https://www.msn.com/en-in/news/other/corona-remedies-ipo...",
    "source": "Mint on MSN",
    "published": "20m",
    "image": "//th.bing.com/th?id=OVFT.afNtSH4aw5tC7OrXHoXk3S..."
  }
]

How to Scrape Bing Maps places safely

Maps data is the gold standard for lead generation because it's structured: Name, Phone, Address, Website, and Coordinates.

How to scrape Bing Maps places safely

But scraping Bing Maps is frustrating if you don't know the UI architecture. The results live in a fixed sidebar that is separate from the main page scroll. If you tell Selenium to window.scrollTo(0, 1000), absolutely nothing will happen.

The Strategy

We need to solve three specific challenges:

  • Sidebar Scrolling. We must locate the specific container (.b_lstcards) and scroll that element using JavaScript.
  • The Hard Limit. Unlike Google Maps, which can paginate through hundreds of results, Bing Maps often returns a more limited set. The exact count varies by query, location, and business density – results commonly stop loading after several dozen entries, though this behavior may change.
  • The "Hidden" JSON. Instead of scraping messy HTML text (where phone numbers and addresses are formatted inconsistently), we will extract the data-entity attribute from each card. This contains a clean JSON object with the raw data.

The CSS Selectors

Selector Description
.b_lstcards The Sidebar. This is what we must scroll
.b_maglistcard The individual business card container
data-entity The data. A JSON attribute on the card containing clean data
.l_rev_pirs Star Rating (e.g., "4/5")

The Code

Save this as bing_maps_scraper.py.

import time
import json
import csv
import logging
import argparse
from urllib.parse import quote_plus
from seleniumbase import SB
from bs4 import BeautifulSoup

logging.basicConfig(level=logging.INFO, format="%(levelname)s - %(message)s")


class BingMapsScraper:
    def scrape(self, query: str, count: int):
        logging.info("Scraping maps for: '%s' (target: %d)", query, count)
        url = f"https://www.bing.com/maps?q={quote_plus(query)}"

        # Desktop viewport required for sidebar rendering
        with SB(uc=True, headless=True, window_size="1920,1080") as sb:
            sb.open(url)
            sb.wait_for_element(".b_maglistcard", timeout=10)

            self._scroll_to_load(sb, count)

            logging.info("Extracting data...")
            return self._parse_places(sb.get_page_source())

    def _scroll_to_load(self, sb, count: int):
        """Scrolls the sidebar container, not the main window."""
        loaded = 0
        retries = 0
        max_retries = 5

        while loaded < count:
            cards = sb.find_elements(".b_maglistcard")
            current = len(cards)
            logging.info("Loaded %d / %d places...", current, count)

            if current >= count:
                break

            if current == loaded and current > 0:
                retries += 1
                if retries >= max_retries:
                    logging.warning("No new places loaded after 5 retries. Stopping.")
                    break
            else:
                retries = 0
                loaded = current

            # Scroll the sidebar element (.b_lstcards), not window
            sb.execute_script(
                """
                (function() {
                    let sidebar = document.querySelector('.b_lstcards');
                    if(sidebar) {
                        sidebar.scrollTop = sidebar.scrollHeight;
                    }
                })();
            """
            )
            time.sleep(2.5)

    def _parse_places(self, html: str):
        """Extracts structured data from data-entity JSON attribute."""
        soup = BeautifulSoup(html, "html.parser")
        places = []

        for card in soup.select(".b_maglistcard"):
            try:
                entity_json = card.get("data-entity")
                if not entity_json:
                    continue

                data = json.loads(entity_json)
                entity = data.get("entity", {})
                geo = data.get("geometry", {})

                address_raw = entity.get("address", "N/A")
                address = (
                    address_raw.get("formattedAddress", "N/A")
                    if isinstance(address_raw, dict)
                    else str(address_raw)
                )

                # Reviews are in separate DOM elements, not in JSON
                rating_el = card.select_one(".l_rev_pirs")
                rating = rating_el.get_text(strip=True) if rating_el else "N/A"

                reviews_el = card.select_one(".l_rev_rc")
                reviews = reviews_el.get_text(strip=True) if reviews_el else "0"

                places.append(
                    {
                        "name": entity.get("title"),
                        "phone": entity.get("phone", "N/A"),
                        "address": address,
                        "website": entity.get("website", "N/A"),
                        "category": entity.get("primaryCategoryName", "N/A"),
                        # Note: Bing uses y=latitude, x=longitude (not standard x/y convention)
                        "latitude": geo.get("y"),
                        "longitude": geo.get("x"),
                        "rating": rating,
                        "review_count": reviews,
                    }
                )

            except Exception as e:
                logging.warning("Error parsing place: %s", e)

        return places


def _save_results(data, base_filename):
    json_file = f"{base_filename}.json"
    with open(json_file, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

    if data:
        csv_file = f"{base_filename}.csv"
        headers = list(data[0].keys())
        with open(csv_file, "w", newline="", encoding="utf-8-sig") as f:
            writer = csv.DictWriter(f, fieldnames=headers)
            writer.writeheader()
            writer.writerows(data)
        logging.info(f"Saved {len(data)} results to {csv_file}")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Bing Maps Scraper")
    parser.add_argument("-q", "--query", type=str, default="coffee shops in Seattle")
    parser.add_argument("-c", "--count", type=int, default=10)
    parser.add_argument("-o", "--output", type=str, help="Output filename")

    args = parser.parse_args()
    logging.getLogger("seleniumbase").setLevel(logging.WARNING)

    scraper = BingMapsScraper()
    places = scraper.scrape(args.query, args.count)

    if args.output:
        fname = args.output
    else:
        fname = f"maps_{args.query.replace(' ', '_')}"

    _save_results(places, fname)

Running the Scraper

You can get both JSON and CSV output automatically.

# Get 20 leads for "Dentists in New York"
python bing_maps_scraper.py -q "dentists in New York" -c 20

Understanding the Output

Because we parsed the JSON, the data is extremely clean. Note that Bing uses y for Latitude and x for Longitude (a quirk of their coordinate system).

{
  "name": "Balthazar",
  "phone": "(212) 965-1414",
  "address": "80 Spring St, New York, NY 10012",
  "category": "Restaurant",
  "latitude": 40.72263336,
  "longitude": -73.99828338
}

How to Scrape Bing Images

If you just scrape the visible <img> tags on Bing, you are failing. Those tags only contain low-res, base64-encoded thumbnails that look terrible when scaled up.

How to scrape Bing images

To get the original, high-resolution URL (essential for ML datasets or visual research), you must parse the hidden m attribute. Bing stores the real metadata there as a JSON string.

The Strategy

We need to solve three specific mechanics to make this work:

  • The Hidden JSON. We target the anchor tag <a class="iusc>. Inside, there is an attribute called m. We parse this string as JSON to extract the murl (Real Image URL).
  • Infinite Scroll + "See More". Bing auto-loads images for a while, but eventually hits a "See More Results" button. Our script detects this button and clicks it to keep the pipeline moving.
  • Duplicate Filtering. Infinite scrolling often re-renders elements. We use a set() to track URLs and ensure we don't save the same image twice.

The CSS Selectors

Selector Description
a.iusc The container for the image result
m (attribute) Hidden JSON containing murl (High-Res URL)
a.btn_seemore The button that breaks infinite scroll (needs clicking)

The Code

Save this as bing_image_scraper.py.

import json
import logging
import argparse
import time
from urllib.parse import quote_plus
from seleniumbase import SB

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    datefmt="%H:%M:%S",
)


class BingImageScraper:
    def scrape(self, query: str, limit: int = 50):
        logging.info(f"Starting Image Scrape for: '{query}' (Target: {limit})")
        url = f"https://www.bing.com/images/search?q={quote_plus(query)}"

        image_data = []
        unique_urls = set()

        with SB(uc=True, headless=True) as sb:
            sb.open(url)
            sb.wait_for_element("a.iusc", timeout=10)

            logging.info("Page loaded. collecting images...")

            last_count = 0
            retries = 0

            while len(image_data) < limit:
                elements = sb.find_elements("a.iusc")

                for elem in elements:
                    if len(image_data) >= limit:
                        break

                    try:
                        # 'm' attribute contains JSON with image URL and metadata
                        m_attr = elem.get_attribute("m")
                        if not m_attr:
                            continue

                        data = json.loads(m_attr)
                        img_url = data.get("murl")
                        title = data.get("t")

                        if img_url and img_url not in unique_urls:
                            unique_urls.add(img_url)
                            image_data.append(
                                {
                                    "title": title,
                                    "image_url": img_url,
                                    "source_url": data.get("purl"),
                                }
                            )
                    except Exception:
                        continue

                logging.info(f"Collected {len(image_data)} / {limit} images...")

                if len(image_data) >= limit:
                    break

                sb.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                sb.sleep(1.5)

                if sb.is_element_visible("a.btn_seemore"):
                    logging.info("Clicking 'See More' button...")
                    sb.click("a.btn_seemore")
                    sb.sleep(2)

                current_count = len(sb.find_elements("a.iusc"))
                if current_count == last_count:
                    retries += 1
                    if retries >= 3:
                        logging.warning(
                            "No new images loading. Reached end of results?"
                        )
                        break
                else:
                    retries = 0

                last_count = current_count

        return image_data


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Bing Image Scraper")
    parser.add_argument(
        "-q", "--query", type=str, required=True, help="Image search query"
    )
    parser.add_argument(
        "-c", "--count", type=int, default=50, help="Number of images to scrape"
    )
    parser.add_argument("-o", "--output", type=str, help="Output filename")

    args = parser.parse_args()

    logging.getLogger("seleniumbase").setLevel(logging.WARNING)

    scraper = BingImageScraper()
    images = scraper.scrape(args.query, args.count)

    if images:
        fname = (
            args.output
            if args.output
            else f"images_{args.query.replace(' ', '_')}.json"
        )
        with open(fname, "w", encoding="utf-8") as f:
            json.dump(images, f, indent=2, ensure_ascii=False)
        logging.info(f"Saved {len(images)} images to {fname}")
    else:
        logging.error("Scrape failed or no images found.")

Running the Scraper

Run this from your terminal:

# Get 50 high-res sunset photos
python bing_image_scraper.py -q "sunset photography" -c 50

# Get 100 architecture photos for a dataset
python bing_image_scraper.py -q "modern architecture" -c 100

Understanding the Output

The image_url field is very useful here. It links directly to the source file, not the compressed Bing version.

[
  {
    "title": "55 Beautiful Examples of Sunset Photography",
    "image_url": "https://www.thephotoargus.com/wp-content/uploads/2019/09/sunsetphotography01.jpg",
    "source_url": "https://www.thephotoargus.com/beautiful-examples-of-sunset-photography/"
  }
]

How to Scrape Bing Recipes

Recipe data (Calories, Prep Time, Ratings) is highly valuable for food aggregators. Bing presents this data in a sleek Carousel at the top of the search results.

How to scrape Bing recipes

The challenge? Laziness. Bing only loads the first 3 or 4 recipes in the DOM. The rest do not exist until you physically click the "Next" button. If you just scroll, you may miss most of the data.

The Strategy

To scrape this reliably, we need a different interaction model than standard scrolling:

  • Active Navigation. We must locate the .btn.next element inside the carousel container (.b_acf_crsl) and click it repeatedly to force Bing to render the hidden slides.
  • Calculated Clicks. We don't click blindly. We calculate the required clicks based on your target count (count / 3 items per slide). This prevents wasting time clicking on a finished carousel.
  • Custom HTML Tags. Bing uses non-standard HTML tags like <acf-badge> for star ratings. Standard CSS selectors often miss these, so we target the tag name explicitly.

The CSS Selectors

Selector Description
.b_acf_crsl The Carousel Container
.btn.next The navigation button we must click to load more items
.acf_p_multi_str Stats line (e.g., "1 hr · 450 cals")
acf-badge span Custom Tag. Holds the star rating (e.g., "4.5")

The Code

Save this as bing_recipe_scraper.py.

import json
import logging
import argparse
import math
import datetime
from seleniumbase import SB
from bs4 import BeautifulSoup

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    datefmt="%H:%M:%S",
)


class BingRecipeScraper:
    def scrape(self, query: str, count: int = 20):
        # Append "recipes" to trigger carousel if not present
        full_query = f"{query} recipes" if "recipe" not in query.lower() else query
        url = f"https://www.bing.com/search?q={full_query.replace(' ', '+')}"

        logging.info(f"Starting scrape: '{query}' (Target: {count})")

        with SB(uc=True, headless=True) as sb:
            sb.open(url)
            sb.wait_for_element("#b_results", timeout=10)

            if sb.is_element_visible(".b_acf_crsl"):
                estimated_clicks = math.ceil(count / 3) + 2
                max_clicks = min(estimated_clicks, 30)
                self._expand_carousel(sb, max_clicks)
            else:
                logging.warning("No recipe carousel found for this query")

            html = sb.get_page_source()
            scraped_data = self._parse_full_page(html)

            final_output = {
                "meta": {
                    "query": query,
                    "scraped_at": datetime.datetime.now().isoformat(),
                    "locale": "en-US",
                    "total_carousel_items": len(scraped_data["carousel_top"]),
                    "total_organic_items": len(scraped_data["organic_list"]),
                },
                "data": {
                    "recipes_surface": scraped_data["carousel_top"][:count],
                    "organic_results": scraped_data["organic_list"],
                },
            }

            return final_output

    def _expand_carousel(self, sb, max_clicks):
        next_btn_selector = ".b_acf_crsl .btn.next"

        if not sb.is_element_visible(".b_acf_crsl"):
            logging.warning("No recipe carousel found on this page.")
            return

        logging.info(f"Expanding carousel (Max clicks: {max_clicks})...")

        for i in range(max_clicks):
            try:
                if sb.is_element_visible(next_btn_selector):
                    classes = sb.get_attribute(next_btn_selector, "class")
                    if "disabled" in classes:
                        logging.info("Carousel reached the end.")
                        break

                    sb.click(next_btn_selector)
                    sb.sleep(0.5)
                else:
                    break
            except Exception as e:
                logging.warning(f"Error clicking carousel: {e}")
                break

    def _parse_full_page(self, html):
        soup = BeautifulSoup(html, "html.parser")
        data = {"carousel_top": [], "organic_list": []}
        seen_titles = set()

        slides = soup.select(".b_acf_crsl .slide")
        for slide in slides:
            try:
                title_elem = slide.select_one(".acf_p_title")
                if not title_elem:
                    continue

                title = title_elem.get_text(strip=True)
                if title in seen_titles:
                    continue

                stats = slide.select_one(".acf_p_multi_str")
                link = slide.select_one(".b_acf_card_link")
                img = slide.select_one(".cico img")
                # Note: acf-badge is a custom element, not a standard CSS class
                rating = slide.select_one("acf-badge span")

                full_link = link.get("href") if link else None
                if full_link and full_link.startswith("/"):
                    full_link = f"https://www.bing.com{full_link}"

                data["carousel_top"].append(
                    {
                        "title": title,
                        "rating": rating.get_text(strip=True) if rating else None,
                        "time_and_stats": stats.get_text(strip=True) if stats else None,
                        "link": full_link,
                        "thumbnail": img.get("src") if img else None,
                    }
                )
                seen_titles.add(title)
            except Exception:
                continue

        for item in soup.select("li.b_algo"):
            try:
                title_elem = item.select_one("h2 a")
                if not title_elem:
                    continue

                fact_row = item.select_one(".b_factrow")
                snippet = item.select_one(".b_lineclamp2")

                data["organic_list"].append(
                    {
                        "title": title_elem.get_text(strip=True),
                        "link": title_elem.get("href"),
                        "stats": (
                            fact_row.get_text(" | ", strip=True) if fact_row else None
                        ),
                        "snippet": snippet.get_text(strip=True) if snippet else None,
                    }
                )
            except Exception:
                continue

        return data


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-q",
        "--query",
        type=str,
        required=True,
        help="Recipe search term (e.g. 'Pasta')",
    )
    parser.add_argument(
        "-c", "--count", type=int, default=20, help="Number of carousel items to target"
    )
    parser.add_argument("-o", "--output", type=str, help="Output JSON filename")
    args = parser.parse_args()

    logging.getLogger("seleniumbase").setLevel(logging.WARNING)

    bot = BingRecipeScraper()

    results = bot.scrape(args.query, args.count)

    if results:
        fname = (
            args.output
            if args.output
            else f"recipes_{args.query.replace(' ', '_')}.json"
        )
        with open(fname, "w", encoding="utf-8") as f:
            json.dump(results, f, indent=2, ensure_ascii=False)
        logging.info(
            f"Saved {len(results['data']['recipes_surface'])} carousel + {len(results['data']['organic_results'])} organic to {fname}"
        )
    else:
        logging.error("Scrape failed.")

Running the Scraper

The script automatically appends "recipes" to your query to maximize the chance of triggering the carousel.

# Get 20 Chocolate Cake recipes
python bing_recipe_scraper.py -q "chocolate cake" -c 20

# Get 50 Pasta recipes for a content database
python bing_recipe_scraper.py -q "pasta" -c 50

Understanding the Output

Note how the stats field aggregates multiple data points (Time, Calories, Servings) because Bing groups them in a single string. You can regex split this later if needed.

[
  {
    "title": "Chocolate Cake Recipe",
    "rating": "4.9",
    "stats": "1 hr 5 min · 1138 cals · 10 servs",
    "link": "https://www.bing.com/images/search?view=detailv2&..."
  }
]

How to Structure, Clean, and Store Bing Scraping Data

Scraping provides raw data, but the output is often inconsistent across different search surfaces. Organic results use link, Ads use real_url, and Maps results often lack a website entirely. Normalizing this data is necessary before you store it.

The Cleanup Pipeline (Pandas)

Using Pandas allows you to handle schema normalization and deduplication efficiently in memory.

The main logical challenge here is deduplication. Standard web scraping deduplicates by URL. However, about 30% of local businesses on Maps do not have a website. If you deduplicate by URL, you will unknowingly remove legitimate businesses that return a null value. The solution is to switch deduplication strategies based on the data type: use Name and Address for Maps, and URL for everything else.

Here is a standard normalization function:

import pandas as pd
from datetime import datetime

def clean_bing_data(results):
    df = pd.DataFrame(results)
    if df.empty: return df
    
    # 1. Normalize Schema
    # Map all URL variations to a single standard 'url' column
    column_mapping = {
        'link': 'url',      # Organic
        'real_url': 'url',  # Ads
        'website': 'url'    # Maps
    }
    df = df.rename(columns=column_mapping)
    
    # 2. Deduplication Logic
    # Maps results often lack URLs. Deduplicate by Address instead to avoid data loss.
    if 'address' in df.columns and 'name' in df.columns:
        df = df.drop_duplicates(subset=['name', 'address'], keep='first')
    elif 'url' in df.columns:
        df = df.drop_duplicates(subset=['url'], keep='first')
    
    # 3. Text Normalization
    # Strip whitespace and fix internal newlines
    text_cols = ['title', 'snippet', 'description', 'name', 'address']
    for col in text_cols:
        if col in df.columns:
            df[col] = df[col].astype(str).str.strip().str.replace(r'\s+', ' ', regex=True)
    
    # 4. Type Casting
    if 'rank' in df.columns:
        df['rank'] = pd.to_numeric(df['rank'], errors='coerce').fillna(0).astype(int)
    
    # 5. Audit Trail
    df['scraped_at'] = datetime.utcnow().isoformat()
    
    return df

Storage Strategy

Your storage choice should depend on your data volume and access patterns.

Volume Recommended Format Why
Small (< 10k) CSV / JSON Human-readable. Good for quick analysis or sharing with non-technical teams.
Medium (10k+) Parquet Compressed columnar format. It saves disk space and preserves data types better than CSV.
Ad-Hoc SQLite Serverless SQL. Useful for joining distinct datasets locally, such as combining organic results with ads.
Production PostgreSQL Best for time-series tracking. Use this to query rank history over long periods.

Recommendation: Partition your files by date (e.g., data/YYYY-MM-DD/results.parquet). This allows you to compare datasets between dates without loading the entire history.

How to Scale Bing Scraping Safely

A single script works well for testing. However, running that same script in a loop for thousands of queries will likely get your IP address blocked. Moving to production requires a shift in architecture.

The Production Architecture

Scaling safely means decoupling "job submission" from "scraping execution". Instead of running scrapers synchronously, push jobs into a queue system like Redis or Celery. Worker processes pick up these jobs one by one. This allows you to control the exact number of concurrent browsers, preventing you from overwhelming Bing or your own server.

If a scraper fails, do not retry immediately. Implement an exponential backoff strategy where the wait time increases after each failure. If an IP address triggers a 429 (Too Many Requests) or a CAPTCHA, trip a circuit breaker for that proxy and pause it for a set period.

Monitoring and Logging

You need visibility into your scraper's health. A sudden drop in response size often indicates that Bing is serving a CAPTCHA page, even if the HTTP status is 200.

Here is a wrapper to add metrics observability to your scraper. Note that we do not capture HTML snapshots here because the browser session has already closed; that logic belongs inside the scraper itself.

import logging
import time
from datetime import datetime

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('scraper.log'),
        logging.StreamHandler()
    ]
)

def scrape_with_logging(query, scraper_func):
    start = time.time()
    try:
        # Execute the scraper
        results = scraper_func(query)
        duration = round(time.time() - start, 2)
        
        # Log success with duration to spot throttling (latency spikes)
        logging.info(f"SUCCESS | Query: {query} | Time: {duration}s | Items: {len(results)}")
        return results
        
    except Exception as e:
        duration = round(time.time() - start, 2)
        error_type = type(e).__name__
        
        logging.error(f"FAILURE | Query: {query} | Time: {duration}s | Error: {error_type}")
        raise e

Choosing the Right Tools

For infrastructure, stick to standard tools. Celery or RQ are excellent for managing job queues. For the actual requests, AsyncIO allows you to handle network operations without blocking your CPU.

Finally, managing IP reputation is the most significant factor in reliability. You should not use your server's IP address for production scraping. Use a residential proxy service that rotates IPs automatically. This distributes your traffic pattern and makes your activity look like distinct users rather than a single bot.

How to Avoid Blocks When Scraping Bing

Bing blocks scrapers based on two factors: IP reputation and Request Velocity. If you send 100 requests from a datacenter IP (like AWS or DigitalOcean), you will likely get blocked quickly. It doesn't matter how good your headless browser is; if the IP is flagged, the request fails.

How to avoid blocks when scraping Bing

Free or datacenter proxies are generally not suitable for production Bing scraping. Those IP ranges are publicly known and often blacklisted by default on Bing. To run this in production, you need residential proxies.

Residential Proxies: The Solution

Residential proxies route traffic through ISP networks, which can make requests look more like normal user traffic and improve success rates. Live Proxies provides residential proxies built for higher frequency scraping and time sensitive operations.

IP Pool Contamination

Most proxy providers share IP pools across all customers:

  • Customer A scrapes Bing aggressively → IPs get flagged.
  • Customer B (you) buys proxies from the same provider → gets those burned IPs.
  • You inherit Customer A's bad reputation.

Using Live Proxies for Bing scraping

Live Proxies uses target-specific IP allocation:

When you scrape Bing using Live Proxies, your IPs are isolated from other Bing scrapers through target-specific IP allocation. Those IPs might be allocated to customers scraping Amazon or Google, but your Bing activity doesn't affect them.

  • Sign up and purchase a plan
  • Copy proxy credentials from the dashboard
  • Implement:
PROXIES = [
    "131.143.362.36:7383:LV71125532-mDmfksl3onyoy-1:bW2VN4Zc5YSyK5nF82tK",
    "131.143.362.36:7383:LV71125532-mDmfksl3onyoy-2:bW2VN4Zc5YSyK5nF82tK",
    "131.143.362.36:7383:LV71125532-mDmfksl3onyoy-3:bW2VN4Zc5YSyK5nF82tK",
    # ...
]

Method 1: Random Proxy Selection

from seleniumbase import SB
import random

def scrape_with_rotation(query):
    proxy = random.choice(PROXIES)
    
    with SB(uc=True, proxy=proxy) as sb:
        sb.open(f"`<https://www.bing.com/search?q={query}>`")
        sb.wait_for_element("#b_results", timeout=10)
        
        html = sb.get_page_source()
        # ... parse results
        return results

queries = ["sunset photography", "machine learning", "web scraping"]
for query in queries:
    data = scrape_with_rotation(query)
    print(f"{len(data)} results for '{query}'")

Use this for scraping different keywords where each query is independent (rank tracking, keyword research).

Method 2: Sticky Session

STICKY_PROXY = "131.143.362.36:7383:LV71125532-mDmfksl3onyoy-1:bW2VN4Zc5YSyK5nF82tK"

with SB(uc=True, proxy=STICKY_PROXY) as sb:
    sb.open("https://www.bing.com/search?q=sunset")
    
    # All requests in this session use the same IP
    # Perfect for pagination, form fills, multi-step workflows
    html = sb.get_page_source()

Use this for pagination or multi-step workflows where Bing expects the same visitor (getting pages 1-10 of results).

Session behavior: Each sticky session maintains the same IP for up to 60 minutes.

View plans (start at $70)

Need enterprise scale? Contact sales

How to Keep Bing Scraping Secure and Compliant

Security isn't just about encryption; it’s about good hygiene. The most effective security measure is simply not logging in. By sticking to public, unauthenticated data, you eliminate the risk of exposing customer credentials or violating strict user agreements.

Beyond that, treat your scraper like any production software. Never hardcode your proxy credentials directly in the script – always use environment variables so you don't accidentally push secrets to a code repository. If you store the data, keep it secure and limit access to only the team members who need it. 

Finally, maintain a simple internal note explaining what you scrape and why. It helps your team stay aligned and serves as a useful record if anyone ever questions your data practices.

Best Use Cases for Bing Scraping

We generally build Bing scrapers to feed three specific data pipelines: SEO monitoring, News aggregation, and Local market analysis. Here is the architectural pattern for each.

SEO and SEM "Share of Voice"

The goal here is to measure exactly how often your domain appears against competitors for high-value keywords. You start with a list of category keywords (e.g., "project management software") and target markets. The process involves running the Organic and Ads scrapers weekly to store rank, advertiser names, and ad copy. 

The key metric to track is Share of Voice (SoV) – the percentage of searches where your domain appears in the Top 3 positions. A significant drop usually indicates an algorithm update or a competitor's bid aggression.

News Latency Monitoring

For PR teams, the focus is on tracking how fast stories are indexed by search engines. The pipeline runs the News scraper hourly using the recency="hour" filter against a list of brand names and product keywords. The critical number here is Time-to-Index, which measures the delta between the Published Time and the Scraped Time. Tier-1 publications typically appear within 15–60 minutes; if this extends to 3+ hours, you may have an indexing issue or Bing's indexing speed may have changed.

Local Reputation Analysis

This pipeline monitors reputation across multi-location chains, such as franchises or retail stores. You generate a list of local queries (e.g., "coffee shops in Seattle") and run the Maps scraper to extract star ratings and review counts from the top results. The metric to watch is Average Rating Spread. For example, comparing your average rating in a specific city against the top 3 competitors reveals operational problems in specific regions that aggregate data might hide.

Further reading: How to Scrape Google Search Results Using Python in 2025: Code and No-Code Solutions and How to Scrape X.com (Twitter) with Python and Without in 2025.

What is the Bottom Line on Scraping Bing

Start with a single script and a few test queries. You need to verify how Bing structures its data before you attempt to scale.

For production, the bottleneck will always be IP Reputation. You cannot scale this using datacenter IPs; they will be blocked instantly. Live Proxies uses target-specific IP allocation, so your Bing IPs aren't shared with other Bing scrapers, but regardless of the provider, ensure you are using a residential network with rotation.

The scrapers we built handle the complexity of dynamic loading and parsing. You can focus on what matters: extracting insights from the data.

FAQs

How often should I recrawl Bing search results

The crawl frequency depends entirely on the volatility of the data source. Search ranks generally require a weekly cadence to capture algorithm updates, whereas news monitoring demands hourly scrapes to catch breaking stories before they age out. Ad campaigns typically rotate daily, so a 24-hour cycle is appropriate there. Conversely, Maps data is highly stable; unless you are specifically tracking review velocity, monthly scraping is sufficient.

Why are Bing results different by region and device

Bing personalizes results based on IP geolocation, device type, and search history. To see what a real user in Berlin sees, you must route your request through a German residential proxy and set a mobile User-Agent header. Additionally, ensure you are clearing cookies or using incognito contexts between queries to prevent previous search history from biasing the current results.

Can I collect only Bing ads and skip organic

Yes, and this is a common optimization for competitive intelligence. You can target the top and bottom containers specifically while ignoring the organic results. This approach reduces parsing overhead significantly and lowers your storage costs since you are discarding the majority of the page content.

What if HTML layouts change during a run

Bing runs A/B tests frequently, so rigid selectors will eventually break. Best practice is to use multiple fallback selectors within the parser logic. We also recommend versioning your parsers and logging which version processed each result. If a selector fails, you can update the parser and reprocess the stored raw HTML without having to re-scrape the live site.

Do proxies really help with Bing scraping

Residential proxies are a strict requirement for production scraping on Bing. Datacenter IPs (like those from AWS or DigitalOcean) are often fingerprinted and typically blocked after a few requests. Residential proxies route traffic through real home internet connections, making your scraper indistinguishable from organic users. This is the only reliable way to maintain a high success rate at scale.

How do I keep costs low as I scale

The biggest cost driver is the headless browser. For standard organic results, consider switching to a TLS-mimicking HTTP client (like curl_cffi) for the initial request, which is significantly cheaper than running a full Chrome instance. Store your output in Parquet format rather than JSON to significantly reduce storage volume (typically 70-90% compression, depending on your data structure). Finally, use Spot instances for your worker nodes to cut compute costs.

Can I mix API data and scraped data

You can, but you must normalize the schemas first. Bing’s official API often uses different field names than what you extract from the HTML. The best approach is to define a single internal schema and write transformers for both data sources to map into that standard. Tag each record with source: api or source: scraped so you can trace data quality issues back to the origin.

What data should I never store

Avoid storing PII (Personally Identifiable Information) found in snippets, such as personal emails or phone numbers that aren't clearly business contacts. For raw HTML, implement a retention policy – extract the structured data immediately and delete the raw files after 30 days. From a security standpoint, ensure you never log proxy credentials, API keys, or session tokens in plain text.

How do I test that my data is accurate

Automated validation is key. You should implement checks that verify "sanity" metrics: ensure organic result counts meet a minimum threshold (e.g., at least 8 results per page) and cross-reference a sample of ad URLs to ensure they resolve to live landing pages. If you are scraping news, compare the extracted timestamp against the publisher's date to ensure you aren't picking up stale cache data.

Where should I put alerts and what should they say

Alerts should be routed to a dedicated Slack channel or PagerDuty service, but only for actionable failures. Specific thresholds, such as error rates spiking above 10% or average response times exceeding 5 seconds, are better than alerting on every single failure. The alert payload should include the query that failed, the error type (e.g., "Selector Not Found"), and the proxy IP used, which helps you quickly identify if the issue is a code break or a proxy ban.