Bing is often overlooked, but it's a key search provider used by ChatGPT Search and captures a B2B audience, especially on corporate devices, that Google under-serves. If you try to scrape it with a standard script at scale, you will often run into TLS fingerprinting and behavior based blocks. In this guide, we'll walk through building scrapers for Bing Search, AI Answers, Maps, News, Ads, Images, and Recipes so that you can get the data reliably.
Can We Scrape Bing Search Results in 10 Lines
If you only need the organic search results (the "blue links"), you don't need a heavy browser. Bing's basic results are Server-Side Rendered (SSR), which means the data is present in the initial HTML response.
However, standard Python requests often trigger bot detection quickly, especially at scale, primarily due to TLS fingerprinting and other behavioral signals that Bing monitors. The solution is to use curl_cffi, a library that mimics the TLS handshake of a real Chrome browser.
When to use this:
- You only need the first page of organic results
- You need high speed and low cost (no headless browser overhead)
Here's the code:
import json
from curl_cffi import requests # pip install curl_cffi
from bs4 import BeautifulSoup # pip install beautifulsoup4
def scrape_bing_static(query):
# impersonate="chrome" bypasses TLS fingerprinting
headers = {"Accept-Language": "en-US,en;q=0.9", "Referer": "https://www.bing.com/"}
response = requests.get(
"https://www.bing.com/search",
params={"q": query},
impersonate="chrome",
headers=headers,
)
if "b_results" not in response.text:
return {"error": "No results (Blocked, Captcha, or Consent Wall)"}
soup = BeautifulSoup(response.content, "html.parser")
results = []
for r in soup.select("li.b_algo"):
title = r.select_one("h2")
link = r.select_one("a")
if title and link:
snippet_el = r.select_one(".b_caption p")
results.append(
{
"title": title.get_text(strip=True),
"link": link["href"],
"snippet": snippet_el.get_text(strip=True) if snippet_el else "",
}
)
return results
if __name__ == "__main__":
print(json.dumps(scrape_bing_static("agentic ai tutorial"), indent=2))
When You Need Browser Automation
The static method above has a hard limit: dynamic content.
Bing's more advanced features, like AI Answers, Maps pins, News carousels, and Recipes, load much of their content through JavaScript after the initial page renders. The exact DOM you get also depends on your query type, location, and which A/B test variant Bing serves you. A static curl_cffi request won't capture this dynamic content.
For pagination beyond the first page, session continuity matters. When Bing sees requests for first=11 or first=21 with cookies and a consistent fingerprint from earlier pages, you're far less likely to hit blocks or redirects. Browser automation handles this naturally, making it the safer choice for multi-page scraping at scale.
For these cases, we use SeleniumBase. It's a wrapper around Selenium that automatically patches the WebDriver to remove "I am a robot" flags (like navigator.webdriver).
Use this for:
- AI Chat/Copilot answers
- Maps & Local data
- Pagination (Pages 2+)
- Images & News
Before diving into the scrapers, understand which data fields drive business value for each Bing surface next.
What Bing Scraping Data Fields Matter for Business
Scraping every pixel on the page is inefficient and expensive. You need a lean schema that maps directly to business value, not just raw HTML.
Here are the specific data fields you should target for each Bing surface:
| Surface | Target Fields | Business Use Case |
|---|---|---|
| Organic Search | Rank, Title, Snippet, URL | SEO rank tracking and competitor content gaps |
| Ads (PPC) | Position, Advertiser, Headline, Target URL | Monitoring ad spend and "Share of Voice" |
| News | Headline, Source, Time Published, URL | Brand reputation and PR alerts |
| Maps (Local) | Name, Website, Phone, Address, Coordinates | Lead gen and NAP (Name, Address, Phone) consistency |
| People Also Ask | Question, Answer, Source Link | Identifying user intent and content opportunities |
| AI Answers | Summary, Citations, Code Snippets | "Generative Engine Optimization" (GEO) tracking |
| Images | Title, Image URL, Source Page | Building ML training datasets |
| Recipes | Name, Rating, Prep Time, Source URL | Trend analysis and competitive research |
Now that we've defined what to extract, let's look at how to do it legally and securely.
Is Scraping Bing Search Legal and Safe in 2026
Scraping public data is a standard industry practice, but it is important to do it responsibly. The general rule of thumb is to scrape only publicly visible data where no login is required. When you log into an account, you agree to stricter terms that usually prohibit automation, so the safest approach is to always scrape as a guest. Additionally, be mindful of privacy and avoid collecting sensitive personal information.
Further reading: How to Scrape YouTube: A Complete Guide to Videos, Comments, and Transcripts (2026) and How to Scrape Yelp Places and Reviews in 2025 (With and Without Python).
How to Scrape Bing Search Results with Python Step by Step
Organic results are the baseline for SEO tracking and competitor analysis. But if you just grab the HTML, you will find that the URLs are encrypted, the ranks reset on every page, and the content is loaded dynamically.
To get quality data, we need a stack that handles JavaScript execution and bot detection simultaneously.

Setup Quick Start
Bing typically blocks standard automation. We need SeleniumBase with "Stealth Mode" enabled to automatically patch common browser bot flags (such as navigator.webdriver), which helps us appear more like a real user.
The tech stack:
- seleniumbase. Browser automation with anti-detect features.
- beautifulsoup4. To parse the HTML after it renders.
- requests / curl_cffi. (Optional) For dependency management or static checks.
# Create and activate a virtual environment
python -m venv bing-env
source bing-env/bin/activate # Windows: bing-env\Scripts\activate.bat
# Install dependencies
pip install seleniumbase beautifulsoup4 requests
seleniumbase install chromedriver
Parse Organic Results and Rank
We have to solve three specific engineering problems to get clean data:
- Tracking redirects. Bing wraps URLs in bing.com/ck/a? Clicking them is slow and triggers bot detection. We will decode the base64 u parameter directly to extract the clean URL without interacting with the redirect.
- Rank calculation. Bing resets ranking on every page (top of Page 2 is "Rank 1"). We calculate absolute rank using ((page_num - 1) * 10) + index.
- Lazy loading. Bing loads results as you scroll. We must force a scroll to the bottom of the page to trigger all JavaScript elements before parsing.
The Code
Save this as bing_organic_scraper.py:
import json
import logging
import argparse
import base64
from urllib.parse import quote_plus, urlparse, parse_qs
from bs4 import BeautifulSoup
from seleniumbase import SB
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(message)s", datefmt="%H:%M:%S"
)
class BingOrganicScraper:
def _clean_url(self, url):
"""
Decodes Bing's tracking URLs (bing.com/ck/a?u=...).
Target URL is base64-encoded in the 'u' parameter.
"""
if not url or "bing.com/ck/a?" not in url:
return url
try:
parsed = urlparse(url)
query_params = parse_qs(parsed.query)
u_param = query_params.get("u", [None])[0]
if u_param:
if u_param.startswith("a1"):
u_param = u_param[2:]
# Add padding to make base64 valid (length must be multiple of 4)
padded = u_param + "=" * (-len(u_param) % 4)
return base64.urlsafe_b64decode(padded).decode("utf-8")
except Exception:
pass
return url
def scrape(self, query: str, max_pages: int = 3):
logging.info(f"Starting scrape: '{query}' ({max_pages} pages)")
all_results = []
unique_links = set()
with SB(uc=True, headless=True, window_size="1920,1080") as sb:
for page_num in range(1, max_pages + 1):
encoded_query = quote_plus(query)
url = f"https://www.bing.com/search?q={encoded_query}"
# Bing pagination: first=1 (page 1), first=11 (page 2), first=21 (page 3)
if page_num > 1:
offset = (page_num - 1) * 10 + 1
url += f"&first={offset}"
logging.info(f"Processing page {page_num}...")
try:
sb.open(url)
sb.wait_for_element("#b_results", timeout=15)
# Scroll to trigger lazy-load
sb.execute_script("window.scrollTo(0, 800);")
sb.sleep(0.5)
sb.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sb.sleep(1.5)
page_source = sb.get_page_source()
page_data = self._parse_page(page_source, page_num, unique_links)
if not page_data:
logging.warning(
f"No results found on page {page_num}. (Captcha or End of Results?)"
)
all_results.extend(page_data)
# Rate limiting between pages
sb.sleep(2)
except Exception as e:
logging.error(f"Error on page {page_num}: {e}")
continue
return all_results
def _parse_page(self, html, page_num, unique_links):
soup = BeautifulSoup(html, "html.parser")
results = []
# Note: If this selector breaks, Bing likely updated their CSS classes
organic_items = soup.select("li.b_algo")
for index, item in enumerate(organic_items, start=1):
t = item.select_one("h2 a")
if not t:
continue
link = self._clean_url(t.get("href"))
# Skip duplicates across pages
if link in unique_links:
continue
unique_links.add(link)
# Multiple selectors for A/B testing variations
s = item.select_one(".b_lineclamp2, .b_caption p, .b_algoSlug")
rank = ((page_num - 1) * 10) + index
results.append(
{
"rank": rank,
"title": t.get_text(" ", strip=True),
"link": link,
"snippet": s.get_text(" ", strip=True) if s else "",
"page": page_num,
}
)
logging.info(f"Found {len(results)} new results on page {page_num}")
return results
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("-q", "--query", type=str, default="best python frameworks")
parser.add_argument("-p", "--pages", type=int, default=3)
parser.add_argument("-o", "--output", type=str)
args = parser.parse_args()
logging.getLogger("seleniumbase").setLevel(logging.WARNING)
scraper = BingOrganicScraper()
data = scraper.scrape(args.query, args.pages)
if args.output:
fname = f"{args.output}.json"
else:
fname = f"organic_{args.query.replace(' ', '_')}.json"
with open(fname, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
logging.info(f"Saved {len(data)} results to {fname}")
Running the Scraper
Run it from your terminal. You can control the query and depth via arguments:
# Single page test
python bing_organic_scraper.py -q "agentic ai tutorial" -p 1
# Full run (3 pages)
python bing_organic_scraper.py -q "best python frameworks" -p 3 -o "frameworks"
Understanding the Output
When you run this, you will get a clean JSON object. Notice that the Rank is continuous (e.g., the top result on Page 2 is accurately labeled as Rank 11, not Rank 1).
[
{
"rank": 1,
"title": "Top 10 Python Frameworks [2025 Updated]",
"link": "https://www.geeksforgeeks.org/blogs/best-python-frameworks/",
"snippet": "23 Jul 2025 · This article presents the 10 best Python frameworks...",
"page": 1
},
{
"rank": 11,
"title": "The 9 Best Python Frameworks in 2025",
"link": "https://blog.botcity.dev/2024/11/28/python-frameworks/",
"snippet": "28 Nov 2024 · Python Framework vs. Python Library...",
"page": 2
}
]
How to Scrape Bing AI Answers
AI answers (Copilot summaries, developer cards, generative overviews) are high-value targets because they synthesize data from multiple sources.

But here is the catch: They typically do not exist in the initial HTML. Bing injects them via JavaScript after the page loads, and the DOM structure changes completely depending on the user's intent (e.g., coding vs. general knowledge).
To scrape this reliably, we need a "Wait and Prioritize" strategy.
The Strategy
We are dealing with three specific challenges here:
- Async Injection. If you parse the HTML immediately after loading, you will get nothing. We must explicitly wait for specific containers to appear in the DOM.
- Three Different Containers. Bing doesn't use one standard box.
- .developer_answercard_wrapper: Best for code (has syntax highlighting).
- .b_genserp_container: Best for general knowledge (has citations).
- #copans_container: The fallback Chat/Copilot interface (hardest to parse).
- Priority Logic. Sometimes multiple containers are loaded. Our script will prioritize the Developer Card (most structured), then the Generative SERP, and finally Copilot as a fallback.
The CSS Selectors
Here are the specific targets we are looking for:
| Selector | What it holds |
|---|---|
| .developer_answercard_wrapper | The "Tech Card" container (Code snippets) |
| .devmag_code | The actual code block inside the Tech Card |
| .b_genserp_container | The "Generative AI" container (Text summaries) |
| .gs_heroTextHeader__Italic | The main summary headline |
| .gs_cit a | Citation links |
| #copans_container | The Chat/Copilot fallback container |
The Code
Save this as bing_ai_scraper.py.
Note: We force the window size to 1920x1080 because Bing often hides the AI sidebar on smaller viewports.
import json
import logging
import argparse
import base64
from urllib.parse import quote_plus, urlparse, parse_qs
from bs4 import BeautifulSoup
from seleniumbase import SB
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
datefmt="%H:%M:%S",
)
class BingAIScraper:
def _clean_url(self, url):
"""
Decodes Bing's tracking URLs (bing.com/ck/a?u=...).
Target URL is base64-encoded in the 'u' parameter.
"""
if not url or "bing.com/ck/a?" not in url:
return url
try:
parsed = urlparse(url)
query_params = parse_qs(parsed.query)
u_param = query_params.get("u", [None])[0]
if u_param:
if u_param.startswith("a1"):
u_param = u_param[2:]
# Add padding to make base64 valid (length must be multiple of 4)
padded = u_param + "=" * (-len(u_param) % 4)
return base64.urlsafe_b64decode(padded).decode("utf-8")
except Exception:
pass
return url
def scrape(self, query: str):
logging.info("Starting AI scrape for: '%s'", query)
# Force US region where AI features are most stable
url = f"https://www.bing.com/search?q={quote_plus(query)}&cc=US&setlang=en"
# Desktop viewport required for AI sidebar to render
with SB(uc=True, headless=True, window_size="1920,1080") as sb:
sb.open(url)
sb.wait_for_element("#b_results", timeout=10)
# AI answers are injected dynamically via JS
logging.info("Waiting for AI containers...")
try:
sb.wait_for_element(
"#copans_container, .b_genserp_container, .developer_answercard_wrapper",
timeout=5,
)
except Exception:
logging.info(
"No AI container appeared (timeout). Proceeding to parse..."
)
html = sb.get_page_source()
result = self._parse_ai_data(html)
result["query"] = query
return result
def _parse_ai_data(self, html):
soup = BeautifulSoup(html, "html.parser")
data = {
"ai_answer_found": False,
"type": None,
"summary": None,
"content": [],
"code_snippets": [],
"sources": [],
}
# Three AI result types in priority order:
# developer_card (tech/code), generative_serp (standard AI), copilot (chat fallback)
rich_card = soup.select_one(".developer_answercard_wrapper")
gen_serp = soup.select_one(".b_genserp_container")
copilot = soup.select_one("#copans_container")
if rich_card:
data["ai_answer_found"] = True
data["type"] = "developer_card"
title = rich_card.select_one("h2")
data["summary"] = title.get_text(" ", strip=True) if title else "AI Answer"
data["content"] = [
s.get_text(" ", strip=True)
for s in rich_card.select(".devmag_cntnt_snip, .rd_sub_header")
]
data["code_snippets"] = [
c.get_text("\n", strip=True) for c in rich_card.select(".devmag_code")
]
data["sources"] = [
self._clean_url(a.get("href"))
for a in rich_card.select(".rd_attr_items a")
]
elif gen_serp:
data["ai_answer_found"] = True
data["type"] = "generative_serp"
summary_header = gen_serp.select_one(".gs_heroTextHeader")
data["summary"] = (
summary_header.get_text(" ", strip=True) if summary_header else ""
)
content_area = gen_serp.select_one(".gs_text.gs_mdr")
if content_area:
for block in content_area.select("h3, .gs_p"):
t = block.get_text(" ", strip=True)
if len(t) > 1:
data["content"].append(t)
data["sources"] = [
self._clean_url(a.get("href"))
for a in gen_serp.select(".gs_cit a")
if a.get("href")
]
elif copilot:
data["ai_answer_found"] = True
data["type"] = "copilot_fallback"
data["content"] = [copilot.get_text("\n", strip=True)]
return data
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("-q", "--query", type=str, default="what are llm tokens")
parser.add_argument("-o", "--output", type=str)
args = parser.parse_args()
logging.getLogger("seleniumbase").setLevel(logging.WARNING)
scraper = BingAIScraper()
data = scraper.scrape(args.query)
if args.output:
fname = f"{args.output}.json"
else:
fname = f"ai_{args.query.replace(' ', '_')}.json"
with open(fname, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
logging.info(f"Saved to {fname} (AI answer found: {data['ai_answer_found']})")
Running the Scraper
You can target different AI types by changing your query intent:
# Knowledge Query (Triggers Generative SERP)
python bing_ai_scraper.py -q "what are llm tokens" -o ai_tokens
# Coding Query (Triggers Developer Card)
python bing_ai_scraper.py -q "python list comprehension syntax" -o ai_python
Understanding the Output
The type field tells you exactly which container we matched.
{
"ai_answer_found": true,
"type": "generative_serp",
"summary": "Tokens in Large Language Models (LLMs) are the basic units of text...",
"content": [
"Definition of Tokens",
"In the context of LLMs, a token is a segment of text...",
"Whole Words: For example, \"apple\" or \"run.\""
],
"sources": [
"https://itsfoss.com/llm-token/",
"https://learn.microsoft.com/en-us/dotnet/ai/conceptual/understanding-tokens"
]
}
How to Scrape "People Also Ask" (PAA) from Bing
"People Also Ask" (PAA) boxes are critical for mapping search intent and finding content gaps.

The challenge here is that PAA is an interactive element. The answers are hidden inside collapsed accordions, and new questions are lazily loaded as you scroll. If you just requests.get() the page, you will get nothing.
The Strategy
We need a three-step interaction pipeline:
- Trigger Lazy Loading. Bing loads the first 3-4 questions initially. To get the rest, we must scroll. We will use a randomized scroll function to mimic human reading behavior, which prompts Bing to inject more questions into the DOM.
- Force Expansion (JS Injection). Clicking accordions one by one using Selenium's .click() is slow and flaky. Instead, we will inject a snippet of JavaScript to find all accordion headers and fire their click events simultaneously. This expands all accordions quickly without the overhead of individual clicks.
- Stateful Parsing. Once everything is expanded and visible, we parse the HTML.
The CSS Selectors
Here is the structure of a PAA card:
| Selector | Description |
|---|---|
| .acf-accn-itm | The container for a single Q&A pair |
| .acf-accn-itm__hdr-label | The Question text |
| .paa-txt | The Answer text |
| a.paa-content__Italic | The Source Link (citation) |
The Code
Save this as bing_paa_scraper.py.
import json
import logging
import argparse
import base64
import time
import random
from urllib.parse import quote_plus, urlparse, parse_qs
from bs4 import BeautifulSoup
from seleniumbase import SB
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
datefmt="%H:%M:%S",
)
class BingPAAScraper:
def _clean_url(self, url):
"""
Decodes Bing's tracking URLs (bing.com/ck/a?u=...).
Target URL is base64-encoded in the 'u' parameter.
"""
if not url or "bing.com/ck/a?" not in url:
return url
try:
parsed = urlparse(url)
query_params = parse_qs(parsed.query)
u_param = query_params.get("u", [None])[0]
if u_param:
if u_param.startswith("a1"):
u_param = u_param[2:]
# Add padding to make base64 valid (length must be multiple of 4)
padded = u_param + "=" * (-len(u_param) % 4)
return base64.urlsafe_b64decode(padded).decode("utf-8")
except Exception:
pass
return url
def scrape(self, query: str):
logging.info(f"Starting PAA scrape for: '{query}'")
url = f"https://www.bing.com/search?q={quote_plus(query)}"
# Use stealth mode to bypass bot detection
with SB(uc=True, headless=True) as sb:
try:
sb.open(url)
sb.wait_for_element("#b_results", timeout=15)
self._slow_scroll_to_bottom(sb)
self._expand_paa(sb)
html = sb.get_page_source()
return self._parse_paa(html)
except Exception as e:
logging.error(f"Error during scrape: {e}")
return []
def _slow_scroll_to_bottom(self, sb):
"""
Mimics human scrolling behavior to trigger lazy-load and avoid detection.
"""
logging.info("Scrolling page to trigger lazy-load...")
last_height = sb.execute_script("return document.body.scrollHeight")
while True:
# Randomize scroll behavior to appear human
step = random.randint(400, 600)
sb.execute_script(f"window.scrollBy(0, {step});")
sb.sleep(random.uniform(0.5, 1.0))
new_height = sb.execute_script("return document.body.scrollHeight")
scrolled_amount = sb.execute_script(
"return window.scrollY + window.innerHeight"
)
if scrolled_amount >= new_height:
sb.sleep(1.5)
new_height = sb.execute_script("return document.body.scrollHeight")
if scrolled_amount >= new_height:
break
last_height = new_height
def _expand_paa(self, sb):
"""
Uses direct JS execution to click PAA accordions.
More reliable than Selenium's .click() when elements overlap.
"""
try:
# Scroll to top so elements are renderable
sb.execute_script("window.scrollTo(0, 0);")
sb.sleep(0.5)
if sb.is_element_visible(".acf-accn-itm"):
sb.scroll_to(".acf-accn-itm")
sb.sleep(0.5)
logging.info("Injecting JS to expand accordions...")
sb.execute_script(
"""
let buttons = document.querySelectorAll('.acf-accn-itm__hdr');
for(let i = 0; i < Math.min(buttons.length, 4); i++) {
try { buttons[i].click(); } catch(e) {}
}
"""
)
sb.sleep(2.5)
except Exception as e:
logging.warning(f"Could not expand PAA: {e}")
def _parse_paa(self, html):
soup = BeautifulSoup(html, "html.parser")
paa_results = []
seen_questions = set()
for item in soup.select(".acf-accn-itm"):
q_elem = item.select_one(".acf-accn-itm__hdr-label")
a_elem = item.select_one(".paa-txt")
link_elem = item.select_one("a.paa-content")
if q_elem:
question = q_elem.get_text(" ", strip=True)
if question in seen_questions:
continue
answer_text = "Could not extract answer"
if a_elem:
answer_text = a_elem.get_text("\n", strip=True)
paa_results.append(
{
"question": question,
"answer": answer_text,
"source_link": (
self._clean_url(link_elem.get("href"))
if link_elem
else None
),
}
)
seen_questions.add(question)
return paa_results
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("-q", "--query", type=str, default="vibe coding tools")
parser.add_argument("-o", "--output", type=str)
args = parser.parse_args()
logging.getLogger("seleniumbase").setLevel(logging.WARNING)
scraper = BingPAAScraper()
data = scraper.scrape(args.query)
if args.output:
fname = f"{args.output}.json"
else:
fname = f"paa_{args.query.replace(' ', '_')}.json"
with open(fname, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
logging.info(f"Saved {len(data)} PAA items to {fname}")
Running the scraper
# Standard run
python bing_paa_scraper.py -q "vibe coding tools"
Understanding the Output
The output gives you the direct Q&A pairs. Note that the answer text is only captured if the JavaScript expansion was successful.
[
{
"question": "What is a vibe coding tool?",
"answer": "Vibe coding tools are modern platforms that combine AI assistance...",
"source_link": "https://codeconductor.ai/blog/vibe-coding-tools/"
}
]
How to Scrape Bing Ad Results Without Mistakes
Ad results are the only way to see exactly what competitors are paying for. They reveal copy strategies, landing pages, and "Share of Voice".

The challenge? Bing ads are fragmented. They appear at the top and bottom of the page, and the DOM structure is full of legacy code and actual typos (e.g., icondomian). If you trust the standard class names, you may miss a significant portion of the data.
The Strategy
To build a robust ad scraper, we need to handle three specific issues:
- Dual Positions. Ads live in .b_ad.sb_top (Header) and .b_ad.sb_bottom (Footer). We must scrape both and tag them to calculate the true rank.
- Legacy Typo. Bing's CSS is messy. Display URLs sometimes use .b_adurl and sometimes .b_addurl (double 'd'). The advertiser domain class is often .b_icondomian (misspelled). Our scraper checks for both correct and broken spellings.
- Tracking Wrappers. Every ad link is wrapped in _a bing.com/aclk _ redirect. We will decode the Base64 u parameter to get the actual landing page without clicking.
The CSS Selectors
Here is the "messy reality" of Bing's Ad DOM:
| Selector | Description |
|---|---|
| .b_ad.sb_top | Top Ad Container |
| .b_ad.sb_bottom | Bottom Ad Container |
| .b_adTopIcon_domain, .b_icondomian | Advertiser Name |
| .b_adurl, .b_addurl | Display URL |
| .b_vlist2col li a | Sitelinks (Sub-links under the main ad) |
The Code
Save this as bing_ads_scraper.py.
import json
import logging
import base64
import argparse
from urllib.parse import quote_plus, urlparse, parse_qs, unquote
from bs4 import BeautifulSoup
from seleniumbase import SB
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
datefmt="%H:%M:%S",
)
class BingAdsExtractor:
def _clean_url(self, url):
"""
Decodes Bing's tracking URLs to get actual destination.
Bing wraps links in bing.com/aclk with target in base64-encoded 'u' parameter.
"""
if not url or ("bing.com/aclk" not in url and "bing.com/ck/a" not in url):
return url
try:
parsed = urlparse(url)
query_params = parse_qs(parsed.query)
u_param = query_params.get("u", [None])[0]
if u_param:
if u_param.startswith("a1"):
u_param = u_param[2:]
# Add padding to make base64 string valid (length must be multiple of 4)
u_param += "=" * (-len(u_param) % 4)
decoded = base64.urlsafe_b64decode(u_param).decode("utf-8")
return unquote(decoded)
except Exception:
pass
return url
def _parse_ads(self, html):
soup = BeautifulSoup(html, "html.parser")
ads = []
seen = set()
for container in soup.select("li.b_ad ul"):
parent_classes = container.parent.get("class", [])
position = "top" if "b_adTop" in parent_classes else "bottom"
for rank, item in enumerate(
container.find_all("li", recursive=False), start=1
):
try:
title_el = item.select_one("h2 a")
if not title_el:
continue
title = title_el.get_text(" ", strip=True)
url = self._clean_url(title_el.get("href"))
# Skip duplicates (same ad can appear multiple times in DOM)
fingerprint = (title, url)
if fingerprint in seen:
continue
seen.add(fingerprint)
# Note: .b_icondomian is a Bing typo that still exists in production
domain_el = item.select_one(".b_adTopIcon_domain, .b_icondomian")
if domain_el:
domain = domain_el.get_text(strip=True)
else:
cite = item.select_one("cite")
domain = (
cite.get_text(strip=True).split(" ")[0]
if cite
else "Unknown"
)
display_el = item.select_one(".b_adurl cite")
display_url = (
display_el.get_text(strip=True) if display_el else domain
)
desc_el = item.select_one(".b_ad_description")
if desc_el:
# Remove "Ad" labels before extracting description text
for label in desc_el.select(".b_adSlug"):
label.decompose()
description = desc_el.get_text(" ", strip=True)
else:
description = ""
callouts = []
for callout in item.select(".b_secondaryText, .b_topAd"):
text = (
callout.get_text(strip=True)
.replace("\u00a0", " ")
.replace("\u00b7", "|")
)
callouts.append(text)
sitelinks = []
for link in item.select(
".b_vlist2col li a, .hsl_carousel .slide a"
):
link_title = link.get_text(strip=True)
link_url = self._clean_url(link.get("href"))
if link_title and link_url:
sitelinks.append({"title": link_title, "url": link_url})
ads.append(
{
"title": title,
"advertiser": domain,
"position": position,
"rank_in_block": rank,
"display_url": display_url,
"real_url": url,
"description": description,
"callouts": callouts,
"sitelinks": sitelinks,
}
)
except Exception as e:
logging.warning(f"Error parsing ad: {e}")
return ads
def scrape(self, query):
logging.info(f"Scraping ads for: {query}")
url = f"https://www.bing.com/search?q={quote_plus(query)}"
# Use stealth mode to bypass bot detection
with SB(uc=True, headless=True) as sb:
sb.open(url)
sb.wait_for_element("#b_results", timeout=15)
# Trigger footer ads to load
sb.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sb.sleep(3)
html = sb.get_page_source()
return self._parse_ads(html)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"-q", "--query", type=str, default="marketing automation software"
)
parser.add_argument("-o", "--output", type=str, default="ads_output")
args = parser.parse_args()
logging.getLogger("seleniumbase").setLevel(logging.WARNING)
scraper = BingAdsExtractor()
ads = scraper.scrape(args.query)
filename = f"{args.output}.json"
with open(filename, "w", encoding="utf-8") as f:
json.dump(ads, f, indent=2, ensure_ascii=False)
logging.info(f"Done! {len(ads)} ads saved to {filename}")
Running the Scraper
Ads only appear for commercial queries. Try this:
# High competition query
python bing_ads_scraper.py -q "marketing automation tools"
Understanding the Output
The output distinguishes between Top (Premium) and Bottom (Footer) placement, which is critical for estimating ad spend.
[
{
"title": "Marketing Automation Software - Best of 2025",
"advertiser": "Wix.com",
"position": "top",
"rank_in_block": 1,
"display_url": "wix.com",
"real_url": "https://www.wix.com/blog/best-marketing-automation-software...",
"callouts": [
"220,000,000+ users | SEO learning hub"
]
}
]
How to Scrape Bing News Results Quickly

News results are time-sensitive. If you are building a monitoring tool, you don't want "relevant" news from 2021; you want what happened in the last hour.
The challenge with Bing News isn't the HTML, it's the URL. Bing uses a cryptic qft parameter with internal interval codes that are not documented.
The Strategy
To get this working, we need to handle three specific mechanics:
-
The URL Hack. We reverse-engineered the qft parameter. You need to pass specific interval="X" strings to filter by time.
- interval="4": Past Hour (Critical for breaking news)
- interval="7": Past 24 Hours
- Crucial Detail: You must append &form=PTFTNR__Italic to the URL, or Bing ignores your filters completely.
-
Dual Extraction. Bing frequently A/B tests its CSS class names. However, the data-* attributes on the .news-card element (like data-title, data-url) are much more stable. Our scraper prioritizes these attributes and only falls back to CSS selectors if they are missing.
-
Batch Loading. Bing typically loads approximately 8 articles initially (this may vary). To get more, we scroll to the bottom to trigger the lazy-loader.
Note: The interval parameter codes shown in this guide are reverse-engineered from Bing's network traffic and are not officially documented. These codes may change without notice.
The CSS Selectors
| Selector | Description |
|---|---|
| .news-card | The main container for each article |
| data-title | Primary: The headline (attribute) |
| data-url | Primary: The direct link (attribute) |
| data-author | Primary: The publisher/source (attribute) |
| a.title | Fallback: Headline selector if attributes fail |
| span[aria-label] | The relative time string (e.g., "53m", "2h") |
The Code
Save this as bing_news_scraper.py.
import json
import logging
import argparse
import datetime
import time
from urllib.parse import quote_plus
from typing import List, Dict, Any
from seleniumbase import SB
from bs4 import BeautifulSoup
logging.basicConfig(level=logging.INFO, format="%(levelname)s - %(message)s")
class BingNewsScraper:
def scrape(
self, query: str, count: int, sort_by: str = "relevance", recency: str = "any"
) -> Dict[str, Any]:
url = self._build_url(query, sort_by, recency)
logging.info(f"Target URL: {url}")
# Use stealth mode to avoid bot detection and CAPTCHA challenges
with SB(uc=True, headless=True) as sb:
sb.open(url)
sb.wait_for_element(".news-card", timeout=10)
self._scroll_to_load(sb, count)
html = sb.get_page_source()
articles = self._parse_articles(html)
articles = articles[:count]
results = {
"search_parameters": {
"query": query,
"sort": sort_by,
"recency": recency,
"url_used": url,
},
"total_count": len(articles),
"scraped_at": datetime.datetime.now().isoformat(),
"articles": articles,
}
logging.info(f"Scraping finished. Extracted {len(articles)} articles.")
return results
def _build_url(self, query: str, sort_by: str, recency: str) -> str:
base_url = "https://www.bing.com/news/search"
# Bing's internal time interval codes (reverse-engineered from network traffic)
time_filters = {
"hour": 'interval="4"',
"day": 'interval="7"',
"week": 'interval="8"',
"month": 'interval="9"',
"year": 'interval="10"',
}
filters = []
if recency in time_filters:
filters.append(time_filters[recency])
# sortbydate="1" forces chronological order (default is 0 for relevance)
if sort_by == "date":
filters.append('sortbydate="1"')
url = f"{base_url}?q={quote_plus(query)}"
if filters:
# Multiple filters in 'qft' are separated by '+'
url += f"&qft={quote_plus('+'.join(filters))}"
# form=PTFTNR is required for filters to apply correctly
url += "&form=PTFTNR"
return url
def _scroll_to_load(self, sb, count: int):
loaded = 0
retries = 0
max_retries = 3
while loaded < count:
cards = sb.find_elements(".news-card")
current = len(cards)
logging.info(f"Loaded {current} of {count} articles...")
# Track consecutive failed attempts to prevent infinite loops
if current == loaded:
retries += 1
if retries >= max_retries:
logging.warning("No new articles loaded after scrolling.")
break
else:
retries = 0
loaded = current
sb.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sb.sleep(2.0)
def _parse_articles(self, html: str) -> List[Dict[str, str]]:
"""
Two-tier extraction strategy:
1. data-* attributes (reliable)
2. CSS selectors (fallback for A/B tests)
"""
soup = BeautifulSoup(html, "html.parser")
articles = []
for card in soup.select(".news-card"):
try:
title = card.get("data-title")
url = card.get("data-url")
# Fallback to visible elements if data attributes missing
if not title or not url:
link = card.select_one("a.title")
if link:
if not title:
title = link.get_text(strip=True)
if not url:
url = link.get("href")
source = card.get("data-author") or "Unknown"
published_el = card.select_one("span[aria-label]")
published = published_el.get_text(strip=True) if published_el else None
snippet_el = card.select_one(".snippet")
snippet = snippet_el.get_text(strip=True) if snippet_el else ""
img = card.select_one("img.rms_img")
# data-src-hq often contains higher resolution images
image = img.get("data-src-hq") or img.get("src") if img else None
if title and url:
articles.append(
{
"title": title,
"url": url,
"source": source,
"published": published,
"snippet": snippet,
"image": image,
}
)
except Exception:
pass
return articles
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Bing News Scraper with Filters")
parser.add_argument("-q", "--query", type=str, required=True, help="Search query")
parser.add_argument(
"-c", "--count", type=int, default=20, help="Number of articles to scrape"
)
parser.add_argument("-o", "--output", type=str, help="Custom output filename")
parser.add_argument(
"--sort",
type=str,
choices=["relevance", "date"],
default="relevance",
help="Sort by: 'relevance' or 'date'",
)
parser.add_argument(
"--when",
type=str,
choices=["any", "hour", "day", "week", "month", "year"],
default="any",
help="Time range: hour, day, week, month, year",
)
args = parser.parse_args()
logging.getLogger("seleniumbase").setLevel(logging.WARNING)
scraper = BingNewsScraper()
data = scraper.scrape(args.query, args.count, args.sort, args.when)
if args.output:
filename = f"{args.output}.json"
else:
safe_q = "".join([c if c.isalnum() else "_" for c in args.query]).lower()
timestamp = int(time.time())
filename = f"news_{safe_q}_{args.when}_{args.sort}_{timestamp}.json"
with open(filename, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
logging.info(f"Results saved to: {filename}")
Running the Scraper
You can now run the time-based queries:
# Get the last 10 breaking news items (Past Hour, Sorted by Date)
python bing_news_scraper.py -q "breaking news" -c 10 --when hour --sort date
# Get 20 AI articles from the last week
python bing_news_scraper.py -q "artificial intelligence" -c 20 --when week
Understanding the Output
The published field returns the raw relative time string (e.g., "20m" or "2h"). If you are storing this in a database, you should normalize these to UTC timestamps immediately after scraping.
[
{
"title": "Corona Remedies IPO: GMP signals strong debut",
"url": "https://www.msn.com/en-in/news/other/corona-remedies-ipo...",
"source": "Mint on MSN",
"published": "20m",
"image": "//th.bing.com/th?id=OVFT.afNtSH4aw5tC7OrXHoXk3S..."
}
]
How to Scrape Bing Maps places safely
Maps data is the gold standard for lead generation because it's structured: Name, Phone, Address, Website, and Coordinates.

But scraping Bing Maps is frustrating if you don't know the UI architecture. The results live in a fixed sidebar that is separate from the main page scroll. If you tell Selenium to window.scrollTo(0, 1000), absolutely nothing will happen.
The Strategy
We need to solve three specific challenges:
- Sidebar Scrolling. We must locate the specific container (.b_lstcards) and scroll that element using JavaScript.
- The Hard Limit. Unlike Google Maps, which can paginate through hundreds of results, Bing Maps often returns a more limited set. The exact count varies by query, location, and business density – results commonly stop loading after several dozen entries, though this behavior may change.
- The "Hidden" JSON. Instead of scraping messy HTML text (where phone numbers and addresses are formatted inconsistently), we will extract the data-entity attribute from each card. This contains a clean JSON object with the raw data.
The CSS Selectors
| Selector | Description |
|---|---|
| .b_lstcards | The Sidebar. This is what we must scroll |
| .b_maglistcard | The individual business card container |
| data-entity | The data. A JSON attribute on the card containing clean data |
| .l_rev_pirs | Star Rating (e.g., "4/5") |
The Code
Save this as bing_maps_scraper.py.
import time
import json
import csv
import logging
import argparse
from urllib.parse import quote_plus
from seleniumbase import SB
from bs4 import BeautifulSoup
logging.basicConfig(level=logging.INFO, format="%(levelname)s - %(message)s")
class BingMapsScraper:
def scrape(self, query: str, count: int):
logging.info("Scraping maps for: '%s' (target: %d)", query, count)
url = f"https://www.bing.com/maps?q={quote_plus(query)}"
# Desktop viewport required for sidebar rendering
with SB(uc=True, headless=True, window_size="1920,1080") as sb:
sb.open(url)
sb.wait_for_element(".b_maglistcard", timeout=10)
self._scroll_to_load(sb, count)
logging.info("Extracting data...")
return self._parse_places(sb.get_page_source())
def _scroll_to_load(self, sb, count: int):
"""Scrolls the sidebar container, not the main window."""
loaded = 0
retries = 0
max_retries = 5
while loaded < count:
cards = sb.find_elements(".b_maglistcard")
current = len(cards)
logging.info("Loaded %d / %d places...", current, count)
if current >= count:
break
if current == loaded and current > 0:
retries += 1
if retries >= max_retries:
logging.warning("No new places loaded after 5 retries. Stopping.")
break
else:
retries = 0
loaded = current
# Scroll the sidebar element (.b_lstcards), not window
sb.execute_script(
"""
(function() {
let sidebar = document.querySelector('.b_lstcards');
if(sidebar) {
sidebar.scrollTop = sidebar.scrollHeight;
}
})();
"""
)
time.sleep(2.5)
def _parse_places(self, html: str):
"""Extracts structured data from data-entity JSON attribute."""
soup = BeautifulSoup(html, "html.parser")
places = []
for card in soup.select(".b_maglistcard"):
try:
entity_json = card.get("data-entity")
if not entity_json:
continue
data = json.loads(entity_json)
entity = data.get("entity", {})
geo = data.get("geometry", {})
address_raw = entity.get("address", "N/A")
address = (
address_raw.get("formattedAddress", "N/A")
if isinstance(address_raw, dict)
else str(address_raw)
)
# Reviews are in separate DOM elements, not in JSON
rating_el = card.select_one(".l_rev_pirs")
rating = rating_el.get_text(strip=True) if rating_el else "N/A"
reviews_el = card.select_one(".l_rev_rc")
reviews = reviews_el.get_text(strip=True) if reviews_el else "0"
places.append(
{
"name": entity.get("title"),
"phone": entity.get("phone", "N/A"),
"address": address,
"website": entity.get("website", "N/A"),
"category": entity.get("primaryCategoryName", "N/A"),
# Note: Bing uses y=latitude, x=longitude (not standard x/y convention)
"latitude": geo.get("y"),
"longitude": geo.get("x"),
"rating": rating,
"review_count": reviews,
}
)
except Exception as e:
logging.warning("Error parsing place: %s", e)
return places
def _save_results(data, base_filename):
json_file = f"{base_filename}.json"
with open(json_file, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
if data:
csv_file = f"{base_filename}.csv"
headers = list(data[0].keys())
with open(csv_file, "w", newline="", encoding="utf-8-sig") as f:
writer = csv.DictWriter(f, fieldnames=headers)
writer.writeheader()
writer.writerows(data)
logging.info(f"Saved {len(data)} results to {csv_file}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Bing Maps Scraper")
parser.add_argument("-q", "--query", type=str, default="coffee shops in Seattle")
parser.add_argument("-c", "--count", type=int, default=10)
parser.add_argument("-o", "--output", type=str, help="Output filename")
args = parser.parse_args()
logging.getLogger("seleniumbase").setLevel(logging.WARNING)
scraper = BingMapsScraper()
places = scraper.scrape(args.query, args.count)
if args.output:
fname = args.output
else:
fname = f"maps_{args.query.replace(' ', '_')}"
_save_results(places, fname)
Running the Scraper
You can get both JSON and CSV output automatically.
# Get 20 leads for "Dentists in New York"
python bing_maps_scraper.py -q "dentists in New York" -c 20
Understanding the Output
Because we parsed the JSON, the data is extremely clean. Note that Bing uses y for Latitude and x for Longitude (a quirk of their coordinate system).
{
"name": "Balthazar",
"phone": "(212) 965-1414",
"address": "80 Spring St, New York, NY 10012",
"category": "Restaurant",
"latitude": 40.72263336,
"longitude": -73.99828338
}
How to Scrape Bing Images
If you just scrape the visible <img> tags on Bing, you are failing. Those tags only contain low-res, base64-encoded thumbnails that look terrible when scaled up.

To get the original, high-resolution URL (essential for ML datasets or visual research), you must parse the hidden m attribute. Bing stores the real metadata there as a JSON string.
The Strategy
We need to solve three specific mechanics to make this work:
- The Hidden JSON. We target the anchor tag <a class="iusc>. Inside, there is an attribute called m. We parse this string as JSON to extract the murl (Real Image URL).
- Infinite Scroll + "See More". Bing auto-loads images for a while, but eventually hits a "See More Results" button. Our script detects this button and clicks it to keep the pipeline moving.
- Duplicate Filtering. Infinite scrolling often re-renders elements. We use a set() to track URLs and ensure we don't save the same image twice.
The CSS Selectors
| Selector | Description |
|---|---|
| a.iusc | The container for the image result |
| m (attribute) | Hidden JSON containing murl (High-Res URL) |
| a.btn_seemore | The button that breaks infinite scroll (needs clicking) |
The Code
Save this as bing_image_scraper.py.
import json
import logging
import argparse
import time
from urllib.parse import quote_plus
from seleniumbase import SB
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
datefmt="%H:%M:%S",
)
class BingImageScraper:
def scrape(self, query: str, limit: int = 50):
logging.info(f"Starting Image Scrape for: '{query}' (Target: {limit})")
url = f"https://www.bing.com/images/search?q={quote_plus(query)}"
image_data = []
unique_urls = set()
with SB(uc=True, headless=True) as sb:
sb.open(url)
sb.wait_for_element("a.iusc", timeout=10)
logging.info("Page loaded. collecting images...")
last_count = 0
retries = 0
while len(image_data) < limit:
elements = sb.find_elements("a.iusc")
for elem in elements:
if len(image_data) >= limit:
break
try:
# 'm' attribute contains JSON with image URL and metadata
m_attr = elem.get_attribute("m")
if not m_attr:
continue
data = json.loads(m_attr)
img_url = data.get("murl")
title = data.get("t")
if img_url and img_url not in unique_urls:
unique_urls.add(img_url)
image_data.append(
{
"title": title,
"image_url": img_url,
"source_url": data.get("purl"),
}
)
except Exception:
continue
logging.info(f"Collected {len(image_data)} / {limit} images...")
if len(image_data) >= limit:
break
sb.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sb.sleep(1.5)
if sb.is_element_visible("a.btn_seemore"):
logging.info("Clicking 'See More' button...")
sb.click("a.btn_seemore")
sb.sleep(2)
current_count = len(sb.find_elements("a.iusc"))
if current_count == last_count:
retries += 1
if retries >= 3:
logging.warning(
"No new images loading. Reached end of results?"
)
break
else:
retries = 0
last_count = current_count
return image_data
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Bing Image Scraper")
parser.add_argument(
"-q", "--query", type=str, required=True, help="Image search query"
)
parser.add_argument(
"-c", "--count", type=int, default=50, help="Number of images to scrape"
)
parser.add_argument("-o", "--output", type=str, help="Output filename")
args = parser.parse_args()
logging.getLogger("seleniumbase").setLevel(logging.WARNING)
scraper = BingImageScraper()
images = scraper.scrape(args.query, args.count)
if images:
fname = (
args.output
if args.output
else f"images_{args.query.replace(' ', '_')}.json"
)
with open(fname, "w", encoding="utf-8") as f:
json.dump(images, f, indent=2, ensure_ascii=False)
logging.info(f"Saved {len(images)} images to {fname}")
else:
logging.error("Scrape failed or no images found.")
Running the Scraper
Run this from your terminal:
# Get 50 high-res sunset photos
python bing_image_scraper.py -q "sunset photography" -c 50
# Get 100 architecture photos for a dataset
python bing_image_scraper.py -q "modern architecture" -c 100
Understanding the Output
The image_url field is very useful here. It links directly to the source file, not the compressed Bing version.
[
{
"title": "55 Beautiful Examples of Sunset Photography",
"image_url": "https://www.thephotoargus.com/wp-content/uploads/2019/09/sunsetphotography01.jpg",
"source_url": "https://www.thephotoargus.com/beautiful-examples-of-sunset-photography/"
}
]
How to Scrape Bing Recipes
Recipe data (Calories, Prep Time, Ratings) is highly valuable for food aggregators. Bing presents this data in a sleek Carousel at the top of the search results.

The challenge? Laziness. Bing only loads the first 3 or 4 recipes in the DOM. The rest do not exist until you physically click the "Next" button. If you just scroll, you may miss most of the data.
The Strategy
To scrape this reliably, we need a different interaction model than standard scrolling:
- Active Navigation. We must locate the .btn.next element inside the carousel container (.b_acf_crsl) and click it repeatedly to force Bing to render the hidden slides.
- Calculated Clicks. We don't click blindly. We calculate the required clicks based on your target count (count / 3 items per slide). This prevents wasting time clicking on a finished carousel.
- Custom HTML Tags. Bing uses non-standard HTML tags like <acf-badge> for star ratings. Standard CSS selectors often miss these, so we target the tag name explicitly.
The CSS Selectors
| Selector | Description |
|---|---|
| .b_acf_crsl | The Carousel Container |
| .btn.next | The navigation button we must click to load more items |
| .acf_p_multi_str | Stats line (e.g., "1 hr · 450 cals") |
| acf-badge span | Custom Tag. Holds the star rating (e.g., "4.5") |
The Code
Save this as bing_recipe_scraper.py.
import json
import logging
import argparse
import math
import datetime
from seleniumbase import SB
from bs4 import BeautifulSoup
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
datefmt="%H:%M:%S",
)
class BingRecipeScraper:
def scrape(self, query: str, count: int = 20):
# Append "recipes" to trigger carousel if not present
full_query = f"{query} recipes" if "recipe" not in query.lower() else query
url = f"https://www.bing.com/search?q={full_query.replace(' ', '+')}"
logging.info(f"Starting scrape: '{query}' (Target: {count})")
with SB(uc=True, headless=True) as sb:
sb.open(url)
sb.wait_for_element("#b_results", timeout=10)
if sb.is_element_visible(".b_acf_crsl"):
estimated_clicks = math.ceil(count / 3) + 2
max_clicks = min(estimated_clicks, 30)
self._expand_carousel(sb, max_clicks)
else:
logging.warning("No recipe carousel found for this query")
html = sb.get_page_source()
scraped_data = self._parse_full_page(html)
final_output = {
"meta": {
"query": query,
"scraped_at": datetime.datetime.now().isoformat(),
"locale": "en-US",
"total_carousel_items": len(scraped_data["carousel_top"]),
"total_organic_items": len(scraped_data["organic_list"]),
},
"data": {
"recipes_surface": scraped_data["carousel_top"][:count],
"organic_results": scraped_data["organic_list"],
},
}
return final_output
def _expand_carousel(self, sb, max_clicks):
next_btn_selector = ".b_acf_crsl .btn.next"
if not sb.is_element_visible(".b_acf_crsl"):
logging.warning("No recipe carousel found on this page.")
return
logging.info(f"Expanding carousel (Max clicks: {max_clicks})...")
for i in range(max_clicks):
try:
if sb.is_element_visible(next_btn_selector):
classes = sb.get_attribute(next_btn_selector, "class")
if "disabled" in classes:
logging.info("Carousel reached the end.")
break
sb.click(next_btn_selector)
sb.sleep(0.5)
else:
break
except Exception as e:
logging.warning(f"Error clicking carousel: {e}")
break
def _parse_full_page(self, html):
soup = BeautifulSoup(html, "html.parser")
data = {"carousel_top": [], "organic_list": []}
seen_titles = set()
slides = soup.select(".b_acf_crsl .slide")
for slide in slides:
try:
title_elem = slide.select_one(".acf_p_title")
if not title_elem:
continue
title = title_elem.get_text(strip=True)
if title in seen_titles:
continue
stats = slide.select_one(".acf_p_multi_str")
link = slide.select_one(".b_acf_card_link")
img = slide.select_one(".cico img")
# Note: acf-badge is a custom element, not a standard CSS class
rating = slide.select_one("acf-badge span")
full_link = link.get("href") if link else None
if full_link and full_link.startswith("/"):
full_link = f"https://www.bing.com{full_link}"
data["carousel_top"].append(
{
"title": title,
"rating": rating.get_text(strip=True) if rating else None,
"time_and_stats": stats.get_text(strip=True) if stats else None,
"link": full_link,
"thumbnail": img.get("src") if img else None,
}
)
seen_titles.add(title)
except Exception:
continue
for item in soup.select("li.b_algo"):
try:
title_elem = item.select_one("h2 a")
if not title_elem:
continue
fact_row = item.select_one(".b_factrow")
snippet = item.select_one(".b_lineclamp2")
data["organic_list"].append(
{
"title": title_elem.get_text(strip=True),
"link": title_elem.get("href"),
"stats": (
fact_row.get_text(" | ", strip=True) if fact_row else None
),
"snippet": snippet.get_text(strip=True) if snippet else None,
}
)
except Exception:
continue
return data
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"-q",
"--query",
type=str,
required=True,
help="Recipe search term (e.g. 'Pasta')",
)
parser.add_argument(
"-c", "--count", type=int, default=20, help="Number of carousel items to target"
)
parser.add_argument("-o", "--output", type=str, help="Output JSON filename")
args = parser.parse_args()
logging.getLogger("seleniumbase").setLevel(logging.WARNING)
bot = BingRecipeScraper()
results = bot.scrape(args.query, args.count)
if results:
fname = (
args.output
if args.output
else f"recipes_{args.query.replace(' ', '_')}.json"
)
with open(fname, "w", encoding="utf-8") as f:
json.dump(results, f, indent=2, ensure_ascii=False)
logging.info(
f"Saved {len(results['data']['recipes_surface'])} carousel + {len(results['data']['organic_results'])} organic to {fname}"
)
else:
logging.error("Scrape failed.")
Running the Scraper
The script automatically appends "recipes" to your query to maximize the chance of triggering the carousel.
# Get 20 Chocolate Cake recipes
python bing_recipe_scraper.py -q "chocolate cake" -c 20
# Get 50 Pasta recipes for a content database
python bing_recipe_scraper.py -q "pasta" -c 50
Understanding the Output
Note how the stats field aggregates multiple data points (Time, Calories, Servings) because Bing groups them in a single string. You can regex split this later if needed.
[
{
"title": "Chocolate Cake Recipe",
"rating": "4.9",
"stats": "1 hr 5 min · 1138 cals · 10 servs",
"link": "https://www.bing.com/images/search?view=detailv2&..."
}
]
How to Structure, Clean, and Store Bing Scraping Data
Scraping provides raw data, but the output is often inconsistent across different search surfaces. Organic results use link, Ads use real_url, and Maps results often lack a website entirely. Normalizing this data is necessary before you store it.
The Cleanup Pipeline (Pandas)
Using Pandas allows you to handle schema normalization and deduplication efficiently in memory.
The main logical challenge here is deduplication. Standard web scraping deduplicates by URL. However, about 30% of local businesses on Maps do not have a website. If you deduplicate by URL, you will unknowingly remove legitimate businesses that return a null value. The solution is to switch deduplication strategies based on the data type: use Name and Address for Maps, and URL for everything else.
Here is a standard normalization function:
import pandas as pd
from datetime import datetime
def clean_bing_data(results):
df = pd.DataFrame(results)
if df.empty: return df
# 1. Normalize Schema
# Map all URL variations to a single standard 'url' column
column_mapping = {
'link': 'url', # Organic
'real_url': 'url', # Ads
'website': 'url' # Maps
}
df = df.rename(columns=column_mapping)
# 2. Deduplication Logic
# Maps results often lack URLs. Deduplicate by Address instead to avoid data loss.
if 'address' in df.columns and 'name' in df.columns:
df = df.drop_duplicates(subset=['name', 'address'], keep='first')
elif 'url' in df.columns:
df = df.drop_duplicates(subset=['url'], keep='first')
# 3. Text Normalization
# Strip whitespace and fix internal newlines
text_cols = ['title', 'snippet', 'description', 'name', 'address']
for col in text_cols:
if col in df.columns:
df[col] = df[col].astype(str).str.strip().str.replace(r'\s+', ' ', regex=True)
# 4. Type Casting
if 'rank' in df.columns:
df['rank'] = pd.to_numeric(df['rank'], errors='coerce').fillna(0).astype(int)
# 5. Audit Trail
df['scraped_at'] = datetime.utcnow().isoformat()
return df
Storage Strategy
Your storage choice should depend on your data volume and access patterns.
| Volume | Recommended Format | Why |
|---|---|---|
| Small (< 10k) | CSV / JSON | Human-readable. Good for quick analysis or sharing with non-technical teams. |
| Medium (10k+) | Parquet | Compressed columnar format. It saves disk space and preserves data types better than CSV. |
| Ad-Hoc | SQLite | Serverless SQL. Useful for joining distinct datasets locally, such as combining organic results with ads. |
| Production | PostgreSQL | Best for time-series tracking. Use this to query rank history over long periods. |
Recommendation: Partition your files by date (e.g., data/YYYY-MM-DD/results.parquet). This allows you to compare datasets between dates without loading the entire history.
How to Scale Bing Scraping Safely
A single script works well for testing. However, running that same script in a loop for thousands of queries will likely get your IP address blocked. Moving to production requires a shift in architecture.
The Production Architecture
Scaling safely means decoupling "job submission" from "scraping execution". Instead of running scrapers synchronously, push jobs into a queue system like Redis or Celery. Worker processes pick up these jobs one by one. This allows you to control the exact number of concurrent browsers, preventing you from overwhelming Bing or your own server.
If a scraper fails, do not retry immediately. Implement an exponential backoff strategy where the wait time increases after each failure. If an IP address triggers a 429 (Too Many Requests) or a CAPTCHA, trip a circuit breaker for that proxy and pause it for a set period.
Monitoring and Logging
You need visibility into your scraper's health. A sudden drop in response size often indicates that Bing is serving a CAPTCHA page, even if the HTTP status is 200.
Here is a wrapper to add metrics observability to your scraper. Note that we do not capture HTML snapshots here because the browser session has already closed; that logic belongs inside the scraper itself.
import logging
import time
from datetime import datetime
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('scraper.log'),
logging.StreamHandler()
]
)
def scrape_with_logging(query, scraper_func):
start = time.time()
try:
# Execute the scraper
results = scraper_func(query)
duration = round(time.time() - start, 2)
# Log success with duration to spot throttling (latency spikes)
logging.info(f"SUCCESS | Query: {query} | Time: {duration}s | Items: {len(results)}")
return results
except Exception as e:
duration = round(time.time() - start, 2)
error_type = type(e).__name__
logging.error(f"FAILURE | Query: {query} | Time: {duration}s | Error: {error_type}")
raise e
Choosing the Right Tools
For infrastructure, stick to standard tools. Celery or RQ are excellent for managing job queues. For the actual requests, AsyncIO allows you to handle network operations without blocking your CPU.
Finally, managing IP reputation is the most significant factor in reliability. You should not use your server's IP address for production scraping. Use a residential proxy service that rotates IPs automatically. This distributes your traffic pattern and makes your activity look like distinct users rather than a single bot.
How to Avoid Blocks When Scraping Bing
Bing blocks scrapers based on two factors: IP reputation and Request Velocity. If you send 100 requests from a datacenter IP (like AWS or DigitalOcean), you will likely get blocked quickly. It doesn't matter how good your headless browser is; if the IP is flagged, the request fails.

Free or datacenter proxies are generally not suitable for production Bing scraping. Those IP ranges are publicly known and often blacklisted by default on Bing. To run this in production, you need residential proxies.
Residential Proxies: The Solution
Residential proxies route traffic through ISP networks, which can make requests look more like normal user traffic and improve success rates. Live Proxies provides residential proxies built for higher frequency scraping and time sensitive operations.
IP Pool Contamination
Most proxy providers share IP pools across all customers:
- Customer A scrapes Bing aggressively → IPs get flagged.
- Customer B (you) buys proxies from the same provider → gets those burned IPs.
- You inherit Customer A's bad reputation.
Using Live Proxies for Bing scraping
Live Proxies uses target-specific IP allocation:
When you scrape Bing using Live Proxies, your IPs are isolated from other Bing scrapers through target-specific IP allocation. Those IPs might be allocated to customers scraping Amazon or Google, but your Bing activity doesn't affect them.
- Sign up and purchase a plan
- Copy proxy credentials from the dashboard
- Implement:
PROXIES = [
"131.143.362.36:7383:LV71125532-mDmfksl3onyoy-1:bW2VN4Zc5YSyK5nF82tK",
"131.143.362.36:7383:LV71125532-mDmfksl3onyoy-2:bW2VN4Zc5YSyK5nF82tK",
"131.143.362.36:7383:LV71125532-mDmfksl3onyoy-3:bW2VN4Zc5YSyK5nF82tK",
# ...
]
Method 1: Random Proxy Selection
from seleniumbase import SB
import random
def scrape_with_rotation(query):
proxy = random.choice(PROXIES)
with SB(uc=True, proxy=proxy) as sb:
sb.open(f"`<https://www.bing.com/search?q={query}>`")
sb.wait_for_element("#b_results", timeout=10)
html = sb.get_page_source()
# ... parse results
return results
queries = ["sunset photography", "machine learning", "web scraping"]
for query in queries:
data = scrape_with_rotation(query)
print(f"{len(data)} results for '{query}'")
Use this for scraping different keywords where each query is independent (rank tracking, keyword research).
Method 2: Sticky Session
STICKY_PROXY = "131.143.362.36:7383:LV71125532-mDmfksl3onyoy-1:bW2VN4Zc5YSyK5nF82tK"
with SB(uc=True, proxy=STICKY_PROXY) as sb:
sb.open("https://www.bing.com/search?q=sunset")
# All requests in this session use the same IP
# Perfect for pagination, form fills, multi-step workflows
html = sb.get_page_source()
Use this for pagination or multi-step workflows where Bing expects the same visitor (getting pages 1-10 of results).
Session behavior: Each sticky session maintains the same IP for up to 60 minutes.
→ View plans (start at $70)
Need enterprise scale? Contact sales
How to Keep Bing Scraping Secure and Compliant
Security isn't just about encryption; it’s about good hygiene. The most effective security measure is simply not logging in. By sticking to public, unauthenticated data, you eliminate the risk of exposing customer credentials or violating strict user agreements.
Beyond that, treat your scraper like any production software. Never hardcode your proxy credentials directly in the script – always use environment variables so you don't accidentally push secrets to a code repository. If you store the data, keep it secure and limit access to only the team members who need it.
Finally, maintain a simple internal note explaining what you scrape and why. It helps your team stay aligned and serves as a useful record if anyone ever questions your data practices.
Best Use Cases for Bing Scraping
We generally build Bing scrapers to feed three specific data pipelines: SEO monitoring, News aggregation, and Local market analysis. Here is the architectural pattern for each.
SEO and SEM "Share of Voice"
The goal here is to measure exactly how often your domain appears against competitors for high-value keywords. You start with a list of category keywords (e.g., "project management software") and target markets. The process involves running the Organic and Ads scrapers weekly to store rank, advertiser names, and ad copy.
The key metric to track is Share of Voice (SoV) – the percentage of searches where your domain appears in the Top 3 positions. A significant drop usually indicates an algorithm update or a competitor's bid aggression.
News Latency Monitoring
For PR teams, the focus is on tracking how fast stories are indexed by search engines. The pipeline runs the News scraper hourly using the recency="hour" filter against a list of brand names and product keywords. The critical number here is Time-to-Index, which measures the delta between the Published Time and the Scraped Time. Tier-1 publications typically appear within 15–60 minutes; if this extends to 3+ hours, you may have an indexing issue or Bing's indexing speed may have changed.
Local Reputation Analysis
This pipeline monitors reputation across multi-location chains, such as franchises or retail stores. You generate a list of local queries (e.g., "coffee shops in Seattle") and run the Maps scraper to extract star ratings and review counts from the top results. The metric to watch is Average Rating Spread. For example, comparing your average rating in a specific city against the top 3 competitors reveals operational problems in specific regions that aggregate data might hide.
Further reading: How to Scrape Google Search Results Using Python in 2025: Code and No-Code Solutions and How to Scrape X.com (Twitter) with Python and Without in 2025.
What is the Bottom Line on Scraping Bing
Start with a single script and a few test queries. You need to verify how Bing structures its data before you attempt to scale.
For production, the bottleneck will always be IP Reputation. You cannot scale this using datacenter IPs; they will be blocked instantly. Live Proxies uses target-specific IP allocation, so your Bing IPs aren't shared with other Bing scrapers, but regardless of the provider, ensure you are using a residential network with rotation.
The scrapers we built handle the complexity of dynamic loading and parsing. You can focus on what matters: extracting insights from the data.
FAQs
How often should I recrawl Bing search results
The crawl frequency depends entirely on the volatility of the data source. Search ranks generally require a weekly cadence to capture algorithm updates, whereas news monitoring demands hourly scrapes to catch breaking stories before they age out. Ad campaigns typically rotate daily, so a 24-hour cycle is appropriate there. Conversely, Maps data is highly stable; unless you are specifically tracking review velocity, monthly scraping is sufficient.
Why are Bing results different by region and device
Bing personalizes results based on IP geolocation, device type, and search history. To see what a real user in Berlin sees, you must route your request through a German residential proxy and set a mobile User-Agent header. Additionally, ensure you are clearing cookies or using incognito contexts between queries to prevent previous search history from biasing the current results.
Can I collect only Bing ads and skip organic
Yes, and this is a common optimization for competitive intelligence. You can target the top and bottom containers specifically while ignoring the organic results. This approach reduces parsing overhead significantly and lowers your storage costs since you are discarding the majority of the page content.
What if HTML layouts change during a run
Bing runs A/B tests frequently, so rigid selectors will eventually break. Best practice is to use multiple fallback selectors within the parser logic. We also recommend versioning your parsers and logging which version processed each result. If a selector fails, you can update the parser and reprocess the stored raw HTML without having to re-scrape the live site.
Do proxies really help with Bing scraping
Residential proxies are a strict requirement for production scraping on Bing. Datacenter IPs (like those from AWS or DigitalOcean) are often fingerprinted and typically blocked after a few requests. Residential proxies route traffic through real home internet connections, making your scraper indistinguishable from organic users. This is the only reliable way to maintain a high success rate at scale.
How do I keep costs low as I scale
The biggest cost driver is the headless browser. For standard organic results, consider switching to a TLS-mimicking HTTP client (like curl_cffi) for the initial request, which is significantly cheaper than running a full Chrome instance. Store your output in Parquet format rather than JSON to significantly reduce storage volume (typically 70-90% compression, depending on your data structure). Finally, use Spot instances for your worker nodes to cut compute costs.
Can I mix API data and scraped data
You can, but you must normalize the schemas first. Bing’s official API often uses different field names than what you extract from the HTML. The best approach is to define a single internal schema and write transformers for both data sources to map into that standard. Tag each record with source: api or source: scraped so you can trace data quality issues back to the origin.
What data should I never store
Avoid storing PII (Personally Identifiable Information) found in snippets, such as personal emails or phone numbers that aren't clearly business contacts. For raw HTML, implement a retention policy – extract the structured data immediately and delete the raw files after 30 days. From a security standpoint, ensure you never log proxy credentials, API keys, or session tokens in plain text.
How do I test that my data is accurate
Automated validation is key. You should implement checks that verify "sanity" metrics: ensure organic result counts meet a minimum threshold (e.g., at least 8 results per page) and cross-reference a sample of ad URLs to ensure they resolve to live landing pages. If you are scraping news, compare the extracted timestamp against the publisher's date to ensure you aren't picking up stale cache data.
Where should I put alerts and what should they say
Alerts should be routed to a dedicated Slack channel or PagerDuty service, but only for actionable failures. Specific thresholds, such as error rates spiking above 10% or average response times exceeding 5 seconds, are better than alerting on every single failure. The alert payload should include the query that failed, the error type (e.g., "Selector Not Found"), and the proxy IP used, which helps you quickly identify if the issue is a code break or a proxy ban.




