Skip to main content

Command Palette

Search for a command to run...

Building a Universal Product Feed from Amazon Data with Python & Pandas

Published
6 min read
Building a Universal Product Feed from Amazon Data with Python & Pandas
R

Technical Talk for Developers

Scraping data is often only half the battle; formatting it is the other half. You might successfully extract thousands of product details from Amazon, but if that data isn't structured correctly, it's useless for downstream applications. Ad platforms like Google Merchant Center or Facebook Catalog require strict, standardized formats before they allow you to upload a single product.

This guide bridges the gap between raw web data and a usable ad catalog. We’ll build a Python-based pipeline that scrapes Amazon product details using BeautifulSoup and transforms that raw data into a Google Shopping-ready CSV using Pandas.

By the end of this tutorial, you’ll have a reusable script that takes a list of Amazon URLs and outputs a clean, validated feed.csv ready for import.

Prerequisites & Setup

We will use three primary libraries for this project:

  1. Requests: To fetch the HTML content of the Amazon pages.

  2. BeautifulSoup4: To parse the HTML and extract specific data points.

  3. Pandas: To clean the data and handle the final CSV export.

Install these via pip:

pip install pandas beautifulsoup4 requests

A Note on Amazon Scraping: Amazon employs sophisticated anti-bot measures. For a production-level pipeline, you would typically use a proxy rotation service or the ScrapeOps Proxy Port to avoid blocks. For this tutorial, we will use custom headers to mimic a real browser.

1. The Extraction (BeautifulSoup)

We need a function that targets specific Amazon DOM elements. We’re looking for the Product Title, Price, Availability, Image URL, and ASIN (Amazon Standard Identification Number).

import requests
from bs4 import BeautifulSoup

def scrape_amazon_product(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9",
    }

    try:
        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code != 200:
            return None

        soup = BeautifulSoup(response.content, "html.parser")

        # Extracting data points with fallbacks
        data = {
            "raw_title": soup.find("span", {"id": "productTitle"}).get_text(strip=True) if soup.find("span", {"id": "productTitle"}) else None,
            "raw_price": soup.find("span", {"class": "a-offscreen"}).get_text(strip=True) if soup.find("span", {"class": "a-offscreen"}) else None,
            "availability": soup.find("div", {"id": "availability"}).get_text(strip=True) if soup.find("div", {"id": "availability"}) else "In Stock",
            "image_url": soup.find("img", {"id": "landingImage"})["src"] if soup.find("img", {"id": "landingImage"}) else None,
            "asin": url.split("/dp/")[1].split("/")[0] if "/dp/" in url else None,
            "link": url
        }
        return data
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None

In this snippet, we use soup.find() to locate elements by their ID or class names. Amazon's structure changes slightly depending on the product category, so using if statements ensures the script doesn't crash if a specific element is missing.

Writing Amazon selectors manually is a great learning exercise, but in real workflows, many teams rely on ready-made Amazon scraper scripts to handle layout changes and product page variations.

2. Loading Data into Pandas

Once we can scrape a single page, we need to process a list of URLs and move that data into a Pandas DataFrame. Using a DataFrame allows us to perform batch operations on our data much faster than looping through standard Python lists.

import pandas as pd

product_urls = [
    "https://www.amazon.com/dp/B08N5KWBXH/",
    "https://www.amazon.com/dp/B09G96TFF7/",
]

raw_results = []
for url in product_urls:
    product_data = scrape_amazon_product(url)
    if product_data:
        raw_results.append(product_data)

# Initialize the DataFrame
df = pd.DataFrame(raw_results)
print(df.head())

At this stage, the data is "dirty." Prices are likely strings like $999.00, titles might contain trailing whitespace, and the availability field is often a long sentence like "Only 5 left in stock - order soon."

3. Data Cleaning & Normalization

Google Merchant Center expects prices to be numeric and availability to follow a specific vocabulary: in_stock, out_of_stock, or preorder.

We'll use Pandas' .apply() and .str.replace() methods to sanitize the data:

import re

def clean_price(price_str):
    if not price_str:
        return None
    # Remove currency symbols and commas, keep decimals
    numeric_price = re.sub(r'[^\d.]', '', price_str)
    return float(numeric_price)

def normalize_availability(status):
    status = status.lower()
    if "in stock" in status or "left in stock" in status:
        return "in_stock"
    return "out_of_stock"

# Apply cleaning logic
df['price'] = df['raw_price'].apply(clean_price)
df['availability'] = df['availability'].apply(normalize_availability)
df['raw_title'] = df['raw_title'].str.strip()

# Drop the raw columns we no longer need
df = df.drop(columns=['raw_price'])

By converting the price to a float and normalizing the availability, we ensure the data meets the Google Merchant Center Product Data Specification.

4. Mapping to Google Merchant Center Specs

Now we need to map the Amazon-extracted fields to the specific header names Google requires. We also need to add "static" columns—values that are the same for every product in our feed, such as the condition of the item.

Amazon FieldGoogle FieldDescription
asinidUnique identifier
raw_titletitleProduct name
linklinkDirect URL
image_urlimage_linkMain image URL
pricepriceNumeric price + Currency
# Rename columns to match Google's schema
mapping = {
    'asin': 'id',
    'raw_title': 'title',
    'image_url': 'image_link'
}
df = df.rename(columns=mapping)

# Add static required fields
df['condition'] = 'new'
df['brand'] = 'Generic' 
df['identifier_exists'] = 'no' # Set to 'yes' if you have GTIN/MPN

# Format price for Google (e.g., "99.99 USD")
df['price'] = df['price'].astype(str) + " USD"

The identifier_exists field is crucial. If you don't have a GTIN (barcode), Google will reject your items unless you explicitly state that no unique identifier exists for this product.

5. Exporting the Feed

The final step is generating the file. While Google supports XML, CSV is the easiest format to debug and manage. We must use UTF-8 encoding to handle any special characters in product titles.

# Reorder columns to be clean
final_columns = ['id', 'title', 'link', 'price', 'availability', 'image_link', 'condition', 'brand', 'identifier_exists']
df = df[final_columns]

# Export to CSV
df.to_csv('google_shopping_feed.csv', index=False, encoding='utf-8')

print("Feed generated successfully: google_shopping_feed.csv")

When you open this file in Excel or Google Sheets, you should see a structured table where every column matches a Google requirement.

Validation & Automation Tips

Before uploading your feed to a live campaign, use the Diagnostics tool within Google Merchant Center. It will flag common errors such as:

  • Missing Images: Check if your scraper was blocked before fetching the image URL.

  • Price Mismatch: Google may crawl your site to verify the CSV price matches the actual page.

  • Invalid Characters: Ensure your CSV encoding is strictly UTF-8.

To keep your feed fresh, you can automate this script using a Cron job (on Linux) or Task Scheduler (on Windows) to run once every 24 hours. If you scale to thousands of URLs, consider using a database like SQLite to track which products have already been scraped to avoid redundant requests.

To wrap up

Building a universal product feed is about data translation. By using Pandas, we’ve turned a messy Amazon scrape into a structured asset that ad platforms can understand.

Key Takeaways:

  • Extraction is the foundation: Use specific selectors and fallbacks in BeautifulSoup.

  • Clean your data: Use Regex and Pandas .apply() to strip currency symbols and normalize statuses.

  • Schema Mapping: Always align your final columns with the target platform's specifications.

  • Static Fields: Include required fields like condition and identifier_exists to avoid rejection.

As a next step, try expanding your scraper to include the product_description or google_product_category to improve search relevance.

More from this blog

Code Journal

9 posts