Web Scraping with LLaMA 3

Web Scraping with LLaMA 3: Turn Any Website into Structured JSON (2025 Guide)

With LLaMA 3, I can easily turn messy HTML into neat, structured JSON. It’s a smarter and more reliable way to scrape data, making the whole process smoother and less prone to errors. Let’s dive into how it works!

What is LLaMA 3?

LLaMA 3, which stands for Large Language Model Meta AI, is Meta’s third iteration of an open-weight language model, released in April 2024. It is a large language model that supports various tasks, from text generation to natural language understanding. LLaMA 3 models are available in different sizes, ranging from 8 billion parameters (8B) to 405 billion parameters (405B). These models are designed to run efficiently on various hardware setups, making them accessible for many developers and businesses.

Unlike traditional scraping methods, which rely on predefined selectors, LLaMA 3 understands the content contextually. This means it can intelligently extract information from web pages, even if the layout changes or the website has complex, dynamically loaded content.

Why Use LLaMA 3 for Web Scraping?

Web scraping has become more complicated over time due to several factors:

  • Dynamic Content: Many websites now load content dynamically using JavaScript. This makes traditional scraping methods that rely on static HTML selectors ineffective.
  • Website Layout Changes: Web pages frequently update their designs, breaking scraping scripts that depend on specific element locations.
  • Anti-Bot Protections: Many websites have implemented anti-bot measures, such as CAPTCHA challenges, IP blocking, and JavaScript-based protections, making it harder to scrape data without getting blocked.

Traditional scraping methods, like using XPath or CSS selectors, are fragile because they break when the website layout changes. However, LLaMA 3 offers a new way of handling these challenges. It can parse the content contextually, making it less likely to break due to minor changes in the website layout.

Some of the benefits of using LLaMA 3 for web scraping include:

  • Flexibility: LLaMA 3 can adapt to different website structures, making it perfect for scraping dynamic and frequently changing websites.
  • Efficiency: By converting raw HTML into clean, structured data in JSON format, LLaMA 3 makes it easier to store and process scraped information.
  • Reliability: The model’s contextual understanding of content ensures that it extracts only relevant data, reducing the risk of errors.

🔗 Boost Web Scraping with Bright Data’s MCP

For more reliable and up-to-date results, consider integrating Bright Data’s Model Context Protocol (MCP) with your LLaMA 3 setup. MCP provides real-time web access, bypasses geo-restrictions and bot protections, and ensures your model processes the freshest data available. It’s especially useful when scraping dynamic or protected sites — making your pipeline more robust and accurate without extra complexity.

Try it here.

Note: I am not affiliated with Bright Data.

Setting Up Your Environment for LLaMA 3

Before diving into the scraping process, there are a few prerequisites that you need to have in place:

  • Python 3: Make sure Python 3 is installed on your system. This guide assumes you have basic Python knowledge.
  • Operating System Compatibility: LLaMA 3 works on macOS (Big Sur or later), Linux, and Windows (10 or later).
  • Hardware Resources: Depending on your chosen model size, you’ll need sufficient RAM and disk space. For example, LLaMA 3.1 with 8 billion parameters requires about 6–8 GB of RAM and 4.9 GB of disk space.

Once your environment is ready, you’ll need to install Ollama, a tool for downloading, setting up, and running LLaMA models locally.

Installing Ollama

Ollama simplifies the process of downloading and setting up LLaMA 3 models. To get started:

  1. Visit the official Ollama website and download the application compatible with your operating system.
  2. Follow the installation instructions provided on the website.

After installing Ollama, you’ll need to select the right model based on your hardware and use case. For most users, the llama3.1:8b model is ideal because it offers a good balance between performance and resource requirements.

Running LLaMA 3 Locally

Once you’ve installed Ollama, you can begin using LLaMA 3 by following these steps:

  1. Download the Model: Open your terminal and run the following command to download the LLaMA 3.1 model:
ollama run llama3.1:8b

This command will download the model and start an interactive prompt where you can test the model by sending queries like:

>>> who are you?

I am LLaMA, an AI assistant developed by Meta AI…

  1. Start the Ollama Server: To run the LLaMA model as a local server, use the following command:
ollama serve

This starts the server at http://127.0.0.1:11434/, which you can access from your browser to verify that the server is running.

Building a Web Scraper Using LLaMA 3

Now that LLaMA 3 is set up and running, let’s walk through the process of building a simple web scraper to extract product information from an e-commerce website like Amazon.

The scraper will follow a multi-stage workflow:

  1. Browser Automation: Use Selenium to load the page and render dynamic content.
  2. HTML Extraction: Identify the product details container on the webpage.
  3. Markdown Conversion: Convert the HTML content to Markdown for better processing by LLaMA.
  4. Data Extraction with LLaMA: Use LLaMA to extract structured data and convert it into JSON format.
  5. Output Handling: Save the extracted data to a JSON file for further analysis.

Step 1: Install Required Libraries

To get started, install the necessary Python libraries:

pip install requests selenium webdriver-manager markdownify

  • requests: Allows sending HTTP requests to the LLaMA model.
  • selenium: Automates browser interactions, which is especially useful for websites with dynamic content.
  • webdriver-manager: Helps manage the correct version of ChromeDriver needed for Selenium.
  • markdownify: Converts HTML content into Markdown format for easier processing by LLaMA.

Step 2: Set Up the Selenium WebDriver

Next, you’ll need to set up a headless browser using Selenium. This allows you to interact with websites programmatically without opening a visual browser.

Here’s how to initialize a headless browser with Chrome:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
options = Options()
options.add_argument(" - headless")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

Step 3: Extract HTML from Amazon Product Page

Now, you’ll need to extract the HTML content of a product page. Amazon product pages contain the product information within a container, which you can access using Selenium:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 15)
product_container = wait.until(EC.presence_of_element_located((By.ID, "ppd")))
page_html = product_container.get_attribute("outerHTML")

Step 4: Convert HTML to Markdown

Convert the extracted HTML into Markdown to optimize LLaMA’s processing. Markdown is a cleaner and more efficient format for processing large language models.

from markdownify import markdownify as md
clean_text = md(page_html, heading_style="ATX")

Step 5: Create a Structured Data Extraction Prompt

The key to successful scraping with LLaMA 3 is crafting the right prompt. The prompt instructs LLaMA on what data to extract from the provided content.

Here’s a sample prompt for extracting product details:

PROMPT = (
"You are an expert Amazon product data extractor. Your task is to extract product data from the provided content. "
"Return ONLY valid JSON with EXACTLY the following fields and formats:nn"
"{n"
' "title": "string - the product title",n'
' "price": number - the current price (numerical value only)",n'
' "original_price": number or null - the original price if available,n'
' "discount": number or null - the discount percentage if available,n'
' "rating": number or null - the average rating (0–5 scale),n'
' "review_count": number or null - total number of reviews,n'
' "description": "string - main product description",n'
' "features": ["string"] - list of bullet point features,n'
' "availability": "string - stock status",n'
' "asin": "string - 10-character Amazon ID"n'
"}nn"
"Return ONLY the JSON without any additional text."
)

Step 6: Call the LLaMA API

With the Markdown content ready, send it to the LLaMA API to extract the structured data. You’ll use the following Python code to send the request:

import requests
import json
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3.1:8b",
"prompt": f"{PROMPT}nn{clean_text}",
"stream": False,
"format": "json",
"options": {
"temperature": 0.1,
"num_ctx": 12000,
},
},
timeout=250,
)
raw_output = response.json()["response"].strip()
product_data = json.loads(raw_output)

Step 7: Save the Results

Finally, save the extracted product data to a JSON file:

with open("product_data.json", "w", encoding="utf-8") as f:
json.dump(product_data, f, indent=2, ensure_ascii=False)

Conclusion

Using LLaMA 3 for web scraping is a game-changer. It allows you to extract data from websites more efficiently and with greater reliability than traditional scraping methods. You can store and process the data by converting raw HTML to structured JSON. Whether scraping product details from e-commerce sites like Amazon or extracting data from other complex websites, LLaMA 3 provides a flexible and powerful solution to simplify and more resilient web scraping.

 

Similar Posts