Web Scraping With Goutte in PHP

Web Scraping With Goutte in PHP: Complete Guide

I’ll walk you through everything step by step — from setting things up to getting real data from websites with Goutte. I’ll also share what to watch out for, since scraping can sometimes get tricky. If you’re ready to learn something useful and fun, let’s dive in and start scraping!

What is Goutte?

Goutte is a small PHP library that helps you scrape data from websites. It makes it easy to send HTTP requests (like visiting a webpage) and to examine a site’s HTML code. You can then collect information on the page. Goutte works well for static pages, which don’t load data with JavaScript.

Goutte uses other PHP tools in the background. It is based on Symfony’s DomCrawler and BrowserKit. These tools help you explore web pages, click links, and fill out forms.

How to Scrape Web Data in PHP Using Goutte?

Before using Goutte, you need to have some tools ready:

  1. PHP installed (version 7.1 or newer).
  2. Composer (PHP’s dependency manager).
  3. A code editor like VS Code, Sublime Text, or PHPStorm.

Check if PHP is installed:

php -v

If your PHP version is below 7.1, update it first.

Step 1: Create a New Project

Open your terminal or command prompt. Create a new folder for your project:

mkdir php-goutte-scraper
cd php-goutte-scraper
Initialize the project using Composer:
composer init

You can press enter to accept the default values for now. After that, install Goutte:

composer require fabpot/goutte

This command adds Goutte to your project. Composer will download all the needed files.

Step 2: Create Your PHP File

In the folder, create a file called index.php. Open it in your code editor. Start by adding this code:

require_once __DIR__ . '/vendor/autoload.php';
use GoutteClient;
$client = new Client();
// Your scraping code will go here

Now you’re ready to start scraping!

Step 3: Choose a Website to Scrape

Let’s choose a simple example site that allows scraping:
https://www.scrapethissite.com/pages/forms/

This page shows data about hockey teams. Our goal is to collect team names, wins, and other stats.

Step 4: Load the Page With Goutte

Add this to your script:

$url = 'https://www.scrapethissite.com/pages/forms/';
$crawler = $client->request('GET', $url);

This loads the page and gives you access to its HTML content through the $crawler object.

Step 5: Find the Data You Want

Now let’s look for the table that holds the data. You can inspect the website in your browser. Right-click on a team and click “Inspect.” You’ll see that each row of the table has a class called team.

To get each row, use this code:

$crawler->filter('tr.team')->each(function ($row) {
$teamName = trim($row->filter('.name')->text());
$wins = trim($row->filter('.wins')->text());
$losses = trim($row->filter('.losses')->text());
echo "Team: $teamName, Wins: $wins, Losses: $lossesn";
});

This will print out the name, wins, and losses of each hockey team.

Step 6: Save the Data

Instead of printing the data, let’s save it in an array. Then, we can write it to a file.

$teams = [];
$crawler->filter('tr.team')->each(function ($row) use (&$teams) {
$teams[] = [
'team' => trim($row->filter('.name')->text()),
'wins' => trim($row->filter('.wins')->text()),
'losses' => trim($row->filter('.losses')->text())
];
});

Now the $teams array holds all the scraped data.

Step 7: Handle Pagination

The site has multiple pages. We need to go through each one to get all the data.

First, create a function to get all page URLs:

function getAllPageUrls($client, $startUrl) {
$crawler = $client->request('GET', $startUrl);
$urls = [$startUrl];
$crawler->filter('.pagination li a')->each(function ($node) use (&$urls) {
$href = $node->attr('href');
$url = 'https://www.scrapethissite.com' . $href;
if (!in_array($url, $urls)) {
$urls[] = $url;
}
});
return $urls;
}

Then call the function:

$urls = getAllPageUrls($client, $url);

Loop over all pages:

$teams = [];
foreach ($urls as $pageUrl) {
echo "Scraping: $pageUrln";
$crawler = $client->request('GET', $pageUrl);
$crawler->filter('tr.team')->each(function ($row) use (&$teams) {
$teams[] = [
'team' => trim($row->filter('.name')->text()),
'wins' => trim($row->filter('.wins')->text()),
'losses' => trim($row->filter('.losses')->text())
];
});
}

Now your scraper collects data from all pages.

Step 8: Export to CSV

Once you have the data, you can export it to a file. Here’s how to write the results into teams.csv:

$file = fopen('teams.csv', 'w');
// Write header
fputcsv($file, ['Team Name', 'Wins', 'Losses']);
// Write rows
foreach ($teams as $team) {
fputcsv($file, [$team['team'], $team['wins'], $team['losses']]);
}
fclose($file);
echo "Data exported to teams.csvn";

Now your data is saved and ready to use in Excel or any spreadsheet software.

Running the Script

Save your code and go back to the terminal. Run the script with:

php index.php

You should see logs of which page is being scraped, and in the end, a file called teams.csv will be created.

Limitations of Goutte

Goutte is easy to use, but it has a few issues:

  • It is deprecated and not updated anymore.
  • It can’t handle JavaScript-heavy websites.
  • It doesn’t work well with anti-bot protections.
  • It has no advanced proxy support.
  • It works fine for static websites.

For basic needs, Goutte still works. But if you need more power, read on.

Better Alternatives in 2025

Here are better and more updated ways to scrape websites in PHP:

1. Symfony Components

Goutte uses Symfony tools. You can use them directly:

composer require symfony/browser-kit symfony/dom-crawler symfony/http-client
Replace Goutte like this:
use SymfonyComponentBrowserKitHttpBrowser;
use SymfonyComponentHttpClientHttpClient;
$client = new HttpBrowser(HttpClient::create());

This method is more future-proof and gives more control.

2. Using Guzzle With a Parser

Guzzle is a powerful HTTP client. Combine it with a parser like symfony/dom-crawler or paquettg/php-html-parser:

composer require guzzlehttp/guzzle symfony/dom-crawler

You make a request using Guzzle and pass the HTML to DomCrawler. This gives you more flexibility. Interested in Guzzle and other libraries? Read my article that lists the best PHP web scraping libraries.

3. Use APIs or Web Unlocker Services

If the website is protected, regular scrapers fail. Some websites use Cloudflare, CAPTCHAs, or other bot-blocking tools. In that case, try using a Web Unlocker APIs like Bright Data or Oxylabs. These services bypass protections and return clean HTML.

Final Thoughts

Web scraping with Goutte is a great way to start if you’re working with PHP. It’s easy to use and perfect for small projects or learning the basics of scraping. In this guide, I showed you how to set up your project, scrape website data, deal with multiple pages, and save everything to a CSV file. I also pointed out some newer tools you might want to explore later. Goutte may be outdated, but it still works well for many simple tasks. Give it a try, play around with it, and see what data you can collect.

Similar Posts