Playwright web scraping: how to scrape the web with Playwright in 2024

Playwright is now an established player in the web scraping and automation world. In 2024, there's really no reason to use Puppeteer instead.

Content

Playwright is a browser automation library very similar to Puppeteer. Both allow you to control a web browser with only a few lines of code. The possibilities are endless, from automating mundane tasks and testing web applications to data mining.

Related ➡️ Playwright vs. Puppeteer: which is better?

With Playwright, you can run Firefox and Safari (WebKit), not only Chromium-based browsers. It will also save you time because Playwright automates away repetitive code, such as waiting for buttons to appear in the page.

You don’t need to be familiar with Playwright, Puppeteer or web scraping to enjoy this tutorial, but knowledge of HTML, CSS and JavaScript is expected.

In this tutorial, you’ll learn how to:

  1. Start a browser with Playwright
  2. Click buttons and wait for actions
  3. Extract data from a website
Related ➡️ Learn more about Playwright and why it's useful for web scraping and automation

The project

To showcase the basics of Playwright, we will create a simple scraper that extracts data about GitHub Topics. You’ll be able to select a topic and the scraper will return information about repositories tagged with this topic.

The page for JavaScript GitHub Topics

We will use Playwright to start a browser, open the GitHub topic page, click the Load more button to display more repositories, and then extract the following information:

  • Owner
  • Name
  • URL
  • Number of stars
  • Description
  • List of repository topics

Installation

To use Playwright you’ll need Node.js and a package manager. We’ll use NPM, which comes preinstalled with Node.js. You can confirm their existence on your machine by running:

node -v && npm -v

To get the most out of this tutorial, you need Node.js version 20 or higher. If you’re missing either Node.js or NPM or have unsupported versions, visit the installation tutorial to get started.

Related ➡️ How to install Node.js properly

Now that we know our environment checks out, let’s create a new project and install Playwright.

mkdir playwright-scraper && cd playwright-scraper
npm init -y
npm i playwright

The first time you install Playwright, it will download browser binaries, so the installation may take a bit longer.

Complete the installation by adding "type": "module" into the package.json file. This will enable use of modern JavaScript syntax. If you don't do this, Node.js will throw SyntaxError: Cannot use import statement outside a module when you run your code. Learn more about ECMAScript modules in Node.js.

{
    "type": "module",
    "name": "playwright-scraper",
    // ... other fields
}

Building a Playwright scraper

Creating a scraper with Playwright is surprisingly easy, even if you have no previous scraping experience. If you understand JavaScript and CSS, it will be a piece of cake.

In your project folder, create a file called scraper.js and open it in your favorite code editor. First, we will confirm that Playwright is correctly installed and working by running a simple script.

// Import the Chromium browser into our scraper.
import { chromium } from 'playwright';

// Open a Chromium browser. We use headless: false
// to be able to watch the browser window.
const browser = await chromium.launch({
    headless: false
});

// Open a new page / tab in the browser.
const page = await browser.newPage();

// Tell the tab to navigate to the JavaScript topic page.
await page.goto('https://github.com/topics/javascript');

// Pause for 10 seconds, to see what's going on.
await page.waitForTimeout(10000);

// Turn off the browser to clean up after ourselves.
await browser.close();

Now run it using your code editor or by executing the following command in your project folder.

node scraper.js

If you saw a Chromium window open and the GitHub Topics page successfully loaded, congratulations, you just robotized your web browser with Playwright.

JavaScript GitHub Topics

Loading more repositories

When you first open the topic page, the number of displayed repositories is limited to 30. You can load more by clicking the Load more… button at the bottom of the page.

Load more button at the bottom of the GitHub page

There are two things we need to tell Playwright to load more repositories:

  1. Click the Load more… button.
  2. Wait for the repositories to load.

Clicking buttons is extremely easy with Playwright. By prefixing text= to a string you’re looking for, Playwright will find the element that includes this string and click it. It will also wait for the element to appear if it’s not rendered on the page yet.

await page.click('text=Load more');

This is a huge improvement over Puppeteer, and it makes Playwright lovely to work with.

After clicking, we need to wait for the repositories to load. If we didn't, the scraper could finish before the new repositories show up on the page, and we would miss that data. page.waitForFunction() allows you to execute a function inside the browser and wait until the function returns true.

await page.waitForFunction(() => {
    const repoCards = document.querySelectorAll('article.border');
    // GitHub displays 20 repositories per page.
    // We wait until there's more than 20.
    return repoCards.length > 20;
});

To find that article.border selector, we used browser DevTools, which you can open in most browsers by right-clicking anywhere on the page and selecting Inspect. It means: Select the <article> tag with the border class.

Chrome DevTools

If you're not familiar with DevTools and CSS selectors, visit the Web scraping for beginners course in our academy. It's free and open-source.

Let’s plug this into our code and do a test run. I've removed earlier comments to make it easier for you to find new changes. We will use this method throughout the whole tutorial.

import { chromium } from 'playwright';

const browser = await chromium.launch({
    headless: false
});

const page = await browser.newPage({
    // We have to add this flag to enable JavaScript execution
    // on GitHub. waitForFunction() would not work otherwise.
    bypassCSP: true,
});

await page.goto('https://github.com/topics/javascript');

// Click and tell Playwright to keep watching for more than
// 20 repository cards to appear in the page.
await page.click('text=Load more');
await page.waitForFunction(() => {
    const repoCards = document.querySelectorAll('article.border');
    return repoCards.length > 20;
});

await page.waitForTimeout(10000);
await browser.close();

If you watch the run, you’ll see that the browser first scrolls down and clicks the Load more… button, which changes the text into Loading more. After a second or two, you’ll see the next batch of 20 repositories appear. Great job!

Extracting data with Playwright

Now that we know how to load more repositories, we will extract the data we want. To do this, we’ll use the page.$$eval() function. It tells the browser to find certain elements and then execute a JavaScript function with those elements. Here's the extraction code:

const repos = await page.$$eval('article.border', (repoCards) => {
    return repoCards.map(card => {
        const [user, repo] = card.querySelectorAll('h3 a');
        const stars = card.querySelector('#repo-stars-counter-star')
            .getAttribute('title');
        const description = card.querySelector('div.px-3 > p');
        const topics = card.querySelectorAll('a.topic-tag');

        const toText = (element) => element && element.innerText.trim();
        const parseNumber = (text) => Number(text.replace(/,/g, ''));

        return {
            user: toText(user),
            repo: toText(repo),
            url: repo.href,
            stars: parseNumber(stars),
            description: toText(description),
            topics: Array.from(topics).map((t) => toText(t)),
        };
    });
});

It works like this: page.$$eval() finds our repositories and executes the provided function in the browser. We get repoCards which is an Array of all the repo elements. The return value of the function becomes the return value of the page.$$eval() call. Thanks to Playwright, you can pull data out of the browser and save them to a variable in Node.js. Magic ✨

If you’re struggling to understand the extraction code itself, be sure to check out this guide on working with CSS selectors and this tutorial on using those selectors to find HTML elements.

And here’s the code with extraction included. When you run it, you’ll see 40 repositories with their information printed to the console.

import { chromium } from 'playwright';

const browser = await chromium.launch({
    headless: false
});

const page = await browser.newPage({
    bypassCSP: true,
});

await page.goto('https://github.com/topics/javascript');
await page.click('text=Load more');
await page.waitForFunction(() => {
    const repoCards = document.querySelectorAll('article.border');
    return repoCards.length > 20;
});

// Extract data from the page. Selecting all 'article' elements
// will return all the repository cards we're looking for.
const repos = await page.$$eval('article.border', (repoCards) => {
    return repoCards.map(card => {
        const [user, repo] = card.querySelectorAll('h3 a');
        const stars = card.querySelector('#repo-stars-counter-star')
            .getAttribute('title');
        const description = card.querySelector('div.px-3 > p');
        const topics = card.querySelectorAll('a.topic-tag');

        const toText = (element) => element && element.innerText.trim();
        const parseNumber = (text) => Number(text.replace(/,/g, ''));

        return {
            user: toText(user),
            repo: toText(repo),
            url: repo.href,
            stars: parseNumber(stars),
            description: toText(description),
            topics: Array.from(topics).map((t) => toText(t)),
        };
    });
});


// Print the results 🚀
console.log(`We extracted ${repos.length} repositories.`);
console.dir(repos);

await page.waitForTimeout(10000);
await browser.close();

Summary

So far we learned how to start a browser with Playwright, and how to control its actions with some of Playwright’s most useful functions: page.click() to emulate mouse clicks, page.waitForFunction() to wait for things to happen and page.$$eval() to extract data from a browser page. But no real scraping project finishes after scraping one page. Scraping is predominantly used to build large datasets for data analytics.

Let's simulate this by extending this tutorial to scrape not only the first 40 repositories, but any number of them. To do this we will have to click the Load more... button repeatedly, not just once.

Further, we will add scraping of the number of commits in the main branch of each of the collected repositories. This number is not available on the topics page, so we'll have to visit each repository page individually and get it from there. Our scraper will learn to crawl.

Crawling with Playwright

While Playwright is absolutely amazing for controlling browsers, it's not primarily a web scraping tool. It is possible to crawl with Playwright, but trust me, it's painful. You have to open browsers, close browsers, open tabs, close tabs, keep track of what you already crawled, what failed and needs to be retried, handle all errors, so they don't crash your crawler, manage memory and CPU so that too many open tabs do not overwhelm your machine. It's doable, but I'd much rather focus on crawling, not browser management.

Comfortable scraping and crawling with Playwright is better done together with another library. This library is called Crawlee, and it's also free and open-source, just like Playwright. Crawlee wraps Playwright and grants access to all of Playwright's functionality, but also provides useful crawling and scraping tools like error handling, queue management, storages, proxies or fingerprints out of the box. Crawlee's goal is to help you build reliable crawlers, and to do it fast.

Crawlee installation

We can add crawlee into our project by executing the following command in the project's folder:

npm install crawlee

Crawlee will recognize that Playwright is already installed and will be able to use it right away. To quickly test this, let's create a new file crawlee.js and use the following code inside the file:

// Crawlee works with other libraries like Puppeteer
// or Cheerio as well. Now we want to work with Playwright.
import { PlaywrightCrawler } from 'crawlee';

// PlaywrightCrawler manages browsers and browser tabs.
// You don't have to manually open and close them.
// It also handles navigation (goto), errors and retries.
const crawler = new PlaywrightCrawler({
    // Request handler gives you access to the currently
    // open page. Similar to the pure Playwright examples
    // above, we can use it to control the browser's page.
    requestHandler: async ({ page }) => {
        // Get the title of the page just to test things.
        const title = await page.title()
        console.log(title);
    }
})

// Here we start the crawler on the selected URLs.
await crawler.run(['https://github.com/topics/javascript']);

The above code uses the PlaywrightCrawler class of Crawlee to manage Playwright and crawl the web with it. This time, it only opens one page and gets its title. For a test, this is enough.

node crawlee.js

After executing the above command, you'll see several log lines printed by Crawlee and among them the following line. This means that everything's working as expected.

javascript · GitHub Topics · GitHub

Make the headless browser's window visible

You probably noticed that no browser window opened. That's because Crawlee (same as Playwright) runs headless by default. If you want to see what's going on in the browser, you have to switch headless to false.

// Don't forget to import the Configuration class.
import { PlaywrightCrawler, Configuration } from 'crawlee';

// Configure Crawlee to launch all browsers
// in headful mode (window visible).
Configuration.set('headless', false);

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page }) => {
        const title = await page.title()
        console.log(title);
        // We can easily use the timeout from previous
        // examples to stop the page from closing quickly.
        await page.waitForTimeout(10000);
    }
})

await crawler.run(['https://github.com/topics/javascript']);

There are other ways to run Playwright headful, like setting the Playwright launch option, but we find using Crawlee Configuration the best, because it also supports JSON config files and environment variables which is very useful in production.

Scrolling with Playwright

Now that we know Crawlee and Playwright work together as expected, we can start leveraging some of Crawlee's tools to help us scrape the commit counts of the top 100 JavaScript repositories. First, let's take a look at clicking the Load more... button enough times to load 100 repos.

Crawlee has a function for exactly this purpose. It's called infiniteScroll and it can be used to automatically handle websites that either have infinite scroll - the feature where you load more items by simply scrolling, or similar designs with a Load more... button. Let's see how it's used.

import { PlaywrightCrawler, Configuration } from 'crawlee';

Configuration.set('headless', false);

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, infiniteScroll }) => {
        const title = await page.title()
        console.log(title);

        // The crawler will keep scrolling and ...
        await infiniteScroll({
            // clicking this button, until ...
            buttonSelector: 'text=Load more',
            // this function returns true, which will
            // happen once GitHub has displayed 100 repos.
            stopScrollCallback: async () => {
                const repos = await page.$$('article.border');
                return repos.length >= 100;
            },
        })

        await page.waitForTimeout(10000);
    }
})

await crawler.run(['https://github.com/topics/javascript']);

After adding the new code and running the above example, you should see Playwright automating the clicking and scrolling until there are 100 repos visible on the page. Extending this to 200 or 1000 repos is as simple as changing the number in stopScrollCallback. Although for really high numbers, you might need to use proxies.

Adding our Playwright code to Crawlee

As mentioned earlier, Crawlee only wraps Playwright, so we can easily reuse the scraping code we wrote in the first section of this tutorial. As a reminder, this is the scraping code we used to extract data from the repo cards.

const repos = await page.$$eval('article.border', (repoCards) => {
    return repoCards.map(card => {
        const [user, repo] = card.querySelectorAll('h3 a');
        const stars = card.querySelector('#repo-stars-counter-star')
            .getAttribute('title');
        const description = card.querySelector('div.px-3 > p');
        const topics = card.querySelectorAll('a.topic-tag');

        const toText = (element) => element && element.innerText.trim();
        const parseNumber = (text) => Number(text.replace(/,/g, ''));

        return {
            user: toText(user),
            repo: toText(repo),
            url: repo.href,
            stars: parseNumber(stars),
            description: toText(description),
            topics: Array.from(topics).map((t) => toText(t)),
        };
    });
});

To use it in our Crawlee crawler, we simply paste it after the infiniteScroll, to make sure we extract all the data, and then we print the results to the console. After the crawler's done its job, you'll see data from 100 repos printed to the terminal.

import { PlaywrightCrawler, Configuration } from 'crawlee';

Configuration.set('headless', false);

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, infiniteScroll }) => {
        const title = await page.title()
        console.log(title);

        await infiniteScroll({
            buttonSelector: 'text=Load more',
            stopScrollCallback: async () => {
                const repos = await page.$$('article.border');
                return repos.length >= 100;
            },
        })

        // This is exactly the same code as we used with pure Playwright.
        const repos = await page.$$eval('article.border', (repoCards) => {
            return repoCards.map(card => {
                const [user, repo] = card.querySelectorAll('h3 a');
                const stars = card.querySelector('#repo-stars-counter-star')
                    .getAttribute('title');
                const description = card.querySelector('div.px-3 > p');
                const topics = card.querySelectorAll('a.topic-tag');

                const toText = (element) => element && element.innerText.trim();
                const parseNumber = (text) => Number(text.replace(/,/g, ''));

                return {
                    user: toText(user),
                    repo: toText(repo),
                    url: repo.href,
                    stars: parseNumber(stars),
                    description: toText(description),
                    topics: Array.from(topics).map((t) => toText(t)),
                };
            });
        });

        // Print the repos to the console
        // to make sure everything works.
        console.log('Repository count:', repos.length);
        console.dir(repos);

        await page.waitForTimeout(10000);
    }
})

await crawler.run(['https://github.com/topics/javascript']);

Now that we extracted all the information that's available on the topic page, we need to get the commit counts. Those are only available on the individual repository pages. This means we have to take all the links we have collected, visit them with Playwright and extract the commit counts from their HTML.

Crawling extracted URLs

Crawlee gives us an easy way to crawl with Playwright, because it will handle enqueueing, network errors and retries for us, without sacrificing full control of each individual page. To add the repositories to the queue, we will use the URLs we already extracted.

Since the code is quite long now, we will always show the new segments first and then the complete, runnable example.

First we need to add an import of the Request class.

import { PlaywrightCrawler, Configuration, Request } from 'crawlee';

Then, at the end of the requestHandler we add new code that adds more pages to the request queue.

// Turn the repository data we extracted into new requests to crawl.
const requests = repos.map(repo => new Request({
    // URL tells Crawlee which page to open
    url: repo.url,
    // labels are helpful for easy identification of requests
    label: 'repository',
    // userData allows us to store any JSON serializable data.
    // It will be kept together with the request and saved
    // persistently, so that no data is lost in the event
    userData: repo,
}));

// Add the requests to the crawler's queue.
// The crawler will automatically process them.
await crawler.addRequests(requests);

Thanks to the code above, Crawlee will open all the individual repository pages, but we have created a problem. Now we have two kinds of pages to process. The initial topic page and then all the repository pages. They require different logic. For now, let's solve it with a simple if statement, but later in the tutorial we will use a Router to clean the solution.

In Crawlee, requests are best identified by their assigned label. That's why we added the repository label to the requests in the previous code example.

// inside the requestHandler
const title = await page.title()
console.log(title);

// We need to separate the logic for the original
// topic page and for the new repository page.
if (request.label === 'repository') {
    // For now, let's just confirm our crawler works
    // by logging the URLs it visits.
    console.log('Scraping:', request.url);
} else {
    // The original, topic page code goes here.
}

Now that we made sure that we have correct logic for the individual page types, let's change the stopScrollCallback to stop scrolling immediately by changing the number of repos to 20. This will let us get results faster in the test runs.

// At the top of the file.
const REPO_COUNT = 20;

// Inside stopScrollCallback
return repos.length >= REPO_COUNT;

Great. It's time to run the crawler to confirm that we set up everything correctly. You can try to make the above changes in your code yourself, or you can use the complete runnable code below.

import { PlaywrightCrawler, Configuration, Request } from 'crawlee';

const REPO_COUNT = 20;

Configuration.set('headless', false);

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ request, page, infiniteScroll }) => {
        const title = await page.title()
        console.log(title);

        if (request.label === 'repository') {
            console.log('Scraping:', request.url);
        } else {
            await infiniteScroll({
                buttonSelector: 'text=Load more',
                stopScrollCallback: async () => {
                    const repos = await page.$$('article.border');
                    return repos.length >= REPO_COUNT;
                },
            });

            const repos = await page.$$eval('article.border', (repoCards) => {
                return repoCards.map(card => {
                    const [user, repo] = card.querySelectorAll('h3 a');
                    const stars = card.querySelector('#repo-stars-counter-star')
                        .getAttribute('title');
                    const description = card.querySelector('div.px-3 > p');
                    const topics = card.querySelectorAll('a.topic-tag');

                    const toText = (element) => element && element.innerText.trim();
                    const parseNumber = (text) => Number(text.replace(/,/g, ''));

                    return {
                        user: toText(user),
                        repo: toText(repo),
                        url: repo.href,
                        stars: parseNumber(stars),
                        description: toText(description),
                        topics: Array.from(topics)
                            .map((t) => toText(t)),
                    };
                });
            });

            console.log('Repository count:', repos.length);
            const requests = repos.map(repo => new Request({
                url: repo.url,
                label: 'repository',
                userData: repo,
            }));

            await crawler.addRequests(requests);
        }
    }
})

await crawler.run(['https://github.com/topics/javascript']);

As you could see after running the above code, scraping with an open browser window can get a bit overwhelming. We recommend you turn off headful mode now and only turn it on when you need it for debugging.

// Delete this line or set headless to false
// to turn off the visible browser window.
Configuration.set('headless', false);

Extracting commit counts

We're almost there. The last thing missing is extracting the commit counts from individual repos. To do that, we need to get back to browser DevTools and take a look at the page's structure.

Extracting commit counts from DevTools

Now that we know how the HTML looks like, we have several options how to get the commit count. We'll explore extracting it either with the usual CSS selector or with Playwright's powerful locator API.

Using CSS selectors with Playwright

After inspecting the page's HTML we found that the commit count can be isolated using the following CSS selector:

.react-last-commit-history-group span

We can use the page.locator() function of Playwright (also part of the locator API) to find any element using a CSS selector. We can simply input the selector we found in DevTools and Playwright will find the element for us and extract its text.

const commitText = await page
    .locator('.react-last-commit-history-group span')
    .textContent();

CSS selectors are the bread and butter of web scraping, but sometimes they can break easily. Why? Because websites get updated very often and with the updates, they change their structure. If the selector is based on a combination of multiple variables like our selector above, which relies on names of two HTML elements, their relative position in the DOM and multiple CSS classes as well, it's more likely that one of those variables will change with a website update and the selector will break.

Using Playwright locator API

To make our scraper more reliable we can use Playwright's locator API to craft a more user-centric selection mechanism. One useful benefit of the locator API is that you can easily combine conditions that reflect how a user would see the element on screen.

const commitText = await page
    .getByRole('link') // check all links
    .filter({ hasText: 'Commits' }) // that contain text "Commits"
    .first() // and pick the first one
    .textContent()

For example, in the above locator, we first find all list elements. It might not be obvious at first sight, but the commit count is in a horizontal list with all the other repository metadata. This is unlikely to change unless GitHub does a major redesign. Still, this will get us all list items on the page, so we need to filter only those that include the word commits. This is also unlikely to change in a commit count. Now we have a more resilient selector that focuses on how a human perceives the page and not on the structure for machines.

Parsing the commit count

In any case, the CSS selector and the locator API will only get us so far. Before we save the commit count, we have to clean the string from extra characters and turn it into a number.

const numberStrings = commitText.match(/\d+/g);
const commitCount = Number(numberStrings.join(''));

Finally, we combine all this code into the first part of our if statement and here's the complete, runnable example.

import { PlaywrightCrawler, Request } from 'crawlee';

const REPO_COUNT = 20;

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ request, page, infiniteScroll }) => {
        const title = await page.title()
        console.log(title);

        if (request.label === 'repository') {
            const commitText = await page
                .getByRole('link')
                .filter({ hasText: 'Commits' })
                .first()
                .textContent()
            const numberStrings = commitText.match(/\d+/g);
            const commitCount = Number(numberStrings.join(''));
            console.log(commitCount);
        } else {
            await infiniteScroll({
                buttonSelector: 'text=Load more',
                stopScrollCallback: async () => {
                    const repos = await page.$$('article.border');
                    return repos.length >= REPO_COUNT;
                },
            });

            const repos = await page.$$eval('article.border', (repoCards) => {
                return repoCards.map(card => {
                    const [user, repo] = card.querySelectorAll('h3 a');
                    const stars = card.querySelector('#repo-stars-counter-star')
                        .getAttribute('title');
                    const description = card.querySelector('div.px-3 > p');
                    const topics = card.querySelectorAll('a.topic-tag');

                    const toText = (element) => element && element.innerText.trim();
                    const parseNumber = (text) => Number(text.replace(/,/g, ''));

                    return {
                        user: toText(user),
                        repo: toText(repo),
                        url: repo.href,
                        stars: parseNumber(stars),
                        description: toText(description),
                        topics: Array.from(topics)
                            .map((t) => toText(t)),
                    };
                });
            });

            console.log('Repository count:', repos.length);
            const requests = repos.map(repo => new Request({
                url: repo.url,
                label: 'repository',
                userData: repo,
            }));

            await crawler.addRequests(requests);
        }
    }
})

await crawler.run(['https://github.com/topics/javascript']);

If you run the above code, you will see the commit counts of all repositories logged to the console.

Saving extracted data

But logging the data to console is not very useful in production, so let's use Crawlee's Dataset class to save the scraped data to the disk.

import { PlaywrightCrawler, Request, Dataset } from 'crawlee';

// ... inside requestHandler

await Dataset.pushData({
    ...request.userData,
    commitCount,
});

Do you remember that we saved all the information about the repo we extracted from the topic page to the userData property of our requests? Now we can easily merge this data with our commitCount and save the whole object to disk. This will create a JSON file for each repository in the following directory.

./storage/datasets/default

You can go there to inspect the files, and you'll find JSONs like this.

{
	"user": "vuejs",
	"repo": "vue",
	"url": "https://github.com/vuejs/vue",
	"stars": 201555,
	"description": "🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.",
	"topics": [
		"javascript",
		"framework",
		"vue",
		"frontend"
	],
	"label": "repository",
	"commitCount": 3544
}

To make it even easier to process the data, we can save it to a CSV or one big JSON with one line of code.

// This should be added as the very last
// function call. After await crawler.run()
await Dataset.exportToCSV('repositories');

Crawlee will save the CSV in this location:

./storage/key_value_stores/default/repositories.csv

The final runnable code looks like this. We could also increase REPO_COUNT back to 100 to get the top 100 JavaScript repositories, but this will most likely lead to getting your IP rate limited by GitHub. So let's increase it only to 40.

import { PlaywrightCrawler, Request, Dataset } from 'crawlee';

const REPO_COUNT = 40;

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ request, page, infiniteScroll }) => {
        const title = await page.title()
        console.log(title);

        if (request.label === 'repository') {
            const commitText = await page
                .getByRole('link')
                .filter({ hasText: 'Commits' })
                .first()
                .textContent()
            const numberStrings = commitText.match(/\d+/g);
            const commitCount = Number(numberStrings.join(''));

            await Dataset.pushData({
                ...request.userData,
                commitCount,
            });
        } else {
            await infiniteScroll({
                buttonSelector: 'text=Load more',
                stopScrollCallback: async () => {
                    const repos = await page.$$('article.border');
                    return repos.length >= REPO_COUNT;
                },
            });

            const repos = await page.$$eval('article.border', (repoCards) => {
                return repoCards.map(card => {
                    const [user, repo] = card.querySelectorAll('h3 a');
                    const stars = card.querySelector('#repo-stars-counter-star')
                        .getAttribute('title');
                    const description = card.querySelector('div.px-3 > p');
                    const topics = card.querySelectorAll('a.topic-tag');

                    const toText = (element) => element && element.innerText.trim();
                    const parseNumber = (text) => Number(text.replace(/,/g, ''));

                    return {
                        user: toText(user),
                        repo: toText(repo),
                        url: repo.href,
                        stars: parseNumber(stars),
                        description: toText(description),
                        topics: Array.from(topics)
                            .map((t) => toText(t)),
                    };
                });
            });

            console.log('Repository count:', repos.length);
            const requests = repos.map(repo => new Request({
                url: repo.url,
                label: 'repository',
                userData: repo,
            }));

            await crawler.addRequests(requests);
        }
    }
})

await crawler.run(['https://github.com/topics/javascript']);
await Dataset.exportToCSV('repositories');

When you run this code, you'll see the crawler printing the individual page titles into the console, and after it finishes, you'll find your CSV in the location shown above.

Deployment to cloud

After taking the time to write this tutorial, we will also use it for a bit of a shameless self-promotion. Apify is a cloud platform that's built to help you develop, run and maintain your web scrapers easily and efficiently. It comes with tons of features like queues storages and proxies, and it supports Playwright without any extra configuration. You can run the above scraper, save results and control everything with a powerful API and you can do it 10 times faster than on AWS or similar universal cloud.

To learn more visit our homepage or jump directly to the Getting Started course in the Apify Academy, where you can find more free courses on Playwright and web scraping in general.

Bonus: routing

Earlier we mentioned that there's a better way of structuring your code with Playwright and Crawlee than to put everything into a single function. In this section we'll explore the Router class of Crawlee and how you can use it to make your code more manageable.

So far, we kept all our code in a single file called crawlee.js. We will add a new file router.js and move all our request handling logic there. Thanks to a router, we can split the code we had in requestHandler into as many functions as we want and the crawler will automatically route the logic based on the label we use for each Request.

// router.js
import { createPlaywrightRouter, Dataset, Request } from 'crawlee';

// We create a Playwright specific router to
// get intellisense and typechecks for our IDE.
export const router = createPlaywrightRouter();

const REPO_COUNT = 40;

router.use(async (ctx) => {
    // This is for middlewares - functions that will be
    // executed on all routes, irrespective of label.
})

router.addHandler('repository', async (ctx) => {
    // This handler will execute for all requests
    // with the 'repository' label.
});

router.addDefaultHandler(async (ctx) => {
    // This handler will execute for requests
    // that don't have a label.
});

We can then move our existing logic into this skeleton and the router.js file will look like this.

// router.js
import { createPlaywrightRouter, Dataset, Request } from 'crawlee';

export const router = createPlaywrightRouter();

const REPO_COUNT = 40;

router.use(async ({ page }) => {
    const title = await page.title()
    console.log(title);
})

router.addHandler('repository', async ({ page, request }) => {
    const commitText = await page
        .getByRole('link')
        .filter({ hasText: 'Commits' })
        .first()
        .textContent()
    const numberStrings = commitText.match(/\d+/g);
    const commitCount = Number(numberStrings.join(''));

    await Dataset.pushData({
        ...request.userData,
        commitCount,
    });
});

router.addDefaultHandler(async ({ page, infiniteScroll, crawler }) => {
    await infiniteScroll({
        buttonSelector: 'text=Load more',
        stopScrollCallback: async () => {
            const repos = await page.$$('article.border');
            return repos.length >= REPO_COUNT;
        },
    });

    const repos = await page.$$eval('article.border', (repoCards) => {
        return repoCards.map(card => {
            const [user, repo] = card.querySelectorAll('h3 a');
            const stars = card.querySelector('#repo-stars-counter-star')
                .getAttribute('title');
            const description = card.querySelector('div.px-3 > p');
            const topics = card.querySelectorAll('a.topic-tag');

            const toText = (element) => element && element.innerText.trim();
            const parseNumber = (text) => Number(text.replace(/,/g, ''));

            return {
                user: toText(user),
                repo: toText(repo),
                url: repo.href,
                stars: parseNumber(stars),
                description: toText(description),
                topics: Array.from(topics)
                    .map((t) => toText(t)),
            };
        });
    });

    console.log('Repository count:', repos.length);
    const requests = repos.map(repo => new Request({
        url: repo.url,
        label: 'repository',
        userData: repo,
    }));

    await crawler.addRequests(requests);
})

If this is still too much for you, feel free to split it even more. For example, one route per file. With the crawling logic removed, crawlee.js is now very short and readable.

// crawlee.js
import { Dataset, PlaywrightCrawler } from 'crawlee';
import { router } from './router.js';

const crawler = new PlaywrightCrawler({
    requestHandler: router
})

await crawler.run(['https://github.com/topics/javascript'])
await Dataset.exportToCSV('repositories');

When you run crawlee.js, it will behave exactly the same as before the split, but thanks to the router, the code will be much easier to read and maintain.

What to learn next?

So there you have it. A complete web scraping and crawling tutorial for Playwright. If you're interested in learning more about Playwright, Puppeteer and web scraping in general, visit our free academy course where we explore Playwright features in more detail in step-by-step lessons with code examples and detailed explanations.

If you're only starting out and would like to learn the basics of web scraping, our web scraping for beginners course explains the basic concepts and gets you ready to tackle more difficult challenges.

Ondra Urban
Ondra Urban
COO of Apify, but I think of myself as the chief debugging officer. Apify’s mission is to make the web more programmable. My mission is to make Apify the well-oiled machine that can achieve that goal.

Get started now

Step up your web scraping and automation