Dec 153 min read

Creating a Review Scraper with ChatGPT: A Journey into Automation

Building an efficient and reliable scraper can be quite a chore when dealing with complex websites, pagination, and rotation of proxies - especially for someone who hasn't written a single line of code, like ever.

Recently, I went on a journey to create a web scraper for collecting reviews on OMR reviews website - one of my agencies needed to analyze market sentiment in Germany and there were no out-of-the-box scrapers available on Apify. The process went well, and ChatGPT helped me build and tune this scraper to meet all our requirements.

Key Takeaways

Coding with ChatGPT is Easy: ChatGPT was a great partner-in-crime, helping me debug errors, refine logic, and even write documentation.
Iterative Development: Building the scraper was not a one-and-done task. It required testing, feedback, and incremental improvements.
Adaptability Matters: Websites change. Structuring the scraper to handle dynamic behavior (like proxy rotation and pagination) made it resilient.

The Problem: Scraping Reviews with Pagination

My goal was simple: scrape reviews from a product page on a website. The reviews spanned multiple pages, and each page had its unique structure. Here were the challenges:

Dynamic Pagination: URLs for subsequent pages required appending /2, /3, etc.
JavaScript-Rendered Content: The reviews were dynamically loaded.
Avoiding IP Blocks: The scraper needed proxy rotation to prevent being blocked.

Step 1: Determining the Requirements

I started by outlining the core features my scraper needed:

Handle multi-page navigation.
Extract structured data like positive feedback, negative feedback, and use cases.
Support scraping up to 100 reviews in total.
Use proxy rotation to avoid detection.

Step 2: Collaborating with ChatGPT

Instead of coding everything from scratch, I turned to ChatGPT. Here’s how it helped:

Initial Setup: ChatGPT guided me in setting up a web scraper using the Apify platform, leveraging tools like Axios and Cheerio for HTML fetching and parsing.

The initial code fetched reviews from a single page:

const response = await axios.get(url);

const $ = cheerio.load(response.data);

const reviews = [];

$('[data-testid="text-review-quotes"]').each((i, element) => {

    const positive = $(element).find('[data-testid="text-review-quotes-positive"] [data-testid="review-quote-answer"]').text().trim() || '';

    const negative = $(element).find('[data-testid="text-review-negative"] [data-testid="review-quote-answer"]').text().trim() || '';

    reviews.push({ positive, negative });

});

Handling Pagination: The website used a /2, /3, /4 URL structure for pagination. ChatGPT updated the scraper to dynamically navigate pages until the desired number of reviews was collected:

while (reviews.length < maxReviews) {

    const currentPageUrl = `${url}/${currentPage}`;

    const response = await axios.get(currentPageUrl);

    // Extract reviews and move to the next page.

    currentPage++;

Adding Proxy Rotation: To avoid being blocked, I asked ChatGPT to incorporate proxy rotation using Apify's proxy capabilities:

const proxyUrls = process.env.APIFY_PROXY_URLS?.split(',') || [];

const proxyUrl = proxyUrls.length > 0 ? proxyUrls[proxyIndex % proxyUrls.length] : null;

const axiosConfig = proxyUrl

    ? { httpsAgent: new (require('https-proxy-agent'))(proxyUrl) }

    : {};

Input Schema for User Configurability: I needed a clear way to pass parameters like the target URL and maximum reviews. ChatGPT helped me design an input_schema.json:

    "$schema": "http://json-schema.org/draft-07/schema#",

    "schemaVersion": 1,

    "type": "object",

    "properties": {

        "url": { "type": "string", "default": "https://example.com/reviews" },

        "maxReviews": { "type": "integer", "default": 100 }

},

    "required": ["url"]

Step 3: Testing and Iteration

Despite initial success, the scraper initially stopped at 37 reviews. I shared this issue with ChatGPT, which helped me refine the pagination logic and debug errors. For example:

Ensuring pagination stopped only when there were no more pages.
Validating the JSON structure to fix schema errors.

Step 4: The Final Solution

After several iterations, I ended up with a robust scraper that:

Navigates through pagination seamlessly.
Extracts structured review data (positive, negative, and problem-solving).
Stops after collecting the specified number of reviews (up to 100).
Uses proxy rotation for resilience.

Here’s the core scraping loop:

while (reviews.length < maxReviews) {

    const currentPageUrl = `${url}/${currentPage}`;

    const response = await axios.get(currentPageUrl);

    const $ = cheerio.load(response.data);

    $('[data-testid="text-review-quotes"]').each((i, element) => {

        const positive = $(element).find('[data-testid="text-review-quotes-positive"] [data-testid="review-quote-answer"]').text().trim() || '';

        const negative = $(element).find('[data-testid="text-review-negative"] [data-testid="review-quote-answer"]').text().trim() || '';

        reviews.push({ positive, negative });

});

    currentPage++;

Final Thoughts

With ChatGPT’s assistance, I went from a vague idea to a functional tool in about 2hours (I'm sure the next scraper will go much faster). If you’re working on a similar project, consider leveraging AI tools like ChatGPT—they're a game changer for rapid development and problem-solving.

If you have any questions or want to build your own scraper, feel free to reach out!