top of page
Writer's pictureMM

Creating a Review Scraper with ChatGPT: A Journey into Automation

Building an efficient and reliable scraper can be quite a chore when dealing with complex websites, pagination, and rotation of proxies - especially for someone who hasn't written a single line of code, like ever.


Recently, I went on a journey to create a web scraper for collecting reviews on OMR reviews website - one of my agencies needed to analyze market sentiment in Germany and there were no out-of-the-box scrapers available on Apify. The process went well, and ChatGPT helped me build and tune this scraper to meet all our requirements.


Key Takeaways

  1. Coding with ChatGPT is Easy: ChatGPT was a great partner-in-crime, helping me debug errors, refine logic, and even write documentation.

  2. Iterative Development: Building the scraper was not a one-and-done task. It required testing, feedback, and incremental improvements.

  3. Adaptability Matters: Websites change. Structuring the scraper to handle dynamic behavior (like proxy rotation and pagination) made it resilient.


The Problem: Scraping Reviews with Pagination

My goal was simple: scrape reviews from a product page on a website. The reviews spanned multiple pages, and each page had its unique structure. Here were the challenges:

  • Dynamic Pagination: URLs for subsequent pages required appending /2, /3, etc.

  • JavaScript-Rendered Content: The reviews were dynamically loaded.

  • Avoiding IP Blocks: The scraper needed proxy rotation to prevent being blocked.


Step 1: Determining the Requirements

I started by outlining the core features my scraper needed:

  • Handle multi-page navigation.

  • Extract structured data like positive feedback, negative feedback, and use cases.

  • Support scraping up to 100 reviews in total.

  • Use proxy rotation to avoid detection.


Step 2: Collaborating with ChatGPT

Instead of coding everything from scratch, I turned to ChatGPT. Here’s how it helped:

Initial Setup: ChatGPT guided me in setting up a web scraper using the Apify platform, leveraging tools like Axios and Cheerio for HTML fetching and parsing.


The initial code fetched reviews from a single page:

const response = await axios.get(url);
const $ = cheerio.load(response.data);
const reviews = [];
$('[data-testid="text-review-quotes"]').each((i, element) => {
    const positive = $(element).find('[data-testid="text-review-quotes-positive"] [data-testid="review-quote-answer"]').text().trim() || '';
    const negative = $(element).find('[data-testid="text-review-negative"] [data-testid="review-quote-answer"]').text().trim() || '';
    reviews.push({ positive, negative });
});

Handling Pagination: The website used a /2, /3, /4 URL structure for pagination. ChatGPT updated the scraper to dynamically navigate pages until the desired number of reviews was collected:

while (reviews.length < maxReviews) {
    const currentPageUrl = `${url}/${currentPage}`;
    const response = await axios.get(currentPageUrl);
    // Extract reviews and move to the next page.
    currentPage++;
}

Adding Proxy Rotation: To avoid being blocked, I asked ChatGPT to incorporate proxy rotation using Apify's proxy capabilities:

const proxyUrls = process.env.APIFY_PROXY_URLS?.split(',') || [];
const proxyUrl = proxyUrls.length > 0 ? proxyUrls[proxyIndex % proxyUrls.length] : null;
const axiosConfig = proxyUrl
    ? { httpsAgent: new (require('https-proxy-agent'))(proxyUrl) }
    : {};

Input Schema for User Configurability: I needed a clear way to pass parameters like the target URL and maximum reviews. ChatGPT helped me design an input_schema.json:

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "schemaVersion": 1,
    "type": "object",
    "properties": {
        "url": { "type": "string", "default": "https://example.com/reviews" },
        "maxReviews": { "type": "integer", "default": 100 }
    },
    "required": ["url"]
}

Step 3: Testing and Iteration

Despite initial success, the scraper initially stopped at 37 reviews. I shared this issue with ChatGPT, which helped me refine the pagination logic and debug errors. For example:

  • Ensuring pagination stopped only when there were no more pages.

  • Validating the JSON structure to fix schema errors.


Step 4: The Final Solution

After several iterations, I ended up with a robust scraper that:

  • Navigates through pagination seamlessly.

  • Extracts structured review data (positive, negative, and problem-solving).

  • Stops after collecting the specified number of reviews (up to 100).

  • Uses proxy rotation for resilience.


Here’s the core scraping loop:

while (reviews.length < maxReviews) {
    const currentPageUrl = `${url}/${currentPage}`;
    const response = await axios.get(currentPageUrl);
    const $ = cheerio.load(response.data);
    $('[data-testid="text-review-quotes"]').each((i, element) => {
        const positive = $(element).find('[data-testid="text-review-quotes-positive"] [data-testid="review-quote-answer"]').text().trim() || '';
        const negative = $(element).find('[data-testid="text-review-negative"] [data-testid="review-quote-answer"]').text().trim() || '';
        reviews.push({ positive, negative });
    });
    currentPage++;
}

Final Thoughts


With ChatGPT’s assistance, I went from a vague idea to a functional tool in about 2hours (I'm sure the next scraper will go much faster). If you’re working on a similar project, consider leveraging AI tools like ChatGPT—they're a game changer for rapid development and problem-solving.


If you have any questions or want to build your own scraper, feel free to reach out!

4 views0 comments

Commentaires


Les commentaires ont été désactivés.
bottom of page