Building an efficient and reliable scraper can be quite a chore when dealing with complex websites, pagination, and rotation of proxies - especially for someone who hasn't written a single line of code, like ever.
Recently, I went on a journey to create a web scraper for collecting reviews on OMR reviews website - one of my agencies needed to analyze market sentiment in Germany and there were no out-of-the-box scrapers available on Apify. The process went well, and ChatGPT helped me build and tune this scraper to meet all our requirements.
Key Takeaways
Coding with ChatGPT is Easy: ChatGPT was a great partner-in-crime, helping me debug errors, refine logic, and even write documentation.
Iterative Development: Building the scraper was not a one-and-done task. It required testing, feedback, and incremental improvements.
Adaptability Matters: Websites change. Structuring the scraper to handle dynamic behavior (like proxy rotation and pagination) made it resilient.
The Problem: Scraping Reviews with Pagination
My goal was simple: scrape reviews from a product page on a website. The reviews spanned multiple pages, and each page had its unique structure. Here were the challenges:
Dynamic Pagination: URLs for subsequent pages required appending /2, /3, etc.
JavaScript-Rendered Content: The reviews were dynamically loaded.
Avoiding IP Blocks: The scraper needed proxy rotation to prevent being blocked.
Step 1: Determining the Requirements
I started by outlining the core features my scraper needed:
Handle multi-page navigation.
Extract structured data like positive feedback, negative feedback, and use cases.
Support scraping up to 100 reviews in total.
Use proxy rotation to avoid detection.
Step 2: Collaborating with ChatGPT
Instead of coding everything from scratch, I turned to ChatGPT. Here’s how it helped:
Initial Setup: ChatGPT guided me in setting up a web scraper using the Apify platform, leveraging tools like Axios and Cheerio for HTML fetching and parsing.
The initial code fetched reviews from a single page:
const response = await axios.get(url);
const $ = cheerio.load(response.data);
const reviews = [];
$('[data-testid="text-review-quotes"]').each((i, element) => {
const positive = $(element).find('[data-testid="text-review-quotes-positive"] [data-testid="review-quote-answer"]').text().trim() || '';
const negative = $(element).find('[data-testid="text-review-negative"] [data-testid="review-quote-answer"]').text().trim() || '';
reviews.push({ positive, negative });
});
Handling Pagination: The website used a /2, /3, /4 URL structure for pagination. ChatGPT updated the scraper to dynamically navigate pages until the desired number of reviews was collected:
while (reviews.length < maxReviews) {
const currentPageUrl = `${url}/${currentPage}`;
const response = await axios.get(currentPageUrl);
// Extract reviews and move to the next page.
currentPage++;
}
Adding Proxy Rotation: To avoid being blocked, I asked ChatGPT to incorporate proxy rotation using Apify's proxy capabilities:
const proxyUrls = process.env.APIFY_PROXY_URLS?.split(',') || [];
const proxyUrl = proxyUrls.length > 0 ? proxyUrls[proxyIndex % proxyUrls.length] : null;
const axiosConfig = proxyUrl
? { httpsAgent: new (require('https-proxy-agent'))(proxyUrl) }
: {};
Input Schema for User Configurability: I needed a clear way to pass parameters like the target URL and maximum reviews. ChatGPT helped me design an input_schema.json:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"schemaVersion": 1,
"type": "object",
"properties": {
"url": { "type": "string", "default": "https://example.com/reviews" },
"maxReviews": { "type": "integer", "default": 100 }
},
"required": ["url"]
}
Step 3: Testing and Iteration
Despite initial success, the scraper initially stopped at 37 reviews. I shared this issue with ChatGPT, which helped me refine the pagination logic and debug errors. For example:
Ensuring pagination stopped only when there were no more pages.
Validating the JSON structure to fix schema errors.
Step 4: The Final Solution
After several iterations, I ended up with a robust scraper that:
Navigates through pagination seamlessly.
Extracts structured review data (positive, negative, and problem-solving).
Stops after collecting the specified number of reviews (up to 100).
Uses proxy rotation for resilience.
Here’s the core scraping loop:
while (reviews.length < maxReviews) {
const currentPageUrl = `${url}/${currentPage}`;
const response = await axios.get(currentPageUrl);
const $ = cheerio.load(response.data);
$('[data-testid="text-review-quotes"]').each((i, element) => {
const positive = $(element).find('[data-testid="text-review-quotes-positive"] [data-testid="review-quote-answer"]').text().trim() || '';
const negative = $(element).find('[data-testid="text-review-negative"] [data-testid="review-quote-answer"]').text().trim() || '';
reviews.push({ positive, negative });
});
currentPage++;
}
Final Thoughts
With ChatGPT’s assistance, I went from a vague idea to a functional tool in about 2hours (I'm sure the next scraper will go much faster). If you’re working on a similar project, consider leveraging AI tools like ChatGPT—they're a game changer for rapid development and problem-solving.
If you have any questions or want to build your own scraper, feel free to reach out!
Commentaires