This n8n workflow automates the process of scraping multiple webpages from a website sitemap and saving their content to Google Drive. It is designed for users who need to extract and archive website information efficiently.
The workflow begins with a manual trigger for testing purposes. It fetches a sitemap XML file containing a list of website URLs, then converts this XML data into JSON format. Next, it splits the list of URLs into individual items, allowing each URL to be processed separately.
For each URL, the workflow filters pages based on specific topics or pages, such as links containing ‘agent’ or ‘tool,’ or a specific URL like ‘https://ai.pydantic.dev/’. It then prepares the URL for scraping by setting the target page URL.
Using the Jina.ai web scraper service, it retrieves the webpage content for each URL. The raw webpage data is processed with a code node to extract the title and markdown content, which are then combined into a structured format.
Finally, the content is saved to Google Drive with filenames derived from the webpage title. Throughout, there are informational sticky notes that guide the user about the workflow’s purpose and usage instructions.
This automation is practical for website content auditing, archiving, data analysis, or creating comprehensive offline copies of webpage information, especially for multi-page sites.
Reviews
There are no reviews yet.