In this article we will explain how to crawl websites with Node.js We will use examples of how this script could help extract information from a travel business website listing hotels in Amsterdam. It uses a method called web scraping to explore the website in a systematic way, starting from the main page and gradually moving to other pages. The script is designed to gather data by following links and collecting specific details from each hotel page it encounters.
Firstly, let’s set up a Node.js project. To do so, you need to run the following command:
Now you've created an empty directory for your web-scraper-nodejs project. You can name this project folder whatever you'd like.
To enter the folder you can use the following command:
For the next step you will need to use npm (Node Package Manager) - a tool that comes bundled with the Node.js runtime environment. It is a package manager that allows developers to easily install, manage, and share reusable code packages, also known as modules or libraries.
Initialize a npm project with command:
This command will set up a new npm project for you. To quickly set up a default project with npm, you'll need to include the -y flag. This flag tells npm to skip the interactive setup process and automatically initialize the project with default settings. If you don't include the -y flag, the terminal will prompt you with some questions that you need to answer.
After completing these steps in the nodejs-scraper directory you should have a package.json file available. Now you can create a JS file (we named in app_parallel.js) and copy this example of the code to crawl our chosen website:
You can change the URL in the async main function and change the ‘scrapData($)’ function if you’d like to extract different data. This function extracts the specified data by selecting elements using CSS selectors.
Don't forget that before you can run the code, you will need to install the Go Simple Tunnel from https://github.com/go-gost/gost/releases. You can also do that by running ‘brew install gost’ command.
After this, you can run a proxy tunnel using the following command:
After all these steps you are ready to run the script and receive back the price and name of the hotel.
To do that, use the following command:
Your output should look like this:
Congratulations! You have successfully learned how to crawl websites using Node.js
To crawl a website, you use a web crawler or spider, which is a program that systematically navigates through web pages, follows links, and collects data from the site. You typically start from a specific URL, then recursively visit linked pages, extracting information as you go.
JS crawling, or JavaScript crawling, involves the process of crawling and scraping websites that heavily rely on JavaScript to load and display content. Unlike traditional HTML-based crawling, JS crawling requires tools or techniques that can execute JavaScript code to access and extract data from web pages that use dynamic rendering.
The time it takes to crawl a website varies widely, depending on factors like the website's size, complexity, and your crawling speed. Small sites may take minutes, while large ones with millions of pages could take hours or even days.
The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.
A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!
Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.