This library intends to make parsing HTML (e.g. Scraping the web) as simple and intuitive as possible. Selenium is the best for scraping JS and Ajax content. Scrapingdog is a web scraping API to scrape any website in just a single API call. It handles millions of proxies, browsers and CAPTCHAs so developers and even non-developers can focus on data collection. You can start with free 1000 API calls. ScrapingBee is a Web Scraping API that handles proxies and Headless browser for you, so you can focus on extracting the data you want, and nothing else.
If you’re wondering how to make a Chrome Extension, Chrome’s extensiondocumentation is great for basic implementations. However, to use more advancedfeatures requires a lot of Googling and Stack Overflow. Let’s make anintermediate Chrome extension that interacts with the page: it will find thefirst external link on the page and open it in a new tab.
- While this article tackles the main aspects of web scraping with NodeJS, it does not talk about web scraping without getting blocked. If you want to learn how to avoid getting blocked, read our complete guide, and if you don't want to deal with this, you can always use our web scraping API.
- Request: For implementing quick use of HTTP and JavaScript web scraping, this is one of the most.
manifest.json
The manifest.json file tells Chrome important information about yourextension, like its name and which permissions it needs.
The most basic possible extension is a directory with a manifest.json
file.Let’s create a directory and put the following JSONinto manifest.json
:
That’s the most basic possible manifest.json
, with all required fields filledin. The manifest_version
should always be 2, because version 1 isunsupported as of January 2014. So far our extension does absolutely nothing,but let’s load it into Chrome anyway.
Load your extension into Chrome
To load your extension in Chrome, open up chrome://extensions/
in your browserand click “Developer mode” in the top right. Now click “Load unpackedextension…” and select the extension’s directory. You should now see yourextension in the list.
When you change or add code in your extension, just come back to this page andreload the page. Chrome will reload your extension.
Content scripts
A content script is “a JavaScript file that runs in the context of web pages.”This means that a content script can interact with web pages that the browservisits. Not every JavaScript file in a Chrome extension can do this; we’ll seewhy later.
Let’s add a content script named content.js
:
To inject the script, we need to tell our manifest.json
file about it.
Add this to your manifest.json
file:
This tells Chrome to inject content.js
into every page we visit using thespecial <all_urls>
URL pattern. If we want to inject the script on only somepages, we can use match patterns. Here are a few examples of values for'matches'
:
['https://mail.google.com/*', 'http://mail.google.com/*']
injects our scriptinto HTTPS and HTTP Gmail. If we have/
at the end instead of/*
, itmatches the URLs exactly, and so would only inject intohttps://mail.google.com/
, nothttps://mail.google.com/mail/u/0/#inbox
.Usually that isn’t what you want.http://*/*
will match anyhttp
URL, but no other scheme. For example, thiswon’t inject your script intohttps
sites.
Node Js Web Scraping
Reload your Chrome extension. Every single page you visit now pops up an alert. Let’slog the first URL on the page instead.
Logging the URL
jQuery isn’t necessary, but it makes everything easier. First, download aversion of jQuery from the jQuery CDN and put it in your extension’s folder. Idownloaded the latest minified version, jquery-2.1.3.min.js
. To load it, addit to manifest.json
before 'content.js'
. Your whole manifest.json
should looklike this:
Now that we have jQuery, let’s use it to log the URL of the first external linkon the page in content.js
:
Note that we don’t need to use jQuery to check if the document has loaded. Bydefault, Chrome injects content scripts after the DOM is complete.
Try it out - you should see the output in your console on every page you visit.
Browser Actions
When an extension adds a little icon next to your address bar, that’s a browseraction. Your extension can listen for clicks on that button and then dosomething.
Put the icon.png from Google’s extension tutorial in your extension folder andadd this to manifest.json
:
In order to use the browser action, we need to add message passing.
Message passing
A content script has access to the current page, but is limited in the APIs itcan access. For example, it cannot listen for clicks on the browser action. Weneed to add a different type of script to our extension, a background script,which has access to every Chrome API but cannot access the current page. AsGoogle puts it:
Content scripts have some limitations. They cannot use chrome.*
APIs, withthe exception of extension
, i18n
, runtime
, and storage
.
So the content script will be able to pull a URL out of the current page, butwill need to hand that URL over to the background script to do something usefulwith it. In order to communicate, we’ll use what Google calls message passing,which allows scripts to send and listen for messages. It is the only way forcontent scripts and background scripts to interact.
Add the following to tell manifest.json
about the background script:
Js Web Scraping Tutorial
Now we’ll add background.js
:
This sends an arbitrary JSON payload to the current tab. The keys of the JSONpayload can be anything, but I chose 'message'
for simplicity. Now we need tolisten for that message in content.js
:
Notice that all of our previous code has been moved into the listener, so thatit is only run when the payload is received. Every time you click the browseraction icon, you should see a URL get logged to the console. If it’s notworking, try reloading the extension and then reloading the page.
Opening a new tab
We can use the chrome.tabs
API to open a new tab:
But chrome.tabs
can only be used by background.js
, so we’ll have to add somemore message passing since background.js
can open the tab, but can’t grab theURL. Here’s the idea:
- Listen for a click on the browser action in
background.js
. When it’sclicked, send aclicked_browser_action
event tocontent.js
. - When
content.js
receives the event, it grabs the URL of the first link on thepage. Then it sendsopen_new_tab
back tobackground.js
with the URL toopen. background.js
listens foropen_new_tab
and opens a new tab with the givenURL when it receives the message.
Clicking on the browser action will trigger background.js
, which will send amessage to content.js
, which will send a URL back to background.js
, which willopen a new tab with the given URL.
First, we need to tell content.js
to send the URL to background.js
. Changecontent.js
to use this code:
Now we need to add some code to tell background.js
to listen for that event:
Now when you click on the browser action icon, it opens a new tab with the firstexternal URL on the page.
Wrapping it up
The full content.js
and background.js
are above. Here’s the full manifest.json
:
And here’s the full directory structure:
More on how to make a Chrome extension
For more information, try the official Chrome extension documentation.