Js Web Scraping



This library intends to make parsing HTML (e.g. Scraping the web) as simple and intuitive as possible. Selenium is the best for scraping JS and Ajax content. Scrapingdog is a web scraping API to scrape any website in just a single API call. It handles millions of proxies, browsers and CAPTCHAs so developers and even non-developers can focus on data collection. You can start with free 1000 API calls. ScrapingBee is a Web Scraping API that handles proxies and Headless browser for you, so you can focus on extracting the data you want, and nothing else.

If you’re wondering how to make a Chrome Extension, Chrome’s extensiondocumentation is great for basic implementations. However, to use more advancedfeatures requires a lot of Googling and Stack Overflow. Let’s make anintermediate Chrome extension that interacts with the page: it will find thefirst external link on the page and open it in a new tab.

  • While this article tackles the main aspects of web scraping with NodeJS, it does not talk about web scraping without getting blocked. If you want to learn how to avoid getting blocked, read our complete guide, and if you don't want to deal with this, you can always use our web scraping API.
  • Request: For implementing quick use of HTTP and JavaScript web scraping, this is one of the most.

manifest.json

The manifest.json file tells Chrome important information about yourextension, like its name and which permissions it needs.

The most basic possible extension is a directory with a manifest.json file.Let’s create a directory and put the following JSONinto manifest.json:

That’s the most basic possible manifest.json, with all required fields filledin. The manifest_versionshould always be 2, because version 1 isunsupported as of January 2014. So far our extension does absolutely nothing,but let’s load it into Chrome anyway.

Load your extension into Chrome

To load your extension in Chrome, open up chrome://extensions/ in your browserand click “Developer mode” in the top right. Now click “Load unpackedextension…” and select the extension’s directory. You should now see yourextension in the list.

When you change or add code in your extension, just come back to this page andreload the page. Chrome will reload your extension.

Content scripts

A content script is “a JavaScript file that runs in the context of web pages.”This means that a content script can interact with web pages that the browservisits. Not every JavaScript file in a Chrome extension can do this; we’ll seewhy later.

Let’s add a content script named content.js:

To inject the script, we need to tell our manifest.json file about it.

Add this to your manifest.json file:

This tells Chrome to inject content.js into every page we visit using thespecial <all_urls> URL pattern. If we want to inject the script on only somepages, we can use match patterns. Here are a few examples of values for'matches':

  • ['https://mail.google.com/*', 'http://mail.google.com/*'] injects our scriptinto HTTPS and HTTP Gmail. If we have / at the end instead of /*, itmatches the URLs exactly, and so would only inject intohttps://mail.google.com/, not https://mail.google.com/mail/u/0/#inbox.Usually that isn’t what you want.
  • http://*/* will match any http URL, but no other scheme. For example, thiswon’t inject your script into https sites.

Node Js Web Scraping

Reload your Chrome extension. Every single page you visit now pops up an alert. Let’slog the first URL on the page instead.

Logging the URL

jQuery isn’t necessary, but it makes everything easier. First, download aversion of jQuery from the jQuery CDN and put it in your extension’s folder. Idownloaded the latest minified version, jquery-2.1.3.min.js. To load it, addit to manifest.json before 'content.js'. Your whole manifest.json should looklike this:

Now that we have jQuery, let’s use it to log the URL of the first external linkon the page in content.js:

Note that we don’t need to use jQuery to check if the document has loaded. Bydefault, Chrome injects content scripts after the DOM is complete.

Try it out - you should see the output in your console on every page you visit.

Browser Actions

When an extension adds a little icon next to your address bar, that’s a browseraction. Your extension can listen for clicks on that button and then dosomething.

Put the icon.png from Google’s extension tutorial in your extension folder andadd this to manifest.json:

In order to use the browser action, we need to add message passing.

Message passing

Jquery web scraping

A content script has access to the current page, but is limited in the APIs itcan access. For example, it cannot listen for clicks on the browser action. Weneed to add a different type of script to our extension, a background script,which has access to every Chrome API but cannot access the current page. AsGoogle puts it:

Content scripts have some limitations. They cannot use chrome.* APIs, withthe exception of extension, i18n, runtime, and storage.

So the content script will be able to pull a URL out of the current page, butwill need to hand that URL over to the background script to do something usefulwith it. In order to communicate, we’ll use what Google calls message passing,which allows scripts to send and listen for messages. It is the only way forcontent scripts and background scripts to interact.

Add the following to tell manifest.json about the background script:

Js Web Scraping Tutorial

Puppeteer js web scraping

Now we’ll add background.js:

This sends an arbitrary JSON payload to the current tab. The keys of the JSONpayload can be anything, but I chose 'message' for simplicity. Now we need tolisten for that message in content.js:

Notice that all of our previous code has been moved into the listener, so thatit is only run when the payload is received. Every time you click the browseraction icon, you should see a URL get logged to the console. If it’s notworking, try reloading the extension and then reloading the page.

Opening a new tab

We can use the chrome.tabs API to open a new tab:

But chrome.tabs can only be used by background.js, so we’ll have to add somemore message passing since background.js can open the tab, but can’t grab theURL. Here’s the idea:

  1. Listen for a click on the browser action in background.js. When it’sclicked, send a clicked_browser_action event to content.js.
  2. When content.js receives the event, it grabs the URL of the first link on thepage. Then it sends open_new_tab back to background.js with the URL toopen.
  3. background.js listens for open_new_tab and opens a new tab with the givenURL when it receives the message.

Clicking on the browser action will trigger background.js, which will send amessage to content.js, which will send a URL back to background.js, which willopen a new tab with the given URL.

First, we need to tell content.js to send the URL to background.js. Changecontent.js to use this code:

Now we need to add some code to tell background.js to listen for that event:

Now when you click on the browser action icon, it opens a new tab with the firstexternal URL on the page.

Wrapping it up

The full content.js and background.js are above. Here’s the full manifest.json:

And here’s the full directory structure:

More on how to make a Chrome extension

For more information, try the official Chrome extension documentation.