Scrape Websites with PhantomJS
Web scraping, also known as data extraction or web harvesting, is paramount to many businesses that rely on data from different websites to craft effective marketing strategies or make well-informed decisions. While harvesting data from several sites is legal, not every website owner is pleased to have their data scraped.
Table of Contents
Therefore, you must use dedicated tools or APIs to extract data from a website. The use of headless browsers is another way to accomplish this.
A headless browser is a web browser without a graphical user interface (GUI). It allows you to control a web page from the command line or via scripting. A good example of a headless browser is PhantomJS, which is commonly used in web scraping.
But how to use PhantomJS efficiently? What does PhantomJS require for optimal functioning? Let’s take a closer look at it.
What Is PhantomJS?
PhantomJS is a headless browser. In simple terms, it lacks a graphical user interface. Instead, it runs and accesses web pages through a command line or programmatically, allowing faster navigation and automation.
Besides web scraping, you can also use it to automate specific tasks. These may include data collection or code testing.
Getting PhantomJS on your desktop is pretty simple. Simply go to the PhantomJS website and click the install button. PhantomJS is available for FreeBSD, Linux, macOS and Windows.
How does a headless browser differ from a web scraping API?
A web scraping API is a cloud-based service that uses proprietary technology to extract structured data from any website. APIs are specialized for a database, program, or website. So, the data you get from them is more organized and structured.
A headless browser has a broader scope since it’s not specific to a database. But its capabilities are more limited since it can’t go beyond the website’s HTML code.
What Can You Do With PhantomJS?
PhantomJS has many use cases. Here are the most common ones.
Page automation
You can use the DOM API to extract information from web pages. Or, you can use a library, such as jQuery, to manipulate the page.
Headless testing
You can use it to automate tests on web pages, such as A/B testing. The process is called headless testing since you can run tests without interacting with the website. PhantomJS lets you run functional tests using QUnit, WebDriver, Mocha and other similar frameworks.
Since there’s no graphical user interface, testing takes place faster. Plus, you can find the error codes at command line levels.
Many developers use PhantomJS in combination with a Continuous Integration system to test code before it goes live. It helps them catch potential bugs before they cause damage.
Network monitoring
PhantomJS network monitoring is matchless in terms of speed and accuracy. You can use it to track page loading or examine the traffic that your site receives. Likewise, Jenkins and YSlow make it easy to automate website performance analysis.
For example, you can extract the price fluctuation data if you want to watch the stock market. Similarly, brands and businesses can scrape social media websites to get engagement data for influencers.
Screen capture
PhantomJS screen capture is useful for taking screenshots of websites and pages. You can capture single or multiple web pages with a few lines of code.
How to Scrape Websites With PhantomJS?
NodeJS developers use PhantomJS frequently to scrape websites. It’s easy, fast and efficient. The example below shows you how to get HTML content using the URL.
Step 1: Set up package.json
The package.json file forms the crux of a Node product. It contains metadata about your project and helps you organize the app’s dependencies.
Even if you don’t want to publish your work to NPM, you must still set up package.json because it is necessary for structured data transfer across network connections. You can find package.json in the root directory of your project.
Step 2: Install npm packages
The Node Package Manager is the standard package manager for Node. It offers an extensive collection of packages that you can use to speed up your development process.
You can use the following Node installers to install npm and Node.js on your computer:
Step 3: Create a project folder with the file ‘package.json.’
The package.json must have two fields: name and version. The name field has your package’s name. It must be one word and lowercase. You can add underscores or hyphens to it.
The version field should be according to the semantic versioning guidelines. It has to be in x.x.x form. You can also add an author field if you want. Here’s how an example looks.
{
“name”: “my-first-package,”
“version”: “1.0.0”,
“author”: “Your Name <[email protected]>”
}
Step 4: Create a script
Next, you have to create a PhantomJS script. It should be named ‘index.js’ and placed in the root directory. The code for the script is given below.
const phantom = require(‘phantom’);const main = async () => {
const instance = await phantom.create();
const page = await instance.createPage();
await page.on(‘onResourceRequested’, function(requestData) {
console.info(‘Requesting’, requestData.url);
});
const url = ‘https://webmd.com/’;
console.log(‘URL::’, url); const status = await page.open(url);
console.log(‘STATUS::’, status); const content = await page.property(‘content’);
console.log(‘CONTENT::’, content);
await instance.exit();
};
main().catch(console.log);
Step 5: Run the script
Run $ node index.js in the terminal to run the script. It will show you HTML content from the website that you specified above.
PhantomJS Login to Web Page: How to Use PhantomJS to Scrape Websites With a Login Page?
Many websites are password-protected. You’ll need to tweak your code a little for such websites. Here’s how you’ll enter the website you want to scrape:
var page = require(‘webpage’).create() ;
var login = ‘https://webstie.com/login’ ;
var home = ‘https://website.com/home’ ;
After that, enter this code:
page.open(login, function (status) {
if (status !== ‘success’) {
console.log(‘fail!’);
phantom.exit(1);
} else {
page.evaluate(function(){
$(“input[name=email]”).val(“user”) ;
$(“input[name=password]”).val(“pass”) ;
$(“input[type=submit]”).click() ;
});
setTimeout(function(){
page.open(home, function(status){
if (status !== “success”) {
console.log(‘fail2’);
phantom.exit(1);
return;
}
page.evaluate(function(){
$(‘body’).css(‘border’,’1px solid red’) ;
});
page.render(‘page.png’);
console.log(‘finished!’);
phantom.exit();
});
}, 500);
}
});
The page.render command creates a screenshot of the webpage when you call the command. Meanwhile, the page.evaluate command injects jQuery into the page and can be used to perform any DOM manipulation.
Pros of PhantomJS
Here are some notable benefits of using PhantomJS:
- Availability: PhantomJS is available for all platforms, so you don’t need to worry about compatibility with your operating system.
- Speed: PhantomJS is faster than other scraping tools. It does the job in no time and makes it easier for developers to work with.
- Software Integration: You can use several testing tools, like Casper and Mocha, with PhantomJS. It also works on a Jenkins-based CI environment. Or, you can use Jasmine to write Javascript unit tests in a test framework. Then, you can execute them with PhantomJS and integrate them into TeamCity or a similar continuous build environment.
Cons of PhantomJS
Although PhantomJS offers several advantages, there are some drawbacks to consider:
- Memory Usage: Since PhantomJS is a full-fledged browser, it might take up more memory than other tools.
- Not a Testing Framework: Keep in mind that you can run unit tests using PhantomJS, but it’s not a testing framework in itself. So, you’ll need to use other tools like Casper or Mocha in conjunction with it.
PhantomJS Alternative: How Else to Scrape Websites?
It shouldn’t be hard to see that scraping with PhantomJS Node JS can be hard. Using Rust or any other programming language to build web scrapers will also require extensive skill and expertise.
Luckily, you can skip this hassle with Scraping Robot. Since the web scraper is fully automated, you only have to provide the URL and watch the scraper take charge.
We also handle proxy management and rotation for you, ensuring your scraping jobs are always successful. Furthermore, the scraper can be used to scrape data from any website, regardless of the complexity involved. Here are the other things we do for you:
- Updating Scraping Robot frequently
- Metadata parsing
- Monitoring anti-scraping updates of your target websites
- CAPTCHA solving
With Scraping Robot, you essentially get PhantomJS as a service minus the need to code. We provide 24/7 support in case you run into any trouble. You can also see usage statistics to determine your scraping rate for your preferred period.
Are you interested in giving Scraping Robot a shot to simplify web scraping? Sign up today, and you’ll get 5,000 free scrapes per month. Once you get the hang of things and want to use our scraper more often, you can bump your plan up to the Business tier with 500,000 scrapers per month.
The best part? Each scrape costs as little as $0.0018. If you want to get an even better bang for your buck, get our Enterprise tier plan for prices sitting at $0.00045 per scrape.
Speed Up Your Scraping Process
Now you know how to use PhantomJS for web scraping. But you also know how tricky, hard, and time-consuming it can be. You’ll also need intermediate-level coding knowledge to use it to its full potential.
With Scraping Robot, you get the best of both worlds. Not only do you not have to see a PhantomJS terminal ever, you also get to speed up the scraping process. Take advantage of our 5,000 free scrapes per month and see why data extraction is easier with us.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.