Updated on Kisan Patel
This post will explain you how to scrap data from web page using cURL library in PHP?
When we make a cURL request, the server responds the requested resource html code of the web page then we are free to scrap the data we require from the html page.
Here, we create a function called curlGet()
, which accepts a single parameter $url
.
function curlGet($url) { $ch = curl_init(); // Initialising cURL session // Setting cURL options curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); curl_setopt($ch, CURLOPT_URL, $url); $results = curl_exec($ch); // Executing cURL session curl_close($ch); // Closing cURL session return $results; // Return the results }
We execute our cURL request, storing the returned string in the $results variable as shown in below code:
$results = curl_exec($ch);
By using curlGet()
function we have requested and downloaded a web page. Now we need to scrap the data that we require.
XPath can be used to navigate through elements in an XML document. So we need to convert web page into XML object, from which we will use XPath to scrap the required elements based on XML tags and attributes, such as CSS classes and IDs.
// Function to return XPath object function returnXPathObject($htmlPage) { $xmlPageDom = new DomDocument(); // Instantiating a new DomDocument object @$xmlPageDom->loadHTML($htmlPage); // Loading the HTML from downloaded page $xmlPageXPath = new DOMXPath($xmlPageDom); // Instantiating new XPath DOM object return $xmlPageXPath; // Returning XPath object }
The above returnXPathObject()
function, which accepts single parameter $htmlPage
and convert it into XML.
Now, we are free to scrap the required data from XML.
For Example, we want to scrap the following HTML file saved into local computer.
index.html
<!doctype html> <html lang="en"> <head> <meta charset="utf-8"> <title>Scrapping Demo</title> </head> <body> <h1>Great Title</h1> </body> </html>
Now, we have used below line of code to scrap index.html page:
$scrapData = array(); // Declaring array to store scraped book data. $scrapPage = curlGet('http://localhost/test/scrap/index.html'); // Calling function curlGet and storing returned results in $scrapPage variable $dataPageXpath = returnXPathObject($scrapPage); // Instantiating new XPath DOM object $title = $dataPageXpath ->query('//h1'); // Querying for <h1>(title of page) // If title exists if ($title->length > 0) { $scrapData['title'] = $title->item(0)->nodeValue; // Add title to array } print_r($scrapData);
The complete code:
<?php // Function to make GET request using cURL function curlGet($url) { $ch = curl_init(); // Initialising cURL session // Setting cURL options curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); curl_setopt($ch, CURLOPT_URL, $url); $results = curl_exec($ch); // Executing cURL session curl_close($ch); // Closing cURL session return $results; // Return the results } $scrapData = array(); // Declaring array to store scraped book data. // Function to return XPath object function returnXPathObject($htmlPage) { $xmlPageDom = new DomDocument(); // Instantiating a new DomDocument object @$xmlPageDom->loadHTML($htmlPage); // Loading the HTML from downloaded page $xmlPageXPath = new DOMXPath($xmlPageDom); // Instantiating new XPath DOM object return $xmlPageXPath; // Returning XPath object } $scrapPage = curlGet('http://localhost/test/scrap/index.html'); // Calling function curlGet and storing returned results in $scrapPage variable $dataPageXpath = returnXPathObject($scrapPage); // Instantiating new XPath DOM object $title = $dataPageXpath->query('//h1'); // Querying for <h1>(title of page) // If title exists if ($title->length > 0) { $scrapData['title'] = $title->item(0)->nodeValue; // Add title to array } print_r($scrapData); ?>