Scrap Data from Web page using cURL library in PHP

Updated on     Kisan Patel

This post will explain you how to scrap data from web page using cURL library in PHP?

When we make a cURL request, the server responds the requested resource html code of the web page then we are free to scrap the data we require from the html page.

Here, we create a function called curlGet() , which accepts a single parameter $url.

function curlGet($url) {
     $ch = curl_init(); // Initialising cURL session

     // Setting cURL options
     curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
     curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
     curl_setopt($ch, CURLOPT_URL, $url);
     $results = curl_exec($ch); // Executing cURL session
     curl_close($ch); // Closing cURL session
     return $results; // Return the results
}

We execute our cURL request, storing the returned string in the $results variable as shown in below code:

$results = curl_exec($ch);

By using curlGet() function we have requested and downloaded a web page. Now we need to scrap the data that we require.

XPath can be used to navigate through elements in an XML document. So we need to convert web page into XML object, from which we will use XPath to scrap the required elements based on XML tags and attributes, such as CSS classes and IDs.

// Function to return XPath object
function returnXPathObject($htmlPage) {
    $xmlPageDom = new DomDocument(); // Instantiating a new DomDocument object
    @$xmlPageDom->loadHTML($htmlPage); // Loading the HTML from downloaded page
    $xmlPageXPath = new DOMXPath($xmlPageDom); // Instantiating new XPath DOM object
    return $xmlPageXPath; // Returning XPath object
}

The above returnXPathObject() function, which accepts single parameter $htmlPage and convert it into XML.

Now, we are free to scrap the required data from XML.

For Example, we want to scrap the following HTML file saved into local computer.

index.html

<!doctype html>
<html lang="en">
  <head>
     <meta charset="utf-8">
     <title>Scrapping Demo</title>
  </head>
  <body>
     <h1>Great Title</h1>
  </body>
</html>

Now, we have used below line of code to scrap index.html page:

$scrapData = array(); // Declaring array to store scraped book data.

$scrapPage = curlGet('http://localhost/test/scrap/index.html');
// Calling function curlGet and storing returned results in $scrapPage variable

$dataPageXpath = returnXPathObject($scrapPage); // Instantiating new XPath DOM object

$title = $dataPageXpath ->query('//h1'); // Querying for <h1>(title of page)

// If title exists
if ($title->length > 0) {
     $scrapData['title'] = $title->item(0)->nodeValue; // Add title to array
}

print_r($scrapData);

scrapping-demo-php

The complete code:

<?php
// Function to make GET request using cURL
function curlGet($url) {
    $ch = curl_init(); // Initialising cURL session
 
    // Setting cURL options
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($ch, CURLOPT_URL, $url);
    $results = curl_exec($ch); // Executing cURL session
    curl_close($ch); // Closing cURL session
    return $results; // Return the results
}

$scrapData = array(); // Declaring array to store scraped book data.
// Function to return XPath object
function returnXPathObject($htmlPage) {
    $xmlPageDom = new DomDocument(); // Instantiating a new DomDocument object
    @$xmlPageDom->loadHTML($htmlPage); // Loading the HTML from downloaded page
    $xmlPageXPath = new DOMXPath($xmlPageDom); // Instantiating new XPath DOM object
    return $xmlPageXPath; // Returning XPath object
}

$scrapPage = curlGet('http://localhost/test/scrap/index.html');
// Calling function curlGet and storing returned results in $scrapPage variable

$dataPageXpath = returnXPathObject($scrapPage); // Instantiating new XPath DOM object

$title = $dataPageXpath->query('//h1'); // Querying for <h1>(title of page)

// If title exists
if ($title->length > 0) {
    $scrapData['title'] = $title->item(0)->nodeValue; // Add title to array
}

print_r($scrapData);

?>

Download Complete Source Code


PHP

Leave a Reply