Extract data from a PDF

#linux #php #pdf

One of our providers give us some data ad PDF and I have to produce a JSON object for further elaborations.

For the textual information non problem: I used pdftotext to extract the text.

$content = shell_exec('pdftotext -enc UTF-8 -layout input.pdf -');

Then I used regular expressions to extract the data

 $anagrafica=array();
 if(preg_match('/^Denominazione\W*(.*)/m', $content, $aDenominazione)) {
     $anagrafica['denominazione']=$aDenominazione[1];
 }

How to extract the data of the semaphores that are images without labels?

I used the linux command pdftohtml

$rawImages = shell_exec('pdftohtml -enc UTF-8 -noframes -stdout -xml "'.$this->filePath.'" - | grep image');
$tok = strtok($rawImages,"\r\n");
while ($tok !== false) {
    $oImage = simplexml_load_string($tok);
    $images[]=$oImage;
    $tok = strtok("\r\n");
}

The output of pdftohtml in a xml document for each text box or image.

$rawImages is an array of the xml elements of the images ans I put them as SimpleXmlObjects in $images array.

Than I searched trough the array the images with 77 pixel of width and sort the by the vertical position.

The images are saved in the current directory of the script.

I queried the color of a pixel in a specific position of the image with convert command of ImageMagick library and saved the data in the JSON object.

$color = shell_exec('convert "'.$imagePath.'" -format \'%[pixel:p{100,50}]\' info:- ');
switch ($color) {
    case 'srgb(253,78,83)':
        $anagrafica[$this::chekcs[$pos]]='red';
    break;
    case 'srgb(123,196,78)':
        $anagrafica[$this::chekcs[$pos]]='green';
    break;
    case 'srgb(254,211,80)':
        $anagrafica[$this::chekcs[$pos]]='yellow';
    break;
};

At this point: is there an easy way to do the trick?

DEV Community

Extract data from a PDF

Top comments (0)