PDF Data and Table Scraping to Excel

PDF Data and Table Scraping to Excel - excel

I'm trying to figure out a good way to increase the productivity of my data entry job.
What I am looking to do is come up with a way to scrape data from a PDF and input it into Excel.
More specifically the data I am working with is from grocery store flyers. As it stands now we have to manually enter every deal in the flyer into a database. A sample of a flyer is http://weeklyspecials.safeway.com/customer_Frame.jsp?drpStoreID=1551
What I am hoping to do is have columns for products, price, and predefined options (Loyalty Cards, Coupons, Select Variety... that sort of thing).
Any help would be appreciated, and if I need to be more specific let me know.

After looking at the specific PDF linked to by the OP, I have to say that this is not quite displaying a typical table format.
It contains many images inside the "cells", but the cells are not all strictly vertically or horizontally aligned:
So this isn't even a 'nice' table, but an extremely ugly and awkward one to work with...
Having said that, I'll have to add:
Extracting even 'nice' tables from PDFs in general is extremely difficult...
Standard PDFs do not provide any hints about the semantics of what they draw on a page:
the only distinction that the syntax provides is the distinctions between vector elements (lines, fills,...), images and text.
Whether any character is part of a table or part of a line or just a lonely, single character within an otherwise empty area is not easy to recognize programmatically by parsing the PDF source code.
For a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article:
Why Updating Dollars for Docs Was So Difficult (ProPublica-Website)
...but doing so with TabulaPDF works very well!
Having said the above now let me add this:
For an amazing open source family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages) -- contradicting what I said in my introductionary paragraphs! -- check out TabulaPDF. See these links:
Introducing Tabula: Upload a PDF, get back tabular CSV data. Poof!
Tabula-Extractor: A Command Line Interface to Tabula
Tabula source code repository
Tabula API (upcoming, not ready yet)
Tabula-Extractor is written in Ruby.
In the background it makes use of PDFBox (which is written in Java) and a few other third-party libs.
To run, Tabula-Extractor requires JRuby-1.7 installed.
Installing Tabula-Extractor
I'm using the 'bleeding-edge' version of Tabula-Extractor directly from its GitHub source code repository.
Getting it to work was extremely easy, since on my system JRuby-1.7.4_0 is already present:
mkdir ~/svn-stuff
cd ~/svn-stuff
git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor
Included in this Git clone will already be the required libraries, so no need to install PDFBox.
The command line tool is in the /bin/ subdirectory.
Exploring the command line options:
~/svn-stuff/git.tabula-extractor/bin/tabula -h
Tabula helps you extract tables from PDFs
Usage:
tabula [options] <pdf_file>
where [options] are:
--pages, -p <s>: Comma separated list of ranges, or all. Examples:
--pages 1-3,5-7, --pages 3 or --pages all. Default
is --pages 1 (default: 1)
--area, -a <s>: Portion of the page to analyze
(top,left,bottom,right). Example: --area
269.875,12.75,790.5,561. Default is entire page
--columns, -c <s>: X coordinates of column boundaries. Example
--columns 10.1,20.2,30.3
--password, -s <s>: Password to decrypt document. Default is empty
(default: )
--guess, -g: Guess the portion of the page to analyze per page.
--debug, -d: Print detected table areas instead of processing.
--format, -f <s>: Output format (CSV,TSV,HTML,JSON) (default: CSV)
--outfile, -o <s>: Write output to <file> instead of STDOUT (default:
-)
--spreadsheet, -r: Force PDF to be extracted using spreadsheet-style
extraction (if there are ruling lines separating
each cell, as in a PDF of an Excel spreadsheet)
--no-spreadsheet, -n: Force PDF not to be extracted using
spreadsheet-style extraction (if there are ruling
lines separating each cell, as in a PDF of an Excel
spreadsheet)
--silent, -i: Suppress all stderr output.
--use-line-returns, -u: Use embedded line returns in cells. (Only in
spreadsheet mode.)
--version, -v: Print version and exit
--help, -h: Show this message
Extracting the table which the OP wants
I'm not even trying to extract this ugly table from the OP's monster PDF. I'll leave it as an excercise to these readers who are feeling adventurous enough...
Instead, I'll demo how to extract a 'nice' table. I'll take pages 651-653 from the official PDF-1.7 specification, here represented with screenshots:
I used this command:
~/svn-stuff/git.tabula-extractor/bin/tabula \
-p 651,652,653 -g -n -u -f CSV \
~/Downloads/pdfs/PDF32000_2008.pdf
After importing the generated CSV into LibreOffice Calc, the spreadsheet looks like this:
To me this looks like the perfect extraction of a table which did spread over 3 different PDF pages. (Even the newlines used within table cells made it into the spreadsheet.)
Update
Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the asciinema command line tool), starring tabula-extractor:

Related

batch file extract numbers from text file with little information

So This is related to my other two posts. Im dealing with extracting text from a text file and analyzing it and I've run into some problems. For A while I've been using a method that sets all the text between two other strings as a variable, but here is the situation I have. I need to extract the speed (numbers) from the below string: "etc...,query":{"ping":47855},"cmts":...etc. The problem is that the text cmts sometimes changes to something else so really I need to extract all the numbers from this:
,query":{"ping":47855},"
One more thing that makes this difficult is that the characters }," Are all over the file. Thank you for helping me! -Lucas EDG Programmer.
Here's the full file:
{"_id":53291,"ip":"158.69.22.95","domain":"jectile.com","port":25565,"url":"","date_add":1453897770,"status":1,"scan":1,"uptime":99.53,"last_update":1485436105,"geo":{"country":"US","country_name":"United States","city":"Lake Forest"},"info":{"name":" Jectile | jectile.com [1.8-1.11]\n Shoota (Call of Duty) \/ Zambies (Zombie Survival)","type":"FML","version":"1.10","plugins":[],"players":18,"max_players":420,"players_list":[],"map":"world","software":"BungeeCord 1.8.x, 1.9.x, 1.10.x, 1.11.x","avg_player_day":24.458333,"avg_load_day":5.8234,"platform":"MINECRAFT","icon":true},"counter":{"online":47871,"offline":228,"players":{"date":"2017-01-26","total":0},"last_offline":0,"query":{"ping":47855},"cmts":1},"rating":{"main":19.24,"difference":-0.64,"content_up":0.15,"K":0},"last":{"offline":1485415702,"online":1485436105},"chart":{"14:30":14,"14:40":16,"14:50":15,"15:00":18,"15:10":12,"15:20":13,"15:30":9,"15:40":9,"15:50":11,"16:00":12,"16:10":11,"16:20":11,"16:30":18,"16:40":25,"16:50":23,"17:00":27,"17:10":27,"17:20":23,"17:30":24,"17:40":26,"17:50":33,"18:00":31,"18:10":31,"18:20":32,"18:30":37,"18:40":38,"18:50":39,"19:00":38,"19:10":34,"19:20":33,"19:30":40,"19:40":36,"19:50":37,"20:00":38,"20:10":36,"20:20":38,"20:30":37,"20:40":37,"20:50":37,"21:00":34,"21:10":32,"21:20":33,"21:30":33,"21:40":29,"21:50":28,"22:00":26,"22:10":21,"22:20":24,"22:30":29,"22:40":22,"22:50":23,"23:00":27,"23:10":24,"23:20":26,"23:30":25,"23:40":28,"23:50":27,"00:00":32,"00:10":29,"00:20":33,"00:30":32,"00:40":31,"00:50":33,"01:00":40,"01:10":40,"01:20":40,"01:30":41,"01:40":45,"01:50":48,"02:00":43,"02:10":45,"02:20":46,"02:30":46,"02:40":43,"02:50":42,"03:00":39,"03:10":36,"03:20":44,"03:30":34,"03:40":0,"03:50":32,"04:00":35,"04:10":35,"04:20":33,"04:30":43,"04:40":37,"04:50":26,"05:00":31,"05:10":31,"05:20":27,"05:30":25,"05:40":26,"05:50":18,"06:00":13,"06:10":15,"06:20":17,"06:30":18,"06:40":17,"06:50":15,"07:00":16,"07:10":17,"07:20":16,"07:30":16,"07:40":18,"07:50":19,"08:00":14,"08:10":12,"08:20":12,"08:30":13,"08:40":17,"08:50":20,"09:00":18,"09:10":0,"09:20":0,"09:30":27,"09:40":18,"09:50":20,"10:00":15,"10:10":13,"10:20":12,"10:30":10,"10:40":10,"10:50":11,"11:00":13,"11:10":13,"11:20":16,"11:30":19,"11:40":17,"11:50":13,"12:00":10,"12:10":11,"12:20":12,"12:30":16,"12:40":15,"12:50":16,"13:00":14,"13:10":10,"13:20":13,"13:30":16,"13:40":16,"13:50":17,"14:00":20,"14:10":16,"14:20":16},"query":"ping","max_stat":{"max_online":{"date":1470764061,"players":129}},"status_query":"ok"}
By the way, the reason things change is because it looks at info from different servers

Very similar to ther answer I gave you to your first question:
#Echo Off
Set/P var=<some.json
Set var=%var:*:{"ping":=%
Set var=%var:},=&:%
Echo=%var%
Timeout -1

Output other than .txt

I'm looking to build a simple program that will simply modify existing output files from an other program so I don't have to open the program and enter a bunch of data the long way. This program is very specific to my domain and has an extension named .wcc. However, when I change the extension of one of these output files to .txt, I get half gibberish :
ÿÿ WPointÿÿ WPolygonÿÿ  WQuadrilateralÿÿ  WMemberDataÿÿ
WLoadÿÿ WLStandardMembersÿÿ WLSavedDesignSettingsÿÿ WLSavedFormatSettingsÿÿ  WLSavedViewSettingsÿÿ WLSavedProjectSettingsÿÿ  WLSavedSettingsÿÿ  WLSavedLoadSettingsÿÿ WLSavedDefaultSettingsÿÿ WLineÿÿ WProductÿÿ WBeamDataÿÿ  WColumnDataÿÿ
WJoistDataÿÿ
WWallStudDataÿÿ WSupportingMemberDataÿÿ WSavedAnalysisSettingsÿÿ WSavedGravityDesignSettingsÿÿ WSavedPreferencesSettingsÿÿ WNotchÿÿ WIJoistÿÿ WFloorCWC37 ÀAE LumberS-P-F No.1/No.2 # À# lumwall.cww ÿÿÿÿ1.2.3.1.Mur_1_EX-D ÿÿÿÿÿÿ B Cÿÿ B C €? 4C 4C   Neige #F #F ÈC ÿÿÿ
WLStandardMembersÿÿ "
There are also musical notes and perpendicular signs which I can't copy paste here. I can sorta read the text, but still not enough to make modifications via txt file. What type of file could this be? Is it even possible to do what I'm trying to do? Thanks!

I am surprised that you are trying to open a .wcc file as a text file (it's contents - as you will see - don't lend themselves to being converted to such a file type); however, the attempt to open the file as a .txt file seems to be specific to your domain.
I noticed part of your question is as follows: "What type of file could this be?"
You are right in thinking that the .wcc file is a rather obscure file type - we don't think about that file type a lot (or are not conscious of it existing). A .wcc file is a WinCam 2000 Cache file that allows WinCam 2000 movies to be previewed in the slide browser - these were often generated by older WinCam 2000 screen recording and editing programs.
Again, the file extension is very rare these days (a Google search only returns ~700 results). But, it appears you have a program that is producing the file, which - as you are saying - "is quite specific to your domain". You may be out of luck with regard to opening them for modification purposes.
Supposedly, you can covert .wac files to .wav files, which are much more relevant to today's technology (and definitely alterable from code); however, without knowing the purpose of the file, e.g. what you are trying to do with the file domain-side, I can't say that this will suit your needs.
Also, the above comments are "correct": changing a file extension will not convert the file to the file extension type. Typically, converters - like a simple software - are needed to convert files.

How can I search in PDF documents/PDX catalog in powershell

I have a vendor that supplies their documentation library as a series of PDF files (and some CHM files) and include a .PDX catalog also.
I want to write a powershell script to front end it (using either powershell forms, or hosting powershell in asp.net).
I'm in the early stages, I've worked out how to get document information from the PDF stream (the xmpmeta XML metadata block near the end of the PDF file - one of the few streams in the file that's in plaintext) which looks like this:
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:08:04
"><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><rdf:Description rdf:about="
" xmlns:pdf="http://ns.adobe.com/pdf/1.3/"><pdf:Producer>GPL Ghostscript 8.64</pdf:Producer><pdf:Keywo
rds>86000056-413</pdf:Keywords></rdf:Description><rdf:Description rdf:about="" xmlns:xmp="http://ns.ad
obe.com/xap/1.0/"><xmp:ModifyDate>2011-03-03T17:38:34-05:00</xmp:ModifyDate><xmp:CreateDate>2011-01-28
T23:12:07+05:30</xmp:CreateDate><xmp:CreatorTool>PScript5.dll Version 5.2</xmp:CreatorTool><xmp:Metada
taDate>2011-03-03T17:38:34-05:00</xmp:MetadataDate></rdf:Description><rdf:Description rdf:about="" xml
ns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"><xmpMM:DocumentID>6cb2263d-2d61-11e0-0000-1390d57dcfcb</xmp
MM:DocumentID><xmpMM:InstanceID>uuid:1a0e68ba-14ad-4a03-b7a1-0a0e127b8753</xmpMM:InstanceID></rdf:Desc
ription><rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/"><dc:format>applicati
on/pdf</dc:format><dc:title><rdf:Alt><rdf:li xml:lang="x-default">I/O Subsystem Programming Guide</rdf
:li></rdf:Alt></dc:title><dc:creator><rdf:Seq><rdf:li>Unisys Information Development</rdf:li></rdf:Seq
></dc:creator><dc:description><rdf:Alt><rdf:li xml:lang="x-default">ClearPath MCP 13.1,Application Dev
elopment,Administration,ClearPath MCP</rdf:li></rdf:Alt></dc:description></rdf:Description></rdf:RDF><
/x:xmpmeta>
using the following code (powershell v3, in v2 you need to select and expand the properties thus [string]$title = ($rdf.GetElementsByTagName('dc:title')| Select -expand Alt|Select -expand li)."#text"):
$file = ".\Downloads\68698703-007\PDF\86000056-413.pdf"
#determine what line in file the xmpmeta string starts
[int]$startln = (select-string -pattern '^<x:' $file).ToString().Split(":")[2]
#determine what line in file the xmpmeta string ends
[int]$endln = (select-string -pattern '^</x:' $file).ToString().Split(":")[2]
$startln--
#grab the xmpmeta and cast as type xml
[xml]$xmp = (gc $file)["$startln".."$endln"]
[xml]$rdf = $xmp.xmpmeta.InnerXml
#get title/creator/description element text
[string]$title = $rdf.GetElementsByTagName('dc:title').Alt.li."#text"
[string]$creator = $rdf.GetElementsByTagName('dc:creator').Alt.li."#text"
[string]$description = $rdf.GetElementsByTagName('dc:description').Alt.li."#text"
That's crucial because the filenames are in the format 12345678-123.pdf, the actual title is in the metadata itself, as well as document category etc.
So, I can produce a list of documents (displaying their proper titles, not the real filename) and allow them to be launched, but I also want to be able to search in all the documents using PDX file, but it's by no means plaintext!
I guess I could use one of a number of tools out there to convert each PDF into text, search it, repeat for each document and then return results for each document.
But, it strikes me that Adobe Reader already does that, so can I either start AcroRd32.exe with switches that will start the search, with search terms I've passed in to the AcroRd32 program, or can I use Adobe Search.API from within Powershell?
Any ideas specifically on automating load of the .PDX in Adobe Reader and firing off the search, or using adobe's API in powershell?
EDIT:
I can now launch acrobat from command line and search (so could mimic this in powershell) but the search only works when searching a PDF, not a PDX catalog. Both bring up the search pane, but only in a PDF document does the search field get populated and the search executed.
C:\Program Files (x86)\Adobe\Reader 10.0\Reader>AcroRd32.exe /A "search=trim" "P:\Doc Library\PDF\00_home.pdx"
Or
C:\Program Files (x86)\Adobe\Reader 10.0\Reader>AcroRd32.exe /A "search=trim" "P:\Doc Library\PDF\86000056-413.pdf"
Regards,
Graham

This is an old post, but be aware that the searching you do is potentially dangerous and that there is a better way to find the XMP metadata in a PDF file. XMP was designed specifically to be "findable" by text search. To that purpose it has a well defined begin and end code defined that is in there specifically so that you can extract the XMP data without having to parse the PDF format (or any other format the XMP metadata blob might be embedded in.
You can download the XMP specification here: http://www.adobe.com/devnet/xmp.html. Part 1 is the part where the explanation about XMP Packets explains how a text scanner can find the XMP packet with more accuracy.
Finally, PDF has an additional quirk that allows it to be incrementally updated. This might cause multiple XMP packets to appear in the file (where the last packet is normally the correct one). But annoyingly when the PDF is exported from applications like InDesign, images in the PDF (and other objects) might also have their own "object" XMP attached to it.
So consider where your files come from and how many strange things you might encounter and you want to provision for. But reading the XMP specification is not a bad idea for sure.

Getting specific fields from ID3 tags using command line tool?

I'm looking for a way that would let me get specific fields from ID3 tags from mp3 files.
All tools I have so far found return all fields, and they also format them for "easier reading". I need just some fields, and formatted differently (artist\talbum\ttitle\n) for reporting purposes.
Is there any such tool? I would love tool that would let me output separately values from ID3v1 and ID3v2.

id3v2 -R sounds like it does what you want. Debian package name is id3v2, upstream is http://id3v2.sourceforge.net/
From the manpage:
-R, --list-rfc822
Lists using an rfc822-style format for output
Example:
$ id3v2 -R 365-Days-Project-04-26-sprinkle-leland-w-the-great-stalacpipe-organ.mp3
Filename: 365-Days-Project-04-26-sprinkle-leland-w-the-great-stalacpipe-organ.mp3
TALB: Released independently through Luray Caverns
TPE1: Leland W. Sprinkle
TIT2: The Great Stalacpipe Organ
COMM: ()[eng]: � 2004, Copyright resides with the artist, The 365 Days Project, and UbuWeb (http://ubu.com) / PennSound (http://www.writing.upenn.edu/pennsound/). All materials at UbuWeb / PennSound are available for free exchange for noncommerical purposes.
365-Days-Project-04-26-sprinkle-leland-w-the-great-stalacpipe-organ.mp3: No ID3v1 tag

The easiest way is creating a bash script.
grep the fields returned by your tool so you get just the ones you want. Then you use awk (if you know how to use it), or cut, etc.
If you give us the format used by one of the tools you found, we can help you to write it. The more simple the format is, the more simple the script will be.

Compare two websites and see if they are "equal?"

We are migrating web servers, and it would be nice to have an automated way to check some of the basic site structure to see if the rendered pages are the same on the new server as the old server. I was just wondering if anyone knew of anything to assist in this task?

Get the formatted output of both sites (here we use w3m, but lynx can also work):
w3m -dump http://google.com 2>/dev/null > /tmp/1.html
w3m -dump http://google.de 2>/dev/null > /tmp/2.html
Then use wdiff, it can give you a percentage of how similar the two texts are.
wdiff -nis /tmp/1.html /tmp/2.html
It can be also easier to see the differences using colordiff.
wdiff -nis /tmp/1.html /tmp/2.html | colordiff
Excerpt of output:
Web Images Vidéos Maps [-Actualités-] Livres {+Traduction+} Gmail plus »
[-iGoogle |-]
Paramètres | Connexion
Google [hp1] [hp2]
[hp3] [-Français-] {+Deutschland+}
[ ] Recherche
avancéeOutils
[Recherche Google][J'ai de la chance] linguistiques
/tmp/1.html: 43 words 39 90% common 3 6% deleted 1 2% changed
/tmp/2.html: 49 words 39 79% common 9 18% inserted 1 2% changed
(he actually put google.com into french... funny)
The common % values are how similar both texts are. Plus you can easily see the differences by word (instead of by line which can be a clutter).

The catch is how to check the 'rendered' pages. If the pages don't have any dynamic content the easiest way to do that is to generate hashes for the files using a md5 or sha1 commands and check then against the new server.
IF the pages have dynamic content you will have to download the site using a tool like wget
wget --mirror http://thewebsite/thepages
and then use diff as suggested by Warner or do the hash thing again. I think diff may be the best way to go since even a change of 1 character will mess up the hash.

I've created the following PHP code that does what Weboide suggest here. Thanks Weboide!
the paste is here:
http://pastebin.com/0V7sVNEq

Using the open source tool recheck-web (https://github.com/retest/recheck-web), there are two possibilities:
Create a Selenium test that checks all of your URLs on the old server, creating Golden Masters. Then running that test on the new server and find how they differ.
Use the free and open source (https://github.com/retest/recheck-web-chrome-extension) Chrome extension, that internally uses recheck-web to do the same: https://chrome.google.com/webstore/detail/recheck-web-demo/ifbcdobnjihilgldbjeomakdaejhplii
For both solutions you currently need to manually list all relevant URLs. In most situations, this shouldn't be a big problem. recheck-web will compare the rendered website and show you exactly where they differ (i.e. different font, different meta tags, even different link URLs). And it gives you powerful filters to let you focus on what is relevant to you.
Disclaimer: I have helped create recheck-web.

Copy the files to the same server in /tmp/directory1 and /tmp/directory2 and run the following command:
diff -r /tmp/directory1 /tmp/directory2
For all intents and purposes, you can put them in your preferred location with your preferred naming convention.
Edit 1
You could potentially use lynx -dump or a wget and run a diff on the results.

Short of rendering each page, taking screen captures, and comparing those screenshots, I don't think it's possible to compare the rendered pages.
However, it is certainly possible to compare the downloaded website after downloading recursively with wget.
wget [option]... [URL]...
-m
--mirror
Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP
directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.
The next step would then be to do the recursive diff that Warner recommended.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string