How to use GNU sort with numeric data in binary format? - linux

Is there any way to use GNU Coreutils sort with 64bit numbers stored in binary file?
If file wasn't binary then sort -n is the solution, but I didn't find any options to use it with binary data.
File is quite large (~100GB) and if it is possible I don't want to make its' text (non-binary) copy.
Sample of data:
$ xxd file
00292e0: 4036 1eb7 6888 d319 de6b 7402 9ca9 f116 #6..h....kt.....
00292f0: db68 7f05 199f 9d36 cf01 cb28 e49f 1116 .h.....6...(....
0029300: 0c7c 8b55 2963 ef0c 277a f2b0 38d7 2b19 .|.U)c..'z..8.+.
0029310: c83b 2614 4327 d838 820c 1bb8 444f 1731 .;&.C'.8....DO.1
0029320: 1695 cab3 cd12 092a 0691 d7e4 5fcc b01d .......*...._...
0029330: b12b 7c1b a209 7c1c 568a 125c 541c d334 .+|...|.V..\T..4
0029340: 09a3 ecbc 8370 e205 9265 7759 a378 4e2f .....p...ewY.xN/

The bsort utility does this.
It is a lightning fast inplace radix sort written in C. One of the test cases for its development was a 100Gb file on a machine with 16Gb ram - took about 22 seconds or so to sort.

sort(1) will not help you here. For a small file it could be possible to split your file into lines and feed it to sort(1), but not for 100G file of course.
The answer to this question on Serverfault has a link of the tool written for solving exactly your task. You can check the github project there (it seems to be written in Go so you will need to install a compiler if you decide to use it).
Quick googling does not find any other popular tool for this task written on some more popular language (and it surprises me a bit as the task itself is just a merge sort that thousands of students implement each year on their CS courses, but that's an off-topic).

Related

Uncommon checksum representation on the FTP web site

I'm sorry if i'm asking something basic. I've found list of files with corresponding checkums at ftp://ftp.ensemblgenomes.org/pub/plants/release-39/fasta/helianthus_annuus/dna/CHECKSUMS.
But in my opinion checksum representation is quite uncommon. Before i saw something like 58c1c7a730ba1e90169ab95efdc39e5c ( got this string as md5sum command output on one of the files from the link). How i can convert checksums from link above to regular format?
The output looks as it is from sum or cksumcommands.
See sum wiki and cksum wiki
You can not use this data to get back a md5sum or shasum.

PDF Data and Table Scraping to Excel

I'm trying to figure out a good way to increase the productivity of my data entry job.
What I am looking to do is come up with a way to scrape data from a PDF and input it into Excel.
More specifically the data I am working with is from grocery store flyers. As it stands now we have to manually enter every deal in the flyer into a database. A sample of a flyer is http://weeklyspecials.safeway.com/customer_Frame.jsp?drpStoreID=1551
What I am hoping to do is have columns for products, price, and predefined options (Loyalty Cards, Coupons, Select Variety... that sort of thing).
Any help would be appreciated, and if I need to be more specific let me know.
After looking at the specific PDF linked to by the OP, I have to say that this is not quite displaying a typical table format.
It contains many images inside the "cells", but the cells are not all strictly vertically or horizontally aligned:
So this isn't even a 'nice' table, but an extremely ugly and awkward one to work with...
Having said that, I'll have to add:
Extracting even 'nice' tables from PDFs in general is extremely difficult...
Standard PDFs do not provide any hints about the semantics of what they draw on a page:
the only distinction that the syntax provides is the distinctions between vector elements (lines, fills,...), images and text.
Whether any character is part of a table or part of a line or just a lonely, single character within an otherwise empty area is not easy to recognize programmatically by parsing the PDF source code.
For a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article:
Why Updating Dollars for Docs Was So Difficult (ProPublica-Website)
...but doing so with TabulaPDF works very well!
Having said the above now let me add this:
For an amazing open source family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages) -- contradicting what I said in my introductionary paragraphs! -- check out TabulaPDF. See these links:
Introducing Tabula: Upload a PDF, get back tabular CSV data. Poof!
Tabula-Extractor: A Command Line Interface to Tabula
Tabula source code repository
Tabula API (upcoming, not ready yet)
Tabula-Extractor is written in Ruby.
In the background it makes use of PDFBox (which is written in Java) and a few other third-party libs.
To run, Tabula-Extractor requires JRuby-1.7 installed.
Installing Tabula-Extractor
I'm using the 'bleeding-edge' version of Tabula-Extractor directly from its GitHub source code repository.
Getting it to work was extremely easy, since on my system JRuby-1.7.4_0 is already present:
mkdir ~/svn-stuff
cd ~/svn-stuff
git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor
Included in this Git clone will already be the required libraries, so no need to install PDFBox.
The command line tool is in the /bin/ subdirectory.
Exploring the command line options:
~/svn-stuff/git.tabula-extractor/bin/tabula -h
Tabula helps you extract tables from PDFs
Usage:
tabula [options] <pdf_file>
where [options] are:
--pages, -p <s>: Comma separated list of ranges, or all. Examples:
--pages 1-3,5-7, --pages 3 or --pages all. Default
is --pages 1 (default: 1)
--area, -a <s>: Portion of the page to analyze
(top,left,bottom,right). Example: --area
269.875,12.75,790.5,561. Default is entire page
--columns, -c <s>: X coordinates of column boundaries. Example
--columns 10.1,20.2,30.3
--password, -s <s>: Password to decrypt document. Default is empty
(default: )
--guess, -g: Guess the portion of the page to analyze per page.
--debug, -d: Print detected table areas instead of processing.
--format, -f <s>: Output format (CSV,TSV,HTML,JSON) (default: CSV)
--outfile, -o <s>: Write output to <file> instead of STDOUT (default:
-)
--spreadsheet, -r: Force PDF to be extracted using spreadsheet-style
extraction (if there are ruling lines separating
each cell, as in a PDF of an Excel spreadsheet)
--no-spreadsheet, -n: Force PDF not to be extracted using
spreadsheet-style extraction (if there are ruling
lines separating each cell, as in a PDF of an Excel
spreadsheet)
--silent, -i: Suppress all stderr output.
--use-line-returns, -u: Use embedded line returns in cells. (Only in
spreadsheet mode.)
--version, -v: Print version and exit
--help, -h: Show this message
Extracting the table which the OP wants
I'm not even trying to extract this ugly table from the OP's monster PDF. I'll leave it as an excercise to these readers who are feeling adventurous enough...
Instead, I'll demo how to extract a 'nice' table. I'll take pages 651-653 from the official PDF-1.7 specification, here represented with screenshots:
I used this command:
~/svn-stuff/git.tabula-extractor/bin/tabula \
-p 651,652,653 -g -n -u -f CSV \
~/Downloads/pdfs/PDF32000_2008.pdf
After importing the generated CSV into LibreOffice Calc, the spreadsheet looks like this:
To me this looks like the perfect extraction of a table which did spread over 3 different PDF pages. (Even the newlines used within table cells made it into the spreadsheet.)
Update
Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the asciinema command line tool), starring tabula-extractor:

Output other than .txt

I'm looking to build a simple program that will simply modify existing output files from an other program so I don't have to open the program and enter a bunch of data the long way. This program is very specific to my domain and has an extension named .wcc. However, when I change the extension of one of these output files to .txt, I get half gibberish :
ÿÿ WPointÿÿ WPolygonÿÿ  WQuadrilateralÿÿ  WMemberDataÿÿ
WLoadÿÿ WLStandardMembersÿÿ WLSavedDesignSettingsÿÿ WLSavedFormatSettingsÿÿ  WLSavedViewSettingsÿÿ WLSavedProjectSettingsÿÿ  WLSavedSettingsÿÿ  WLSavedLoadSettingsÿÿ WLSavedDefaultSettingsÿÿ WLineÿÿ WProductÿÿ WBeamDataÿÿ  WColumnDataÿÿ
WJoistDataÿÿ
WWallStudDataÿÿ WSupportingMemberDataÿÿ WSavedAnalysisSettingsÿÿ WSavedGravityDesignSettingsÿÿ WSavedPreferencesSettingsÿÿ WNotchÿÿ WIJoistÿÿ WFloorCWC37 ÀAE LumberS-P-F No.1/No.2 # À# lumwall.cww ÿÿÿÿ1.2.3.1.Mur_1_EX-D ÿÿÿÿÿÿ B Cÿÿ B C €? 4C 4C   Neige #F #F ÈC ÿÿÿ
WLStandardMembersÿÿ "
There are also musical notes and perpendicular signs which I can't copy paste here. I can sorta read the text, but still not enough to make modifications via txt file. What type of file could this be? Is it even possible to do what I'm trying to do? Thanks!
I am surprised that you are trying to open a .wcc file as a text file (it's contents - as you will see - don't lend themselves to being converted to such a file type); however, the attempt to open the file as a .txt file seems to be specific to your domain.
I noticed part of your question is as follows: "What type of file could this be?"
You are right in thinking that the .wcc file is a rather obscure file type - we don't think about that file type a lot (or are not conscious of it existing). A .wcc file is a WinCam 2000 Cache file that allows WinCam 2000 movies to be previewed in the slide browser - these were often generated by older WinCam 2000 screen recording and editing programs.
Again, the file extension is very rare these days (a Google search only returns ~700 results). But, it appears you have a program that is producing the file, which - as you are saying - "is quite specific to your domain". You may be out of luck with regard to opening them for modification purposes.
Supposedly, you can covert .wac files to .wav files, which are much more relevant to today's technology (and definitely alterable from code); however, without knowing the purpose of the file, e.g. what you are trying to do with the file domain-side, I can't say that this will suit your needs.
Also, the above comments are "correct": changing a file extension will not convert the file to the file extension type. Typically, converters - like a simple software - are needed to convert files.

Getting specific fields from ID3 tags using command line tool?

I'm looking for a way that would let me get specific fields from ID3 tags from mp3 files.
All tools I have so far found return all fields, and they also format them for "easier reading". I need just some fields, and formatted differently (artist\talbum\ttitle\n) for reporting purposes.
Is there any such tool? I would love tool that would let me output separately values from ID3v1 and ID3v2.
id3v2 -R sounds like it does what you want. Debian package name is id3v2, upstream is http://id3v2.sourceforge.net/
From the manpage:
-R, --list-rfc822
Lists using an rfc822-style format for output
Example:
$ id3v2 -R 365-Days-Project-04-26-sprinkle-leland-w-the-great-stalacpipe-organ.mp3
Filename: 365-Days-Project-04-26-sprinkle-leland-w-the-great-stalacpipe-organ.mp3
TALB: Released independently through Luray Caverns
TPE1: Leland W. Sprinkle
TIT2: The Great Stalacpipe Organ
COMM: ()[eng]: � 2004, Copyright resides with the artist, The 365 Days Project, and UbuWeb (http://ubu.com) / PennSound (http://www.writing.upenn.edu/pennsound/). All materials at UbuWeb / PennSound are available for free exchange for noncommerical purposes.
365-Days-Project-04-26-sprinkle-leland-w-the-great-stalacpipe-organ.mp3: No ID3v1 tag
The easiest way is creating a bash script.
grep the fields returned by your tool so you get just the ones you want. Then you use awk (if you know how to use it), or cut, etc.
If you give us the format used by one of the tools you found, we can help you to write it. The more simple the format is, the more simple the script will be.

Compare two websites and see if they are "equal?"

We are migrating web servers, and it would be nice to have an automated way to check some of the basic site structure to see if the rendered pages are the same on the new server as the old server. I was just wondering if anyone knew of anything to assist in this task?
Get the formatted output of both sites (here we use w3m, but lynx can also work):
w3m -dump http://google.com 2>/dev/null > /tmp/1.html
w3m -dump http://google.de 2>/dev/null > /tmp/2.html
Then use wdiff, it can give you a percentage of how similar the two texts are.
wdiff -nis /tmp/1.html /tmp/2.html
It can be also easier to see the differences using colordiff.
wdiff -nis /tmp/1.html /tmp/2.html | colordiff
Excerpt of output:
Web Images Vidéos Maps [-Actualités-] Livres {+Traduction+} Gmail plus »
[-iGoogle |-]
Paramètres | Connexion
Google [hp1] [hp2]
[hp3] [-Français-] {+Deutschland+}
[ ] Recherche
avancéeOutils
[Recherche Google][J'ai de la chance] linguistiques
/tmp/1.html: 43 words 39 90% common 3 6% deleted 1 2% changed
/tmp/2.html: 49 words 39 79% common 9 18% inserted 1 2% changed
(he actually put google.com into french... funny)
The common % values are how similar both texts are. Plus you can easily see the differences by word (instead of by line which can be a clutter).
The catch is how to check the 'rendered' pages. If the pages don't have any dynamic content the easiest way to do that is to generate hashes for the files using a md5 or sha1 commands and check then against the new server.
IF the pages have dynamic content you will have to download the site using a tool like wget
wget --mirror http://thewebsite/thepages
and then use diff as suggested by Warner or do the hash thing again. I think diff may be the best way to go since even a change of 1 character will mess up the hash.
I've created the following PHP code that does what Weboide suggest here. Thanks Weboide!
the paste is here:
http://pastebin.com/0V7sVNEq
Using the open source tool recheck-web (https://github.com/retest/recheck-web), there are two possibilities:
Create a Selenium test that checks all of your URLs on the old server, creating Golden Masters. Then running that test on the new server and find how they differ.
Use the free and open source (https://github.com/retest/recheck-web-chrome-extension) Chrome extension, that internally uses recheck-web to do the same: https://chrome.google.com/webstore/detail/recheck-web-demo/ifbcdobnjihilgldbjeomakdaejhplii
For both solutions you currently need to manually list all relevant URLs. In most situations, this shouldn't be a big problem. recheck-web will compare the rendered website and show you exactly where they differ (i.e. different font, different meta tags, even different link URLs). And it gives you powerful filters to let you focus on what is relevant to you.
Disclaimer: I have helped create recheck-web.
Copy the files to the same server in /tmp/directory1 and /tmp/directory2 and run the following command:
diff -r /tmp/directory1 /tmp/directory2
For all intents and purposes, you can put them in your preferred location with your preferred naming convention.
Edit 1
You could potentially use lynx -dump or a wget and run a diff on the results.
Short of rendering each page, taking screen captures, and comparing those screenshots, I don't think it's possible to compare the rendered pages.
However, it is certainly possible to compare the downloaded website after downloading recursively with wget.
wget [option]... [URL]...
-m
--mirror
Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP
directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.
The next step would then be to do the recursive diff that Warner recommended.

Resources