Compare two websites and see if they are "equal?" - linux

We are migrating web servers, and it would be nice to have an automated way to check some of the basic site structure to see if the rendered pages are the same on the new server as the old server. I was just wondering if anyone knew of anything to assist in this task?

Get the formatted output of both sites (here we use w3m, but lynx can also work):
w3m -dump http://google.com 2>/dev/null > /tmp/1.html
w3m -dump http://google.de 2>/dev/null > /tmp/2.html
Then use wdiff, it can give you a percentage of how similar the two texts are.
wdiff -nis /tmp/1.html /tmp/2.html
It can be also easier to see the differences using colordiff.
wdiff -nis /tmp/1.html /tmp/2.html | colordiff
Excerpt of output:
Web Images Vidéos Maps [-Actualités-] Livres {+Traduction+} Gmail plus »
[-iGoogle |-]
Paramètres | Connexion
Google [hp1] [hp2]
[hp3] [-Français-] {+Deutschland+}
[ ] Recherche
avancéeOutils
[Recherche Google][J'ai de la chance] linguistiques
/tmp/1.html: 43 words 39 90% common 3 6% deleted 1 2% changed
/tmp/2.html: 49 words 39 79% common 9 18% inserted 1 2% changed
(he actually put google.com into french... funny)
The common % values are how similar both texts are. Plus you can easily see the differences by word (instead of by line which can be a clutter).

The catch is how to check the 'rendered' pages. If the pages don't have any dynamic content the easiest way to do that is to generate hashes for the files using a md5 or sha1 commands and check then against the new server.
IF the pages have dynamic content you will have to download the site using a tool like wget
wget --mirror http://thewebsite/thepages
and then use diff as suggested by Warner or do the hash thing again. I think diff may be the best way to go since even a change of 1 character will mess up the hash.

I've created the following PHP code that does what Weboide suggest here. Thanks Weboide!
the paste is here:
http://pastebin.com/0V7sVNEq

Using the open source tool recheck-web (https://github.com/retest/recheck-web), there are two possibilities:
Create a Selenium test that checks all of your URLs on the old server, creating Golden Masters. Then running that test on the new server and find how they differ.
Use the free and open source (https://github.com/retest/recheck-web-chrome-extension) Chrome extension, that internally uses recheck-web to do the same: https://chrome.google.com/webstore/detail/recheck-web-demo/ifbcdobnjihilgldbjeomakdaejhplii
For both solutions you currently need to manually list all relevant URLs. In most situations, this shouldn't be a big problem. recheck-web will compare the rendered website and show you exactly where they differ (i.e. different font, different meta tags, even different link URLs). And it gives you powerful filters to let you focus on what is relevant to you.
Disclaimer: I have helped create recheck-web.

Copy the files to the same server in /tmp/directory1 and /tmp/directory2 and run the following command:
diff -r /tmp/directory1 /tmp/directory2
For all intents and purposes, you can put them in your preferred location with your preferred naming convention.
Edit 1
You could potentially use lynx -dump or a wget and run a diff on the results.

Short of rendering each page, taking screen captures, and comparing those screenshots, I don't think it's possible to compare the rendered pages.
However, it is certainly possible to compare the downloaded website after downloading recursively with wget.
wget [option]... [URL]...
-m
--mirror
Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP
directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.
The next step would then be to do the recursive diff that Warner recommended.

Related

how to download batch of data with linux command line?

For example I want to download data from:
http://nimbus.cos.uidaho.edu/DATA/OBS/
with the link:
http://nimbus.cos.uidaho.edu/DATA/OBS/pr_1979.nc
to
http://nimbus.cos.uidaho.edu/DATA/OBS/pr_2015.nc
How can I write a script to download all of them? with wget?and how to loop the links from 1979 to 2015?
wget can take file as input which contains URLs per line.
wget -ci url_file
-i : input file
-c : resume functionality
So all you need to do is put the URLs in a file and use that file with wget.
A simple loop like Jeff Puckett II's answer will be sufficient for your particular case, but if you happen to deal with more complex situations (random urls), this method may come in handy.
Probably something like a for loop iterating over a predefined series.
Untested code:
for i in {1979..2015}; do
wget http://nimbus.cos.uidaho.edu/DATA/OBS/pr_$i.nc
done

PDF Data and Table Scraping to Excel

I'm trying to figure out a good way to increase the productivity of my data entry job.
What I am looking to do is come up with a way to scrape data from a PDF and input it into Excel.
More specifically the data I am working with is from grocery store flyers. As it stands now we have to manually enter every deal in the flyer into a database. A sample of a flyer is http://weeklyspecials.safeway.com/customer_Frame.jsp?drpStoreID=1551
What I am hoping to do is have columns for products, price, and predefined options (Loyalty Cards, Coupons, Select Variety... that sort of thing).
Any help would be appreciated, and if I need to be more specific let me know.
After looking at the specific PDF linked to by the OP, I have to say that this is not quite displaying a typical table format.
It contains many images inside the "cells", but the cells are not all strictly vertically or horizontally aligned:
So this isn't even a 'nice' table, but an extremely ugly and awkward one to work with...
Having said that, I'll have to add:
Extracting even 'nice' tables from PDFs in general is extremely difficult...
Standard PDFs do not provide any hints about the semantics of what they draw on a page:
the only distinction that the syntax provides is the distinctions between vector elements (lines, fills,...), images and text.
Whether any character is part of a table or part of a line or just a lonely, single character within an otherwise empty area is not easy to recognize programmatically by parsing the PDF source code.
For a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article:
Why Updating Dollars for Docs Was So Difficult (ProPublica-Website)
...but doing so with TabulaPDF works very well!
Having said the above now let me add this:
For an amazing open source family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages) -- contradicting what I said in my introductionary paragraphs! -- check out TabulaPDF. See these links:
Introducing Tabula: Upload a PDF, get back tabular CSV data. Poof!
Tabula-Extractor: A Command Line Interface to Tabula
Tabula source code repository
Tabula API (upcoming, not ready yet)
Tabula-Extractor is written in Ruby.
In the background it makes use of PDFBox (which is written in Java) and a few other third-party libs.
To run, Tabula-Extractor requires JRuby-1.7 installed.
Installing Tabula-Extractor
I'm using the 'bleeding-edge' version of Tabula-Extractor directly from its GitHub source code repository.
Getting it to work was extremely easy, since on my system JRuby-1.7.4_0 is already present:
mkdir ~/svn-stuff
cd ~/svn-stuff
git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor
Included in this Git clone will already be the required libraries, so no need to install PDFBox.
The command line tool is in the /bin/ subdirectory.
Exploring the command line options:
~/svn-stuff/git.tabula-extractor/bin/tabula -h
Tabula helps you extract tables from PDFs
Usage:
tabula [options] <pdf_file>
where [options] are:
--pages, -p <s>: Comma separated list of ranges, or all. Examples:
--pages 1-3,5-7, --pages 3 or --pages all. Default
is --pages 1 (default: 1)
--area, -a <s>: Portion of the page to analyze
(top,left,bottom,right). Example: --area
269.875,12.75,790.5,561. Default is entire page
--columns, -c <s>: X coordinates of column boundaries. Example
--columns 10.1,20.2,30.3
--password, -s <s>: Password to decrypt document. Default is empty
(default: )
--guess, -g: Guess the portion of the page to analyze per page.
--debug, -d: Print detected table areas instead of processing.
--format, -f <s>: Output format (CSV,TSV,HTML,JSON) (default: CSV)
--outfile, -o <s>: Write output to <file> instead of STDOUT (default:
-)
--spreadsheet, -r: Force PDF to be extracted using spreadsheet-style
extraction (if there are ruling lines separating
each cell, as in a PDF of an Excel spreadsheet)
--no-spreadsheet, -n: Force PDF not to be extracted using
spreadsheet-style extraction (if there are ruling
lines separating each cell, as in a PDF of an Excel
spreadsheet)
--silent, -i: Suppress all stderr output.
--use-line-returns, -u: Use embedded line returns in cells. (Only in
spreadsheet mode.)
--version, -v: Print version and exit
--help, -h: Show this message
Extracting the table which the OP wants
I'm not even trying to extract this ugly table from the OP's monster PDF. I'll leave it as an excercise to these readers who are feeling adventurous enough...
Instead, I'll demo how to extract a 'nice' table. I'll take pages 651-653 from the official PDF-1.7 specification, here represented with screenshots:
I used this command:
~/svn-stuff/git.tabula-extractor/bin/tabula \
-p 651,652,653 -g -n -u -f CSV \
~/Downloads/pdfs/PDF32000_2008.pdf
After importing the generated CSV into LibreOffice Calc, the spreadsheet looks like this:
To me this looks like the perfect extraction of a table which did spread over 3 different PDF pages. (Even the newlines used within table cells made it into the spreadsheet.)
Update
Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the asciinema command line tool), starring tabula-extractor:

How to use GNU sort with numeric data in binary format?

Is there any way to use GNU Coreutils sort with 64bit numbers stored in binary file?
If file wasn't binary then sort -n is the solution, but I didn't find any options to use it with binary data.
File is quite large (~100GB) and if it is possible I don't want to make its' text (non-binary) copy.
Sample of data:
$ xxd file
00292e0: 4036 1eb7 6888 d319 de6b 7402 9ca9 f116 #6..h....kt.....
00292f0: db68 7f05 199f 9d36 cf01 cb28 e49f 1116 .h.....6...(....
0029300: 0c7c 8b55 2963 ef0c 277a f2b0 38d7 2b19 .|.U)c..'z..8.+.
0029310: c83b 2614 4327 d838 820c 1bb8 444f 1731 .;&.C'.8....DO.1
0029320: 1695 cab3 cd12 092a 0691 d7e4 5fcc b01d .......*...._...
0029330: b12b 7c1b a209 7c1c 568a 125c 541c d334 .+|...|.V..\T..4
0029340: 09a3 ecbc 8370 e205 9265 7759 a378 4e2f .....p...ewY.xN/
The bsort utility does this.
It is a lightning fast inplace radix sort written in C. One of the test cases for its development was a 100Gb file on a machine with 16Gb ram - took about 22 seconds or so to sort.
sort(1) will not help you here. For a small file it could be possible to split your file into lines and feed it to sort(1), but not for 100G file of course.
The answer to this question on Serverfault has a link of the tool written for solving exactly your task. You can check the github project there (it seems to be written in Go so you will need to install a compiler if you decide to use it).
Quick googling does not find any other popular tool for this task written on some more popular language (and it surprises me a bit as the task itself is just a merge sort that thousands of students implement each year on their CS courses, but that's an off-topic).

Getting specific fields from ID3 tags using command line tool?

I'm looking for a way that would let me get specific fields from ID3 tags from mp3 files.
All tools I have so far found return all fields, and they also format them for "easier reading". I need just some fields, and formatted differently (artist\talbum\ttitle\n) for reporting purposes.
Is there any such tool? I would love tool that would let me output separately values from ID3v1 and ID3v2.
id3v2 -R sounds like it does what you want. Debian package name is id3v2, upstream is http://id3v2.sourceforge.net/
From the manpage:
-R, --list-rfc822
Lists using an rfc822-style format for output
Example:
$ id3v2 -R 365-Days-Project-04-26-sprinkle-leland-w-the-great-stalacpipe-organ.mp3
Filename: 365-Days-Project-04-26-sprinkle-leland-w-the-great-stalacpipe-organ.mp3
TALB: Released independently through Luray Caverns
TPE1: Leland W. Sprinkle
TIT2: The Great Stalacpipe Organ
COMM: ()[eng]: � 2004, Copyright resides with the artist, The 365 Days Project, and UbuWeb (http://ubu.com) / PennSound (http://www.writing.upenn.edu/pennsound/). All materials at UbuWeb / PennSound are available for free exchange for noncommerical purposes.
365-Days-Project-04-26-sprinkle-leland-w-the-great-stalacpipe-organ.mp3: No ID3v1 tag
The easiest way is creating a bash script.
grep the fields returned by your tool so you get just the ones you want. Then you use awk (if you know how to use it), or cut, etc.
If you give us the format used by one of the tools you found, we can help you to write it. The more simple the format is, the more simple the script will be.

Is there an Open-Source Web Search Library that does not use a Search Index File?

I'm looking for an open-source web search library that does not use a search index file.
Do you know any?
Thanks,
Kenneth
You mean:
search.cgi
#/bin/sh
arg=`echo $QUERY | sed -e 's/^s=//' -e 's/&.*$//'`
cd /var/www/httpd
find . -type f | xargs egrep -l "$arg" | awk 'BEGIN {
print "Content-type: text/html";
print "";
print "<HTML><HEAD><TITLE>Search Result</TITLE></HEAD>";
print "<BODY><P>Here are your search results, sorry it took so long.</P>";
print "<UL>";
}
{ print "<LI>" $1 "</LI>"; }
END {
print "</UL></BODY>";
}'
Untested...
The original poster clarified in a comment to this reply that what he is looking for is essentially "greplike search but through HTTP", and mentioned that he is looking for something that uses little disk as he's working with an embedded system.
I am not aware of any related projects, but you might want to look at html parsers and xquery implementations in your language of choice. You should be able to take care of "real-life" messiness of html with the former, and write a search that's almost as detailed as you might desire with the latter.
I assume that you will be working with a set of urls that will either be provided, or already stored locally, since the idea of actually crawling the whole web, discovering links, etc, in an embedded device is thoroughly unrealistic.
Although with a good html/xquery implementation, you do have the tools to extract all the links..
My original answer, which was really a request for clarification:
Not sure what you mean. How do you picture a search working without an index? Crawling the web for every query? Piping through to google? Or are you referring to a specific kind of search index file that you are trying to avoid?
I guess there is none (at least that is popular enough for users here to be aware of).
We've went ahead to code our own Search system.

Resources