download images from google with command line [closed] - linux

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 1 year ago.
Improve this question
I would like to download n-th image that google gives me with command line i.e. like with command wget
To search image of [something] I just go to page https://www.google.cz/search?q=[something]&tbm=isch but how do I get url of n-th search result so I can use wget?

First attempt
First you need to set the user agent so google will authorize output from searches. Then we can look for images and select the desired one. To accomplish that we insert missing newlines, wget will return google searches on one single line, and filter the link. The index of the file is stored in the variable count.
$ count=10
$ imagelink=$(wget --user-agent 'Mozilla/5.0' -qO - "www.google.be/search?q=something\&tbm=isch" | sed 's/</\n</g' | grep '<img' | head -n"$count" | tail -n1 | sed 's/.*src="\([^"]*\)".*/\1/')
$ wget $imagelink
The image will now be in your working directory, you can tweak the last command and specify a desired output file name.
You can summarize it in a shell script:
#! /bin/bash
count=${1}
shift
query="$#"
[ -z $query ] && exit 1 # insufficient arguments
imagelink=$(wget --user-agent 'Mozilla/5.0' -qO - | "www.google.be/search?q=${query}\&tbm=isch" | sed 's/</\n</g' | grep '<img' | head -n"$count" | tail -n1 | sed 's/.*src="\([^"]*\)".*/\1/')
wget -qO google_image $imagelink
Example usage:
$ ls
Documents
Downloads
Music
script.sh
$ chmod +x script.sh
$ bash script.sh 5 awesome
$ ls
Documents
Downloads
google_image
Music
script.sh
Now the google_image should contain the fifth google image when looking for 'awesome'. If you experience any bugs, let me know, I'll take care of them.
Better code
The problem with this code is that it returns pictures in low resolution. A better solution is as follows:
#! /bin/bash
# function to create all dirs til file can be made
function mkdirs {
file="$1"
dir="/"
# convert to full path
if [ "${file##/*}" ]; then
file="${PWD}/${file}"
fi
# dir name of following dir
next="${file#/}"
# while not filename
while [ "${next//[^\/]/}" ]; do
# create dir if doesn't exist
[ -d "${dir}" ] || mkdir "${dir}"
dir="${dir}/${next%%/*}"
next="${next#*/}"
done
# last directory to make
[ -d "${dir}" ] || mkdir "${dir}"
}
# get optional 'o' flag, this will open the image after download
getopts 'o' option
[[ $option = 'o' ]] && shift
# parse arguments
count=${1}
shift
query="$#"
[ -z "$query" ] && exit 1 # insufficient arguments
# set user agent, customize this by visiting http://whatsmyuseragent.com/
useragent='Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:31.0) Gecko/20100101 Firefox/31.0'
# construct google link
link="www.google.cz/search?q=${query}\&tbm=isch"
# fetch link for download
imagelink=$(wget -e robots=off --user-agent "$useragent" -qO - "$link" | sed 's/</\n</g' | grep '<a href.*\(png\|jpg\|jpeg\)' | sed 's/.*imgurl=\([^&]*\)\&.*/\1/' | head -n $count | tail -n1)
imagelink="${imagelink%\%*}"
# get file extention (.png, .jpg, .jpeg)
ext=$(echo $imagelink | sed "s/.*\(\.[^\.]*\)$/\1/")
# set default save location and file name change this!!
dir="$PWD"
file="google image"
# get optional second argument, which defines the file name or dir
if [[ $# -eq 2 ]]; then
if [ -d "$2" ]; then
dir="$2"
else
file="${2}"
mkdirs "${dir}"
dir=""
fi
fi
# construct image link: add 'echo "${google_image}"'
# after this line for debug output
google_image="${dir}/${file}"
# construct name, append number if file exists
if [[ -e "${google_image}${ext}" ]] ; then
i=0
while [[ -e "${google_image}(${i})${ext}" ]] ; do
((i++))
done
google_image="${google_image}(${i})${ext}"
else
google_image="${google_image}${ext}"
fi
# get actual picture and store in google_image.$ext
wget --max-redirect 0 -qO "${google_image}" "${imagelink}"
# if 'o' flag supplied: open image
[[ $option = "o" ]] && gnome-open "${google_image}"
# successful execution, exit code 0
exit 0
The comments should be self explanatory, if you have any questions about the code (such as the long pipeline) I'll be happy to clarify the mechanics. Note that I had to set a more detailed user agent on the wget, it may happen that you need to set a different user agent but I don't think it'll be a problem. If you do have a problem, visit http://whatsmyuseragent.com/ and supply the output in the useragent variable.
When you wish to open the image instead of only downloading, use the -o flag, example below. If you wish to extend the script and also include a custom output file name, just let me know and I'll add it for you.
Example usage:
$ chmod +x getimg.sh
$ ./getimg.sh 1 dog
$ gnome-open google_image.jpg
$ ./getimg.sh -o 10 donkey

This is an addition to the answer provided by ShellFish. Much respect to them for working this out. :)
Google have recently changed their web-code for the image results page which has, unfortunately, broken Shellfish's code. I was using it every night in a cron job up until about 4 days ago when it stopped receiving search-results. While investigating this, I found that Google have removed elements like imgurl and have shifted a lot more into javascript.
My solution is an expansion of Shellfish's great code but has modifications to handle these Google changes and includes some 'enhancements' of my own.
It performs a single Google search, saves the results, bulk-downloads a specified number of images, then builds these into a single gallery-image using ImageMagick. Up to 1,000 images can be requested.
This bash script is available at https://git.io/googliser
Thank you.

Python code to download high resolution images from Google. i had poste the original answer here Python - Download Images from google Image search?
Currently downloads 100 original images given a search query
Code
from bs4 import BeautifulSoup
import requests
import re
import urllib2
import os
import cookielib
import json
def get_soup(url,header):
return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)))
query = raw_input("query image")# you can change the query for the image here
image_type="ActiOn"
query= query.split()
query='+'.join(query)
url="https://www.google.co.in/search?q="+query+"&source=lnms&tbm=isch"
print url
#add the directory for your image here
DIR="C:\\Users\\Rishabh\\Pictures\\"+query.split('+')[0]+"\\"
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"
}
soup = get_soup(url,header)
ActualImages=[]# contains the link for Large original images, type of image
for a in soup.find_all("div",{"class":"rg_meta"}):
link , Type =json.loads(a.text)["ou"] ,json.loads(a.text)["ity"]
ActualImages.append((link,Type))
print "there are total" , len(ActualImages),"images"
###print images
for i , (img , Type) in enumerate( ActualImages):
try:
req = urllib2.Request(img, headers={'User-Agent' : header})
raw_img = urllib2.urlopen(req).read()
if not os.path.exists(DIR):
os.mkdir(DIR)
cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
print cntr
if len(Type)==0:
f = open(DIR + image_type + "_"+ str(cntr)+".jpg", 'wb')
else :
f = open(DIR + image_type + "_"+ str(cntr)+"."+Type, 'wb')
f.write(raw_img)
f.close()
except Exception as e:
print "could not load : "+img
print e

as for shelfish's answer
imagelink=$(wget -e robots=off --user-agent "$useragent" -qO - "$link" | sed 's/\"ou\"/\n\"ou\"/g' | grep '\"ou\"\:\".*\(png\|jpg\|jpeg\).*ow\"' | awk -F'"' '{print $4}' | head -n $count|tail -n1)
will work with the current google image search june 2016

Simple solution, for files < 4 MB only (otherwise you get a TLS error):
wget --user-agent "Mozilla/5.0" -qO - "$#" |grep video.googleusercontent.com|cut -d'"' -f2|wget --content-disposition -c -i -

Related

How to download URLs from the website and save them in a file (wget, curl)?

How to use WGET to separate the marked links from this side?
Can this be done with CURL?
I want to download URLs from this page and save them in a file.
I tried like that.
wget -r -p -k https://polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2984/585ddf5a3dde69cb58c7f42ba52790a4
Link Gopher separated the addresses.
EDITION.
How can I download addresses to the file from the terminal?
Can it be done with the help of WGET?
Can it be done with the help of CURL?
I want to download addresses from this page and save them to the file.
I want to save these links.
`
https://polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2984/585ddf5a3dde69cb58c7f42ba52790a4
https://polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2985/e15e664718ef6c0dba471d59c4a1928a
https://polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2986/58edb8e0f06dc3da40c255e50b3839cf
`
Edition 1.
You will need to use something like
Download Serialized DOM
I added that to my Firefox browser and it works, although it is a bit slow, and the only time you know it is completed is when the *.html.part file disappears for the corresponding *.html file which you will save using the Add-on button.
Basically, that will save the complete web page (excluding binaries, i.e. images, videos, etc.) as a single text file.
Also, only while saving these files, the developper indicates there is a bug for which you MUST allow "Use in private mode" to circumvent the bug.
Here is a fragment of the full season 44 index page displayed (note the address in the address bar):
Since I don't have your access I can't reproduce, but the service is hiding from me the page of the individual video (what you get when you click on a picture) because I don't have login access. They give me the index instead of the address in the address bar (their security processes at work). However the index page should probably show something different after the ".../sezon-44/5027472/" .
Using that saved DOM file as input, the following will extract the necessary references:
#!/bin/sh
###
### LOGIC FLOW => CONFIRMED VALID
###
DBG=1
#URL="https://polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2984/585ddf5a3dde69cb58c7f42ba52790a4"
###
### Completely expanded and populated DOM file,
### as captured by Firefox extension "Download Serialized DOM"
###
### Extension is slow but apparently very functional.
###
INPUT="test_77_Serialized.html"
BASE=$(basename "$0" ".sh")
TMP="${BASE}.tmp"
HARVESTED="${BASE}.harvest"
DISTILLED="${BASE}.urls"
#if [ ! -s "${TMP}" ]
#then
# ### Non-serialized
# wget -O "${TMP}" "${URL}"
#fi
### Each 'more' step is to allow review of outputs to identify patterns which are to be used for the next step.
cp -p ${INPUT} "${TMP}"
test ${DBG} -eq 1 && more ${TMP}
sed 's+\<a\ +\n\<a\ +g' "${TMP}" >"${TMP}.2"
URL_BASE=$( grep 'tiba=' ${TMP}.2 |
sed 's+tiba=+\ntiba=+' |
grep -v 'viewport' |
cut -f1 -d\; |
cut -f2 -d\= |
cut -f1 -d\% )
echo "\n=======================\n${URL_BASE}\n=======================\n"
sed 's+\<a\ +\n\<a\ +g' "${TMP}" | grep '<a ' >"${TMP}.2"
test ${DBG} -eq 1 && more ${TMP}.2
grep 'title="Pierwsza Miłość - Odcinek' "${TMP}.2" >"${TMP}.3"
test ${DBG} -eq 1 && more ${TMP}.3
### FORMAT: Typical entry identified for video files
#<a data-testing="list.item.0" title="Pierwsza Miłość - Odcinek 2984" href="/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2984/585ddf5a3dde69cb58c7f42ba52790a4" class="ifodj3-0 yIiYl"></a><div class="sc-1vdpbg2-2 hKhMfx"><img data-src="https://ipla.pluscdn.pl/p/vm2images/9x/9xzfengehrm1rm8ukf7cvzvypv175iin.jpg" alt="Pierwsza Miłość - Odcinek 2984" class="rebvib-0 hIBTLi" src="https://ipla.pluscdn.pl/p/vm2images/9x/9xzfengehrm1rm8ukf7cvzvypv175iin.jpg"></div><div class="sc-1i4o84g-2 iDuLtn"><div class="orrg5d-0 gBnmbk"><span class="orrg5d-1 AjaSg">Odcinek 2984</span></div></div></div></div><div class="sc-1vdpbg2-1 bBDzBS"><div class="sc-1vdpbg2-0 hWnUTt"><
sed 's+href=+\nhref=+' "${TMP}.3" |
sed 's+class=+\nclass=+' |
grep '^href=' >"${TMP}.4"
test ${DBG} -eq 1 && more ${TMP}.4
awk -v base="${URL_BASE}" -v splitter=\" '{
printf("https://%s", base ) ;
pos=index( $0, "href=" ) ;
if( pos != 0 ){
rem=substr( $0, pos+6 ) ;
n=split( rem, var, splitter) ;
printf("%s\n", var[1] ) ;
} ;
}' "${TMP}.4" >${TMP}.5
more ${TMP}.5
exit
That will give you a report for ${TMP}.5 like this:
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2984/585ddf5a3dde69cb58c7f42ba52790a4
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2985/e15e664718ef6c0dba471d59c4a1928a
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2986/58edb8e0f06dc3da40c255e50b3839cf
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2987/2ebc2e7b13268e74d90cc64c898530ee
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2988/2031529377d3be27402f61f07c1cd4f4
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2989/eaceb96a0368da10fb64e1383f93f513
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2990/4974094499083a8d67158d51c5df2fcb
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2991/4c79d87656dcafcccd4dfd9349ca7c23
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2992/26b4d8808ef4851640b9a2dfa8499a6d
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2993/930aaa5b2b3d52e2367dd4f533728020
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2994/fa78c186bc9414f844f197fd2d673da3
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2995/c059c7b2b54c3c25996c02992228e46b
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2996/4a016aeed0ee5b7ed5ae1c6117347e6a
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2997/1e3dca41d84471d5d95579afee66c6cf
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2998/440d069159114621939d1627eda37aec
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2999/f54381d4b61f76bb83f072059c15ea84
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-3000/b272901a616147cd9f570750aa450f99
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-3001/3aca6bd8e81962dc4a45fcc586cdcc7f
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-3002/c6500c6e261bd5d65d0bd3a57cd36288
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-3003/35a13bc5e5570ed223c5a0221a8d13f3
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-3004/a5cfb71ed30e704730b8891323ff7d92
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-3005/d86c1308029d78a6b7090503f8bab88e
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-3006/54bba327bc7a1ae7b9b609e7ee11c07c
https://Polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-3007/17d199a0523df8430bcb1f21d4a5b573
NOTE: In the image below, the icon between the "folder" and the "star", in the address bar of that image, is the button for the Download Serialized DOM extension to capture the currently displayed page as a fully-instantiated DOM file.
To save the output of the wget command that you provided above, add the following at the end of your command line:
-O ${vidfileUniqueName}.${fileTypeSuffix}
Before that wget, you will need to define something like the following:
vidfileUniqueName=$(echo "${URL}" | cut -f10 -d\/ )
fileTypeSuffix="mp4|avi|mkv"
You need to choose only one of the suffix types from that list and remove the others.

How does Google search PDF's so quickly even though it takes way more time for the PDF to load when I open the link?

It takes time for the PDF load completely when I click on links in Google. But Google searches millions of files and returns the exact result. It even returns part where words I searched can be found(briefly below the result). All of this in a few seconds.
However it takes way more time to open those links individually.
MY REASONING- Google has already gone through the internet(as soon as the link is uploaded to net) and Google just gives me links from its history rather doing the search in real time.
But still it sounds unconvincing as it might make it slightly quick but not much.
Also as an extension, can solution to this be used as hack to open web-pages / PDF quickly avoiding all irrelevant parts(like ads/toolbars in some news pages etc)? You know if Google can search it in such short time, there should be a way for us to get relevant pages quickly,right?
Thanks in advance.
Example: Image of the PDF which took over 10s to open. But Google returned the result in 0.58s(according Google itself)
In case you wanted to do this yourself, here is one solution that I came up with to the problem of quickly searching every law ever enacted by the U.S. Congress. These laws are in more than 32GB of PDF files, which you can download for free from here and there on the Internet.
For more than 150 PDF files that I downloaded, I used the naming convention of V<volume>[C<congress>[S<session>]]Y<year>[<description>].pdf. Here's some examples of how my PDF files were named, whitespace and all.
V1C1Y1789-1791.pdf
V6Y1789-1845 Private Laws and Resolutions.pdf
V7Y1789-1845 Indian Treaties.pdf
V50C75S1Y1937.pdf
V51C75S2Y1937.pdf
V52C75S3Y1938.pdf
V53C76S1Y1939.pdf
V54C76S2-3Y1939-1941.pdf
Then I made the following directories to hold the PDF files and their text representations (to be used for fast searching) that I was going to create. I placed all the downloaded PDFs in the first, Originals, directory.
~/Documents/Books/Laws/Originals/
~/Documents/Books/Laws/PDF/
~/Documents/Books/Laws/Text/
The first problem I encountered was that, for the volumes before number 65, the (selectable) text in the PDF files was poorly constructed: often out of order and jumbled around. (I initially discovered this when using a pdfgrep tool.) This problem made the text almost impossible to search through. However, the images of the PDF files seemed quite reasonable.
Using brew, I installed the ocrmypdf tool ("brew install ocrmypdf") to improve the OCR text layer on those problematic PDF files. It worked very well.
To get around some apparent limitations of xargs (long command lines halted file substitutions) and zargs (file substitution halted after a pipe, a redirect, or the first substitution in a string), I had created the following Zsh function, which I used to mass execute ocrmypdf on 92 PDF files.
# Make a function to execute a string of shell code in the first argument on
# one or more files specified in the remaining arguments. Every specification
# of $F in the string will be replaced with the current file. If the last
# argument is "test," show each command before asking the user to execute it.
#
# XonFs <command_string> <file_or_directory>... [test]
#
XonFs()
{
# If the last argument is a test, identify and remove it.
#
local testing
if [[ ${argv[-1]} == "test" ]]
then
testing=1
unset -v 'argv[-1]'
fi
# Get a list of files, from each argument after the first one, and sort
# the list like the Finder does. The IFS setting makes the output of the
# sort command be separated by newlines, instead of by whitespace.
#
local F IFS=$'\n' answer=""
local files=($(sort -Vi <<<"${argv[2,-1]}"))
# Execute the command for each file. But if we are testing, show the
# command that will be executed before asking the user for permission to
# do so.
#
for F in $files
do
# If this is a test, show the user the command that we will execute,
# using the current filename. Then ask the user if they want execute
# the command or not, or just quit the script. If this is not a test,
# execute the command.
#
if (( $testing ))
then
# Separate each file execution with a newline. Show the first
# argument to the function, the command that can be executed, with
# the f variable expanded as the current file. Then, ask the user
# whether the command should be executed.
#
[[ -n $answer ]] && print
printf "%s\n" "${(e)1}"
read -ks "answer?EXECUTE? y/n/q [no] "
# Report what the user's answer is interpreted to be, and do what
# they want.
#
if [[ "$answer" == [yY] ]]
then
print "Yes."
eval $1
elif [[ "$answer" == [qQ] ]]
then
print "Quit."
break
else
answer="n"
print "No."
fi
else
eval $1
fi
done
}
Thus, I used the following shell commands to put the best versions of the PDF files in the PDF directory. (It took about a day on my computer to complete the ocrmypdf conversions.) During the conversions, I also had text files created from the converted PDF files and placed in the Originals directory.
cd ~/Documents/Books/Laws/Originals/
cp V{65..128}[YC]*.pdf ../PDF
XonFs 'print "$F"; ocrmypdf -f --sidecar "${F:r}.txt" "$F" "../PDF/$F"; print;' V{1..64}[YC]*.pdf
I then used pdftotext to create the text file versions of the original (unconverted) PDF files, as follows. If I remember correctly, the pdftotext tool is installed automatically with the installation of pdfgrep ("brew install pdfgrep").
XonFs 'print "$F"; pdftotext "$F" "${F:r}.txt";' V{65..128}[YC]*.pdf
Next, I created easily and quickly searchable versions of all the text files (direct conversions of the PDF files) and placed these new versions of the text files in the Text directory with the following command.
XonFs 'print "$F"; cat "$F" | tr -cds "[:space:][:alnum:]\!\$%&,.:;?" "\!\$%&,.:;?" | tr -s "\n" "=" | tr "\f" "_" | tr -s "[:space:]" " " | sed -E -e "s/ ?& ?/ and /g" -e '"'s/[ =]*_[ =]*/\\'$'\n/g' -e 's/( ?= ?)+/\\'$'\t/g'"' > "../Text/$F";' *.txt
(Ok, tr and sed commands look crazy, but they basically do the following. Delete everything except certain characters and remove some of their redundancies, change all newlines to =, change each formfeed to _, change all whitespace to " ", and change each "&" into "and", change each _ to newline, and change each = to tab. Thus, in the new versions of the text files, a lot of extraneous characters are removed or reduced, newlines separate pages, and tabs represent newlines. Unfortunately, sed seems to require tricky escaping of characters like \n and \t in sed replacement specifications.)
The below is the zsh code for a tool I created called greplaw (and a supporting function called error). Since I will be using this tool a lot, I placed this code in my ~/.zshenv file.
# Provide a function to print an error message for the current executor, which
# is identified by the first argument. The second argument, if not null, is a
# custom error message to print. If the third argument exists and is neither
# zero nor null, the script is exited, but only to the prompt if there is one.
# The fourth argument, if present, is the current linenumber to report in the
# error message.
#
# Usage:
# error [executor [messageString [exitIndicator [lineNumber]]]]
#
# Examples:
# error greplaw
# error greplaw "" 1
# error greplaw "No text files found" 0 $LINENO
# error greplaw "No pdf files found" "" $LINENO
# error greplaw "No files found" x $LINENO
# error greplaw HELL eject $LINENO
#
error()
{
print ${1:-"Script"}": "${4:+"Line $4: "}"Error: "${2:-"Uknown"} 1>&2
[[ $3 =~ '^0*$' ]] || { ${missing_variable_ejector:?} } 2>/dev/null
}
# Function to grep through law files: see usage below or execute with -h.
#
greplaw()
{
# Provide a function to print an error message for the current executor.
# If the user did not include any arguments, give the user a little help.
#
local executor=$( basename $0 )
local err() {error $executor $*}
(( $# )) || 1="-h"
# Create variables with any defaults that we have. The color and no color
# variables are for coloring the extra output that this script outputs
# beyond what grep does.
#
local lawFileFilter=() contextLines=1 maxFileMatches=5 grepOutput=1
local maxPageMatches=5 quiet=0 c="%B%F{cyan}" nc="%f%b" grep="pcregrep"
local grepOptions=(-M -i --color) fileGrepOptions=(-M -n -i)
# Print out the usage for the greplaw function roughly in the fashion of a
# man page, with color and bolding. The output of this function should be
# sent to a print -P command.
#
local help()
{
# Make some local variables to make the description more readable in
# the below string. However, insert the codes for bold and color as
# appropriate.
#
local func="%B%F{red}$executor%f%b" name synopsis description examples
local c f g h l o p q number pattern option
# Mass declare our variables for usage categories, function flags, and
# function argument types. Use the man page standards for formatting
# the text in these variables.
#
for var in name synopsis description examples
declare $var="%B%F{red}"${(U)var}"%f%b"
for var in c f g h l o p q
declare $var="%B%F{red}-"$var"%f%b"
for var in number pattern option
declare $var="%U"$var"%u"
# Print the usage for the function, using our easier to use and read
# variables.
#
cat <<greplawUsage
$name
$func
$synopsis
$func [$c $number] [$f $number] [$g] [$h] [$l $pattern]... [$o $option]...
[$p $number] [$q] $pattern...
$description
This function searches law files with a regular expression, which is
specified as one or more arguments. If more than one argument is provided,
they are joined, with a pattern of whitespace, into a singular expression.
The searches are done without regard to case or whitespace, including line
and page breaks. The output of this function is the path of each law file
the expression is found in, including the PDF page as well as the results
of the $grep. If just a page is reported without any $grep results,
the match begins on that page and continues to the next one.
The following options are available:
$c $number
Context of $number lines around each match shown. Default is 1.
$f $number
File matches maximum is $number. Default is 5. Infinite is -1.
$g
Grep output is omitted
$h
Help message is merely printed.
$l $pattern
Law file regex $pattern will be added as a filename filter.
$o $option
Option $option added to the final $grep execution, for output.
$p $number
Page matches maximum is $number. Default is 5. Infinite is -1.
$q
Quiet file and page information: information not from $grep.
$examples
$func bureau of investigation
$func $o --color=always congress has not | less -r
$func $l " " $l Law congress
greplawUsage
}
# Update our defaulted variables according to the supplied arguments,
# until an argument is not understood. If an agument looks invalid,
# complain and eject. Add each law file filter to an array.
#
while (( $# ))
do
case $1 in
(-[Cc]*)
[[ $2 =~ '^[0-9]+$' ]] || err "Bad $1 argument: $2" eject
contextLines=$2
shift 2;;
(-f*)
[[ $2 =~ '^-?[0-9]+$' ]] || err "Bad $1 argument: $2" eject
maxFileMatches=$2
shift 2;;
(-g*)
grepOutput=0
shift;;
(-h*|--h*)
print -P "$( help )"
return;;
(-l*)
lawFileFilter+=$2
shift 2;;
(-o*)
grepOptions+=$2
shift 2;;
(-p*)
[[ $2 =~ '^-?[0-9]+$' ]] || err "Bad $1 argument: $2" eject
maxPageMatches=$2
shift 2;;
(-q*)
quiet=1
shift;;
(*)
break;;
esac
done
# If the user specified that the script and that grep has no output, then
# we will give no output at all, so just eject with an error. Also, make
# sure we have remaining arguments to assemble the search pattern with.
# Assemble it by joining them with the only allowable whitespace in the
# text files: a space, a tab (which represents a newline), or a newline
# (which prepresents a new PDF page).
#
(( $quiet && ! $grepOutput )) && err "No grep output and quiet: nothing" x
(( $# )) || err "No pattern supplied to grep law files with" eject
local pattern=${(j:[ \t\n]:)argv[1,-1]}
# Quickly seachable text files are searched as representatives of the
# actual PDF law files. Define our PDF and text directories. Note that to
# expand the home directory specification, no quotes can be used.
#
local pdfDirectory=~/Documents/Books/Laws/PDF
local textDirectory=${pdfDirectory:h}"/Text"
# Get a list of the text files, without their directory specifications,
# sorted like the Finder would: this makes the file in order of when the
# laws were created. The IFS setting separates the output of the sort
# command by newlines, instead of by whitespace.
#
local filter fileName fileMatches=0 IFS=$'\n'
local files=( $textDirectory/*.txt )
local fileNames=( $( sort -Vi <<<${files:t} ) )
[[ $#files -gt 1 ]] || err "No text files found" eject $LINENO
# Repeatedly filter the fileNames for each of the law file filters that
# were passed in.
#
for filter in $lawFileFilter
fileNames=( $( grep $filter <<<"${fileNames}" ) )
[[ $#fileNames -gt 0 ]] || err "All law files were filtered out" eject
# For each filename, search for pattern matches. If there are any, report
# the corresponding PDF file, the page numbers and lines of the match.
#
for fileName in $fileNames
do
# Do a case-insensitive, multiline grep of the current file for the
# search pattern. In the grep, have each line prepended with the line
# number, which represents the PDF page number.
#
local pages=() page="" pageMatches=0
local file=$textDirectory"/"$fileName
pages=( $( $grep $fileGrepOptions -e $pattern $file ) )
# If the grep found nothing, move on to the next file. Otherwise, if
# the maximum file matches has been defined and has been exeeded, then
# stop processing files.
#
if [[ $#pages -eq 0 ]]
then
continue
elif [[ ++fileMatches -gt $maxFileMatches && $maxFileMatches -gt 0 ]]
then
break
fi
# For each page with a match, print the page number and the matching
# lines in the page.
#
for page in $pages
do
# If there have been no previous page matches in the current file,
# identify the corresponding PDF file that the matches, in theory,
# come from.
#
if [[ ++pageMatches -eq 1 ]]
then
# Put a blank line between matches for each file, unless
# either minimum output is requested or page matches are not
# reported.
#
if [[ $fileMatches -ne 1 && $pageMatches -ne 0
&& $maxPageMatches -ne 0 ]]
then
(( $quiet )) || print
fi
# Identify and print in color the full location of the PDF
# file (prepended with an open command for easy access),
# unless minimum output is requested.
#
local pdfFile=$pdfDirectory"/"${fileName:r}".pdf"
(( $quiet )) || print -P $c"open "$pdfFile$nc
fi
# If the maximum pages matches has been defined and has been
# exeeded, stop processing pages for the current file.
#
if [[ $maxPageMatches -gt 0 && $pageMatches -gt $maxPageMatches ]]
then
break
fi
# Extract and remove the page number specification (an initial
# number before a colon) from the grep output for the page. Then
# extract the lines of the page: tabs are decoded as newlines.
#
local pageNumber=${page%%:*}
page=${page#*:}
local lines=( $( tr '\t' '\n' <<<$page ) )
# Print the PDF page number in yellow. Then grep the lines of the
# page that we have, matching possibly multiple lines without
# regard to case. And have any grep output use color and a line
# before and after the match, for context.
#
(( $quiet )) || print -P $c"Page "$pageNumber$nc
if (( $grepOutput ))
then
$grep -C $contextLines -e $pattern $grepOptions <<<$lines
fi
done
done
}
Yes, if I had to do it again, I would have used Perl...
Here's the usage for greplaw, as a zsh shell function. It's a surprisingly fast search tool.
NAME
greplaw
SYNOPSIS
greplaw [-c number] [-f number] [-g] [-h] [-l pattern]... [-o option]...
[-p number] [-q] pattern...
DESCRIPTION
This function searches law files with a regular expression, which is
specified as one or more arguments. If more than one argument is provided,
they are joined, with a pattern of whitespace, into a singular expression.
The searches are done without regard to case or whitespace, including line
and page breaks. The output of this function is the path of each law file
the expression is found in, including the PDF page as well as the results
of the pcregrep. If just a page is reported without any pcregrep results,
the match begins on that page and continues to the next one.
The following options are available:
-c number
Context of number lines around each match shown. Default is 1.
-f number
File matches maximum is number. Default is 5. Infinite is -1.
-g
Grep output is omitted
-h
Help message is merely printed.
-l pattern
Law file regex pattern will be added as a filename filter.
-o option
Option option added to the final pcregrep execution, for output.
-p number
Page matches maximum is number. Default is 5. Infinite is -1.
-q
Quiet file and page information: information not from pcregrep.
EXAMPLES
greplaw bureau of investigation
greplaw -o --color=always congress has not | less -r
greplaw -l " " -l Law congress
That'll do it... (If I remembered everything correctly and didn't do any typos...)

Is there a way to automatically and programmatically download the latest IP ranges used by Microsoft Azure?

Microsoft has provided a URL where you can download the public IP ranges used by Microsoft Azure.
https://www.microsoft.com/en-us/download/confirmation.aspx?id=41653
The question is, is there a way to download the XML file from that site automatically using Python or other scripts? I am trying to schedule a task to grab the new IP range file and process it for my firewall policies.
Yes! There is!
But of course you need to do a bit more work than expected to achieve this.
I discovered this question when I realized Microsoft’s page only works via JavaScript when loaded in a browser. Which makes automation via Curl practically unusable; I’m not manually downloading this file when updates happen. Instead IO came up with this Bash “one-liner” that works well for me.
First, let’s define that base URL like this:
MICROSOFT_IP_RANGES_URL="https://www.microsoft.com/en-us/download/confirmation.aspx?id=41653"
Now just run this Curl command—that parses the returned web page HTML with a few chained Greps—like this:
curl -Lfs "${MICROSOFT_IP_RANGES_URL}" | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | grep "download.microsoft.com/download/" | grep -m 1 -Eo '(http|https)://[^"]+'
The returned output is a nice and clean final URL like this:
https://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_20181224.xml
Which is exactly what one needs to automatic the direct download of that XML file. If you need to then assign that value to a URL just wrap it in $() like this:
MICROSOFT_IP_RANGES_URL_FINAL=$(curl -Lfs "${MICROSOFT_IP_RANGES_URL}" | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | grep "download.microsoft.com/download/" | grep -m 1 -Eo '(http|https)://[^"]+')
And just access that URL via $MICROSOFT_IP_RANGES_URL_FINAL and there you go.
if you can use wget writing in a text editor and saving it as scriptNameYouWant.sh will give you a bash script that will download the xml file you want.
#! /bin/bash
wget http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_21050223.xml
Of course, you can run the command directly from a terminal and have the exact result.
If you want to use Python, one way is:
Again, write in a text editor and save as scriptNameYouWant.py the folllowing:
import urllib2
page = urllib2.urlopen("http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_21050223.xml").read()
f = open("xmlNameYouWant.xml", "w")
f.write(str(page))
f.close()
I'm not that good with python or bash though, so I guess there is a more elegant way in both python or bash.
EDIT
You get the xml despite the dynamic url using curl. What worked for me is this:
#! /bin/bash
FIRSTLONGPART="http://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_"
INITIALURL="http://www.microsoft.com/EN-US/DOWNLOAD/confirmation.aspx?id=41653"
OUT="$(curl -s $INITIALURL | grep -o -P '(?<='$FIRSTLONGPART').*(?=.xml")'|tail -1)"
wget -nv $FIRSTLONGPART$OUT".xml"
Again, I'm sure that there is a more elegant way of doing this.
Here's an one-liner to get the URL of the JSON file:
$ curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?\.json' | uniq
I enhanced the last response a bit and I'm able to download the JSON file:
#! /bin/bash
download_link=$(curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?\.json' | uniq | grep -v refresh)
if [ $? -eq 0 ]
then
wget $download_link ; echo "Latest file downloaded"
else
echo "Download failed"
fi
Slight mod to ushuz script from 2020 - as that was producing 2 different entries and unique doesn't work for me in Windows -
$list=#(curl -sS https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519 | egrep -o 'https://download.*?.json')
$list.Get(2)
That outputs the valid path to the json, which you can punch into another variable for use for grepping or whatever.

Linux script to return domains on a web page

I was tasked with this question:
Write a bash script that takes a URL as its first argument and prints out statistics of the number of links per host/domain in the HTML of the URL.
So for instance given a URL like www.bbc.co.uk it might print something like
www.bbc.co.uk: 45
bbc.com: 1
google.com: 2
Facebook.com: 4
That is, it should analyse the HTML of the page, pull out all the links, examine the href attribute, decide which links are to the same domain (figure that one out of course), and which are foreign, then produce statistics for the local ones and for the remote ones.
Rules: You may use any set of standard Linux commands in your script. You may not use any higher-level programming languages such as C or Python or Perl. You may however use awk, sed, etc.
I came up with the solution as follows:
#!/bin/sh
echo "Enter a url eg www.bbc.com:"
read url
content=$(wget "$url" -q -O -)
echo "Enter file name to store URL output"
read file
echo $content > $file
echo "Enter file name to store filtered links:"
read links
found=$(cat $file | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | sort | uniq | awk '/http/' > $links)
output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)
cat out
I was then told that "i must look at the data, and then check that your program deals satisfactorily with all the scenarios.This reports URLs but no the domains"
Is there someone out there that can help me or point me in the right direction so as i can be able to achieve my goal? what am i missing or what is the script not doing? I thought i had made it work as required.
The output of your script is:
7 http://news.bbc.co.uk/
1 http://newsvote.bbc.co.uk/
1 http://purl.org/
8 http://static.bbci.co.uk/
1 http://www.bbcamerica.com/
23 http://www.bbc.com/
179 http://www.bbc.co.uk/
1 http://www.bbcknowledge.com/
1 http://www.browserchoice.eu/
I think they mean that it should look more like:
7 news.bbc.co.uk
1 newsvote.bbc.co.uk
1 purl.org
8 static.bbci.co.uk
1 www.bbcamerica.com
23 www.bbc.com
179 www.bbc.co.uk
1 www.bbcknowledge.com
1 www.browserchoice.eu

Search MS word files in a directory for specific content in Linux

I have a directory structure full of MS word files and I have to search the directory for particular string. Until now I was using the following command to search files for in a directory
find . -exec grep -li 'search_string' {} \;
find . -name '*' -print | xargs grep 'search_string'
But, this search doesn't work for MS word files.
Is it possible to do string search in MS word files in Linux?
I'm a translator and know next to nothing about scripting but I was so pissed off about grep not being able to scan inside Word .doc files that I worked out how to make this little shell script to use catdoc and grep to search a directory of .doc files for a given input string.
You need to install catdocand docx2txt packages
#!/bin/bash
echo -e "\n
Welcome to scandocs. This will search .doc AND .docx files in this directory for a given string. \n
Type in the text string you want to find... \n"
read response
find . -name "*.doc" |
while read i; do catdoc "$i" |
grep --color=auto -iH --label="$i" "$response"; done
find . -name "*.docx" |
while read i; do docx2txt < "$i" |
grep --color=auto -iH --label="$i" "$response"; done
All improvements and suggestions welcome!
Here's a way to use "unzip" to print the entire contents to standard output, then pipe to "grep -q" to detect whether the desired string is present in the output. It works for docx format files.
#!/bin/bash
PROG=`basename $0`
if [ $# -eq 0 ]
then
echo "Usage: $PROG string file.docx [file.docx...]"
exit 1
fi
findme="$1"
shift
for file in $#
do
unzip -p "$file" | grep -q "$findme"
[ $? -eq 0 ] && echo "$file"
done
Save the script as "inword" and search for "wombat" in three files with:
$ ./inword wombat file1.docx file2.docx file3.docx
file2.docx
Now you know file2.docx contains "wombat". You can get fancier by adding support for other grep options. Have fun.
The more recent versions of MS Word intersperse ascii[0] in between each of the letters of the text for purposes I cannot yet understand. I have written my own MS Word search utilities that insert ascii[0] in between each of the characters in the search field and it just works fine. Clumsy but OK. A lot of questions remain. Perhaps the junk characters are not always the same. More tests need to be done. It would be nice if someone could write a utility that would take all this into account. On my windows machine the same files respond well to searches.
We can do it!
In a .doc file the text is generally present and can be found by grep, but that text is broken up and interspersed with field codes and formatting information so searching for a phrase you know is there may not match. A search for something very short has a better chance of matching.
A .docx file is actually a zip archive collecting several files together in a directory structure (try renaming a .docx to .zip then unzipping it!) -- with zip compression it's unlikely that grep will find anything at all.
The opensource command line utility crgrep will search most MS document formats (I'm the author).
Have you tried with awk ‘/Some|Word|In|Word/’ document.docx ?
If it's not too many files you can write a script that incorporates something like catdoc: http://manpages.ubuntu.com/manpages/gutsy/man1/catdoc.1.html , by looping over each file, perfoming a catdoc and grep, storing that in a bash variable, and outputting it if it's satisfactory.
If you have installed program called antiword you can use this command:
find -iname "*.doc" |xargs -I {} bash -c 'if (antiword {}|grep "string_to_search") > /dev/null 2>&1; then echo {} ; fi'
replace "string_to_search" in above command with your text. This command spits file name(s) of files containing "string_to_search"
The command is not perfect because works weird on small files (the result can be untrustful), becasue for some reseaon antiword spits this text:
"I'm afraid the text stream of this file is too small to handle."
if file is small (whatever it means .o.)
The best solution I came upon was to use unoconv to convert the word documents to html. It also has a .txt output, but that dropped content in my case.
http://linux.die.net/man/1/unoconv
I've found a way of searching Word files (doc and docx) that uses the preprocessor functionality of ripgrep.
This depends on the following being installed:
ripgrep (more information about the preprocessor here)
LibreOffice
docx2txt
this catdoc2 script, which I've added to my $PATH:
#!/bin/bash
temp_dir=$(mktemp -d)
trap "rm $temp_dir/* && rmdir $temp_dir" 0 2 3 15
libreoffice --headless --convert-to "txt:Text (encoded):UTF8" --outdir ${temp_dir} $1 1>/dev/null
cat ${temp_dir}/$(basename -s .doc $1).txt
The command pattern tor a one-level recursive search is:
$ rg --pre <preprocessor> --glob <glob with filetype> <search string>
Example:
$ ls *
one:
a.docx
two:
b.docx c.doc
$ rg --pre docx2txt --glob *.docx This
two/b.docx
1:This is file b.
one/a.docx
1:This is file a.
$ rg --pre catdoc2 --glob *.doc This
two/c.doc
1:This is file c.
Here's the full script I use on macOS (Catalina, Big Sur, Monterey).
It's based on Ralph's suggestion, but using built-in textutil for .doc
#!/bin/bash
searchInDoc() {
# in .doc
find "$DIR" -name "*.doc" |
while read -r i; do
textutil -stdout -cat txt "$i" | grep --color=auto -iH --label="$i" "$PATTERN"
done
}
searchInDocx() {
for i in "$DIR"/*.docx; do
#extract
docx2txt.sh "$i" 1> /dev/null
#point, grep, remove
txtExtracted="$i"
txtExtracted="${txtExtracted//.docx/.txt}"
grep -iHn "$PATTERN" "$txtExtracted"
rm "$txtExtracted"
done
}
askPrompts() {
local i
for i in DIR PATTERN; do
#prompt
printf "\n%s to search: \n" "$i"
#read & assign
read -e REPLY
eval "$i=$REPLY"
done
}
makeLogs() {
local i
for i in results errors; do
# extract dir for log name
dirNAME="${DIR##*/}"
# set var
eval "${i}LOG=$HOME/$i-$PATTERN-$dirNAME.log"
local VAR="${i}LOG"
# remove if existant
if [ -f "${!VAR}" ]; then
printf "WARNING: %s will be overwriten.\n" "${!VAR}"
fi
# touch file
touch "${!VAR}"
done
}
checkDocx2txt() {
#see if soft exists
if ! command -v docx2txt.sh 1>/dev/null; then
printf "\nWARNING: docx2txt is required.\n"
printf "Use \e[3mbrew install docx2txt\e[0m.\n\n"
exit
else
printf "\n~~~~~~~~~~~~~~~~~~~~~~~~\n"
printf "Welcome to scandocs macOS.\n"
printf "~~~~~~~~~~~~~~~~~~~~~~~~\n"
fi
}
parseLogs() {
# header
printf "\n------\n"
printf "Scandocs finished.\n"
# results
if [ ! -s "$resultsLOG" ]; then
printf "But no results were found."
printf "\"%s\" did not match in \"%s\"" "$PATTERN" "$DIR" > "$resultsLOG"
else
printf "See match results in %s" "$resultsLOG"
fi
# errors
if [ ! -s "$errorsLOG" ]; then
rm -f "$errorsLOG"
else
printf "\nWARNING: there were some errors. See %s" "$errorsLOG"
fi
# footer
printf "\n------\n\n"
}
#the soft
checkDocx2txt
askPrompts
makeLogs
{
searchInDoc
searchInDocx
} 1>"$resultsLOG" 2>"$errorsLOG"
parseLogs

Resources