How to pull down a list of domains using wget and scan them using grep - linux

I have a list of domain names contained within a folder named "domains.txt", formatted like this:
www.google.com
www.stackoverflow.com
www.apple.com
etc...
I want to perform a wget command to pull down a copy of each domain listed inside "domains.txt" and save it as a .html page.
I can do this individually using wget www.google.com but I'm wondering, instead of doing each one separately, can I iterate through the list and save each domain name as a separate .html file?
The second action I want to perform is a scan of these pulled down .html files for keywords, which I have contained in a text file named "keywords.txt". They're formatted like this:
first_keyword
second_keyword
third_keyword
etc...
Ideally, I'd like to have an output that prints the domain name to a text file, with a "yes" beside it if it has been found to contain any of the keywords contained in "keywords.txt". If it's possible to print what keywords were found beside each domain that would be brilliant, but a simple "yes" would be great too. I'm brand new to Linux and scripting, so any help would be greatly appreciated!

I assume the files don't contain the quotes. Otherwise I would need more code to remove the quotes.
domains.txt
www.google.com
www.stackoverflow.com
www.apple.com
keywords.txt
first_keyword
second_keyword
third_keyword
You can try something like this
outfile=tmp.html
while IFS= read -r domain
do
wget -O "$outfile" "$domain"
if fgrep -q -f keywords.txt "$outfile"
then
echo "$domain" yes
else
echo "$domain" no
fi
rm "$outfile"
done < domains.txt

Related

wget: Append text to default file name

When I download a file using wget without the -O argument, it saves the file with a default name. I would like to append some text to that default name, but the -O option completely overrides the default name.
For example:
wget www.google.com --> saves as index.html (this is the default name)
wget -O foo.html www.google.com --> saves as foo.html
I would like to save the file as ${default_name}_sometext.html. For the above example, it would look something like index_sometext.html. Any ideas appreciated!
Edit: For those who are wondering why I might need something like this, I have scrapped thousands of URLs for a personal project, and I need to preserve the default name along with some attributes for each file.
#!/bin/bash
domain=google.com
some_text=something
lol=$(wget https://${domain} 2>&1 | grep Saving | cut -d ' ' -f 3 | sed -e's/[^A-Za-z0-9._-]//g')
mv $lol ${some_text}_${lol}
i think it will work for you.

How to replace text strings (by bulk) after getting the results by using grep

One of my Linux MySQL servers suffered from a crash. So I put back a backup, however this time the MySQL is running local (localhost) instead of remotely (IP-address).
Thanks to Stack Overflow users I found an excellent command to find the IP-address in all .php files in a given directory! The command I am using for this is:
grep -r -l --include="*.php" "100.110.120.130" .
This outputs the necessary files with its location ofcourse. If it were less than 10 results, I would simply change them by hand obviously. However I received over 200 hits/results.
So now I want to know if there is a safe command which replaces the IP-address (example: 100.110.120.130) with the text "localhost" instead for all .php files in the given directory (/var/www/vhosts/) recursively.
And maybe, if only possible and not to much work, also output the changed lines to a file? I don't know if thats even possible.
Maybe someone can provide me with a working solution? To be honest, I dont dare to fool around out of the blue with this. Thats why I created a new thread.
The most standard way of replacing a string in multiple files would be to use a tool such as sed. The list of files you've obtained via grep could be read line by line (when output to a file) using a while loop in combination with sed.
$ grep -r -l --include="*.php" "100.110.120.130" . > list.txt
# this will output all matching files to list.txt
Replacing IP in matched files:
while read -r line ; do echo "$line" >> updated.txt ; sed -i 's/100.110.120.130/localhost/g' "${line}" ; done<list.txt
This will take list.txt and read it line by line to the sed command which should replace all occurrences of the IP to "localhost". The echo command directly before sed outputs all the filenames that will be modified into a file updated.txt (it isn't necessary though as list.txt contains the same exact filenames, although it could be used as a means of verification perhaps).
To do a dry run before modifying all of the matched files remove the
-i from the sed command and it will print the output to stdout
instead of in-place modifying the files.

How to search multiple DOCX files for a string within a Word field?

Is there any Windows app that will search for a string of text within fields in a Word (DOCX) document? Apps like Agent Ransack and its big brother FileLocator Pro can find strings in the Word docs but seem incapable of searching within fields.
For example, I would like to be able to find all occurrences of the string "getProposalTranslations" within a collection of Word documents that have fields with syntax like this:
{ AUTOTEXTLIST \t "<wr:out select='$.shared_quote_info' datasource='getProposalTranslations'/>" }
Note that string doesn't appear within the text of the document itself but rather only within a field. Essentially the DOCX file is just a zip file, I believe, so if there's a tool that can grep within archives, that might work. Note also that I need to be able to search across hundreds or perhaps thousands of files in many directories, so unzipping the files one by one isn't feasible. I haven't found anything on my own and thought I'd ask here. Thanks in advance.
This script should accomplish what you are trying to do. Let me know if that isn't the case. I don't usually write entire scripts because it can hurt the learning process, so I have commented each command so that you might learn from it.
#!/bin/sh
# Create ~/tmp/WORDXML folder if it doesn't exist already
mkdir -p ~/tmp/WORDXML
# Change directory to ~/tmp/WORDXML
cd ~/tmp/WORDXML
# Iterate through each file passed to this script
for FILE in $#; do
{
# unzip it into ~/tmp/WORDXML
# 2>&1 > /dev/null discards all output to the terminal
unzip $FILE 2>&1 > /dev/null
# find all of the xml files
find -type f -name '*.xml' | \
# open them in xmllint to make them pretty. Discard errors.
xargs xmllint --recover --format 2> /dev/null | \
# search for and report if found
grep 'getProposalTranslations' && echo " [^ found in file '$FILE']"
# remove the temporary contents
rm -rf ~/tmp/WORDXML/*
}; done
# remove the temporary folder
rm -rf ~/tmp/WORDXML
Save the script wherever you like. Name it whatever you like. I'll name it docxfind. Make it executable by running chmod +x docxfind. Then you can run the script like this (assuming your terminal is running in the same directory): ./docxfind filenames...

Using wget to download images and saving with specified filename

I'm using wget mac terminal to download images from a file where each image url is it's own line, and that works perfectly with this command:
cut -f1 -d, images.txt | while read url; do wget ${url} -O $(basename ${url}); done
However I want to specify the output filename it's saved as instead of using the basename. The file name is specified in the next column, separated by either space or comma and I can't quite figure out how to tell wget to use the 2nd column as the name it should as the -O name.
I'm sure it's a simple change to my above command but after reading dozens of different posts on here and other sites I can't figure it out. Any help would be appreciated.
If you use whitespace as the seperator it's very easy:
cat images.txt | while read url name; do wget ${url} -O ${name}; done
Explanation: instead of reading just one variable per line (${url}) as in your example, you read two (${url} and ${name}). The second one is your local filename. I assumed your images.txt file looks something like this:
http://cwsmgmt.corsair.com/newscripts/landing-pages/wallpaper/v3/Wallpaper-v3-2560x1440.jpg test.jpg

Grep: Copy a link with specific text

I have a text file with many links which aren't in separate lines.
I want to save in another file probably, all the links which contains a specific word.
How can I do this with grep?
EDIT:
To become more specifique, I have a messy txt file with many links. I want to copy in onother file all links starting with https:://, ending with .jpg and contains anywhere "10x10" string for example
You can get all the lines containing a specific word from the file like this:
LINKS=$(cat myfile.txt | grep MYWORD)
Then with LINKS, you can use a delimiter to create an array of links, which you can print to another file.
# Using a space as the delimeter
while IFS=' 'read -ra ind_link
do
echo $ind_link >> mynewfile.txt
done <<< "$LINKS"
Something along those lines I think is what you are looking for no?
Also if you need to refine your search, you can use the grep options such as -w to get more specific.
Hope it helps.
Could you give us the specific word and an example of input file ?
You could try to use egrep or/and sed like this (for example) :
egrep -o "href=\".*\.html\"" file|sed "s/\"\([^\"]*\)/\1/g"
Another exemple for all kind of http/https ressources links (whithout spaces in the URL) :
$ echo "<a href=http://titi/toto.jpg >"|egrep -o "https?:\/\/[^\ ]*"
http://titi/toto.jpg
$ echo "<a href=https://titi/toto.htm >"|egrep -o "https?:\/\/[^\ ]*"
https://titi/toto.htm
You have to customize the regexp according to your needs.

Resources