formatting whois results to CSV using BASH and grep

formatting whois results to CSV using BASH and grep - linux

I've been trying to get my head round this for a few hours now and thought I'd come here and ask for help.
I have a CSV file of IP addresses from a log file that I want to run through and get a WHOIS netrange and company name from and then append the result to the end of the CSV.
So far what I have managed to do is get the whois results to a separate csv
echo ip, company, > result.csv
for ip in $(grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}" source.csv); do
whois $ip | grep -i -e 'netrange\|inetnum' -e 'org-name\|orgname' \
| awk 'BEGIN{FS="NetRange:|inetnum:|OrgName:|org-name:"} {print $2","$3}'
|xargs; done >> result.csv
my challenge is how to add my 2 new columns back into the source.csv? I have tried using
paste -d, source.csv result.csv
but all that happens is the values in result.csv overwrite the first few columns of source.csv
my source.csv looks something like the below
ip address requests number of visits
66.249.90.77 2149 200
66.249.66.1 216 233
My result.csv
ip range company
66.249.64.0 - 66.249.95.255 Google Inc.
66.249.64.0 - 66.249.95.255 Google Inc.
i would like my final csv to look like
ip requests number of visits ip range company
66.249.90.77 2149 200 66.249.64.0 - 66.249.95.255 Google Inc.
66.249.66.1 2161 233 66.249.64.0 - 66.249.95.255 Google Inc.
If possible I would prefer to accomplish this with BASH rather than installing any 3rd party tools etc. I have tried the python package ipwhois but my python knowledge is far less than my limited BASH knowledge so I abandoned it lest I continue wasting time!
Any help is much appreciated.

Try putting this in a script file and running it while inside the directory containing your data file.
#!/bin/bash -ue
# Pattern for spacing characters
sp="[[:space:]]*"
#Pattern for non-spacing characters
nsp="[^[:space:]]+"
# Iterate on each line in the data file
while IFS= read -r line
do
[[ "$line" =~ ^$sp($nsp)$sp($nsp)$sp($nsp)$sp$ ]] || continue
f1="${BASH_REMATCH[1]}"
f2="${BASH_REMATCH[2]}"
f3="${BASH_REMATCH[3]}"
# Extract information from whois
whois_data="$(whois "$f1")"
range="$(grep NetRange <<<"$whois_data" | cut -f2 -d":")"
company="$(grep OrgName <<<"$whois_data" | cut -f2 -d":")"
echo $f1,$f2,$f3,$range,$company
done <"source.csv"
The output is formatted as fields separated by commas, and there should be some trimming of the range and company variables before they are used (to remove spaces at the beginning and end), but this should give you the idea.
Basically, the code does not try to merge two files together, but rather extracts the fields from each line of the source file, performs the whois, extracts the two fields needed from it, and then outputs all five fields to the output without an intermediary step.

Related

Find and copy specific files by date

I've been trying to get a script working to backup some files from one machine to another but have been running into an issue.
Basically what I want to do is copy two files, one .log and one (or more) .dmp. Their format is always as follows:
something_2022_01_24.log
something_2022_01_24.dmp
I want to do three things with these files:
find the second to last one .log file (i.e. something_2022_01_24.log is the latest,I want to find the one before that say something_2022_01_22.log)
get a substring with just the date (2022_01_22)
copy every .dmp that matches the date (i.e something_2022_01_24.dmp, something01_2022_01_24.dmp)
For the first one from what I could find the best way is to do: ls -t *.log | head-2 as it displays the second to last file created.
As for the second one I'm more at a loss because I'm not sure how to parse the output of the first command.
The third one I think I could manage with something of the sort:
[ -f "/var/www/my_folder/*$capturedate.dmp" ] && cp "/var/www/my_folder/*$capturedate.dmp" /tmp/
What do you guys think is there any way to do this? How can I compare the substring?
Thanks!

Would you please try the following:
#!/bin/bash
dir="/var/www/my_folder"
second=$(ls -t "$dir/"*.log | head -n 2 | tail -n 1)
if [[ $second =~ .*_([0-9]{4}_[0-9]{2}_[0-9]{2})\.log ]]; then
capturedate=${BASH_REMATCH[1]}
cp -p "$dir/"*"$capturedate".dmp /tmp
fi
second=$(ls -t "$dir"/*.log | head -n 2 | tail -n 1) will pick the
second to last log file. Please note it assumes that the timestamp
of the file is not modified since it is created and the filename
does not contain special characters such as a newline. This is an easy
solution and we may need more improvement for the robustness.
The regex .*_([0-9]{4}_[0-9]{2}_[0-9]{2})\.log will match the log
filename. It extracts the date substring (enclosed with the parentheses) and assigns the bash variable
${BASH_REMATCH[1]} to it.
Then the next cp command will do the job. Please be cateful
not to include the widlcard * within the double quotes so that
the wildcard is properly expanded.
FYI here are some alternatives to extract the date string.
With sed:
capturedate=$(sed -E 's/.*_([0-9]{4}_[0-9]{2}_[0-9]{2})\.log/\1/' <<< "$second")
With parameter expansion of bash (if something does not include underscores):
capturedate=${second%.log}
capturedate=${capturedate#*_}
With cut command (if something does not include underscores):
capturedate=$(cut -d_ -f2,3,4 <<< "${second%.log}")

Removing content existing in another file in bash [duplicate]

This question already has answers here:
Print lines in one file matching patterns in another file
(5 answers)
Closed 4 years ago.
I am attempting to clean one file1.txt that contains always the same lines using file2.txt that contains a list of IP addresses I want to remove.
The working script I have written I believe can be enhanced somehow to be faster in execution.
My script:
#!/bin/bash
IFS=$'\n'
for i in $(cat file1.txt); do
for j in $(cat file2); do
echo ${i} | grep -v ${j}
done
done
I have tested the script with the following data set:
Amount of lines in file1.txt = 10,000
Amount of lines in file2.txt = 3
Scrit execution time:
real 0m31.236s
user 0m0.820s
sys 0m6.816s
file1.txt content:
I3fSgGYBCBKtvxTb9EMz,1.1.2.3,45,This IP belongs to office space,1539760501,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.2.2.3,45,This IP belongs to office space,1539760502,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.3.2.3,45,This IP belongs to office space,1539760503,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.4.2.3,45,This IP belongs to office space,1539760504,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.5.2.3,45,This IP belongs to office space,1539760505,https://myoffice.com
... lots of other lines in the same format
I3fSgGYBCBKtvxTb9EMz,4.1.2.3,45,This IP belongs to office space,1539760501,https://myoffice.com
file2.txt content:
1.1.2.3
1.2.2.3
... lots of other IPs here
1.2.3.9
How can I improve those timings?
I am confident that the files will grow over time. In my case I will run the script every hour from cron, therefore I would like to improve here.

You want to get rid of all lines in file1.txt that contains substrings which match file2.txt. grep to the rescue
grep -vFwf file2.txt file1.txt
The -w is need to avoid that 11.11.11.11 matches 111.11.11.111
-F, --fixed-strings, --fixed-regexp Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified by POSIX, --fixed-regexp is an obsoleted alias, please do not use it in new scripts.)
-f FILE, --file=FILE Obtain patterns from FILE, one per line. The empty file contains zero patterns and therefore matches nothing. (-f is specified by POSIX.)
-w, --word-regexp Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.
source: man grep
On a further note, here are a couple of pointers for your script:
Don't use for loops to read files (http://mywiki.wooledge.org/DontReadLinesWithFor).
Don't use cat (See How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?)
Use quotes! (See Bash and Quotes)
This allows us to rewrite it as:
#!/bin/bash
while IFS=$'\n' read -r i; do
while IFS=$'\n' read -r j; do
echo "$i" | grep -v "$j"
done < file2
done < file1
Now the problem is that you read file2 N times. Where N is the number of lines of file1. This is not really efficient. And luckily grep has the solution for us (see top).

cat | sort csv file by name in bash

i have a bunch of csv files that i want to save them in one file ordered by name
i use
cat *.csv | sort -t\ -k2 -n *.csv > output.csv
works good for a naming like a001, a002, a010. a100
but in my files the names are fup a bit so they are like a1. a2. a10. a100
and the command i wrote will arrange my things like this:
cn201
cn202
cn202
cn203
cn204
cn99
cn98
cn97
cn96
..
cn9
can anyone please help me ?
Thanks

If I understand correctly, you want to use the -V (version-sort) flag instead of -n. This is only available on GNU sort, but that's probably the one you are using.
However, it depends how you want the prefixes to be sorted.

If you don't have the -V option, sort allows you to be more precise about what characters constitute a sort key.
sort -t\ -k2.3n *.csv > output.csv
The .3 tells sort that the key to sort on starts with the 3rd character of the second field, effectively skipping the cn prefix. You can put the n directly in the field specifier, which saves you two whole characters, but more importantly for more complex sorts, allows you to treat just that key as a number, rather than applying -n globally (which is only an issue if you specify multiple keys with several uses of -k).

The sort version on the live server is 5.97 from 2006
so few things did not work correctly.
However the code bellow is my solution
#!/bin/bash
echo "This script reads all CSVs into a single file (clusters.csv) in this directory"
for filers in *.csv
do
echo "" >> clusters.csv
echo "--------------------------------" >> clusters.csv
echo $filers >> largefile.txt
echo "--------------------------------" >> clusters.csv
cat $filers >> clusters.csv
done
or if you want to keep it simple inside one command
awk 'FNR > 1' *.csv > clusers.csv

How to compare two text files for the same exact text using BASH?

Let's say I have two text files that I need to extract data out of. The text of the two files is as follows:
File 1:
1name - randomemail#email.com
2Name - superrandomemail#email.com
3Name - 123random#email.com
4Name - random123#email.com
File 2:
email.com
email.com
email.com
anotherwebsite.com
File 2 is File 1's list of domain names, extracted from the email addresses.
These are not the same domain names by any means, and are quite random.
How can I get the results of the domain names that match File 2 from File 1?
Thank you in advance!

Assuming that order does not matter,
grep -F -f FILE2 FILE1
should do the trick. (This works because of a little-known fact: the -F option to grep doesn't just mean "match this fixed string," it means "match any of these newline-separated fixed strings.")

The recipe:
join <(sed 's/^.*#//' file1|sort -u) <(sort -u file2)
it will output the intersection of all domain names in file1 and file2

See BashFAQ/036 for the list of usual solutions to this type of problem.

Use VimDIFF command, this gives a nice presentation of difference

If I got you right, you want to filter for all addresses with the host mentioned in File 2.
You could then just loop over File 2 and grep for #<line>, accumulating the result in a new file or something similar.
Example:
cat file2 | sort -u | while read host; do grep "#$host" file1; done > filtered

using grep in a If statement to get all items, ignoring spaces

This is part of a homework problem in a beginning bash class.
I need to bring in the passwd file, which I have done with my passfile variable, then I need to be able to extract certain pieces of it and display the different fields. When I manually grep from CLI using this statement below it works fine. I'm wanting all the variables and I get them all.
grep 1000 passfile | cut -c1-
However, when I do this from the script it stops or breaks or starts over at the first 'blank space' in the users full name. John D. Doe will return 3 lines when I only want one. I see this by echoing the value of i and the following.
for i in `grep 1000 ${passfile} | cut -c1-
user=`echo $1 | cut -d : -f1`
userID=`echo $1 | cut -d : -f3`
For example, if the line reads
jdoe:x:123:1000:John D Doe:/home/jdoe:/bin/bash
I get the following:
i = jdoe:x:123:1000:John
which gives me:
User is jdoe, UID is 509
but then in the next line i starts at R.
i = R. so User is R., UID is R.
next line
i = Johnson:/home/jjohnson:/bin/bash
which returns User is Johnson, UID is /bin/bash
The passwd file holds many users so I need to use the for loop to process them all. I think if I can get it to ignore the space I can get it. But not knowing a whole lot about linux, I'm not sure if I'm even going down the right path. Thanks in Advance for guidence/help.

By default, cut splits on spaces, not colons. If you continue to use it, specify the separator.
You probably want to use IFS=: and a read statement in a while loop to get the values in:
while IFS=: read user password uid gid comment home shell
do
...whatever...
done < /etc/passwd
Or you can pipe the output of grep into the while loop.

Are you allowed to use any external program? If so, I'd recommend awk
UID=1000
awkcmd="\$4==\"$UID\" {print \"user:\",\$1}"
cat $PASSWORDFILE | awk -F ":" "$awkcmd"

when parsing structured files with specific field delimiters such as passwd file, the appropriate tool for the job is awk.
UID=1000
awk -vuid="$UID" '$4==uid{print "user: "$1}' /etc/passwd
you do not have to use grep or cut or anything else. ( Of course, you can also use pure bash while read loops as demonstrated.)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string