Is there a way to remove a line fully? - linux

I'm using a one-line command to compile and print all of the animal names listed in a log file.
The WILD names are all listed in capital letters under the /wild directory.
The output should appear in the format of one name per line, with no duplicates:
ANT
BAT
CAT
I tried
grep 'wild' animal.txt | awk '{print $7}' | sed 's/[a-z0-9./]//g' | sort -u
It showed what I want but I want to remove the whole string which contains special characters like -, # ? %
Below is a sample of the file animal.txt
191.21.66.100 - - [21/Aug/1995:05:17:57 -0400] "GET /wild/elvpage.htm#ZOO HTTP/1.0"
191.21.66.100 - - [01/Aug/1995:02:22:35 -0400] "GET /wild/S/s_26s.jpg HTTP/1.0"
191.21.66.100 - - [01/Aug/1995:02:22:41 -0400] "GET /wild/struct.gif HTTP/1.0"
191.21.66.100 - - [01/Aug/1995:02:27:34 -0400] "GET /wild/elvpage.htm HTTP/1.0"
191.21.66.100 - - [01/Aug/1995:02:27:36 -0400] "GET /wild/endball.gif HTTP/1.0"
191.21.66.100 - - [01/Aug/1995:02:27:37 -0400] "GET /wild/hot.gif HTTP/1.0"
191.21.66.100 - - [01/Aug/1995:02:27:38 -0400] "GET /wild/elvhead3.gif HTTP/1.0"
191.21.66.100 - - [01/Aug/1995:02:27:38 -0400] "GET /wild/PEGASUS/minpeg1.gif HTTP/1.0"
191.21.66.100 - - [01/Aug/1995:02:27:39 -0400] "GET /wild/DOG/DOG.gif HTTP/1.0"
191.21.66.100 - - [01/Aug/1995:02:27:39 -0400] "GET /wild/SWAN/SWAN.gif HTTP/1.0"
191.21.66.100 - - [01/Aug/1995:02:27:39 -0400] "GET /wild/ATLAS/atlas.gif HTTP/1.0"
191.21.66.100 - - [01/Aug/1995:02:27:40 -0400] "GET /wild/LIZARD/lizard.gif HTTP/1.0"
Below is a sample of my output after running the command:
ATLAS
ATLAS-
CAT_
DOG
%FACT
-KWM
?TIL-
#ZOO

Why not allow only capital A-Z and remove everything else:
grep 'wild' animal.txt | awk '{print $7}' | sed 's/[^A-Z]//g'
from your example input, this will return:
PEGASUS
DOGDOG
SWANSWAN
ATLAS
LIZARD
If you need to: you can further cleanup empty lines by appending |sed "/^$/d" and then sort

You can use a single GNU sed command:
sed -n 's!.*/wild/\([A-Z][A-Z]\+\)/.*!\1!p' animal.txt
Means:
-n: Do not print every line.
s!X!Y! Substitute X with Y.
.*/wild/\([A-Z][A-Z]\+\)/*: find a capital letter followed by at least one capital letter and preceded by wild/. These should be followed by a / and anything. Capture (remember) the capital letters.
!\1!: Replace whatever you found with the capital letter sequence.
p: If it was a match then print the line.
Gives:
PEGASUS
DOG
SWAN
ATLAS
LIZARD

This might work for you (GNU sed):
sed -E '/.*\/wild\/[^A-Z ]*([A-Z]+).*/!d # delete lines with no uppercase letters
s//\1/ # remove everything but uppercases letters
H # append word to the hold space
$!d # delete all lines but the last
x # swap to the hold space
:a # loop name space
s/((\n[^\n]+).*)\2/\1/ # remove duplicates
ta # repeat until failure
s/.//' file # remove introduced newline

GNU awk to get result:
grep 'wild' animal.txt | awk '
($0 = $7)
{gsub(/\//, " ", $0)}; #replace '/' with space so we can separate $0 to ($1, $2, $3);
(NF == 3 && length($2) > 2) #check if there is three word in line ($1, $2, $3) and then check if length($2) is more then 2 character
{print $2}'
| sort -u
Answer:
grep 'wild' animal.txt | awk '
($0 = $7) {gsub(/\//, " ", $0)};
(NF == 3 && length($2) > 2) {print $2}' | sort -u

Related

AWK command to print the pattern as below from pattern provided

I have a pattern as below:
Pattern in a unix file:
{1.11.111.111 - 2017-10-06T00:00:00+00:00 111111 1 302 "GET /abcd/z1/bcdfgggg?values" uri="/abcd/v2/nano" 111 111 0 "-" "abcd/2.1.0 (Linux; U; Android 8.1.0; Redmi Note 6 Pro MIUI/V10.2.2.0.bcdwvc)" "1111:1111:111:1111:11:d11e:c11c:111a" cu=0.011 nano=0.011 var="-12345" "1111:1111:111:1111:11:d11e:c11c:111a, 11.111.111.111"}
I am trying to print the below result but the result is not printed as expected.
Code:
Cat test.txt | awk -F'"' '{ print $1,$9}' |awk -F' ' '{ print $3,$6,$24}'
Actual Result: 2017-10-06T00:00:00+00:00 302
Expected Result: 2017-10-06T00:00:00+00:00 302 cu=0.011
With GNU sed and a regex with three backreferences:
sed -r 's/.* ([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9:+]{14}) [0-9]+ [0-9]+ ([0-9]{3}) .*(cu=[0-9.]+).*/\1 \2 \3/' file
Output:
2017-10-06T00:00:00+00:00 302 cu=0.011
See: The Stack Overflow Regular Expressions FAQ
Assuming the log entry will always look like presented by the OP:
pattern='{1.11.111.111 - 2017-10-06T00:00:00+00:00 111111 1 302 "GET /abcd/z1/bcdfgggg?values" uri="/abcd/v2/nano" 111 111 0 "-" "abcd/2.1.0 (Linux; U; Android 8.1.0; Redmi Note 6 Pro MIUI/V10.2.2.0.bcdwvc)" "1111:1111:111:1111:11:d11e:c11c:111a" cu=0.011 nano=0.011 var="-12345" "1111:1111:111:1111:11:d11e:c11c:111a, 11.111.111.111"}'
awk -F ' ' '{print $3,$6,$25}' <<< "$pattern"
Output: 2017-10-06T00:00:00+00:00 302 cu=0.011

Linux Shell Script: How to compare a specific field in a text document with specific text in an if statement

I have a file named transfer.log that has a few Apache logs. I need to count how many GET requests each IP address has logged. I know how to access the file and loop through the lines in the file but I am having trouble with comparing the 6th field in each line with "GET".
#!/bin/bash
while read p;
do
name=( $(awk '{print $6}' p))
echo $name
if [ "$name" == "GET" ]
then
echo "yes"
else
echo "no"
fi
done < transfer.log
Currently, when I run the script "no" is printed 5 times and I receive an error that says awk cannot open the file "p". When I change the p to transfer.log in the variable declaration, I can get the echo $name to output "GET (with the quotation), but it obviously never changes because it is accessing the entire file and not the new line p.
I need to know how to assign the 6th column of p to my variable name each time the while loop executes. Also, I am confused as to why my loop only iterates 5 times and not 6.
My transfer.log looks like this:
140.211.167.27 - - [15/Oct/2012:23:11:38 +0000] "GET / HTTP/1.1" 200 2963 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4"
140.211.167.27 - - [15/Oct/2012:23:11:46 +0000] "GET /systems/ganeti/index HTTP/1.1" 200 5918 "https://wiki.osuosl.org/systems/index" "Mozilla/5.0(X11; Linux x86_64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4"
140.211.167.9 - - [15/Oct/2012:23:17:33 +0000] "GET /resources/index HTTP/1.1" 200 3411 "https://wiki.osuosl.org/index" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1"
140.211.167.25 - - [15/Oct/2012:16:02:07 +0000] "GET /index HTTP/1.1" 200 2673 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1"
66.249.74.101 - - [15/Oct/2012:02:20:14 +0000] "GET /robots.txt HTTP/1.1" 404 2458 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
128.193.164.34 - - [15/Oct/2012:12:41:18 +0000] "POST /rpc/xmlrpc HTTP/1.0" 200 8328 "-" "PHP XMLRPC 1.0"
Ultimately, I need to count how many GET requests each specific IP address has logged and sort the addresses by least to greatest GET requests.
You can use the following awk command to do the trick:
$ awk '{if($6=="\"GET")ip[$1]++; else ip[$1]+=0}END{for(elem in ip){print elem, ip[elem]}}' input.log | sort -k2nr
140.211.167.27 2
140.211.167.25 1
140.211.167.9 1
66.249.74.101 1
128.193.164.34 0
Explanations:
{if($6=="\"GET")ip[$1]++; else ip[$1]+=0} on each line of the file it will check the 6th field and if it is equal to "GET it will increment an array for which the index is the ip; if the 6th field is not equal to "GET it will add 0 to the array in order to take into account the ip that have done some POST you can remove this logic if you do not id it.
Then at the end of the file it will print each ip plus the number of GET
Everything is piped to a sort command that will sort the output with the second field and order in reverse numerical order
The following line is wrong:
name=( $(awk '{print $6}' p))
You should replace it with:
name=$(echo "$p" | awk '{print $6}')
You passed p, the variable name, to the awk, where a file name was expected. Also, the outer brackets were redundant.
I try to parse the log file, for your reference:
#!/bin/bash
howmanyGET=0
loopcounter=0
while read line;do
#echo "Line # $loopcounter: $line"
((loopcounter++))
name=`echo $line | awk '{print $6}'`
#name=( $(awk '{print $6}' p))
#echo $name
name=${name:1:3}
echo $name
if [ "$name" == "GET" ]
then
echo "yes"
((howmanyGET++))
else
echo "no"
fi
done < transfer.log
echo "GET: $howmanyGET"
echo "loop: $loopcounter"
output here:
$ bash counter.sh
GET
yes
GET
yes
GET
yes
GET
yes
GET
yes
POS
no
GET: 5
loop: 6
Hope to be helpful.

awk match multiple regular expression strings and digits from a line

I am trying to match multiple items in each line in a httpd log file. The lines look like this:
192.168.0.1 - - [06/Apr/2016:16:35:42 +0100] "-" "100" "GET /breacher/gibborum.do?firstnumber=1238100121135&simple=1238100121135&protocol=http&_super=telco1 HTTP/1.1" 200 161 "-" "NING/1.0"
192.168.0.1 - - [06/Apr/2016:16:35:44 +0100] "-" "00" "GET /breacher/gibborum.do?firstnumber=1237037630256&simple=1237037630256&protocol=http&_super=telco1 HTTP/1.1" 200 136 "-" "NING/1.0"
192.168.0.1 - - [06/Apr/2016:16:35:44 +0100] "-" "00" "GET /breacher/gibborum.do?firstnumber=1238064400578&simple=1238064400578&protocol=http&_super=telco1 HTTP/1.1" 200 136 "-" "NING/1.0"
I am trying to extract the numbers, the timestamp and the value of the _super variable. So far I can extract the numbers and the timestamp with this:
awk '{match ($0, /123([0-9]+)/, arr); print $4, arr[0]}'
Please how do I extract the value at the end of the _super= variable as well?
You could change your script like this: (add the gsub and the $9):
awk '{match ($0, /123([0-9]+)/, arr); gsub(/.*_super=/, "",$9); print $4, arr[0], $9}'

parse httpd log in bash

my httpd log has the following format
123.251.0.000 - - [05/Sep/2014:18:19:24 -0700] "GET /myapp/MyService?param1=value1&param2=value2&param3=value3 HTTP/1.1" 200 15138 "-" "-"
I need to extract the following fields and display on a line:
IP value1 httpResponseCode(eg.200), dataLength
what's the most efficient way to do this in bash?
As you're using Linux, chances are that you also have GNU awk installed. If so:
$ awk 'match ($7, /param1=([^& ]*)/, m) { print $1, m[1], $9",", $10 }' http.log
gives:
123.251.0.000 value1 200, 15138
This works as long as value1 hasn't got an ampersand or space in it, which they shouldn't if the request has been escaped correctly.
$ cat tmp.txt
123.251.0.000 - - [05/Sep/2014:18:19:24 -0700] "GET /myapp/MyService?param1=value1&param2=value2&param3=value3 HTTP/1.1" 200 15138 "-" "-"
$ awk '{ print "IP", $1, $9, $10 }' tmp.txt
IP 123.251.0.000 200 15138

Find text in files and get the needed content

I have many access_log files. This is a line from a file of them.
access_log.20111215:111.222.333.13 - - [15/Dec/2011:05:25:00 +0900] "GET /index.php?uid=01O9m5s23O0p&p=nutty&a=check_promotion&guid=ON HTTP/1.1" 302 - "http://xxx.com/index.php?uid=xxx&p=mypage&a=index&sid=&fid=&rand=1681" "Something/2.0 qqq(xxx;yyy;zzz)" "-" "-" 0
How to extract the uid "01O9m5s23O0p" from the lines which have the occurence of "p=nutty&a=check_promotion" and output to a new file.
For example, The "output.txt" file should be:
01O9m5s23O0p
01O9m5s0999p
01O9m5s3249p
fFDSFewrew23
SOMETHINGzzz
...
I tried the:
grep "p=nutty&a=check_promotion" access* > using_grep.out
and
fgrep -o "p=nutty&a=check_promotion" access* > using_fgrep.out
but it prints whole line. I just want to get the uid.
Summary:
1) Find the lines which have "p=nutty&a=check_promotion"
2) Extract uid from those lines.
3) Print them to a file.
Do exactly that, in three stages:
(formatted to avoid the scroll)
grep 'p=nutty&a=check_promotion' access* \
| grep -o '[[:alnum:]]\{4\}m5s[[:alnum:]]\{4\}p' \
> output.txt
If your lines which have p=nutty&a=check_promotion are similar in nature then we can set the delimiters and use awk to extract the uid and place them in a file.
awk -v FS="[?&=]" '
$0~/p=nutty&a=check_promotion/{ print $3 > "output_file"}' input_file
Test:
[jaypal:~/Temp] cat file
access_log.20111215:210.136.161.13 - - [15/Dec/2011:05:25:00 +0900] "GET /index.php?uid=01O9m5s23O0p&p=nutty&a=check_promotion&guid=ON HTTP/1.1" 302 - "http://xxx.com/index.php?uid=xxx&p=mypage&a=index&sid=&fid=&rand=1681" "Something/2.0 qqq(xxx;yyy;zzz)" "-" "-" 0
[jaypal:~/Temp] awk -v FS="[?&=]" '
$0~/p=nutty&a=check_promotion/{ print $3 > "output_file"}' input_file
[jaypal:~/Temp] cat output_file
01O9m5s23O0p

Resources