LInux sort/uniq apache log - linux

I have small file (100) lines of web request(apache std format) there are multiple request from clients. I want to ONLY have a list of request(lines) in my file that comes from a UNIQUE IP and is the latest entry
I have so far
/home/$: cat all.txt | awk '{ print $1}' | sort -u | "{print the whole line ??}"
The above gives me the IP's(bout 30 which is right) now i need to have the rest of the line(request) as well.

Use an associative array to keep track of which IPs you've found already:
awk '{
if (!found[$1]) {
print;
found[$1]=1;
}
}' all.txt
This will print the first line for each IP. If you want the last one then:
awk '
{ found[$1] = $0 }
END {
for (ip in found)
print found[ip]
}
' all.txt

I hate that unique doesn't come with the same options as sort, or that sort cannot do what it says, I reckon this should work[1],
tac access.log | sort -fb -k1V -u
but alas, it doesn't;
Therefore, it seems we're stuck at doing something silly like
cat all.txt | awk '{ print $1}' | sort -u | while read ip
do
tac all.txt | grep "^$ip" -h | head -1
done
Which is really inefficient, but 'works' (haven't tested it: module typo's then)
[1] according to the man-page

The following should work:
tac access.log | sort -f -k1,1 -us
This takes the file in reverse order and does a stable sort using the first field, keeping only unique items.

Related

Linux Sorting based on different column in file names

Can you please help me on sorting the file names by a couple of conditions?
ls -tr | grep ${DATE}* | sort -k1
dboption_01beforeschemasize_20200710-092914_A_IS_CRB_CDS.sql
dboption_01beforeschemasize_20200710-092914_A_IS_CRB_CTL.sql
dboption_02beforetablesize_20200710-092914_A_IS_CRB_CDS.sql
dboption_02beforetablesize_20200710-092914_A_IS_CRB_CTL.sql
dboption_03create_table_20200710-092914_A_IS_CRB_CDS.sql
dboption_03create_table_20200710-092914_A_IS_CRB_CTL.sql
dboption_04Export_DDL_AFTER_CHANGE_20200710-092914_A_IS_CRB_CDS.sh
dboption_04Export_DDL_AFTER_CHANGE_20200710-092914_A_IS_CRB_CTL.sh
dboption_05drop_table_20200710-092914_A_IS_CRB_CDS.sql
dboption_05drop_table_20200710-092914_A_IS_CRB_CTL.sql
dboption_06aftertablesize_20200710-092914_A_IS_CRB_CDS.sql
dboption_06aftertablesize_20200710-092914_A_IS_CRB_CTL.sql
dboption_07afterschemasize_20200710-092914_A_IS_CRB_CDS.sql
dboption_07afterschemasize_20200710-092914_A_IS_CRB_CTL.sql
I want the output should be: database, schema, and then file number
A_IS_CRB is the db and CTL,CDS are schema. (It can also have different db names)
I want to process all 7 files for one database one schema and then proceed with other 7 files of same database different schema or different database with some schema .
I tried a couple of things:
ls -tr | grep ${DATE}* | sort -k1
ls -tr | grep ${DATE}* | sort -t $'_' -k4 -k5 -k2,2
ls -tr | grep ${DATE}* | grep " awk -F'[0-9]_' '{print $NF}' |awk -F_ '{print $NF}' |sed 's/.sql//' |sed 's/.sh//' | sed 's/\_$//'| uniq" (to grep schema)
No luck, any help much appreciated. The desired output is:
dboption_01beforeschemasize_20200710-092914_A_IS_CRB_CDS.sql
dboption_02beforetablesize_20200710-092914_A_IS_CRB_CDS.sql
dboption_03create_table_20200710-092914_A_IS_CRB_CDS.sql
dboption_04Export_DDL_AFTER_CHANGE_20200710-092914_A_IS_CRB_CDS.sh
dboption_05drop_table_20200710-092914_A_IS_CRB_CDS.sql
dboption_06aftertablesize_20200710-092914_A_IS_CRB_CDS.sql
dboption_07afterschemasize_20200710-092914_A_IS_CRB_CDS.sql
dboption_01beforeschemasize_20200710-092914_A_IS_CRB_CTL.sql
dboption_02beforetablesize_20200710-092914_A_IS_CRB_CTL.sql
dboption_03create_table_20200710-092914_A_IS_CRB_CTL.sql
dboption_04Export_DDL_AFTER_CHANGE_20200710-092914_A_IS_CRB_CTL.sh
dboption_05drop_table_20200710-092914_A_IS_CRB_CTL.sql
dboption_06aftertablesize_20200710-092914_A_IS_CRB_CTL.sql
dboption_07afterschemasize_20200710-092914_A_IS_CRB_CTL.sql
I isolated the names of the files in two tables then I displayed them at the end of the processing
awk '$0 ~ /_CDS/{cds[$0]} $0 ~ /_CTL/{ ctl[$0]} END{for(i in cds){print i} for(ii in ctl){print ii}}' your_file
Tell me if this solution is right for you.
Look at this:
sort -t '-' -k2 list
It's may be the good solution for you. Tel me if it's what you want.
If I understand what you want, you want to sort by db-schema so you can process 1-7 of the files with CDS schema and then 1-7 with the CTL schema. You can do that using awk split to isolate the db-schema, outputting the entire record followed by the db-schema to allow sorting and then use awk again to drop the second db-schema sort column, e.g.
awk -F'-' '{split($2,a,"_"); print $0" "substr(a[5],1,3)}' listing |
sort -k2 |
awk '{print $1}'
Example Output
With your input in the listing file, you would receive:
dboption_01beforeschemasize_20200710-092914_A_IS_CRB_CDS.sql
dboption_02beforetablesize_20200710-092914_A_IS_CRB_CDS.sql
dboption_03create_table_20200710-092914_A_IS_CRB_CDS.sql
dboption_04Export_DDL_AFTER_CHANGE_20200710-092914_A_IS_CRB_CDS.sh
dboption_05drop_table_20200710-092914_A_IS_CRB_CDS.sql
dboption_06aftertablesize_20200710-092914_A_IS_CRB_CDS.sql
dboption_07afterschemasize_20200710-092914_A_IS_CRB_CDS.sql
dboption_01beforeschemasize_20200710-092914_A_IS_CRB_CTL.sql
dboption_02beforetablesize_20200710-092914_A_IS_CRB_CTL.sql
dboption_03create_table_20200710-092914_A_IS_CRB_CTL.sql
dboption_04Export_DDL_AFTER_CHANGE_20200710-092914_A_IS_CRB_CTL.sh
dboption_05drop_table_20200710-092914_A_IS_CRB_CTL.sql
dboption_06aftertablesize_20200710-092914_A_IS_CRB_CTL.sql
dboption_07afterschemasize_20200710-092914_A_IS_CRB_CTL.sql
Let me know if you need changes to this output.
Edit - Update Per-Format Change Delimiting dbname and schema with '-'
In the comments when you advised that the db-name and db-schema were not fixed with the db-name having two '_' delimiters and none in the db-schema, that created a problem where what constituted the db-name and what was the db-schema was now ambiguous. There being no way to know whether you have a 3-part (two '_') name and 2-part (one '_') schema or a 4-part name (three '_') and a 1-part schema (no '_') (or any of the other 6 or 7 combinations between 3-5 part names and 1-3 part schema).
Adding the '-' as the delimiter between the db-name and db-schema now provides a non-ambiguous way to isolate the db-schema from the filename regardless of the number of parts separated by '_' in the db-name and db-schema. You can use '-' as the delimiter for awk and then $NF becomes the last field. (db-schema plus extension). Then using substr($NF, 1, match($1, /[.]/) - 1) you can isolate the db-schema alone.
awk -F'-' '{ print $0" "substr($NF,1,match($NF,/[.]/)-1) }' listing |
sort -k2 |
awk '{print $1}'
Short Example Input
$ cat listing
dboption_01beforeschemasize_20200710-092914_A_FOO_IS_CRB-PDO_CDS.sql
dboption_01beforeschemasize_20200710-092914_A_FOO_IS_CRB-PDO_CTS.sql
dboption_02beforetablesize_20200710-092914_A_FOO_IS_CRB-PDO_CDS.sql
dboption_02beforetablesize_20200710-092914_A_FOO_IS_CRB-PDO_CTS.sql
dboption_03create_table_20200710-092914_A_FOO_IS_CRB-PDO_CDS.sql
dboption_03create_table_20200710-092914_A_FOO_IS_CRB-PDO_CTS.sql
Example Use/Output
$ awk -F'-' '{ print $0" "substr($NF,1,match($NF,/[.]/)-1) }' listing |
> sort -k2 |
> awk '{print $1}'
dboption_01beforeschemasize_20200710-092914_A_FOO_IS_CRB-PDO_CDS.sql
dboption_02beforetablesize_20200710-092914_A_FOO_IS_CRB-PDO_CDS.sql
dboption_03create_table_20200710-092914_A_FOO_IS_CRB-PDO_CDS.sql
dboption_01beforeschemasize_20200710-092914_A_FOO_IS_CRB-PDO_CTS.sql
dboption_02beforetablesize_20200710-092914_A_FOO_IS_CRB-PDO_CTS.sql
dboption_03create_table_20200710-092914_A_FOO_IS_CRB-PDO_CTS.sql
If you want to maintain the extension as part of the db-schema, (I see you have both .sql and .sh extensions) then simply use the following as the first awk command
awk -F'-' '{ print $0" "$NF }` listing
Give it a go with your updated names and let me know if there are any hiccups.
Additional Sort By DBSchema, then DBName, then FileNo.
In order to sort by all the additional parameters you list, you will need to change the primary field delimiter to something that allows each of the fields to be separated and the information needed to sort extracted from the fields. A good choice would simply be to use '-' to separate the fields as:
Option fileno_stuff date time dbname dbschema
That would correspond to an example record of, e.g.:
dboption-03create_table-20200710-092914-FOO_PDA-BAR_CDS.sql
If you make those changes to your listing, then you can add three columns to your listing, (e.g. fileno, dbname, dbschema) allowing you to then sort -k4 -k3 -k2n. To append the fields and sort the new data, you could do:
awk -F'-' '{print $0" "substr($2,1,match($2,/[^0-9]+/)-1)+0" "$(NF-1)" "substr($NF,1,match($NF,/[.]/)-1)}' listing |
sort -k4 -k3 -k2n |
awk '{print $1}'
Example Input Listing
dboption-01beforeschemasize-20200710-092914_A_IS_CRB-CDS.sql
dboption-01beforeschemasize-20200710-092914_A_IS_CRB-CDT.sql
dboption-01beforeschemasize-20200710-092914_PDA-CDS.sql
dboption-02beforetablesize-20200710-092914_A_IS_CRB-CDS.sql
dboption-02beforetablesize-20200710-092914_A_IS_CRB-CDT.sql
dboption-02beforetablesize-20200710-092914_PDA-CDS.sql
dboption-03create_table-20200710-092914_A_IS_CRB-CDS.sql
dboption-03create_table-20200710-092914_A_IS_CRB-CDT.sql
dboption-03create_table-20200710-092914_PDA-CDS.sql
Sorted Output
dboption-01beforeschemasize-20200710-092914_A_IS_CRB-CDS.sql
dboption-02beforetablesize-20200710-092914_A_IS_CRB-CDS.sql
dboption-03create_table-20200710-092914_A_IS_CRB-CDS.sql
dboption-01beforeschemasize-20200710-092914_PDA-CDS.sql
dboption-02beforetablesize-20200710-092914_PDA-CDS.sql
dboption-03create_table-20200710-092914_PDA-CDS.sql
dboption-01beforeschemasize-20200710-092914_A_IS_CRB-CDT.sql
dboption-02beforetablesize-20200710-092914_A_IS_CRB-CDT.sql
dboption-03create_table-20200710-092914_A_IS_CRB-CDT.sql
When you have a reformatted listing, give it a try and let me know of any issues.

Find number of unique values in a column

I would like to know the count of unique values in column using linux commands. The column has values like below (data is edited from previous ones). I need to ignore .M, .Q and .A at the end and just count the unique number of plants
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL.M"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL.Q"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-WND-ALL.A"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL.Q"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL.A"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56841-WND-WT.Q"
"series_id":"ELEC.CONS_TOT.COW-GA-2.M"
"series_id":"ELEC.CONS_TOT.COW-GA-94.M"
I've tried this code but I'm not able to avoid those suffix
cat ELEC.txt | grep 'series_id' | cut -d, -f1 | wc -l
For above sample, expected count should be 6 but I get 8
This should do the job:
grep -Po "ELEC.PLANT.*" FILE | cut -d. -f -4 | sort | uniq -c
You first grep for the "ELEC.PLANT." part
remove the .Q,A,M
remove duplicates and count using sort | uniq -c
EDIT:
for the new data it should be only necessary to do the following:
grep -Po "ELEC.*" FILE | cut -d. -f -4 | sort | uniq -c
When you have to do some counting, you can easily do it with awk. Awk is an extremely versatile tool and I strongly recommend you to have a look at it. Maybe start with Awk one-liners explained.
Having that said, you can easily do some conditioned counting here:
What you want, is to count all unique lines which have series_id in it.
awk '/series_id/ && (! $0 in a) { c++; a[$0] } END {print c}'
This essentially states: if my line contains "series_id" and I did not store the line in my array a, then it means I did not encounter my line yet and increase the counter c with 1. At the END of the program, I print the count c.
Now you want to clean things up a bit. Your lines of interest essentially look like
"something":"something else"
So we are interested in something else which is in the 4th field if " is a field separator, and we are only interested in that if something is series_id located in field 2.
awk -F'"' '($2=="series_id") && (! $4 in a ) { c++; a[$4] } END {print c}'
Finally, you don't care about the last letter of the fourth field, so we need to make a small substitution:
awk -F'"' '($2=="series_id") { str=$4; gsub(/.$/,"",str); if (! str in a) {c++; a[str] } } END {print c}'
You could also rewrite this differently as:
awk -F'"' '($2 != "series_id" ) { next }
{ str=$4; gsub(/.$/,"",str) }
( str in a ) { next }
{ c++; a[str] }
END { print c }'
My standard way to count unique values is making sure I have the list of values (using grep and cut in your case), and add the following commands behind a pipe:
| sort -n | uniq -c
The sort does the sorting, based on number sorting, while the uniq gets the unique entries (the -c stands for "count").
Do this : cat ELEC.txt | grep 'series_id' | cut -f1-4 -d. | uniq | wc -l
-f1-4 will remove the the fourth . from each line
Here is a possible solution using awk:
awk 'BEGIN{FS="[:.\"]+"} /^"series_id":/{print $6}' \
ELEC.txt |sort -n |uniq -c
The ouput for the sample you posted will be something like this:
1 56841-WND-WT
2 56855-ALL-ALL
1 56855-WND-ALL
2 56868-LFG-ALL
If you need the entire string, you can print the other fields as well:
awk 'BEGIN{FS="[:.\"]+"; OFS="."} /^"series_id":/{print $3,$4,$5,$6}' \
ELEC.txt |sort -n | uniq -c
And the output will be something like this:
1 ELEC.PLANT.CONS_EG_BTU.56841-WND-WT
2 ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL
1 ELEC.PLANT.CONS_EG_BTU.56855-WND-ALL
2 ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL

sort write failed standard output broken pipe--Linux

I am using following commands in my script:
max_length=`awk '{print length}' $File_Path_Name/$filnm | sort -nr | head -1`;
min_length=`awk '{print length}' $File_Path_Name/$filnm | sort -nr | tail -1`;
where the filenm variable contains the name of the file and File_Path_Name contains the directory path.
While executing this from script I am getting the error
sort: write failed: standard output: Broken pipe
Any suggestions what I am doing wrong?
you don't need to scan the file twice for getting max/min
try
$ read max min < <(awk '{print length}' file | sort -nr | sed -n '1p;$p' | paste -s)
or you can avoid sorting as well by calculating max/min within awk
$ awk '{len=length}
NR==1 {max=min=len}
max<len{max=len}
min>len{min=len}
END {print max, min}' file

Linux usernames /etc/passwd listing

I want to print the longest and shortest username found in /etc/passwd. If I run the code below it works fine for the shortest (head -1), but doesn't run for (sort -n |tail -1 | awk '{print $2}). Can anyone help me figure out what's wrong?
#!/bin/bash
grep -Eo '^([^:]+)' /etc/passwd |
while read NAME
do
echo ${#NAME} ${NAME}
done |
sort -n |head -1 | awk '{print $2}'
sort -n |tail -1 | awk '{print $2}'
Here the issue is:
Piping finishes with the first sort -n |head -1 | awk '{print $2}' command. So, input to first command is provided through piping and output is obtained.
For the second command, no input is given. So, it waits for the input from STDIN which is the keyboard and you can feed the input through keyboard and press ctrl+D to obtain output.
Please run the code like below to get desired output:
#!/bin/bash
grep -Eo '^([^:]+)' /etc/passwd |
while read NAME
do
echo ${#NAME} ${NAME}
done |
sort -n |head -1 | awk '{print $2}'
grep -Eo '^([^:]+)' /etc/passwd |
while read NAME
do
echo ${#NAME} ${NAME}
done |
sort -n |tail -1 | awk '{print $2}
'
All you need is:
$ awk -F: '
NR==1 { min=max=$1 }
length($1) > length(max) { max=$1 }
length($1) < length(min) { min=$1 }
END { print min ORS max }
' /etc/passwd
No explicit loops or pipelines or multiple commands required.
The problem is that you only have two pipelines, when you really need one. So you have grep | while read do ... done | sort | head | awk and sort | tail | awk: the first sort has an input (i.e., the while loop) - the second sort doesn't. So the script is hanging because your second sort doesn't have an input: or rather it does, but it's STDIN.
There's various ways to resolve:
save the output of the while loop to a temporary file and use that as an input to both sort commands
repeat your while loop
use awk to do both the head and tail
The first two involve iterating over the password file twice, which may be okay - depends what you're ultimately trying to do. But using a small awk script, this can give you both the first and last line by way of the BEGIN and END blocks.
While you already have good answers, you can also use POSIX shell to accomplish your goal without any pipe at all using the parameter expansion and string length provided by the shell itself (see: POSIX shell specifiction). For example you could do the following:
#!/bin/sh
sl=32;ll=0;sn=;ln=; ## short len, long len, short name, long name
while read -r line; do ## read each line
u=${line%%:*} ## get user
len=${#u} ## get length
[ "$len" -lt "$sl" ] && { sl="$len"; sn="$u"; } ## if shorter, save len, name
[ "$len" -gt "$ll" ] && { ll="$len"; ln="$u"; } ## if longer, save len, name
done </etc/passwd
printf "shortest (%2d): %s\nlongest (%2d): %s\n" $sl "$sn" $ll "$ln"
Example Use/Output
$ sh cketcpw.sh
shortest ( 2): at
longest (17): systemd-bus-proxy
Using either pipe/head/tail/awk or the shell itself is fine. It's good to have alternatives.
(note: if you have multiple users of the same length, this just picks the first, you can use a temp file if you want to save all names and use -le and -ge for the comparison.)
If you want both the head and the tail from the same input, you may want something like sed -e 1b -e '$!d' after you sort the data to get the top and bottom lines using sed.
So your script would be:
#!/bin/bash
grep -Eo '^([^:]+)' /etc/passwd |
while read NAME
do
echo ${#NAME} ${NAME}
done |
sort -n | sed -e 1b -e '$!d'
Alternatively, a shorter way:
cut -d":" -f1 /etc/passwd | awk '{ print length, $0 }' | sort -n | cut -d" " -f2- | sed -e 1b -e '$!d'

About awk shell and pipe in linux

everyone, I am dealing with a log file which has about 5 million lines, so I use the awk shell in linux
I have to grep the domains and get the highest 100 in the log, so I write like this:
awk '{print $19}' $1 |
awk '{ split($0, string, "/");print string[1]}' |
awk '{domains[$0]++} END{for(j in domains) print domains[j], j}' |
sort -n | tail -n 100 > $2
it runs about 13 seconds
then I change the script like this:
awk 'split($19, string, "/"); domains[string[1]]++}
END{for(j in domains) print domains[j], j}' $1 |
sort -n | tail -n 100 > $2
it runs about 21 seconds
why?
you know one line of awk shell may reduce the sum of cal, it only read each line once, but the time increase...
so, if you know the answer, tell me
When you pipe commands they run in parallel as long as the pipe is full.
So my guess is that in the first version work is distributed among your CPUs, while in the second one all the work is done by one core.
You can verify this with top (or, better, htop).
Out of curiosity, is this faster? (untested):
cut -f 19 -d' ' $1 | cut -f1 -d'/' | sort | uniq -c | sort -nr | head -n 100 > $2

Resources