Linux Sorting based on different column in file names - linux

Can you please help me on sorting the file names by a couple of conditions?
ls -tr | grep ${DATE}* | sort -k1
dboption_01beforeschemasize_20200710-092914_A_IS_CRB_CDS.sql
dboption_01beforeschemasize_20200710-092914_A_IS_CRB_CTL.sql
dboption_02beforetablesize_20200710-092914_A_IS_CRB_CDS.sql
dboption_02beforetablesize_20200710-092914_A_IS_CRB_CTL.sql
dboption_03create_table_20200710-092914_A_IS_CRB_CDS.sql
dboption_03create_table_20200710-092914_A_IS_CRB_CTL.sql
dboption_04Export_DDL_AFTER_CHANGE_20200710-092914_A_IS_CRB_CDS.sh
dboption_04Export_DDL_AFTER_CHANGE_20200710-092914_A_IS_CRB_CTL.sh
dboption_05drop_table_20200710-092914_A_IS_CRB_CDS.sql
dboption_05drop_table_20200710-092914_A_IS_CRB_CTL.sql
dboption_06aftertablesize_20200710-092914_A_IS_CRB_CDS.sql
dboption_06aftertablesize_20200710-092914_A_IS_CRB_CTL.sql
dboption_07afterschemasize_20200710-092914_A_IS_CRB_CDS.sql
dboption_07afterschemasize_20200710-092914_A_IS_CRB_CTL.sql
I want the output should be: database, schema, and then file number
A_IS_CRB is the db and CTL,CDS are schema. (It can also have different db names)
I want to process all 7 files for one database one schema and then proceed with other 7 files of same database different schema or different database with some schema .
I tried a couple of things:
ls -tr | grep ${DATE}* | sort -k1
ls -tr | grep ${DATE}* | sort -t $'_' -k4 -k5 -k2,2
ls -tr | grep ${DATE}* | grep " awk -F'[0-9]_' '{print $NF}' |awk -F_ '{print $NF}' |sed 's/.sql//' |sed 's/.sh//' | sed 's/\_$//'| uniq" (to grep schema)
No luck, any help much appreciated. The desired output is:
dboption_01beforeschemasize_20200710-092914_A_IS_CRB_CDS.sql
dboption_02beforetablesize_20200710-092914_A_IS_CRB_CDS.sql
dboption_03create_table_20200710-092914_A_IS_CRB_CDS.sql
dboption_04Export_DDL_AFTER_CHANGE_20200710-092914_A_IS_CRB_CDS.sh
dboption_05drop_table_20200710-092914_A_IS_CRB_CDS.sql
dboption_06aftertablesize_20200710-092914_A_IS_CRB_CDS.sql
dboption_07afterschemasize_20200710-092914_A_IS_CRB_CDS.sql
dboption_01beforeschemasize_20200710-092914_A_IS_CRB_CTL.sql
dboption_02beforetablesize_20200710-092914_A_IS_CRB_CTL.sql
dboption_03create_table_20200710-092914_A_IS_CRB_CTL.sql
dboption_04Export_DDL_AFTER_CHANGE_20200710-092914_A_IS_CRB_CTL.sh
dboption_05drop_table_20200710-092914_A_IS_CRB_CTL.sql
dboption_06aftertablesize_20200710-092914_A_IS_CRB_CTL.sql
dboption_07afterschemasize_20200710-092914_A_IS_CRB_CTL.sql

I isolated the names of the files in two tables then I displayed them at the end of the processing
awk '$0 ~ /_CDS/{cds[$0]} $0 ~ /_CTL/{ ctl[$0]} END{for(i in cds){print i} for(ii in ctl){print ii}}' your_file
Tell me if this solution is right for you.

Look at this:
sort -t '-' -k2 list
It's may be the good solution for you. Tel me if it's what you want.

If I understand what you want, you want to sort by db-schema so you can process 1-7 of the files with CDS schema and then 1-7 with the CTL schema. You can do that using awk split to isolate the db-schema, outputting the entire record followed by the db-schema to allow sorting and then use awk again to drop the second db-schema sort column, e.g.
awk -F'-' '{split($2,a,"_"); print $0" "substr(a[5],1,3)}' listing |
sort -k2 |
awk '{print $1}'
Example Output
With your input in the listing file, you would receive:
dboption_01beforeschemasize_20200710-092914_A_IS_CRB_CDS.sql
dboption_02beforetablesize_20200710-092914_A_IS_CRB_CDS.sql
dboption_03create_table_20200710-092914_A_IS_CRB_CDS.sql
dboption_04Export_DDL_AFTER_CHANGE_20200710-092914_A_IS_CRB_CDS.sh
dboption_05drop_table_20200710-092914_A_IS_CRB_CDS.sql
dboption_06aftertablesize_20200710-092914_A_IS_CRB_CDS.sql
dboption_07afterschemasize_20200710-092914_A_IS_CRB_CDS.sql
dboption_01beforeschemasize_20200710-092914_A_IS_CRB_CTL.sql
dboption_02beforetablesize_20200710-092914_A_IS_CRB_CTL.sql
dboption_03create_table_20200710-092914_A_IS_CRB_CTL.sql
dboption_04Export_DDL_AFTER_CHANGE_20200710-092914_A_IS_CRB_CTL.sh
dboption_05drop_table_20200710-092914_A_IS_CRB_CTL.sql
dboption_06aftertablesize_20200710-092914_A_IS_CRB_CTL.sql
dboption_07afterschemasize_20200710-092914_A_IS_CRB_CTL.sql
Let me know if you need changes to this output.
Edit - Update Per-Format Change Delimiting dbname and schema with '-'
In the comments when you advised that the db-name and db-schema were not fixed with the db-name having two '_' delimiters and none in the db-schema, that created a problem where what constituted the db-name and what was the db-schema was now ambiguous. There being no way to know whether you have a 3-part (two '_') name and 2-part (one '_') schema or a 4-part name (three '_') and a 1-part schema (no '_') (or any of the other 6 or 7 combinations between 3-5 part names and 1-3 part schema).
Adding the '-' as the delimiter between the db-name and db-schema now provides a non-ambiguous way to isolate the db-schema from the filename regardless of the number of parts separated by '_' in the db-name and db-schema. You can use '-' as the delimiter for awk and then $NF becomes the last field. (db-schema plus extension). Then using substr($NF, 1, match($1, /[.]/) - 1) you can isolate the db-schema alone.
awk -F'-' '{ print $0" "substr($NF,1,match($NF,/[.]/)-1) }' listing |
sort -k2 |
awk '{print $1}'
Short Example Input
$ cat listing
dboption_01beforeschemasize_20200710-092914_A_FOO_IS_CRB-PDO_CDS.sql
dboption_01beforeschemasize_20200710-092914_A_FOO_IS_CRB-PDO_CTS.sql
dboption_02beforetablesize_20200710-092914_A_FOO_IS_CRB-PDO_CDS.sql
dboption_02beforetablesize_20200710-092914_A_FOO_IS_CRB-PDO_CTS.sql
dboption_03create_table_20200710-092914_A_FOO_IS_CRB-PDO_CDS.sql
dboption_03create_table_20200710-092914_A_FOO_IS_CRB-PDO_CTS.sql
Example Use/Output
$ awk -F'-' '{ print $0" "substr($NF,1,match($NF,/[.]/)-1) }' listing |
> sort -k2 |
> awk '{print $1}'
dboption_01beforeschemasize_20200710-092914_A_FOO_IS_CRB-PDO_CDS.sql
dboption_02beforetablesize_20200710-092914_A_FOO_IS_CRB-PDO_CDS.sql
dboption_03create_table_20200710-092914_A_FOO_IS_CRB-PDO_CDS.sql
dboption_01beforeschemasize_20200710-092914_A_FOO_IS_CRB-PDO_CTS.sql
dboption_02beforetablesize_20200710-092914_A_FOO_IS_CRB-PDO_CTS.sql
dboption_03create_table_20200710-092914_A_FOO_IS_CRB-PDO_CTS.sql
If you want to maintain the extension as part of the db-schema, (I see you have both .sql and .sh extensions) then simply use the following as the first awk command
awk -F'-' '{ print $0" "$NF }` listing
Give it a go with your updated names and let me know if there are any hiccups.
Additional Sort By DBSchema, then DBName, then FileNo.
In order to sort by all the additional parameters you list, you will need to change the primary field delimiter to something that allows each of the fields to be separated and the information needed to sort extracted from the fields. A good choice would simply be to use '-' to separate the fields as:
Option fileno_stuff date time dbname dbschema
That would correspond to an example record of, e.g.:
dboption-03create_table-20200710-092914-FOO_PDA-BAR_CDS.sql
If you make those changes to your listing, then you can add three columns to your listing, (e.g. fileno, dbname, dbschema) allowing you to then sort -k4 -k3 -k2n. To append the fields and sort the new data, you could do:
awk -F'-' '{print $0" "substr($2,1,match($2,/[^0-9]+/)-1)+0" "$(NF-1)" "substr($NF,1,match($NF,/[.]/)-1)}' listing |
sort -k4 -k3 -k2n |
awk '{print $1}'
Example Input Listing
dboption-01beforeschemasize-20200710-092914_A_IS_CRB-CDS.sql
dboption-01beforeschemasize-20200710-092914_A_IS_CRB-CDT.sql
dboption-01beforeschemasize-20200710-092914_PDA-CDS.sql
dboption-02beforetablesize-20200710-092914_A_IS_CRB-CDS.sql
dboption-02beforetablesize-20200710-092914_A_IS_CRB-CDT.sql
dboption-02beforetablesize-20200710-092914_PDA-CDS.sql
dboption-03create_table-20200710-092914_A_IS_CRB-CDS.sql
dboption-03create_table-20200710-092914_A_IS_CRB-CDT.sql
dboption-03create_table-20200710-092914_PDA-CDS.sql
Sorted Output
dboption-01beforeschemasize-20200710-092914_A_IS_CRB-CDS.sql
dboption-02beforetablesize-20200710-092914_A_IS_CRB-CDS.sql
dboption-03create_table-20200710-092914_A_IS_CRB-CDS.sql
dboption-01beforeschemasize-20200710-092914_PDA-CDS.sql
dboption-02beforetablesize-20200710-092914_PDA-CDS.sql
dboption-03create_table-20200710-092914_PDA-CDS.sql
dboption-01beforeschemasize-20200710-092914_A_IS_CRB-CDT.sql
dboption-02beforetablesize-20200710-092914_A_IS_CRB-CDT.sql
dboption-03create_table-20200710-092914_A_IS_CRB-CDT.sql
When you have a reformatted listing, give it a try and let me know of any issues.

Related

Find number of unique values in a column

I would like to know the count of unique values in column using linux commands. The column has values like below (data is edited from previous ones). I need to ignore .M, .Q and .A at the end and just count the unique number of plants
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL.M"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL.Q"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-WND-ALL.A"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL.Q"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL.A"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56841-WND-WT.Q"
"series_id":"ELEC.CONS_TOT.COW-GA-2.M"
"series_id":"ELEC.CONS_TOT.COW-GA-94.M"
I've tried this code but I'm not able to avoid those suffix
cat ELEC.txt | grep 'series_id' | cut -d, -f1 | wc -l
For above sample, expected count should be 6 but I get 8
This should do the job:
grep -Po "ELEC.PLANT.*" FILE | cut -d. -f -4 | sort | uniq -c
You first grep for the "ELEC.PLANT." part
remove the .Q,A,M
remove duplicates and count using sort | uniq -c
EDIT:
for the new data it should be only necessary to do the following:
grep -Po "ELEC.*" FILE | cut -d. -f -4 | sort | uniq -c
When you have to do some counting, you can easily do it with awk. Awk is an extremely versatile tool and I strongly recommend you to have a look at it. Maybe start with Awk one-liners explained.
Having that said, you can easily do some conditioned counting here:
What you want, is to count all unique lines which have series_id in it.
awk '/series_id/ && (! $0 in a) { c++; a[$0] } END {print c}'
This essentially states: if my line contains "series_id" and I did not store the line in my array a, then it means I did not encounter my line yet and increase the counter c with 1. At the END of the program, I print the count c.
Now you want to clean things up a bit. Your lines of interest essentially look like
"something":"something else"
So we are interested in something else which is in the 4th field if " is a field separator, and we are only interested in that if something is series_id located in field 2.
awk -F'"' '($2=="series_id") && (! $4 in a ) { c++; a[$4] } END {print c}'
Finally, you don't care about the last letter of the fourth field, so we need to make a small substitution:
awk -F'"' '($2=="series_id") { str=$4; gsub(/.$/,"",str); if (! str in a) {c++; a[str] } } END {print c}'
You could also rewrite this differently as:
awk -F'"' '($2 != "series_id" ) { next }
{ str=$4; gsub(/.$/,"",str) }
( str in a ) { next }
{ c++; a[str] }
END { print c }'
My standard way to count unique values is making sure I have the list of values (using grep and cut in your case), and add the following commands behind a pipe:
| sort -n | uniq -c
The sort does the sorting, based on number sorting, while the uniq gets the unique entries (the -c stands for "count").
Do this : cat ELEC.txt | grep 'series_id' | cut -f1-4 -d. | uniq | wc -l
-f1-4 will remove the the fourth . from each line
Here is a possible solution using awk:
awk 'BEGIN{FS="[:.\"]+"} /^"series_id":/{print $6}' \
ELEC.txt |sort -n |uniq -c
The ouput for the sample you posted will be something like this:
1 56841-WND-WT
2 56855-ALL-ALL
1 56855-WND-ALL
2 56868-LFG-ALL
If you need the entire string, you can print the other fields as well:
awk 'BEGIN{FS="[:.\"]+"; OFS="."} /^"series_id":/{print $3,$4,$5,$6}' \
ELEC.txt |sort -n | uniq -c
And the output will be something like this:
1 ELEC.PLANT.CONS_EG_BTU.56841-WND-WT
2 ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL
1 ELEC.PLANT.CONS_EG_BTU.56855-WND-ALL
2 ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL

How can I show only some words in a line using sed?

I'm trying to use sed to show only the 1st, 2nd, and 8th word in a line.
The problem I have is that the words are random, and the amount of spaces between the words are also random... For example:
QST334 FFR67 HHYT 87UYU HYHL 9876S NJI QD112 989OPI
Is there a way to get this to output as just the 1st, 2nd, and 8th words:
QST334 FFR67 QD112
Thanks for any advice or hints for the right direction!
Use awk
awk '{print $1,$2,$8}' file
In action:
$ echo "QST334 FFR67 HHYT 87UYU HYHL 9876S NJI QD112 989OPI" | awk '{print $1,$2,$8}'
QST334 FFR67 QD112
You do not really need to put " " between two columns as mentioned in another answer. By default awk consider single white space as output field separator AKA OFS. so you just need commas between the desired columns.
so following is enough:
awk '{print $1,$2,$8}' file
For Example:
echo "QST334 FFR67 HHYT 87UYU HYHL 9876S NJI QD112 989OPI" |awk '{print $1,$2,$8}'
QST334 FFR67 QD112
However, if you wish to have some other OFS then you can do as follow:
echo "QST334 FFR67 HHYT 87UYU HYHL 9876S NJI QD112 989OPI" |awk -v OFS="," '{print $1,$2,$8}'
QST334,FFR67,QD112
Note that this will put a comma between the output columns.
Another solution is to use the cut command:
cut --delimiter '<delimiter-character>' --fields <field> <file>
Where:
'<delimiter-character>'>: is the delimiter on which the string should be parsed.
<field>: specifies which column to output, could a single column 1, multiple columns 1,3 or a range of them 1-3.
In action:
cut -d ' ' -f 1-3 /path/to/file
This might work for you (GNU sed):
sed 's/\s\+/\n/g;s/.*/echo "&"|sed -n "1p;2p;8p"/e;y/\n/ /' file
Convert spaces to newlines. Evaluate each line as a separate file and print only the required lines i.e. fields. Replace remaining newlines with spaces.

Checking 2 files in Unix and finding the sum of a particular column in 3rd file through shell script

I have something I need help with, would appreciate your help
Let's take an example
I have file 1 with data
"eno", "ename", "salary"
"1","john","50000"
"2","steve","30000"
"3","aku","20000"
and I have file 2 with data
"eno", "ename", "incentives"
"1","john","2000"
"2","steve","5000"
"4","akshi","200"
And the expected output in 3 file I want is :
"eno", "ename", "t_salary"
"1","john","52000"
"2","steve","35000"
This is what is expected result
as I should be using eno and the ename as the primary key and output should be shown like this
if your files are sorted and first field is the key, you can join the files and work on the combined fields
that is,
$ join -t, file1 file2
"eno", "ename", "salary", "ename", "incentives"
"1","john","50000","john","2000"
"2","steve","30000","steve","5000"
and your awk can be
... | awk -F, -v OFS=, 'NR==1{print ...}
NR>1{gsub(/"/,"",$3);
gsub(/"/,"",$5);
print $1,$2,$3+$5}'
printing header and quoting the total field is left as an exercise.
$ cat tst.awk
BEGIN { FS="\"[[:space:]]*,[[:space:]]*\""; OFS="\",\"" }
{ key = $1 FS $2 }
NR==FNR { sal[key] = $NF; next }
key in sal { $3 = (FNR>1 ? $3+sal[key] : "t_salary") "\""; print }
$ awk -f tst.awk file1 file2
"eno","ename","t_salary"
"1","john","52000"
"2","steve","35000"
Get the book Effective Awk Programming, 4th Edition, by Arnold Robbins.
Abbreviating the input files to f1 & f2, and breaking out the swiss army knife utils, (plus a bashism):
head -n 1 f1 | sed 's/sal/t_&/' ; \
grep -h -f <(tail -qn +2 f1 f2 | tr ',' '\t' | sort -k1,2 | \
rev | uniq -d -f1 | rev | \
cut -f 2) \
f1 f2 | \
tr -s ',"' '\t' | datamash -s -g2,3 sum 4 | sed 's/[^\t]*/"&"/g;s/\t/,/g'
Output:
"eno", "ename", "t_salary"
"1","john","52000"
"2","steve","35000"
The main job is fairly simple:
grep searches for only those lines with duplicate (and therefore add-able) fields #1 & #2, and this is piped to...
datamash which does the adding.
The rest of the code is reformatting needed to please the various text utils which all seem to have ugly but minor format inconsistencies.
Those revs are only needed because uniq lacks most of sort's field functions.
The trs are because uniq also lacks a field separator switch, and datamash can't sum quoted numbers. The sed at the end is to undo all that tr-ing.

UNIX Shell Script remove one column from the file

I have a file like the following:
Header1:value1|value2|value3|
Header2:value4|value5|value6|
The column number is unknown and I have a function which can return the column number.
And I want to write a script which can remove one column from the file. For exampple, after removing column 1, I will get:
Header1:value2|value3|
Header2:value5|value6|
I use cut to achieve this and so far I can give the values after removing one column but without the headers. For example
value2|value3|
value5|value6|
Could anyone tell me how can I add headers back? Or any command can do that directly? Thanks.
Replace the colon with a pipe, do your cut command, then replace the first pipe with a colon again:
sed 's/:/|/' input.txt | cut ... | sed 's/|/:/'
You may need to adjust the column number for the cut command, to ensure you don't count the header.
Turn the ':' into '|', so that the header is another field, rather than part of the first field. You can do that either in whatever generates the data to begin with, or by passing the data through tr ':' '|' before cut. The rest of your fields will be offset by +1 then, but that should be easy enough to compensate for.
Your problem is that HeaderX are followed by ':' which is not the '|' delimiter you use in cut.
You could separate first your lines in two parts with :, with something like
"cut -f 1 --delimiter=: YOURFILE", then remove the first column and then put back the headers.
awk can handle multiple delimiters. So another alternative is...
jkern#ubuntu:~/scratch$ cat ./data188
Header1:value1|value2|value3|
Header2:value4|value5|value6|
jkern#ubuntu:~/scratch$ awk -F"[:|]" '{ print $1 $3 $4 }' ./data188
Header1value2value3
Header2value5value6
you can do it just with sed without cut:
sed 's/:[^|]*|/:/' input.txt
My solution:
$ sed 's,:,|,' data | awk -F'|' 'BEGIN{OFS="|"}{$2=""; print}' | sed 's,||,:,'
Header1:value2|value3|
Header2:value5|value6|
replace : with |
-F'|' tells awk to use | symbol as field separator
in each line we replace 2nd (because header now becomes first) field with empty string and printing result line with new field separator (|)
return back header by replacing first | with :
Not perfect, but should works.
$ cat file.txt | grep 'Header1' | awk -F"1" '{ print $1 $2 $3 $4}'
This will print all values in separate columns. You can print any number of columns.
Just chiming in with a Perl solution:
(rearrange/remove fields as needed)
-l effectively adds a newline to every print statement
-a autosplit mode splits each line using the -F expression into array #F
-n adds a loop around the -e code
-e your 'one liner' follows this option
$ perl -F[:\|] -lane 'print "$F[0]:$F[1]|$F[2]|$F[3]"' input.txt

LInux sort/uniq apache log

I have small file (100) lines of web request(apache std format) there are multiple request from clients. I want to ONLY have a list of request(lines) in my file that comes from a UNIQUE IP and is the latest entry
I have so far
/home/$: cat all.txt | awk '{ print $1}' | sort -u | "{print the whole line ??}"
The above gives me the IP's(bout 30 which is right) now i need to have the rest of the line(request) as well.
Use an associative array to keep track of which IPs you've found already:
awk '{
if (!found[$1]) {
print;
found[$1]=1;
}
}' all.txt
This will print the first line for each IP. If you want the last one then:
awk '
{ found[$1] = $0 }
END {
for (ip in found)
print found[ip]
}
' all.txt
I hate that unique doesn't come with the same options as sort, or that sort cannot do what it says, I reckon this should work[1],
tac access.log | sort -fb -k1V -u
but alas, it doesn't;
Therefore, it seems we're stuck at doing something silly like
cat all.txt | awk '{ print $1}' | sort -u | while read ip
do
tac all.txt | grep "^$ip" -h | head -1
done
Which is really inefficient, but 'works' (haven't tested it: module typo's then)
[1] according to the man-page
The following should work:
tac access.log | sort -f -k1,1 -us
This takes the file in reverse order and does a stable sort using the first field, keeping only unique items.

Resources