linux group by date column and show count [duplicate] - linux

This question already has answers here:
Best way to simulate "group by" from bash?
(17 answers)
Closed 6 years ago.
i have a CSV file that looks like this:
aaa bbb ccc, 2015-01-01
fff ggg ddd, 2015-01-01
ggg hhh sss, 2015-01-02
ddd fff aaa, 2015-01-03
sss kkk www, 2015-01-03
i want to group by the second field (date).
cat myfile.csv | sort -t, k2 |uniq -c
printed 1 next to every line which is wrong.
i want this:
2015-01-01 2
2015-01-02 1
2015-01-03 2

This assumes, as in your example, that the dates are in order:
$ awk -F, 'NR>1 && d!=$2 {print d,c;c=0} {c++; d=$2;} END{print d,c;}' myfile.csv
2015-01-01 2
2015-01-02 1
2015-01-03 2

Related

How to give numbers to row that reset every time the value of another column changes?

this is probably a simple question, but I want to create row numbers based on the values in the column.
It would look like this:
1 | AAA
2 | AAA
3 | AAA
1 | BBB
2 | BBB
1 | CCC
2 | CCC
3 | CCC
4 | CCC
1 | DDD
2 | DDD
I don't really know how to word my question since my first language isn't english, but what would be the function or steps to take to achieve that?
The answer that I found from #BigBen is this:
=IF(A2=A1,B1+1,1)
https://superuser.com/questions/631644/count-the-number-of-sequential-duplicates-excel

How to sort and ignore spaces?

I'm trying to sort a file but I can't get the results I want.
I have this file :
742550111 aaa aaa aaa aaa aaa 2008 3 1 1
5816470687 aa a dissertation for the 933 2 2 2
Each field is separated by a tabulation, and I would like to sort on the second column.
When I try sort test.txt -t\t -k 2, the output is the same as in the file.
But the output I want to have is :
5816470687 aa a dissertation for the 933 2 2 2
742550111 aaa aaa aaa aaa aaa 2008 3 1 1
I think that's because sort ignores the spaces between the words.
So I tried with this command : LC_ALL=C sort test.txt -t\t -k 2, but it still doesn't work.
Do you have any ideas ?
Bash replaces $'\t' with a real tab:
LC_ALL=C sort file -t $'\t' -k 2
Output:
5816470687 aa a dissertation for the 933 2 2 2
742550111 aaa aaa aaa aaa aaa 2008 3 1 1

How to compare two columns in same file and store the difference in new file with the unchanged column according to it?

Row Actual Expected
1 AAA BBB
2 CCC CCC
3 DDD EEE
4 FFF GGG
5 HHH HHH
I want to compare actual and expected and store the difference in a file. Like
Row Actual Expected
1 AAA BBB
3 DDD EEE
4 FFF GGG
I have used awk -F, '{if ($2!=$3) {print $1,$2,$3}}' Sample.csv It will only compare Int values not String value
You can use AWK to do this
awk '{if($2!=$3) print $0}' oldfile > newfile
where
$2 and $3 are second and third columns
!= means second and third columns does not match
$0 means whole line
> newfile redirects to new file
I prefer an awk solution (can handle more fields and easier to understand), but you could use
sed -r '/\t([^ ]*)\t\1$/d' Sample.csv
Assuming the file uses tab or some other delimiter to separate the columns, then tsv-filter from eBay's TSV Utilities supports this type of field comparison directly. For the file above:
$ tsv-filter --header --ff-str-ne 2:3 file.tsv
Row Actual Expected
1 AAA BBB
3 DDD EEE
4 FFF GGG
The --ff-str-ne option compares two fields in a row for non-equal strings.
Disclaimer: I'm the author.

printf format specifiers in awk does not work for multiple parameters

I'm trying to Write a program script in Bash named example7
Which accepts as parameters: a file name (lets call it file 1) and a list of
Numbers (below we'll call it List 1) and what the program needs to do is to print as output the columns from
File 1 after rescheduling them to right or left by the numbers in list 1. (This is obtainable by using the printf command of awk).
Example
Suppose the contents of an F1 file are:
A abcd ddd eee zz tt
ab gggwe 12 88 iii jjj
yaara yyzz 12abcd xyz x y z
After running the program by command:
example7 F1 -8 -7 6 4
Output:
A abcd ddd eee
ab gggwe 12 88
yaara yyzz 12abcd xyz
In the example above between A and ABCD there are 7 spaces, between abcd and ddd
there are 6 spaces and between ddd and eee
there is one space.
Another example:
After running the program by command:
example7 F1 -8 -7 6 4 5
Output:
A abcd ddd eee zz
ab gggwe 12 88 iii
yaara yyzz 12abcd xyz x
In the example above between A and ABCD there are 7 spaces, between abcd and ddd
there are 6 spaces, between ddd and eee
there is one space, between eee and zz there are 3 spaces, between 88 and iii
there are two spaces, between xyz and x there are 4 spaces.
I've tried doing something like this:
file=$1
shift
awk '{printf "%'$1's\n" ,$1}' $file
but it only works for one number and one parameter and I don't know how I can do it for multiple columns and multiple parameters..
any help will be appreciated.
Set an awk variable to all the remaining parameters, then split it and loop over them.
file=$1
shift
awk -v sizes="$*" '{words = split(sizes, s); for(i = 1; i <= words; i++) printf("%" s[i] "s", $i); print ""; }' "$file"
It's generally wrong to try to substitute a shell variable directly into an awk script. You should prefer to set an awk variable using -v, and then use awk's own string concatenation operation, as I did with s[i].

compare columns from different files and print those that DO NOT match

I have two files, file1 and file2. I want to compare several columns - $1,$2 ,$3 and $4 of file1 with several columns $1,$2, $3 and $4 of file2 and print those rows of file2 that do not match any row in file1.
E.g.
file1
aaa bbb ccc 1 2 3
aaa ccc eee 4 5 6
fff sss sss 7 8 9
file2
aaa bbb ccc 1 f a
mmm nnn ooo 1 d e
aaa ccc eee 4 a b
ppp qqq rrr 4 e a
sss ttt uuu 7 m n
fff sss sss 7 5 6
I want to have as output:
mmm nnn ooo 1 d e
ppp qqq rrr 4 e a
sss ttt uuu 7 m n
I have seen questions asked here for finding those that do match and printing them, but not viceversa,those that DO NOT match.
Thank you!
Use the following script:
awk '{k=$1 FS $2 FS $3 FS $4} NR==FNR{a[k]; next} !(k in a)' file1 file2
k is the concatenated value of the columns 1, 2, 3 and 4, delimited by FS (see comments), and will be used as a key in a search array a later. NR==FNR is true while reading file1. I'm creating the array a indexed by k while reading file1.
For the remaining lines of input I check with !(k in a) if the index does not exists in a. If that evaluates to true awk will print that line.
here is another approach if the files are sorted and you know the used char set.
$ function f(){ sed 's/ /~/g;s/~/ /4g' $1; }; join -v2 <(f file1) <(f file2) |
sed 's/~/ /g'
mmm nnn ooo 1 d e
aaa ccc eee 4 a b
ppp qqq rrr 4 e a
sss ttt uuu 7 m n
fff sss sss 7 5 6
create a key field by concatenating first four fields (with a ~ char, but any unused char can be used), use join to find the unmatched entries from file2 and partition the synthetic key field back.
However, the best way is to use awk solution with a slight fix
$ awk 'NR==FNR{a[$1,$2,$3,$4]; next} !(($1,$2,$3,$4) in a)' file1 file2
No doubt that the awk solution from #hek2mgl is better than this one, but for information this is also possible using uniq, sort, and rev:
rev file1 file2 | sort -k3 | uniq -u -f2 | rev
rev is reverting both files from right to left.
sort -k3 is sorting lines skipping the 2 first column.
uniq -u -f2 prints only lines that are unique (skipping the 2 first while comparing).
At last the rev is reverting back the lines.
This solution sorts the lines of both files. That might be desired or not.

Resources