In AWK, how to split consecutive rows that have the same string as a "record"? - string

Let's say I have below text.
aaaaaaa
aaaaaaa
bbb
bbb
bbb
ccccccccccccc
ddddd
ddddd
Is there a way to modify the text as the following.
1 aaaaaaa
1 aaaaaaa
2 bbb
2 bbb
2 bbb
3 ccccccccccccc
4 ddddd
4 ddddd

You could use something like this in awk:
$ awk '{print ($0!=p?++i:i),$0;p=$0}' file
1 aaaaaaa
1 aaaaaaa
2 bbb
2 bbb
2 bbb
3 ccccccccccccc
4 ddddd
4 ddddd
i is incremented whenever the current line differs from the previous line. p holds the value of the previous line, $0.
Alternatively, as suggested by JID:
awk '$0!=p{p=$0;i++}{print i,$0}' file
When the current line differs from p, replace p and increment i. See the comments for discussion of the pros and cons of either approach :)
A further contribution (and even shorter!) by NeronLeVelu
$ awk '{print i+=($0!=p),p=$0}' file
This version performs the addition assignment and basic assignment within the print statement. This works because the return value of each assignment is the value that has been assigned.
As pointed out in the comments, if the first line of the file is empty, the behaviour changes slightly. Assuming that the first line should always begin with a 1, the following block can be added to the start of any of the one-liners:
NR==1{p=$0;i=1}
i.e. on the first line, initialise p to the contents of the line (empty or not) and i to 1. Thanks to Wintermute for this suggestion.

Related

How to compare two columns in same file and store the difference in new file with the unchanged column according to it?

Row Actual Expected
1 AAA BBB
2 CCC CCC
3 DDD EEE
4 FFF GGG
5 HHH HHH
I want to compare actual and expected and store the difference in a file. Like
Row Actual Expected
1 AAA BBB
3 DDD EEE
4 FFF GGG
I have used awk -F, '{if ($2!=$3) {print $1,$2,$3}}' Sample.csv It will only compare Int values not String value
You can use AWK to do this
awk '{if($2!=$3) print $0}' oldfile > newfile
where
$2 and $3 are second and third columns
!= means second and third columns does not match
$0 means whole line
> newfile redirects to new file
I prefer an awk solution (can handle more fields and easier to understand), but you could use
sed -r '/\t([^ ]*)\t\1$/d' Sample.csv
Assuming the file uses tab or some other delimiter to separate the columns, then tsv-filter from eBay's TSV Utilities supports this type of field comparison directly. For the file above:
$ tsv-filter --header --ff-str-ne 2:3 file.tsv
Row Actual Expected
1 AAA BBB
3 DDD EEE
4 FFF GGG
The --ff-str-ne option compares two fields in a row for non-equal strings.
Disclaimer: I'm the author.

linux command to delete the last column of csv

How can I write a linux command to delete the last column of tab-delimited csv?
Example input
aaa bbb ccc ddd
111 222 333 444
Expected output
aaa bbb ccc
111 222 333
It is easy to remove the fist field instead of the last. So we reverse the content, remove the first field, and then revers it again.
Here is an example for a "CSV"
rev file1 | cut -d "," -f 2- | rev
Replace the "file1" and the "," with your file name and the delimiter accordingly.
You can use cut for this. You specify a delimiter with option -d and then give the field numbers (option -f) you want to have in the output. Each line of the input gets treated individually:
cut -d$'\t' -f 1-6 < my.csv > new.csv
This is according to your words. Your example looks more like you want to strip a column in the middle:
cut -d$'\t' -f 1-3,5-7 < my.csv > new.csv
The $'\t' is a bash notation for the string containing the single tab character.
You can use below command which will delete the last column of tab-delimited csv irrespective of field numbers,
sed -r 's/(.*)\s+[^\s]+$/\1/'
for example:
echo "aaa bbb ccc ddd 111 222 333 444" | sed -r 's/(.*)\s+[^\s]+$/\1/'

printf format specifiers in awk does not work for multiple parameters

I'm trying to Write a program script in Bash named example7
Which accepts as parameters: a file name (lets call it file 1) and a list of
Numbers (below we'll call it List 1) and what the program needs to do is to print as output the columns from
File 1 after rescheduling them to right or left by the numbers in list 1. (This is obtainable by using the printf command of awk).
Example
Suppose the contents of an F1 file are:
A abcd ddd eee zz tt
ab gggwe 12 88 iii jjj
yaara yyzz 12abcd xyz x y z
After running the program by command:
example7 F1 -8 -7 6 4
Output:
A abcd ddd eee
ab gggwe 12 88
yaara yyzz 12abcd xyz
In the example above between A and ABCD there are 7 spaces, between abcd and ddd
there are 6 spaces and between ddd and eee
there is one space.
Another example:
After running the program by command:
example7 F1 -8 -7 6 4 5
Output:
A abcd ddd eee zz
ab gggwe 12 88 iii
yaara yyzz 12abcd xyz x
In the example above between A and ABCD there are 7 spaces, between abcd and ddd
there are 6 spaces, between ddd and eee
there is one space, between eee and zz there are 3 spaces, between 88 and iii
there are two spaces, between xyz and x there are 4 spaces.
I've tried doing something like this:
file=$1
shift
awk '{printf "%'$1's\n" ,$1}' $file
but it only works for one number and one parameter and I don't know how I can do it for multiple columns and multiple parameters..
any help will be appreciated.
Set an awk variable to all the remaining parameters, then split it and loop over them.
file=$1
shift
awk -v sizes="$*" '{words = split(sizes, s); for(i = 1; i <= words; i++) printf("%" s[i] "s", $i); print ""; }' "$file"
It's generally wrong to try to substitute a shell variable directly into an awk script. You should prefer to set an awk variable using -v, and then use awk's own string concatenation operation, as I did with s[i].

compare columns from different files and print those that DO NOT match

I have two files, file1 and file2. I want to compare several columns - $1,$2 ,$3 and $4 of file1 with several columns $1,$2, $3 and $4 of file2 and print those rows of file2 that do not match any row in file1.
E.g.
file1
aaa bbb ccc 1 2 3
aaa ccc eee 4 5 6
fff sss sss 7 8 9
file2
aaa bbb ccc 1 f a
mmm nnn ooo 1 d e
aaa ccc eee 4 a b
ppp qqq rrr 4 e a
sss ttt uuu 7 m n
fff sss sss 7 5 6
I want to have as output:
mmm nnn ooo 1 d e
ppp qqq rrr 4 e a
sss ttt uuu 7 m n
I have seen questions asked here for finding those that do match and printing them, but not viceversa,those that DO NOT match.
Thank you!
Use the following script:
awk '{k=$1 FS $2 FS $3 FS $4} NR==FNR{a[k]; next} !(k in a)' file1 file2
k is the concatenated value of the columns 1, 2, 3 and 4, delimited by FS (see comments), and will be used as a key in a search array a later. NR==FNR is true while reading file1. I'm creating the array a indexed by k while reading file1.
For the remaining lines of input I check with !(k in a) if the index does not exists in a. If that evaluates to true awk will print that line.
here is another approach if the files are sorted and you know the used char set.
$ function f(){ sed 's/ /~/g;s/~/ /4g' $1; }; join -v2 <(f file1) <(f file2) |
sed 's/~/ /g'
mmm nnn ooo 1 d e
aaa ccc eee 4 a b
ppp qqq rrr 4 e a
sss ttt uuu 7 m n
fff sss sss 7 5 6
create a key field by concatenating first four fields (with a ~ char, but any unused char can be used), use join to find the unmatched entries from file2 and partition the synthetic key field back.
However, the best way is to use awk solution with a slight fix
$ awk 'NR==FNR{a[$1,$2,$3,$4]; next} !(($1,$2,$3,$4) in a)' file1 file2
No doubt that the awk solution from #hek2mgl is better than this one, but for information this is also possible using uniq, sort, and rev:
rev file1 file2 | sort -k3 | uniq -u -f2 | rev
rev is reverting both files from right to left.
sort -k3 is sorting lines skipping the 2 first column.
uniq -u -f2 prints only lines that are unique (skipping the 2 first while comparing).
At last the rev is reverting back the lines.
This solution sorts the lines of both files. That might be desired or not.

Add a line counter to lines matching a pattern

I need to prepend a line counter to lines matching specific patterns in a file, while still outputting the lines that do not match this pattern.
For example, if my file looks like this:
aaa 123
bbb 456
aaa 666
ccc 777
bbb 999
and the patterns I want to count are 'aaa' and 'ccc', I'd like to get the following output:
1:aaa 123
bbb 456
2:aaa 666
3:ccc 777
bbb 999
Preferably I'm looking for a Linux one-liner. Shell or tool doesn't matter as long it's installed by default in most distros.
With awk:
awk '{if ($1=="aaa" || $1=="ccc") {a++; $0=a":"$0}} {print}' file
1: aaa 123
bbb 456
2: aaa 666
3: ccc 777
bbb 999
Explanation
Loop through lines checking whether first field is aaa or ccc. If so, append the line ($0) with the variable a and auto increment it. Finally, print the line in all cases: if the pattern was matched will have a in the beginning, otherways just the original line.
Use the following code. The following approach is in perl
open FH,"<abc.txt";
$incremental_val = 1;
while(my $line = <FH>){
chomp($line);
if($line =~ m/^aaa / || $line =~ m/^ccc /){
print "$incremental_val : $line\n";
$incremental_val++;
next;
}
print "$line\n";
}
close FH;
The output will be as follows.
1 : aaa 123
bbb 456
2 : aaa 666
3 : ccc 777
bbb 999

Resources