linux, Comma Separated Cells to Rows Preserve/Aggregate Column - linux

There was a similar question here but for excel/vba Excel Macro - Comma Separated Cells to Rows Preserve/Aggregate Column
because i have a big file (>300mb) this is not an option, thus I am struggeling to get it to work in bash.
Based on this data
1 Cat1 a,b,c
2 Cat2 d
3 Cat3 e
4 Cat4 f,g
I would like to convert it to:
1 Cat1 a
1 Cat1 b
1 Cat1 c
2 Cat2 d
3 Cat3 e
4 Cat4 f
4 Cat4 g

cat > data << EOF
1 Cat1 a,b,c
2 Cat2 d
3 Cat3 e
4 Cat4 f,g
EOF
set -f # turn off globbing
IFS=, # prepare for comma-separated data
while IFS=$'\t' read C1 C2 C3; do # split columns at tabs
for X in $C3; do # split C3 at commas (due to IFS)
printf '%s\t%s\t%s\n' "$C1" "$C2" "$X"
done
done < data

This looks like a job for awk or perl.
awk 'BEGIN { FS = OFS = "\t" }
{ split($3, a, ",");
for (i in a) {$3 = a[i]; print} }'
perl -F'\t' -alne 'foreach (split ",", $F[2]) {
$F[2] = $_; print join("\t", #F)
}'
Both programs are based on the same algorithm: split the third column at commas, and iterate over the components, printing the original line with each component in the third column in turn.

Related

Merge Two files of columns but insert columns of second file into columns of first file

Assume two files with same amount of columns.
file_A:
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
and
file_B:
A B C D E
A B C D E
A B C D E
A B C D E
A B C D E
I want to merge two files in order like
file_C:
1 A 2 B 3 C 4 D 5 E
1 A 2 B 3 C 4 D 5 E
1 A 2 B 3 C 4 D 5 E
1 A 2 B 3 C 4 D 5 E
1 A 2 B 3 C 4 D 5 E
I have found a solution in the community like this
paste file_A file_B | awk '{print $1,$6,$2,$7,$3,$8,$4,$9,$5,$10}'
But considering amount of columns is like 100 for each file or not constant, I want to know if there is a better method.
Thanks in advance.
You can use a loop in awk, for example
paste file_A file_B | awk '{
half = NF/2;
for(i = 1; i < half; i++)
{
printf("%s %s ", $i, $(i+half));
}
printf("%s %s\n", $half, $NF);
}'
or
paste file_A file_B | awk '{
i = 1; j = NF/2 + 1;
while(j < NF)
{
printf("%s %s ", $i, $j);
i++; j++;
}
printf("%s %s\n", $i, $j);
}'
The code assumes that the number of columns in awk's input is even.
Use this Perl one-liner after paste to print alternating columns:
paste file_A file_B | perl -F'\t' -lane 'print join "\t", #F[ map { ( $_, $_ + ( #F/2 ) ) } 0 .. ( $#F - 1 ) / 2 ];'
Example:
Create tab-delimited input files:
perl -le 'print join "\t", 1..5 for 1..2;' > file_A
perl -le 'print join "\t", "A".."E" for 1..2;' > file_B
head file_A file_B
Prints:
==> file_A <==
1 2 3 4 5
1 2 3 4 5
==> file_B <==
A B C D E
A B C D E
Paste files side by side, also tab-delimited:
paste file_A file_B | perl -F'\t' -lane 'print join "\t", #F[ map { ( $_, $_ + ( #F/2 ) ) } 0 .. ( $#F - 1 ) / 2 ];'
Prints:
1 A 2 B 3 C 4 D 5 E
1 A 2 B 3 C 4 D 5 E
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F'/\t/' : Split into #F on TAB, rather than on whitespace.
$#F : last index of the array #F with the input fields, split on tab.
0 .. ( $#F - 1 ) / 2 : array of indexes of the array #F, from the start (0) to half of the array. These are all indexes that correspond to file_A.
map { ( $_, $_ + ( #F/2 ) ) } 0 .. ( $#F - 1 ) / 2 : map takes the above array of indexes from 0 to half of the length of #F, and returns a new array, with twice the number of elements. Its elements alternate: (a) the index corresponding to file_A ($_) and (b) that index plus half the length of the array ($_ + ( #F/2 )), which is the corresponding index from file_B.
#F[ map { ( $_, $_ + ( #F/2 ) ) } 0 .. ( $#F - 1 ) / 2 ] : a slice of array #F with the specified indexes, namely alternating fields from file_A and file_B.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perldata: Slices
With one awk script parsing the files:
FNR==NR {
rec[NR] = $0
next
}
{
split(rec[FNR], fields)
for (i=1;i<=NF;i++) $i = fields[i] FS $i
print
}
Usage:
awk -f tst.awk file_A file_B

How to add two columns based on header names and paste results in a third row based on header name?

How can I add 2 columns (test 1 and test 2) and print the result in a fourth column based on column header names? (CSV file)- comma demilited file
Input:
test1 test2 test3 test4
1 2 x
2 4 Y
Output:
test1 test2 test3 test4
1 2 x 3
2 4 Y 6
I tried the below which works but I want it to be based on the column headers and not positions.
awk -F, '{$3=$1+$2;} {print $1,$2,$3}' OFS=, testing.csv
awk -F, '{$3=$1+$2;} {print $1,$2,$3}' OFS=, testing.csv
Input:
test1 test2 test3 test4
1 2 x
2 4 Y
Output:
test1 test2 test3 test4
1 2 x 3
2 4 Y 6
The best way to deal with this is to create an array that maps the column header strings (i.e. the field names) to the field numbers when reading the header line and then just access the fields by their names from then on:
$ awk '
NR==1 { for (i=1;i<=NF;i++) f[$i]=i }
NR>1 { $(f["test4"]) = $(f["test1"]) + $(f["test2"]) }
1' file
test1 test2 test3 test4
1 2 x 3
2 4 Y 6
I assumed above that you don't really have blank lines between data lines in your input. Trivially handled if you do.
If your input/output is really CSV then just create a BEGIN section declaring that:
$ cat file
test1,test2,test3,test4
1,2,x,
2,4,Y
$ awk 'BEGIN{FS=OFS=","} NR==1{for (i=1;i<=NF;i++) f[$i]=i} NR>1{$(f["test4"]) = $(f["test1"]) + $(f["test2"])} 1' file
test1,test2,test3,test4
1,2,x,3
2,4,Y,6
Sample input:
cat inputfile
test1 test2 test3 test4
1 2 x
2 4 Y
Here, from the first line read header and get the column number of test1 and test2 and store it into variables t1 and t2 and then reassign the $4 with itself and sum of column pointed by t1 and t2 .
awk 'NR==1{for(i=1;i<=NF;i++) if($i=="test1") t1=i; else if($i=="test2") t2=i} NR>1{$4=$4 FS $t1+$t2} {print }' inputfile
test1 test2 test3 test4
1 2 x 3
2 4 Y 6
In case you have blank lines in your input file and want to preserve them then use NF as non-zero as check like NR>1&& NF{$4=$4 FS $t1+$t2}.

Finding the rows sharing information

I have a file having a structure like below:
file1.txt:
1 10 20 A
1 10 20 B
1 10 20 E
1 10 20 F
1 12 22 C
1 13 23 X
2 33 45 D
2 48 49 D
2 48 49 E
I am trying to find out, which letters have the same information in the 1st,2nd,3rd columns?
For example the output should be:
A
B
E
F
D
E
I am only able to count how many lines are unique via:
cut -f1,2,3 file1.txt | sort | uniq | wc -l
5
which does not give me anything related with the 4th column.
How do I have the letters in the forth column sharing the first three columns?
Following awk may help you here.
awk 'FNR==NR{a[$1,$2,$3]++;next} a[$1,$2,$3]>1' Input_file Input_file
Output will be as follows.
1 10 20 A
1 10 20 B
1 10 20 E
1 10 20 F
2 48 49 D
2 48 49 E
To get only the last field's value change a[$1,$2,$3]>1 to a[$1,$2,$3]>1{print $NF}'
process the file once:
awk '{k=$1 FS $2 FS $3}
k in a{a[k]=a[k]RS$4;b[k];next}{a[k]=$4}END{for(x in b)print a[x]}' file
process the file twice:
awk 'NR==FNR{a[$1,$2,$3]++;next}a[$1,$2,$3]>1{print $4}' file file
With the given example, both one-liners above give same output:
A
B
E
F
D
E
Note the first one may generate the "letters" in different order.
using best of both worlds...
$ awk '{print $4 "\t" $1,$2,$3}' file | uniq -Df1 | cut -f1
A
B
E
F
D
E
swap the order of the fields, ask uniq to skip the first field and print duplicates only, remove compared fields.
or,
$ rev file | uniq -Df1 | cut -d' ' -f1
A
B
E
F
D
E
if the tagname is not single char you need to add | rev at the end.
NB. Both scripts assume the data is sorted on the compared keys already as in the input file.
Another one-pass:
$ awk ' {
k=$1 FS $2 FS $3 # create array key
if(k in a) { # a is the not-yet-printed queue
print a[k] ORS $NF # once printed from a...
b[k]=$NF # move it to b
delete a[k] # delete from a
}
else if(k in b) { # already-printed queue
print $NF
} else a[k]=$NF # store to not-yet-printed queue a
}' file
A
B
E
F
D
E

How to filter based on another file

Say I have a file in this format
file 1:
kk a 1
rf c 3
df g 7
er e 4
es b 3
and another file 2:
c
g
e
I want filter the second column based on file 2 and output a file like this:
rf c 3
df g 7
er e 4
how would be the linux command for this?
awk 'NR==FNR{A[$1];next}($2 in A)' file2 file1
You can use join for this, if both files are sorted or in the correct order. Although this gives a different output
join --nocheck-order -1 2 -2 1 file1.txt file2.txt
gives
c rf 3
g df 7
e er 4
With perl, you can read the keys file and then check each line for a match
use strict;
use warnings;
my %keys;
open(my $f1, '<', 'file2.txt') or die("Cannot open file2.txt: $!");
while (<$f1>) {
chomp;
$keys{$_} = 1;
}
close($f1);
open(my $f2, '<', 'file1.txt') or die("Cannot open file1.txt: $!");
while (<$f2>) {
my(undef, $col2, undef) = split(' ', $_);
print if ($keys{$col2});
}
close($f2);
This will give the desired
rf c 3
df g 7
er e 4
Not necessarily fast or pretty, but does the trick:
cut -f 2 -d ' ' file1 | while read letter; do grep -n "$letter" file2 | cut -d ':' -f 1 | while read lineNo; do sed $((lineNo+1))'!d' file1; done; done;

In a *nix environment, how would I group columns together?

I have the following text file:
A,B,C
A,B,C
A,B,C
Is there a way, using standard *nix tools (cut, grep, awk, sed, etc), to process such a text file and get the following output:
A
A
A
B
B
B
C
C
C
You can do:
tr , \\n
and that will generate
A
B
C
A
B
C
A
B
C
which you could sort.
Unless you want to pull the first column then second then third, in which case you want something like:
awk -F, '{for(i=1;i<=NF;++i) print i, $i}' | sort -sk1 | awk '{print $2}'
To explain this, the first part generates
1 A
2 B
3 C
1 A
2 B
3 C
1 A
2 B
3 C
the second part will stably sort (so the internal order is preserved)
1 A
1 A
1 A
2 B
2 B
2 B
3 C
3 C
3 C
and the third part will strip the numbers
You could use a shell for-loop combined with cut if you know in advanced the number of columns. Here is an example using bash syntax:
for i in {1..3}; do
cut -d, -f $i file.txt
done
Try:
awk 'BEGIN {FS=","} /([A-C],)+([A-C])?/ {for (i=1;i<=NF;i++) print $i}' YOURFILE | sort

Resources