File Manipulation in UNIX - Creating duplicate records based on the count from a column and manipulating only one column - linux

I have a space delimited file.txt with n number of columns. 3rd column in the file.txt is comma delimited, and I want to create duplicate records in same file.txt based on the number of counts in column(n=3) by splitting the comma delimited column with each value.
--file.txt
I have 0,1,2,3 apples
I have 2,3 bananas
I have 3 oranges
--desiredoutput.txt
I have 0 apples
I have 1 apples
I have 2 apples
I have 3 apples
I have 2 bananas
I have 3 bananas
I have 3 oranges

Awk solution:
awk '$3~/,/{
split($3, a, ","); f=$1 OFS $2; sub(/^[^ ]+ +[^ ]+ +[^ ]+/,"");
for (i in a) print f,a[i] $0; next
}1' file
The output:
I have 0 apples
I have 1 apples
I have 2 apples
I have 3 apples
I have 2 bananas
I have 3 bananas
I have 3 oranges

Related

AWK count occurrences of column A based on uniqueness of column B

I have a file with several columns and I want to count the occurrence of one column based on a second columns value being unique to the first column
For example:
column 10 column 15
-------------------------------
orange New York
green New York
blue New York
gold New York
orange Amsterdam
blue New York
green New York
orange Sweden
blue Tokyo
gold New York
I am fairly new to using commands like awk and am looking to gain more practical knowledge.
I've tried some different variations of
awk '{A[$10 OFS $15]++} END {for (k in A) print k, A[k]}' myfile
but, not quite understanding the code, the output was not what I've expected.
I am expecting output of
orange 3
blue 2
green 1
gold 1
With GNU awk. I assume tab is your field separator.
awk '{count[$10 FS $15]++}END{for(j in count) print j}' FS='\t' file | cut -d $'\t' -f 1 | sort | uniq -c | sort -nr
Output:
3 orange
2 blue
1 green
1 gold
I suppose it could be more elegant.
Single GNU awk invocation version (Works with non-GNU awk too, just doesn't sort the output):
$ gawk 'BEGIN{ OFS=FS="\t" }
NR>1 { names[$2,$1]=$1 }
END { for (n in names) colors[names[n]]++;
PROCINFO["sorted_in"] = "#val_num_desc";
for (c in colors) print c, colors[c] }' input.tsv
orange 3
blue 2
gold 1
green 1
Adjust column numbers as needed to match real data.
Bonus solution that uses sqlite3:
$ sqlite3 -batch -noheader <<EOF
.mode tabs
.import input.tsv names
SELECT "column 10", count(DISTINCT "column 15") AS total
FROM names
GROUP BY "column 10"
ORDER BY total DESC, "column 10";
EOF
orange 3
blue 2
gold 1
green 1

diff 2 files with an output that does not include extra lines

I have 2 files test and test1 and I would like to do a diff between them without the output having extra characters 2a3, 4a6, 6a9 as shown below.
mangoes
apples
banana
peach
mango
strawberry
test1:
mangoes
apples
blueberries
banana
peach
blackberries
mango
strawberry
star fruit
when I diff both the files
$ diff test test1
2a3
> blueberries
4a6
> blackberries
6a9
> star fruit
How do I get the output as
$ diff test test1
blueberries
blackberries
star fruit
A solution using comm:
comm -13 <(sort test) <(sort test1)
Explanation
comm - compare two sorted files line by line
With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files
As we only need the lines unique to the second file test1, -13 is used to suppress the unwanted columns.
Process Substitution is used to get the sorted files.
You can use grep to filter out lines that are not different text:
$ diff file1 file2 | grep '^[<>]'
> blueberries
> blackberries
> star fruit
If you want to remove the direction indicators that indicate which file differs, use sed:
$ diff file1 file2 | sed -n 's/^[<>] //p'
blueberries
blackberries
star fruit
(But it may be confusing to not see which file differs...)
You can use awk
awk 'NR==FNR{a[$0];next} !($0 in a)' test test1
NR==FNR means currently first file on the command line (i.e. test) is being processed,
a[$0] keeps each record in array named a,
next means read next line without doing anything else,
!($0 in a) means if current line does not exist in a, print it.

compare columns from different files and print those that DO NOT match

I have two files, file1 and file2. I want to compare several columns - $1,$2 ,$3 and $4 of file1 with several columns $1,$2, $3 and $4 of file2 and print those rows of file2 that do not match any row in file1.
E.g.
file1
aaa bbb ccc 1 2 3
aaa ccc eee 4 5 6
fff sss sss 7 8 9
file2
aaa bbb ccc 1 f a
mmm nnn ooo 1 d e
aaa ccc eee 4 a b
ppp qqq rrr 4 e a
sss ttt uuu 7 m n
fff sss sss 7 5 6
I want to have as output:
mmm nnn ooo 1 d e
ppp qqq rrr 4 e a
sss ttt uuu 7 m n
I have seen questions asked here for finding those that do match and printing them, but not viceversa,those that DO NOT match.
Thank you!
Use the following script:
awk '{k=$1 FS $2 FS $3 FS $4} NR==FNR{a[k]; next} !(k in a)' file1 file2
k is the concatenated value of the columns 1, 2, 3 and 4, delimited by FS (see comments), and will be used as a key in a search array a later. NR==FNR is true while reading file1. I'm creating the array a indexed by k while reading file1.
For the remaining lines of input I check with !(k in a) if the index does not exists in a. If that evaluates to true awk will print that line.
here is another approach if the files are sorted and you know the used char set.
$ function f(){ sed 's/ /~/g;s/~/ /4g' $1; }; join -v2 <(f file1) <(f file2) |
sed 's/~/ /g'
mmm nnn ooo 1 d e
aaa ccc eee 4 a b
ppp qqq rrr 4 e a
sss ttt uuu 7 m n
fff sss sss 7 5 6
create a key field by concatenating first four fields (with a ~ char, but any unused char can be used), use join to find the unmatched entries from file2 and partition the synthetic key field back.
However, the best way is to use awk solution with a slight fix
$ awk 'NR==FNR{a[$1,$2,$3,$4]; next} !(($1,$2,$3,$4) in a)' file1 file2
No doubt that the awk solution from #hek2mgl is better than this one, but for information this is also possible using uniq, sort, and rev:
rev file1 file2 | sort -k3 | uniq -u -f2 | rev
rev is reverting both files from right to left.
sort -k3 is sorting lines skipping the 2 first column.
uniq -u -f2 prints only lines that are unique (skipping the 2 first while comparing).
At last the rev is reverting back the lines.
This solution sorts the lines of both files. That might be desired or not.

AWK count number of times a term appear with respect to other columns

Given a CSV file:
id, fruit, binary
1, apple, 1
2, orange, 0
3, pear, 1
4, apple, 0
5, peach, 0
6, apple, 1
How can i calculate for each unique values in fruit,
the number of times the binary value =1 / number of occurences of that
fruit appearing in the fruit column
?
Another way to do it is to sum the value of the binary column for for each unique fruit.
For example:
For the fruit apple, it appeared with binary = 1 two times and had a frequency of 3. Hence i will get 2/3.
How can i write this in an efficient AWK code?
I know that i can do this to get unique values from the second column:
cut -d , -f2 file.csv | sort | uniq |
or
awk '{ a[$2]++ } END { for (b in a) { print b } }' file.csv
So my non-working code looks like this:
cat file.csv | awk '{ a[$2]++ } END { for (b in a) if ($3==1) {sum+=$3} END {print $0 sum}'
and
awk '{ a[$2]++ } END { for (b in a) { sum +=1 } }' file.csv
need help in correcting my syntax and merging the 2 awk codes
This should work for you?
$ cat file.csv
1, apple, 1
2, orange, 0
3, pear, 1
4, apple, 0
5, peach, 0
6, apple, 1
$ cat file.csv|awk -F',' '{ $3 == 1 && fruit[$2]++; tfruit[$2]++ } END { for ( fr in tfruit) { print fr, fruit[fr], tfruit[fr] } }'
pear 1 1
apple 2 3
orange 1
peach 1
Almost same as the other answer, but printing 0 instead of blank.
AMD$ awk -F, 'NR>1{a[$2]+=$3;b[$2]++} END{for(i in a)print i, a[i], b[i]}' File
pear 1 1
apple 2 3
orange 0 1
peach 0 1
Taking , as field seperator. For all lines except the first, update array a. i.e $2(fruit name) is taken as index and adding up the number of times binary is 1 for this fruit. Also increase b[$2] by one, this will be the number of times the fruit is seen. At the end, print the fruit, binary count and num of times fruit seen. Hope it is clear.

sed: How do I replace all lines containing a certain string?

Say I have a file containing these lines:
tomatoes
bananas
tomatoes with apples
pears
pears with apples
How do I replace every line containing the word "apples" with just "oranges"? This is what I want to end up with:
tomatoes
bananas
oranges
pears
oranges
Use sed 's/.*apples.*/oranges/' ?
you can use awk
$ awk '/apples/{$0="orange"}1' file
tomatoes
bananas
orange
pears
orange
this says to search for apples, then change the whole record to "orange".

Resources