Completing a tab delimited file, by copying down the first string - linux

I wasn't sure how to put this one into words. I have a list that I am trying to convert into a tab delimited file. Here is the list in raw form:
|01BFRUITS|
^banana
^apple
^orange
^pear
|01AELECTRONICS|
^television
^radio
^dishwasher
^computer
|01AANIMAL|
^bear
^cat
^dog
^elephant
|01ASHAPE|
^circle
^square
^diamond
^star
After much headaches I learned the GNU has sed -z (cat test.txt | sed -z 's/|\r\n^/\t/g' | tr '^' '\t' | tr -d '|') which allowed me to create the following output
01BFRUITS banana
apple
orange
pear
01AELECTRONICS television
radio
dishwasher
computer
01AANIMAL bear
cat
dog
elephant
01ASHAPE circle
square
diamond
star
Now i'm trying to get the output to look like:
01BFRUITS banana
01BFRUITS apple
01BFRUITS orange
01BFRUITS pear
01AELECTRONICS television
01AELECTRONICS radio
01AELECTRONICS dishwasher
01AELECTRONICS computer
01AANIMAL bear
01AANIMAL cat
01AANIMAL dog
01AANIMAL elephant
01ASHAPE circle
01ASHAPE square
01ASHAPE diamond
01ASHAPE star
What type of command can handle that?
As suggested:
$ awk -v OFS='\t' '/^\|/{ c1=$0; gsub(/\|/,"",c1) } /^\^/{ c2=$0; sub(/^\^/,"",c2); print c1,c2 }' < test.txt
01BFRUITbanana
01BFRUITapple
01BFRUITorange
01BFRUITpear
01AELECTtelevision
01AELECTradioS
01AELECTdishwasher
01AELECTcomputer
01AANIMAbear
01AANIMAcat
01AANIMAdog
01AANIMAelephant
01ASHAPEcircle
01ASHAPEsquare
01ASHAPEdiamond
01ASHAPEstar
clipping the first string and ignoring the tab in between. This seems like a good start. I will try to see if i can fix this.
Resolved this by adding OFS to the print:
$ awk -v OFS='\t' '/^\|/{ c1=$0; gsub(/\|/,"",c1) } /^\^/{ c2=$0; sub(/^\^/,"",c2); print c1,OFS,c2 }' < test.txt
01BFRUITS banana
01BFRUITS apple
01BFRUITS orange
01BFRUITS pear
01AELECTRONICS television
01AELECTRONICS radio
01AELECTRONICS dishwasher
01AELECTRONICS computer
01AANIMAL bear
01AANIMAL cat
01AANIMAL dog
01AANIMAL elephant
01ASHAPE circle
01ASHAPE square
01ASHAPE diamond
01ASHAPE star
Thanks for getting me there #jhnc
Edit:
Added | sed -z s/\r\t\t//g to remove the \r\t after c1
cat test.txt | awk -v OFS='\t' '/^\|/{ c1=$0; gsub(/\|/,"",c1) } /^\^/{ c2=$0; sub(/^\^/,"",c2); print c1,OFS,c2 }' | sed -z s/\\r\\t\\t//g
01BFRUITS banana
01BFRUITS apple
01BFRUITS orange
01BFRUITS pear
01AELECTRONICS television
01AELECTRONICS radio
01AELECTRONICS dishwasher
01AELECTRONICS computer
01AANIMAL bear
01AANIMAL cat
01AANIMAL dog
01AANIMAL elephant
01ASHAPE circle
01ASHAPE square
01ASHAPE diamond
01ASHAPE star

$ awk -F'|' -v OFS="\t" 'NF==3{h=$2; next}{gsub(/^[\^]/,""); print h,$0}' inputfile
01BFRUITS banana
01BFRUITS apple
01BFRUITS orange
01BFRUITS pear
01AELECTRONICS television
01AELECTRONICS radio
01AELECTRONICS dishwasher
01AELECTRONICS computer
01AANIMAL bear
01AANIMAL cat
01AANIMAL dog
01AANIMAL elephant
01ASHAPE circle
01ASHAPE square
01ASHAPE diamond
01ASHAPE star
Or
$ awk -F'[|^]' -v OFS="\t" 'NF==3{h=$2;next}{print h,$2}' inputfile
Or
$ awk -F'[|^]' 'NF==3{h=$2;next}{$0=h"\t"$2}1' inputfie

#jhnc
the print section of the command was missing OFS.. i added it and voila!
EDIT: To account for the \r\t after c1, i've added
| sed -z s/\\r\\t\\t//g
which resulted in
cat TESTCOUNT.txt | awk -v OFS='\t' '/^\|/{ c1=$0; gsub(/\|/,"",c1) } /^\^/{ c2=$0; sub(/^\^/,"",c2); print c1,OFS,c2 }' | sed -z s/\\r\\t\\t//g
01BFRUITS banana
01BFRUITS apple
01BFRUITS orange
01BFRUITS pear
01AELECTRONICS television
01AELECTRONICS radio
01AELECTRONICS dishwasher
01AELECTRONICS computer
01AANIMAL bear
01AANIMAL cat
01AANIMAL dog
01AANIMAL elephant
01ASHAPE circle
01ASHAPE square
01ASHAPE diamond
01ASHAPE star

This might work for you (GNU sed):
sed -En 'N;/^(\|(.*)\|)\n\^(.*)/{s//\2\t\3\n\1/;P};D' file
Append the following line.
If the first of the now two lines begins and ends with | and the first character of the second line begins ^, format them as required, append the original first line and then print the amended first line only.
Whatever the result, delete the first line and repeat.

Related

Combine 3 files into one

I have 3 files.
File1
Red
Blue
Green
File2
Apple LadyBug Fire Red Set1
Lettuce Grass Frog Green Set1
Jean Ocean Sky Blue Set1
File3
BlueBerries Blue Set2
Rose Red Set2
Tree Green Set2
Output
Red
Apple LadyBug Fire Red Set1
Rose Red Set2
Blue
Jean Ocean Sky Blue Set1
BlueBerries Blue Set2
.
.
.
Cat File1 File2 File3 > output4 | sort -u
Or
Grep -f File1 Filew File3 > output4
This doesn't work.
I think your are trying to use the file1 like the pattern.
Then this should work:
while IFS= read -r line; do
echo -e "\n-------";
for foo in 'file2 file3'; do
echo $line;
grep -h $line $foo;
done;
done < file1

AWK count occurrences of column A based on uniqueness of column B

I have a file with several columns and I want to count the occurrence of one column based on a second columns value being unique to the first column
For example:
column 10 column 15
-------------------------------
orange New York
green New York
blue New York
gold New York
orange Amsterdam
blue New York
green New York
orange Sweden
blue Tokyo
gold New York
I am fairly new to using commands like awk and am looking to gain more practical knowledge.
I've tried some different variations of
awk '{A[$10 OFS $15]++} END {for (k in A) print k, A[k]}' myfile
but, not quite understanding the code, the output was not what I've expected.
I am expecting output of
orange 3
blue 2
green 1
gold 1
With GNU awk. I assume tab is your field separator.
awk '{count[$10 FS $15]++}END{for(j in count) print j}' FS='\t' file | cut -d $'\t' -f 1 | sort | uniq -c | sort -nr
Output:
3 orange
2 blue
1 green
1 gold
I suppose it could be more elegant.
Single GNU awk invocation version (Works with non-GNU awk too, just doesn't sort the output):
$ gawk 'BEGIN{ OFS=FS="\t" }
NR>1 { names[$2,$1]=$1 }
END { for (n in names) colors[names[n]]++;
PROCINFO["sorted_in"] = "#val_num_desc";
for (c in colors) print c, colors[c] }' input.tsv
orange 3
blue 2
gold 1
green 1
Adjust column numbers as needed to match real data.
Bonus solution that uses sqlite3:
$ sqlite3 -batch -noheader <<EOF
.mode tabs
.import input.tsv names
SELECT "column 10", count(DISTINCT "column 15") AS total
FROM names
GROUP BY "column 10"
ORDER BY total DESC, "column 10";
EOF
orange 3
blue 2
gold 1
green 1

Linux, awk and how to count and print consecutive lines in a file?

For example I have a file like:
apple
apple
strawberry
What I want to achieve is to print the consecutive line(apple) and count how many times it is consecutive(2) like this: apple-2 using awk.
My code so far is this however it does the following: apple1-apple1.
awk '{current = $NF;
getline;
if($NF == current) i++;
printf ("%s-%d",current,i) }' $file
Thank you in advance.
How about uniq -c and awk for filtering:
$ uniq -c foo|awk '$1>1'
2 apple
Given:
$ cat file
apple
apple
strawberry
mango
apple
strawberry
strawberry
strawberry
You can do:
$ awk '$1==last{seen[$1]++}
{last=$1}
END{for (e in seen)
print seen[e]+1, e}' file
2 apple
3 strawberry

Linux command or/and script for duplicate lines retrieval

I would like to know if there's an easy way way to locate duplicate lines in a text file that contains many entries (about 200.000 or more) and output a file with the duplicates' line numbers, keeping the source file intact. For instance, I got a file with tweets like this:
1. i got red apple
2. i got red apple in my stomach
3. i got green apple
4. i got red apple
5. i like blue bananas
6. i got red apple
7. i like blues music
8. i like blue bananas
9. i like blue bananas
I want the output to be a separate file like this:
4
6
8
9
where numbers will indicate the lines with duplicate entries (excluding the first occurrence of the duplicates). Also note that the matching pattern must be exactly the same sentence (like line 1 is different than line 2, 5 is different than 7 and so on).
Everything I could find with sort | uniq doesn't seem to match the whole sentence but only the first word of the sentence so I'm considering if an awk script would be better for this task or if there is another type of command that can do that.
I also need the first file to be intact (not sorted or reordered in any way) and get only the line numbers as shown above because I want to manually delete these lines from two files. The first file contains the tweets and the second the hashtags of these tweets, so I want to delete the lines that contain duplicate tweets in both files, keeping the first occurrence.
You can try this awk:
awk '$0 in a && a[$0]==1{print NR} {a[$0]++}' file
As per comment,
awk '$0 in a{print NR} {a[$0]++}' file
Output:
$ awk '$0 in a && a[$0]==1{print NR} {a[$0]++}' file
4
8
$ awk '$0 in a{print NR} {a[$0]++}' file
4
6
8
9
you could use python script for doing the same.
f = open("file")
lines = f.readlines()
count = len (lines)
i=0
ignore = []
for i in range(count):
if i in ignore:
continue
for j in range(count):
if (j<= i):
continue
if lines[i] == lines[j]:
ignore.append(j)
print j+1
output :
4
6
8
9
Here is a method combining a few command line tools:
nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}' |
cut -f 1
This
numbers the lines with nl, left adjusted with no leading zeroes (-n ln)
sorts them (ignoring the the first field, i.e., the line number) with sort
finds duplicate lines, ignoring the first field with uniq; the --all-repeated=prepend adds an empty line before each group of duplicate lines
removes all the empty lines and the first one of each group of duplicates with sed
removes everything but the line number with cut
This is what the output looks like at the different stages:
$ nl -n ln file
1 i got red apple
2 i got red apple in my stomach
3 i got green apple
4 i got red apple
5 i like blue bananas
6 i got red apple
7 i like blues music
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2
3 i got green apple
1 i got red apple
4 i got red apple
6 i got red apple
2 i got red apple in my stomach
5 i like blue bananas
8 i like blue bananas
9 i like blue bananas
7 i like blues music
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend
1 i got red apple
4 i got red apple
6 i got red apple
5 i like blue bananas
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}'
4 i got red apple
6 i got red apple
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}' | cut -f 1
4
6
8
9

how to sort this in bash

Hello I have a file containing these lines:
apple
12
orange
4
rice
16
how to use bash to sort it by numbers ?
Suppose each number is the price for the above object.
I want they are formatted like this:
12 apple
4 orange
16 rice
or
apple 12
orange 4
rice 16
Thanks
A solution using paste + sort to get each product sorted by its price:
$ paste - - < file|sort -k 2nr
rice 16
apple 12
orange 4
Explanation
From paste man:
Write lines consisting of the sequentially corresponding lines from
each FILE, separated by TABs, to standard output. With no FILE, or
when FILE is -, read standard input.
paste gets the stream coming from the stdin (your <file) and figures that each line belongs to the fictional archive represented by - , so we get two columns using - -
sort use the flag -k 2nr to get paste output sorted by second column in reverse numerical order.
you can use awk:
awk '!(NR%2){printf "%s %s\n" ,$0 ,p}{p=$0}' inputfile
(slightly adapted from this answer)
If you want to sort the output afterwards, you can use sort (quite logically):
awk '!(NR%2){printf "%s %s\n" ,$0 ,p}{p=$0}' inputfile | sort -n
this would give:
4 orange
12 apple
16 rice
Another solution using awk
$ awk '/[0-9]+/{print prev, $0; next} {prev=$0}' input
apple 12
orange 4
rice 16
while read -r line1 && read -r line2;do
printf '%s %s\n' "$line1" "$line2"
done < input_file
If you want lines to be sorted by price, pipe the result to sort -k2:
while read -r line1 && read -r line2;do
printf '%s %s\n' "$line1" "$line2"
done < input_file | sort -k2
You can do this using paste and awk
$ paste - - <lines.txt | awk '{printf("%s %s\n",$2,$1)}'
12 apple
4 orange
16 rice
an awk-based solution without needing external paste / sort, using regex, calculating modulo % of anything, or awk/bash loops
{m,g}awk '(_*=--_) ? (__ = $!_)<__ : ($++NF = __)_' FS='\n'
12 apple
4 orange
16 rice

Resources