Combine 3 files into one - linux

I have 3 files.
File1
Red
Blue
Green
File2
Apple LadyBug Fire Red Set1
Lettuce Grass Frog Green Set1
Jean Ocean Sky Blue Set1
File3
BlueBerries Blue Set2
Rose Red Set2
Tree Green Set2
Output
Red
Apple LadyBug Fire Red Set1
Rose Red Set2
Blue
Jean Ocean Sky Blue Set1
BlueBerries Blue Set2
.
.
.
Cat File1 File2 File3 > output4 | sort -u
Or
Grep -f File1 Filew File3 > output4
This doesn't work.

I think your are trying to use the file1 like the pattern.
Then this should work:
while IFS= read -r line; do
echo -e "\n-------";
for foo in 'file2 file3'; do
echo $line;
grep -h $line $foo;
done;
done < file1

Related

Completing a tab delimited file, by copying down the first string

I wasn't sure how to put this one into words. I have a list that I am trying to convert into a tab delimited file. Here is the list in raw form:
|01BFRUITS|
^banana
^apple
^orange
^pear
|01AELECTRONICS|
^television
^radio
^dishwasher
^computer
|01AANIMAL|
^bear
^cat
^dog
^elephant
|01ASHAPE|
^circle
^square
^diamond
^star
After much headaches I learned the GNU has sed -z (cat test.txt | sed -z 's/|\r\n^/\t/g' | tr '^' '\t' | tr -d '|') which allowed me to create the following output
01BFRUITS banana
apple
orange
pear
01AELECTRONICS television
radio
dishwasher
computer
01AANIMAL bear
cat
dog
elephant
01ASHAPE circle
square
diamond
star
Now i'm trying to get the output to look like:
01BFRUITS banana
01BFRUITS apple
01BFRUITS orange
01BFRUITS pear
01AELECTRONICS television
01AELECTRONICS radio
01AELECTRONICS dishwasher
01AELECTRONICS computer
01AANIMAL bear
01AANIMAL cat
01AANIMAL dog
01AANIMAL elephant
01ASHAPE circle
01ASHAPE square
01ASHAPE diamond
01ASHAPE star
What type of command can handle that?
As suggested:
$ awk -v OFS='\t' '/^\|/{ c1=$0; gsub(/\|/,"",c1) } /^\^/{ c2=$0; sub(/^\^/,"",c2); print c1,c2 }' < test.txt
01BFRUITbanana
01BFRUITapple
01BFRUITorange
01BFRUITpear
01AELECTtelevision
01AELECTradioS
01AELECTdishwasher
01AELECTcomputer
01AANIMAbear
01AANIMAcat
01AANIMAdog
01AANIMAelephant
01ASHAPEcircle
01ASHAPEsquare
01ASHAPEdiamond
01ASHAPEstar
clipping the first string and ignoring the tab in between. This seems like a good start. I will try to see if i can fix this.
Resolved this by adding OFS to the print:
$ awk -v OFS='\t' '/^\|/{ c1=$0; gsub(/\|/,"",c1) } /^\^/{ c2=$0; sub(/^\^/,"",c2); print c1,OFS,c2 }' < test.txt
01BFRUITS banana
01BFRUITS apple
01BFRUITS orange
01BFRUITS pear
01AELECTRONICS television
01AELECTRONICS radio
01AELECTRONICS dishwasher
01AELECTRONICS computer
01AANIMAL bear
01AANIMAL cat
01AANIMAL dog
01AANIMAL elephant
01ASHAPE circle
01ASHAPE square
01ASHAPE diamond
01ASHAPE star
Thanks for getting me there #jhnc
Edit:
Added | sed -z s/\r\t\t//g to remove the \r\t after c1
cat test.txt | awk -v OFS='\t' '/^\|/{ c1=$0; gsub(/\|/,"",c1) } /^\^/{ c2=$0; sub(/^\^/,"",c2); print c1,OFS,c2 }' | sed -z s/\\r\\t\\t//g
01BFRUITS banana
01BFRUITS apple
01BFRUITS orange
01BFRUITS pear
01AELECTRONICS television
01AELECTRONICS radio
01AELECTRONICS dishwasher
01AELECTRONICS computer
01AANIMAL bear
01AANIMAL cat
01AANIMAL dog
01AANIMAL elephant
01ASHAPE circle
01ASHAPE square
01ASHAPE diamond
01ASHAPE star
$ awk -F'|' -v OFS="\t" 'NF==3{h=$2; next}{gsub(/^[\^]/,""); print h,$0}' inputfile
01BFRUITS banana
01BFRUITS apple
01BFRUITS orange
01BFRUITS pear
01AELECTRONICS television
01AELECTRONICS radio
01AELECTRONICS dishwasher
01AELECTRONICS computer
01AANIMAL bear
01AANIMAL cat
01AANIMAL dog
01AANIMAL elephant
01ASHAPE circle
01ASHAPE square
01ASHAPE diamond
01ASHAPE star
Or
$ awk -F'[|^]' -v OFS="\t" 'NF==3{h=$2;next}{print h,$2}' inputfile
Or
$ awk -F'[|^]' 'NF==3{h=$2;next}{$0=h"\t"$2}1' inputfie
#jhnc
the print section of the command was missing OFS.. i added it and voila!
EDIT: To account for the \r\t after c1, i've added
| sed -z s/\\r\\t\\t//g
which resulted in
cat TESTCOUNT.txt | awk -v OFS='\t' '/^\|/{ c1=$0; gsub(/\|/,"",c1) } /^\^/{ c2=$0; sub(/^\^/,"",c2); print c1,OFS,c2 }' | sed -z s/\\r\\t\\t//g
01BFRUITS banana
01BFRUITS apple
01BFRUITS orange
01BFRUITS pear
01AELECTRONICS television
01AELECTRONICS radio
01AELECTRONICS dishwasher
01AELECTRONICS computer
01AANIMAL bear
01AANIMAL cat
01AANIMAL dog
01AANIMAL elephant
01ASHAPE circle
01ASHAPE square
01ASHAPE diamond
01ASHAPE star
This might work for you (GNU sed):
sed -En 'N;/^(\|(.*)\|)\n\^(.*)/{s//\2\t\3\n\1/;P};D' file
Append the following line.
If the first of the now two lines begins and ends with | and the first character of the second line begins ^, format them as required, append the original first line and then print the amended first line only.
Whatever the result, delete the first line and repeat.

AWK count occurrences of column A based on uniqueness of column B

I have a file with several columns and I want to count the occurrence of one column based on a second columns value being unique to the first column
For example:
column 10 column 15
-------------------------------
orange New York
green New York
blue New York
gold New York
orange Amsterdam
blue New York
green New York
orange Sweden
blue Tokyo
gold New York
I am fairly new to using commands like awk and am looking to gain more practical knowledge.
I've tried some different variations of
awk '{A[$10 OFS $15]++} END {for (k in A) print k, A[k]}' myfile
but, not quite understanding the code, the output was not what I've expected.
I am expecting output of
orange 3
blue 2
green 1
gold 1
With GNU awk. I assume tab is your field separator.
awk '{count[$10 FS $15]++}END{for(j in count) print j}' FS='\t' file | cut -d $'\t' -f 1 | sort | uniq -c | sort -nr
Output:
3 orange
2 blue
1 green
1 gold
I suppose it could be more elegant.
Single GNU awk invocation version (Works with non-GNU awk too, just doesn't sort the output):
$ gawk 'BEGIN{ OFS=FS="\t" }
NR>1 { names[$2,$1]=$1 }
END { for (n in names) colors[names[n]]++;
PROCINFO["sorted_in"] = "#val_num_desc";
for (c in colors) print c, colors[c] }' input.tsv
orange 3
blue 2
gold 1
green 1
Adjust column numbers as needed to match real data.
Bonus solution that uses sqlite3:
$ sqlite3 -batch -noheader <<EOF
.mode tabs
.import input.tsv names
SELECT "column 10", count(DISTINCT "column 15") AS total
FROM names
GROUP BY "column 10"
ORDER BY total DESC, "column 10";
EOF
orange 3
blue 2
gold 1
green 1

comparing files by fields in bash

I have two arbitrary files:
==> file1 <==
11110 abcdef
11111 apple
11112 banana
11113 carrot
11114 date
11115 eggplant
==> file2 <==
11110 abcdefg
11111 apple-pie
11112 banana-cake
11113 chocolate
11115 egg
11116 fruit
For the sake of comparison of these files, I only care about the number in the first column, the words after the break are unimportant.
I want to be able to readily identify numbers that are missing from each file.
For example, file 1 has no 11116 and file 2 has no 11114.
If I sort the files together I can get a complete list:
$ sort file*
11110 abcdef
11110 abcdefg
11111 apple
11111 apple-pie
11112 banana
11112 banana-cake
11113 carrot
11113 chocolate
11114 date
11115 egg
11115 eggplant
11116 fruit
I can get a list of all the numbers by running it through uniq and only comparing the length of the number:
$ sort file* | uniq -w5
11110 abcdef
11111 apple
11112 banana
11113 carrot
11114 date
11115 egg
11116 fruit
That's a list of all numbers 11110-11116.
I can get a list of uniques and duplicates by asking uniq to filter those for me:
duplicates (numbers that appear in both files):
$ sort file* | uniq -dw5
11110 abcdef
11111 apple
11112 banana
11113 carrot
11115 egg
unique numbers, or numbers that only appear in one file:
$ sort file* | uniq -uw5
11114 date
11116 fruit
I would like something that has output resembling:
# shows numbers that do not exist in this file
$ sort file* | <is missing>
==> file1 <==
11116 fruit
==> file2 <==
11114 date
It could do the reverse and show what numbers are missing from the OTHER file, each case is workable:
# shows numbers that do exist ONLY in this file
$ sort file* | <has unqie>
==> file1 <==
11114 date
==> file2 <==
11116 fruit
The first field will contain ~30 alphanumeric characters.
The files in question contain thousands of entries and the majority of entries are expected to be in both files.
The arbitrary data to the right of the number is relevant and needs to remain.
I had the idea of:
generate a complete list of numbers
compare that list with file1 searching for unique entries
compare that list with file2 searching for unique entries
But I can't work out how to do that on a single line:
sort file* | uniq -w5 | sort file1 | uniq -uw5
sort file* | uniq -w5 | sort file2 | uniq -uw5
However, the output of the first uniq doesn't get merged in with the resorting of file1/2...
The solution I came up with was to create the output of all the numbers:
$ sort file* | uniq -w5
and then run that against each file individually, that does work. I just couldn't piece it together on one line:
$ sort all file1 | uniq -uw5
11116 fruit
$ sort all file2 | uniq -uw5
11114 date
I am now working on incorporating join, thanks Kamil
edit: I never got to go any further myself, #Shawn gave it to me in one very short line:
join -j1 -v1 file1 file2
After I have two compiled lists in the format I require, a join performed on the files spits out the required answer. From my code examples above:
$join -j1 -v1 file1 file2
11114 date
$ join -j1 -v2 file1 file2
11116 fruit
A real world Example:
I thought I would generate a real world example of what I have been working on. Take 5 arbitrary files:
lorem1.txt
lorem2.txt
lorem3.txt
lorem4.txt
lorem5.txt
and make a backup of them. I have modified one bit in lorem2.txt and I removed `lorem4.txt from the backup (consider it a new file, or for whatever reason, it is just a missing file):
test$ tree
.
├── data
│   ├── lorem1.txt
│   ├── lorem2.txt
│   ├── lorem3.txt
│   ├── lorem4.txt
│   └── lorem5.txt
└── data-backup
├── lorem1.txt
├── lorem2.txt
├── lorem3.txt
└── lorem5.txt
2 directories, 9 files
mad#test$ md5deep data/* | sort > hash1
mad#test$ md5deep data-backup/* | sort > hash2
mad#test$ head hash*
==> hash1 <==
44da5caec444b6f00721f499e97c857a /test/data/lorem1.txt
5ba24c9a5f6d74f81499872877a5061d /test/data/lorem2.txt
a00edd450c533091e0f62a06902545a4 /test/data/lorem5.txt
b80118923d16f649dd5410d54e5acb2d /test/data/lorem4.txt
fb8f7f39344394c78ab02d2ac524df9d /test/data/lorem3.txt
==> hash2 <==
000e755b8e840e42d50ef1ba5c7ae45d /test/data-backup/lorem2.txt
44da5caec444b6f00721f499e97c857a /test/data-backup/lorem1.txt
a00edd450c533091e0f62a06902545a4 /test/data-backup/lorem5.txt
fb8f7f39344394c78ab02d2ac524df9d /test/data-backup/lorem3.txt
Running our joins:
join 1
mad#test$ join -j1 -v1 hash*
5ba24c9a5f6d74f81499872877a5061d /test/data/lorem2.txt
b80118923d16f649dd5410d54e5acb2d /test/data/lorem4.txt
From our two sets of hash files, joining them verified against the first file, we see that the matching hashes of lorem2.txt and lorem4.txtare missing from the second file. (lorem2because we changed a bit, andlorem4` because we didn't copy, or we deleted the file from the backup).
Doing the reverse join we can see lorem2 exists, it's just that the hash is incorrect:
join 2
mad#test$ join -j1 -v2 hash*
000e755b8e840e42d50ef1ba5c7ae45d /test/data-backup/lorem2.txt
Using my sort and uniq examples from earlier, I could get similar results, but the join above is much better. join1 shows us files we need to revisit, join2 specifically shows us what hashes are incorrect.
sort by name and show uniq names (which was way outside the scope of the original question) can show us files that are missing from the backup. In this example, I convert the backup filenames so they mimic the original filenames, merge/sort them with the original filenames and sort based only on the names, not the hashes. This will show files that are missing from the backup:
test$ sort -k2 hash1 <(sed 's/data-backup/data/g' hash2) | uniq -uf1
b80118923d16f649dd5410d54e5acb2d /test/data/lorem4.txt
if we had a file that contained all the hashes:
test$ sort -k2 hash1 allhashes | uniq -uf1
b80118923d16f649dd5410d54e5acb2d /test/data/lorem4.txt
Thanks again to everyone who helped me formulate this. It has turned into a real life and time saver.
Using gnu awk, you can make use of this approach:
awk 'ARGIND < ARGC-1 {
a[ARGIND][$1] = 1
next
} {
for (i=1; i<ARGC-1; i++)
if (!a[i][$1])
print ARGV[i] ":", $0
}' file1 file2 <(sort file1 file2)
file2: 11114 date
file1: 11116 fruit
Only in file1:
grep `comm -23 <(cut -d \ -f 1 f1 | sort) <(cut -d \ -f 1 f2 | sort)` f1
This awk version only takes one pass through each file:
It assumes that there are no duplicate IDs in a file.
awk '
NR == FNR {f1[$1] = $0; next}
!($1 in f1) {printf "only in %s: %s\n", FILENAME, $0}
$1 in f1 {delete f1[$1]}
END {for (id in f1) printf "only in %s: %s\n", ARGV[1], f1[id]}
' file1 file2
ouputs
only in file2: 11116 fruit
only in file1: 11114 date
You can use diff between 2 files. However, if you diff these files all the lines will listed.
$ diff file1 file2
1,6c1,6
< 11110 abcdef
< 11111 apple
< 11112 banana
< 11113 carrot
< 11114 date
< 11115 eggplant
---
> 11110 abcdefg
> 11111 apple-pie
> 11112 banana-cake
> 11113 chocolate
> 11115 egg
> 11116 fruit
But you only care about the leading numbers.
$ diff <(cut -d' ' -f1 file1) <(cut -d' ' -f1 file2)
5d4
< 11114
6a6
> 11116
If the files are not sorted then add a sort
$ diff <(cut -d' ' -f1 file1 | sort) <(cut -d' ' -f1 file2 | sort)
5d4
< 11114
6a6
> 11116

Linux, awk and how to count and print consecutive lines in a file?

For example I have a file like:
apple
apple
strawberry
What I want to achieve is to print the consecutive line(apple) and count how many times it is consecutive(2) like this: apple-2 using awk.
My code so far is this however it does the following: apple1-apple1.
awk '{current = $NF;
getline;
if($NF == current) i++;
printf ("%s-%d",current,i) }' $file
Thank you in advance.
How about uniq -c and awk for filtering:
$ uniq -c foo|awk '$1>1'
2 apple
Given:
$ cat file
apple
apple
strawberry
mango
apple
strawberry
strawberry
strawberry
You can do:
$ awk '$1==last{seen[$1]++}
{last=$1}
END{for (e in seen)
print seen[e]+1, e}' file
2 apple
3 strawberry

Linux command or/and script for duplicate lines retrieval

I would like to know if there's an easy way way to locate duplicate lines in a text file that contains many entries (about 200.000 or more) and output a file with the duplicates' line numbers, keeping the source file intact. For instance, I got a file with tweets like this:
1. i got red apple
2. i got red apple in my stomach
3. i got green apple
4. i got red apple
5. i like blue bananas
6. i got red apple
7. i like blues music
8. i like blue bananas
9. i like blue bananas
I want the output to be a separate file like this:
4
6
8
9
where numbers will indicate the lines with duplicate entries (excluding the first occurrence of the duplicates). Also note that the matching pattern must be exactly the same sentence (like line 1 is different than line 2, 5 is different than 7 and so on).
Everything I could find with sort | uniq doesn't seem to match the whole sentence but only the first word of the sentence so I'm considering if an awk script would be better for this task or if there is another type of command that can do that.
I also need the first file to be intact (not sorted or reordered in any way) and get only the line numbers as shown above because I want to manually delete these lines from two files. The first file contains the tweets and the second the hashtags of these tweets, so I want to delete the lines that contain duplicate tweets in both files, keeping the first occurrence.
You can try this awk:
awk '$0 in a && a[$0]==1{print NR} {a[$0]++}' file
As per comment,
awk '$0 in a{print NR} {a[$0]++}' file
Output:
$ awk '$0 in a && a[$0]==1{print NR} {a[$0]++}' file
4
8
$ awk '$0 in a{print NR} {a[$0]++}' file
4
6
8
9
you could use python script for doing the same.
f = open("file")
lines = f.readlines()
count = len (lines)
i=0
ignore = []
for i in range(count):
if i in ignore:
continue
for j in range(count):
if (j<= i):
continue
if lines[i] == lines[j]:
ignore.append(j)
print j+1
output :
4
6
8
9
Here is a method combining a few command line tools:
nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}' |
cut -f 1
This
numbers the lines with nl, left adjusted with no leading zeroes (-n ln)
sorts them (ignoring the the first field, i.e., the line number) with sort
finds duplicate lines, ignoring the first field with uniq; the --all-repeated=prepend adds an empty line before each group of duplicate lines
removes all the empty lines and the first one of each group of duplicates with sed
removes everything but the line number with cut
This is what the output looks like at the different stages:
$ nl -n ln file
1 i got red apple
2 i got red apple in my stomach
3 i got green apple
4 i got red apple
5 i like blue bananas
6 i got red apple
7 i like blues music
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2
3 i got green apple
1 i got red apple
4 i got red apple
6 i got red apple
2 i got red apple in my stomach
5 i like blue bananas
8 i like blue bananas
9 i like blue bananas
7 i like blues music
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend
1 i got red apple
4 i got red apple
6 i got red apple
5 i like blue bananas
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}'
4 i got red apple
6 i got red apple
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}' | cut -f 1
4
6
8
9

Resources