I would like to merge two files (one is space delimited and the other tab delimited) keeping only the records that are matching between the two files:
File 1: space delimited
A B C D E F G H
s e id_234 4 t 5 7 9
r d id_45 6 h 3 9 10
f w id_56 2 y 7 3 0
s f id_67 2 y 10 3 0
File 2: tab delimited
I L M N O P
s e 4 u id_67 88
d a 5 d id_33 67
g r 1 o id_45 89
I would like to match File 1 field 3 ("C") with file 2 field 5 ("O"), and merge the files like this:
File 3: tab delimited
I L M N O P A B D E F G H
s e 4 u id_67 88 s f 2 y 10 3 0
g r 1 o id_45 89 r d 6 h 3 9 10
There are entries in file 1 that don't appear in file 2, and vice versa, but I only want to keep the intersection (the common ids).
I don't really care about the order.
I would prefer not to use join because these are really big unsorted files and join requires to sort by common field before, which takes a very long time and much memory.
I have tried with awk but unsuccessfully
awk > file3 'NR == FNR {
f2[$3] = $2; next
}
$5 in f2 {
print $0, f2[$2]
}' file2 file1
Can someone please help me?
Thank you very much
Hmm.. you'll ideally be looking to avoid an n^2 solution which is what the awk based approach seems to require. For each record in file1 you have to scan file2 to see if occurs. That's where the time is going.
I'd suggest writing a python (or similar) script for this and building a map id->file position for one of the files and then querying that whilst scanning the other file. That'd get you an nlogn runtime which, to me at least, looks to be the best you could do here (using a hash for the index leaves you with the expensive problem of seeking to the file pos).
In fact, here's the Python script to do that:
f1 = file("file1.txt")
f1_index = {}
# Generate index for file1
fpos = f1.tell()
line = f1.readline()
while line:
id = line.split()[2]
f1_index[id] = fpos
fpos = f1.tell()
line = f1.readline()
# Now scan file2 and output matches
f2 = file("file2.txt")
line = f2.readline()
while line:
id = line.split()[4]
if id in f1_index:
# Found a matching line, seek to file1 pos and read
# the line back in
f1.seek(f1_index[id], 0)
line2 = f1.readline().split()
del line2[2] # <- Remove the redundant id_XX
new_line = "\t".join(line.strip().split() + line2)
print new_line
line = f2.readline()
If sorting the two files (on the columns you want to match on) is a possibility (and wouldn't break the content somehow), join is probably a better approach than trying to accomplish this with bash or awk. Since you state you don't really care about the order, then this would probably be an appropriate method.
It would look something like this:
join -1 3 -2 5 -o '2.1,2.2,2.3,2.4,2.5,2.6,1.1,1.2,1.4,1.5,1.6,1.7,1.8' <(sort -k3,3 file1) <(sort -k5,5 file2)
I wish there was a better way to tell it which columns to output, because that's a lot of typing, but that's the way it works. You could probably also leave off the -o ... stuff, and then just post-process the output with awk or something to get it into the order you want...
Related
This sounds simple on its face but is actually somewhat more complex. I would like to use a unix utility to delete consecutive duplicates, leaving the original. But, I would also like to preserve other duplicates that do not occur immediately after the original. For example, if we have the lines:
O B
O B
C D
T V
O B
I want the output to be:
O B
C D
T V
O B
Although the first and last lines are the same, they are not consecutive and therefore I want to keep them as unique entries.
You can do:
cat file1 | uniq > file2
or more succinctly:
uniq file1 file2
assuming file1 contains
O B
O B
C D
T V
O B
For more details, see man uniq. In particular, note that the uniq command accepts two arguments with the following syntax: uniq [OPTION]... [INPUT [OUTPUT]].
Finally if you'd want to remove all duplicates (and sort the file along the way), you could do:
sort -u file1 > file2
I am looking for a fast way to extract lines of a file based on a list of line numbers read from a different file in bash.
Define three files:
position_file: Containing a single column of integers
full_data_file: Containing a single column of data
extracted_data_file: Containing those lines in full_data_file whose line numbers match the integers in position_file
My current way of doing this is
while read position; do
awk -v pos="$position" 'NR==pos {print; exit}' < full_data_file >> extracted_data_file
done < position_file
The problem is that this is painfully slow and I'm trying to do this for a large number of rather large files. I was hoping someone might be able to suggest a faster way.
Thank you for your help.
The right way with awk command:
Input files:
$ head pos.txt data.txt
==> pos.txt <==
2
4
6
8
10
==> data.txt <==
a
b
c
d
e
f
g
h
i
j
awk 'NR==FNR{ a[$1]; next }FNR in a' pos.txt data.txt > result.txt
$ cat result.txt
b
d
f
h
j
I have a tsv file like
1 2 3 4 5 ...
a b c d e ...
x y z j k ...
How can I merge two contiguous columns, say the 2nd and the 3rd, to get
1 2-3 4 5 ...
a b-c d e ...
x y-z j k ...
I need the code to work with text files with different numbers of columns, so I can't use something like awk 'BEGIN{FS="\t"} {print $1"\t"$2"-"$3"\t"$4"\t"$5}' file
awk is the first tool I thought about for the task and one I'm trying to learn, so I'm very interested in answers using it, but any solution with any other tool would be greatly appreciated.
With simple sed command for tsv file:
sed 's/\t/-/2' file
The output:
1 2-3 4 5 ...
a b-c d e ...
x y-z j k ...
Following awk may help you in same, in case you are not worried about little space which will be created when 3rd field will be nullified.
awk '{$2=$2"-"$3;$3=""} 1' Input_file
With awk:
awk -v OFS='\t' -v col=2 '{
$(col)=$(col)"-"$(col+1); # merge col and col+1
for (i=col+1;i<NF;i++) $(i)=$(i+1); # shift columns right of col+1 by one to the left
NF--; # remove the last field
}1' file # print the record
Output:
1 2-3 4 5 ...
a b-c d e ...
x y-z j k ...
Can someone help me how to write a piece of command that will insert some text in multiple places (given column and row) of a given file that already contains data. For example: old_data is a file that contains:
A
And I wish to get new_data that will contain:
A 1
I read something about awk and sed commands, but I don't believe to understand how to incorporate these, to get what I want.
I would like to add up, that this command I would like to use as a part of script
for b in ./*/ ; do (cd "$b" && command); done
If we imagine content of old_data as a matrix of elements {An*m} where n corresponds to number of row and m to number of column of this matrix, I wish to manipulate with matrix so that I could add new elements. A in old-data has coordinates (1,1). In new_data therefore, I wish to assign 1 to a matrix element that has coordinates (1,3).
If we compare content of old_data and new_data we see that (1,2) element corresponds to space (it is empty).
It's not at all clear to me what you are asking for, but I suspect you are saying that you would like a way to insert some given text in to a particular row and column. Perhaps:
$ cat input
A
B
C
D
$ row=2 column=2 text="This is some new data"
$ awk 'NR==row {$column = new_data " " $column}1' row=$row column=$column new_data="$text" input
A
B This is some new data
C
D
This bash & unix tools code works:
# make the input files.
echo {A..D} | tr ' ' '\n' > abc ; echo {1..4} | tr ' ' '\n' > 123
# print as per previous OP spec
head -1q abc 123 ; paste abc 123 123 | tail -n +2
Output:
A
1
B 2 2
C 3 3
D 4 4
Version #3, (using commas as more visible separators), as per newest OP spec:
# for the `sed` code change the `2` to whatever column needs deleting.
paste -d, abc 123 123 | sed 's/[^,]*//2'
Output:
A,,1
B,,2
C,,3
D,,4
The same, with tab delimiters (less visually obvious):
paste abc 123 123 | sed 's/[^\t]*//2'
A 1
B 2
C 3
D 4
So I have a huge list of items.
I need to grep every lines containing the number: 1300 and above.
How can I do this? Will grep do this? Thanks
While grep technically can it's probably not the best tool for the job. If the list is in a fixed format you might be better off using something like awk.
Sample input:
a b c 1100 d e f
g h i 1200 j k l
m n o 1300 p q r
s t u 1400 v w x
Sample code:
awk -F' ' '($4 >= 1300) { print $0 }' input_file
Sample output:
m n o 1300 p q r
s t u 1400 v w x
awk goes through every line, splitting it into tokens, delimited by a space (as dictated by the parameter -F' ', by default it already uses space but explicitly showing it here lets you change it to however your file is formatted). The logic then says for all values in field 4 that are greater or equal to 1300, print the line (print $0).
Yes you can do this with grep, something along the lines of:
$ grep -E '(1[3-9][0-9]{2}|[2-9][0-9]{3}|[1-9][0-9]{4,})'