Removing Only Sequential Duplicates from Text FIle? [duplicate] - linux

This sounds simple on its face but is actually somewhat more complex. I would like to use a unix utility to delete consecutive duplicates, leaving the original. But, I would also like to preserve other duplicates that do not occur immediately after the original. For example, if we have the lines:
O B
O B
C D
T V
O B
I want the output to be:
O B
C D
T V
O B
Although the first and last lines are the same, they are not consecutive and therefore I want to keep them as unique entries.

You can do:
cat file1 | uniq > file2
or more succinctly:
uniq file1 file2
assuming file1 contains
O B
O B
C D
T V
O B
For more details, see man uniq. In particular, note that the uniq command accepts two arguments with the following syntax: uniq [OPTION]... [INPUT [OUTPUT]].
Finally if you'd want to remove all duplicates (and sort the file along the way), you could do:
sort -u file1 > file2

Related

Bash script: filter columns based on a character

My text file should be of two columns separated by a tab-space (represented by \t) as shown below. However, there are a few corrupted values where column 1 has two values separated by a space (represented by \s).
A\t1
B\t2
C\sx\t3
D\t4
E\sy\t5
My objective is to create a table as follows:
A\t1
B\t2
C\t3
D\t4
E\t5
i.e. discard the 2nd value that is present after the space in column 1 for eg. in C\sx\t3 I can discard the x that is present after space and store the columns as C\t3.
I have tried a couple of things but with no luck.
I tried to cut the cols based on \t into independent columns and then cut the first column based on \s and join them again. However, it did not work.
Here is the snippet:
col1=(cut -d$'\t' -f1 $file | cut -d' ' -f1)
col2=(cut -d$'\t' -f1 $file)
myArr=()
for((idx=0;idx<${#col1[#]};idx++));do
echo "#{col1[$idx]} #{col2[$idx]}"
# I will append to myArr here
done
The output is appending the list of col2 to the col1 as A B C D E 1 2 3 4 5. And on top of this, my file is very huge i.e. 5,300,000 rows so I would like to avoid looping over all the records and appending them one by one.
Any advice is very much appreciated.
Thank you. :)
And another sed solution:
Search and replace any literal space followed by any number of non-TAB-characters with nothing.
sed -E 's/ [^\t]+//' file
A 1
B 2
C 3
D 4
E 5
If there could be more than one actual space in there just make it 's/ +[^\t]+//' ...
Assuming that when you say a space you mean a blank character then using any awk:
awk 'BEGIN{FS=OFS="\t"} {sub(/ .*/,"",$1)} 1' file
Solution using Perl regular expressions (for me they are easier than seds, and more portable as there are few versions of sed)
$ cat ls
A 1
B 2
C x 3
D 4
E y 5
$ cat ls |perl -pe 's/^(\S+).*\t(\S+)/$1 $2/g'
A 1
B 2
C 3
D 4
E 5
This code gets all non-empty characters from the front and all non-empty characters from after \t
Try
sed $'s/^\\([^ \t]*\\) [^\t]*/\\1/' file
The ANSI-C Quoting ($'...') feature of Bash is used to make tab characters visible as \t.
take advantage of FS and OFS and let them do all the hard work for you
{m,g}awk NF=NF FS='[ \t].*[ \t]' OFS='\t'
A 1
B 2
C 3
D 4
E 5
if there's a chance of leading edge or trailing edge spaces and tabs, then perhaps
mawk 'NF=gsub("^[ \t]+|[ \t]+$",_)^_+!_' OFS='\t' RS='[\r]?\n'

Fast extraction of lines based on line numbers

I am looking for a fast way to extract lines of a file based on a list of line numbers read from a different file in bash.
Define three files:
position_file: Containing a single column of integers
full_data_file: Containing a single column of data
extracted_data_file: Containing those lines in full_data_file whose line numbers match the integers in position_file
My current way of doing this is
while read position; do
awk -v pos="$position" 'NR==pos {print; exit}' < full_data_file >> extracted_data_file
done < position_file
The problem is that this is painfully slow and I'm trying to do this for a large number of rather large files. I was hoping someone might be able to suggest a faster way.
Thank you for your help.
The right way with awk command:
Input files:
$ head pos.txt data.txt
==> pos.txt <==
2
4
6
8
10
==> data.txt <==
a
b
c
d
e
f
g
h
i
j
awk 'NR==FNR{ a[$1]; next }FNR in a' pos.txt data.txt > result.txt
$ cat result.txt
b
d
f
h
j

Multiple text insertion in Linux

Can someone help me how to write a piece of command that will insert some text in multiple places (given column and row) of a given file that already contains data. For example: old_data is a file that contains:
A
And I wish to get new_data that will contain:
A 1
I read something about awk and sed commands, but I don't believe to understand how to incorporate these, to get what I want.
I would like to add up, that this command I would like to use as a part of script
for b in ./*/ ; do (cd "$b" && command); done
If we imagine content of old_data as a matrix of elements {An*m} where n corresponds to number of row and m to number of column of this matrix, I wish to manipulate with matrix so that I could add new elements. A in old-data has coordinates (1,1). In new_data therefore, I wish to assign 1 to a matrix element that has coordinates (1,3).
If we compare content of old_data and new_data we see that (1,2) element corresponds to space (it is empty).
It's not at all clear to me what you are asking for, but I suspect you are saying that you would like a way to insert some given text in to a particular row and column. Perhaps:
$ cat input
A
B
C
D
$ row=2 column=2 text="This is some new data"
$ awk 'NR==row {$column = new_data " " $column}1' row=$row column=$column new_data="$text" input
A
B This is some new data
C
D
This bash & unix tools code works:
# make the input files.
echo {A..D} | tr ' ' '\n' > abc ; echo {1..4} | tr ' ' '\n' > 123
# print as per previous OP spec
head -1q abc 123 ; paste abc 123 123 | tail -n +2
Output:
A
1
B 2 2
C 3 3
D 4 4
Version #3, (using commas as more visible separators), as per newest OP spec:
# for the `sed` code change the `2` to whatever column needs deleting.
paste -d, abc 123 123 | sed 's/[^,]*//2'
Output:
A,,1
B,,2
C,,3
D,,4
The same, with tab delimiters (less visually obvious):
paste abc 123 123 | sed 's/[^\t]*//2'
A 1
B 2
C 3
D 4

Compare 2 files and remove duplicate lines only once

I need to remove the duplicate values from file 1 comparing with file 2 . When i was trying to do so , i am facing issue like since the value in file 2(c,g) also comes under [b] in file 1 , those are also getting deleted. but my requirement is to delete only those under [a]. Thanks
$ less file 1
[a]
c
g
d
[b]
c
g
h
and
$ less file 2
[a]
c
g
d
You can use this awk command:
awk '/^\[.*?\]/{s=$0} FNR==NR{seen[s,$0]++; next} !seen[s,$0]' file2 file1
[b]
c
g
h
This awk is using an associative array seen with a composite key of value inside [...] and all the subsequent records i.e. s,$0
While going through file2 it saves those value in array and while traversing through file1 it will print only those that aren't available in seen thus avoiding duplicates.

in Linux: merge two very big files

I would like to merge two files (one is space delimited and the other tab delimited) keeping only the records that are matching between the two files:
File 1: space delimited
A B C D E F G H
s e id_234 4 t 5 7 9
r d id_45 6 h 3 9 10
f w id_56 2 y 7 3 0
s f id_67 2 y 10 3 0
File 2: tab delimited
I L M N O P
s e 4 u id_67 88
d a 5 d id_33 67
g r 1 o id_45 89
I would like to match File 1 field 3 ("C") with file 2 field 5 ("O"), and merge the files like this:
File 3: tab delimited
I L M N O P A B D E F G H
s e 4 u id_67 88 s f 2 y 10 3 0
g r 1 o id_45 89 r d 6 h 3 9 10
There are entries in file 1 that don't appear in file 2, and vice versa, but I only want to keep the intersection (the common ids).
I don't really care about the order.
I would prefer not to use join because these are really big unsorted files and join requires to sort by common field before, which takes a very long time and much memory.
I have tried with awk but unsuccessfully
awk > file3 'NR == FNR {
f2[$3] = $2; next
}
$5 in f2 {
print $0, f2[$2]
}' file2 file1
Can someone please help me?
Thank you very much
Hmm.. you'll ideally be looking to avoid an n^2 solution which is what the awk based approach seems to require. For each record in file1 you have to scan file2 to see if occurs. That's where the time is going.
I'd suggest writing a python (or similar) script for this and building a map id->file position for one of the files and then querying that whilst scanning the other file. That'd get you an nlogn runtime which, to me at least, looks to be the best you could do here (using a hash for the index leaves you with the expensive problem of seeking to the file pos).
In fact, here's the Python script to do that:
f1 = file("file1.txt")
f1_index = {}
# Generate index for file1
fpos = f1.tell()
line = f1.readline()
while line:
id = line.split()[2]
f1_index[id] = fpos
fpos = f1.tell()
line = f1.readline()
# Now scan file2 and output matches
f2 = file("file2.txt")
line = f2.readline()
while line:
id = line.split()[4]
if id in f1_index:
# Found a matching line, seek to file1 pos and read
# the line back in
f1.seek(f1_index[id], 0)
line2 = f1.readline().split()
del line2[2] # <- Remove the redundant id_XX
new_line = "\t".join(line.strip().split() + line2)
print new_line
line = f2.readline()
If sorting the two files (on the columns you want to match on) is a possibility (and wouldn't break the content somehow), join is probably a better approach than trying to accomplish this with bash or awk. Since you state you don't really care about the order, then this would probably be an appropriate method.
It would look something like this:
join -1 3 -2 5 -o '2.1,2.2,2.3,2.4,2.5,2.6,1.1,1.2,1.4,1.5,1.6,1.7,1.8' <(sort -k3,3 file1) <(sort -k5,5 file2)
I wish there was a better way to tell it which columns to output, because that's a lot of typing, but that's the way it works. You could probably also leave off the -o ... stuff, and then just post-process the output with awk or something to get it into the order you want...

Resources