I am looping through tab-delimited lines in a txt file. This txt file is the output of an xml/xslt process and has duplicates. Below I am looking for a solution with the txt file, but solutions using XSLT are just as appreciated. Please see example txt file.
txtfile.txt: line 3 is a duplicate of line 1
hello#example.com running 1111
puppy#kennel.com running 9876
hello#example.com running 1111
husky#siberia.com shutdown 1234
puppy#kennel.com running 9876
hello#example.com running 1111
My question is: Can duplicate lines be skipped in a loop so that the loop only processes unique lines? In this case, how to configure to loop lines 1, 2, 4 and skip lines 3, 5, 6?
My current working loop which reads duplicates:
while read name status num
do
echo "<tag1>"
echo "<tag2>"$name"</tag2>"
echo "<tag3>"$status"</tag3>"
echo "<tag2>"$num"</tag2>"
echo "</tag1>"
done < txtfile.txt
In my txtfile there are hundreds of lines and nearly half are duplicates, so this is a huge problem for me! Any ideas/solutions appreciated. Thanks in Advance.
You can read that file via sort -u to eliminate duplicate lines:
sort -u /your/file | while read ...
I would suggest using awk:
$ awk '!a[$0]++{print "<tag1>\n<tag2>" $1 "</tag2>\n<tag3>" $2 "</tag3>\n<tag2>" $3 "</tag2>\n</tag1>"}' file
<tag1>
<tag2>hello#example.com</tag2>
<tag3>running</tag3>
<tag2>1111</tag2>
</tag1>
<tag1>
<tag2>puppy#kennel.com</tag2>
<tag3>running</tag3>
<tag2>9876</tag2>
</tag1>
<tag1>
<tag2>husky#siberia.com</tag2>
<tag3>shutdown</tag3>
<tag2>1234</tag2>
</tag1>
The condition !a[$0]++ evaluates to true the first time each line is seen and false thereafter. When the condition is true, the output is printed.
The basic principle is that the contents of the line $0 is used as a key in the array a. If there's a change that the spacing may differ between records, you could use !a[$1,$2,$3]++ instead, which will count lines as being the same as long as the 3 fields are the same, regardless of the spacing between them.
Related
Question:
I have 2 files, file 1 is a TSV (BED) file that has 23 base-pair sequences in column 7, for example:
1 779692 779715 Sample_3 + 1 ATGGTGCTTTGTTATGGCAGCTC
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC
File 2 is a FASTA file (hg19.fasta) that looks like this. Although it breaks across the lines, this continous string of A,C,G, and T's reprsents a continous string (i.e. a chromsosome). This file is the entire human reference genome build 19, so the two > headers followed by sequences essentially occurs 23 times for each of the 23 chromosomes:
>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
AATTTGACCAGAAGTTATGGGCATCCCTCCCCTGGGAAGGAGGCAGGCAGAAAAGTTTGGAATCTATGTAGTAAAATATG
TTACTCTTTTATATATATGAATAAGTCAAGTAAATGGACATACATATATGTGTGTATATGTGTATATATATATACACACA
TATATACATACATACATACATACATATTATCTGAATTAGGCCATGGTGCTTTGTTATGGCAGCTCTCTGGGATACATGTG
CAGAATGTACAGGTTTGTTACACAGGTATACACCTGCCATGGTTGTTTGCTGCACCCATCAACTCACCATCTACATTAGG
TATTTCTCCTAACGTTATCCCTCATGAATAAGTCAAGTAAATGGAC
>2 dna:chromosome chromosome:GRCh37:1:1:2492300:1
AATTTGACCAGAAGTTATGGGCATCCCTCCCCTGGGAAGGAGGCAGGCAGAAAAGTTTGGAATCTATGTAGTAAAATATG
TTACTCTTTTATATATATGAATAAGTCAAGTAAATGGACATACATATATGTGTGTATATGTGTATATATATATACACACA
TATATACATACATACATACATACATATTATCTGAATTAGGCCATGGTGCTTTGTTATGGCAGCTCTCTGGGATACATGTG
I want to 1) Find out how many times each 23bp sequence appears in the second file without overlapping any others including sequences that break across the lines and 2) append this number to a new column next to the sequence so the new file looks like this:
Desired Output:
1 779692 779715 Sample_3 + 1 ATGGTGCTTTGTTATGGCAGCTC 1
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC 2
My attempt:
I imagine solving the first part will be some variation on grep, so far I've managed:
grep -o ATGGTGCTTTGTTATGGCAGCTC "$file_2" | grep -c ""
which gets the count of a specific sequence, but not each sequence in the column. I think appending the grep results will require awk and paste but I haven't gotten that far!
Any help is appreciated as always! =)
Updates and Edits:
The actual size of these files is relatively massive (30mb or ~500,000 lines for each tsv/BED file) and the FASTA file is the entire human reference genome build 19, which is ~60,000,000 lines. The perl solution proposed by #choroba works, but doesn't scale well to these sizes.
Unfortunately, because of the need to identify matches across the lines, the awk and bash/grep solutions memtnioned below won't work.
I want multiple non-overlapping hits in the same chromosome to count as the actual number of hits. I.e. If you search for a sequence and get 2 hits in a single chromosome and 1 in another chromosome, the total count should be 3.
Ted Lyngmo is very kindly helping me develop a solution in C++ that allows this to be run in a realistic timeframe, there's more detail on his post in this thread. And link to the Github for this is here =)
If the second file is significantly bigger than the first one, I would try this awk script:
awk 'v==1 {a[$7];next} # Get the pattern from first file into the array a
v==2 { # For each line of the second file
for(i in a){ # Loop though all patterns
a[i]=split($0,b,i)-1 # Get the number of pattern match in the line
}
}
v==3 {print $0,a[$7]} # Re-read first file to add the number of pattern matches
' v=1 file1 v=2 file2 v=3 file1
I'd reach for a programming language like Perl.
#!/usr/bin/perl
use warnings;
use strict;
my ($fasta_file, $bed_file) = #ARGV;
open my $fasta, '<', $fasta_file or die "$fasta_file: $!";
open my $bed, '<', $bed_file or die "$bed_file: $!";
my $seq;
while (<$fasta>) {
$seq .= "\n", next if /^>/;
chomp;
$seq .= $_;
}
while (<$bed>) {
chomp;
my $short_seq = (split /\t/, $_)[-1];
my $count = () = $seq =~ /\Q$short_seq\E/g;
print "$_\t$count\n";
}
To count overlapping sequences, change the regex to a lookahead.
my $count = () = $seq =~ /(?=\Q$short_seq\E)/g;
Since grep -c seems to give you the correct count (matching lines, not counting multiple occurances on the same line) you could read the 7 fields from the TSV (BED) file and just print them again with the grep output added to the end:
#!/bin/bash
# read the fields into the array `v`:
while read -ra v
do
# print the 7 first elements in the array + the output from `grep -c`:
echo "${v[#]:0:7}" "$(grep -Fc "${v[6]}" hg19.fasta)"
done < tsv.bed > outfile
outfile will now contain
1 779692 779715 Sample_3 + 1 ATGGTGCTTTGTTATGGCAGCTC 1
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC 2
Benchmarks
This table is a comparison of the three different solutions presented as answers here, with timings to finish different amount of tsv/bed records with the full hg19.fa file (excluding the records containing only N:s). hg19.fa contains 57'946'726 such records. As a baseline I've used two versions of a C++ program (called hgsearch/hgsearchmm). hgsearch reads the whole hg19.fa file into memory and then searches it in parallel. hgsearchmm uses a memory mapped file instead and then searches that (also in parallel).
search \ beds
1
2
100
1000
10000
awk
1m0.606s
17m19.899s
-
-
-
perl
13.263s
15.618s
4m48.751s
48m27.267s
-
bash/grep
2.088s
3.670s
3m27.378s
34m41.129s
-
hgsearch
8.776s
9.425s
30.619s
3m56.508s
38m43.984s
hgsearchmm
1.942s
2.146s
21.715s
3m28.265s
34m56.783s
The tests were run on an Intel Core i9 12 cores/24 HT:s in WSL/Ubuntu 20.04 (SSD disk).
The sources for the scripts and baseline programs used can be found here
I am working on a project that require me to take some .bed in input, extract one column from each file, take only certain parameters and count how many of them there are for each file. I am extremely inexperienced with bash so I don't know most of the commands. But with this line of code it should do the trick.
for FILE in *; do cat $FILE | awk '$9>1.3'| wc -l ; done>/home/parallels/Desktop/EP_Cell_Type.xls
I saved those values in a .xls since I need to do some graphs with them.
Now I would like to take the filenames with -ls and save them in the first column of my .xls while my parameters should be in the 2nd column of my excel file.
I managed to save everything in one column with the command:
ls>/home/parallels/Desktop/EP_Cell_Type.xls | for FILE in *; do cat $FILE | awk '$9>1.3'-x| wc -l ; done >>/home/parallels/Desktop/EP_Cell_Type.xls
My sample files are:A549.bed, GM12878.bed, H1.bed, HeLa-S3.bed, HepG2.bed, Ishikawa.bed, K562.bed, MCF-7.bed, SK-N-SH.bed and are contained in a folder with those files only.
The output is the list of all filenames and the values on the same column like this:
Column 1
A549.bed
GM12878.bed
H1.bed
HeLa-S3.bed
HepG2.bed
Ishikawa.bed
K562.bed
MCF-7.bed
SK-N-SH.bed
4536
8846
6754
14880
25440
14905
22721
8760
28286
but what I need should be something like this:
Filenames
#BS
A549.bed
4536
GM12878.bed
8846
H1.bed
6754
HeLa-S3.bed
14880
HepG2.bed
25440
Ishikawa.bed
14905
K562.bed
22721
MCF-7.bed
8760
SK-N-SH.bed
28286
Assuming OP's awk program (correctly) finds all of the desired rows, an easier (and faster) solution can be written completely in awk.
One awk solution that keeps track of the number of matching rows and then prints the filename and line count:
awk '
FNR==1 { if ( count >= 1 ) # first line of new file? if line counter > 0
printf "%s\t%d\n", prevFN, count # then print previous FILENAME + tab + line count
count=0 # then reset our line counter
prevFN=FILENAME # and save the current FILENAME for later printing
}
$9>1.3 { count++ } # if field #9 > 1.3 then increment line counter
END { if ( count >= 1 ) # flush last FILENAME/line counter to stdout
printf "%s\t%d\n", prevFN, count
}
' * # * ==> pass all files as input to awk
For testing purposes I replaced $9>1.3 with /do/ (match any line containing the string 'do') and ran against a directory containing an assortment of scripts and data files. This generated the following tab-delimited output:
bigfile.txt 7
blocker_tree.sql 4
git.bash 2
hist.bash 4
host.bash 2
lines.awk 2
local.sh 3
multi_file.awk 2
I have a text file that contains numerous lines that have partially duplicated strings. I would like to remove lines where a string match occurs twice, such that I am left only with lines with a single match (or no match at all).
An example output:
g1: sample1_out|g2039.t1.faa sample1_out|g334.t1.faa sample1_out|g5678.t1.faa sample2_out|g361.t1.faa sample3_out|g1380.t1.faa sample4_out|g597.t1.faa
g2: sample1_out|g2134.t1.faa sample2_out|g1940.t1.faa sample2_out|g45.t1.faa sample4_out|g1246.t1.faa sample3_out|g2594.t1.faa
g3: sample1_out|g2198.t1.faa sample5_out|g1035.t1.faa sample3_out|g1504.t1.faa sample5_out|g441.t1.faa
g4: sample1_out|g2357.t1.faa sample2_out|g686.t1.faa sample3_out|g1251.t1.faa sample4_out|g2021.t1.faa
In this case I would like to remove lines 1, 2, and 3 because sample1 is repeated multiple times on line 1, sample 2 is twice on line 2, and sample 5 is repeated twice on line 3. Line 4 would pass because it contains only one instance of each sample.
I am okay repeating this operation multiple times using different 'match' strings (e.g. sample1_out , sample2_out etc in the example above).
Here is one in GNU awk:
$ awk -F"[| ]" '{ # pipe or space is the field reparator
delete a # delete previous hash
for(i=2;i<=NF;i+=2) # iterate every other field, ie right side of space
if($i in a) # if it has been seen already
next # skit this record
else # well, else
a[$i] # hash this entry
print # output if you make it this far
}' file
Output:
g4: sample1_out|g2357.t1.faa sample2_out|g686.t1.faa sample3_out|g1251.t1.faa sample4_out|g2021.t1.faa
The following sed command will accomplish what you want.
sed -ne '/.* \(.*\)|.*\1.*/!p' file.txt
grep: grep -vE '(sample[0-9]).*\1' file
Inspiring from Glenn's answer: use -i with sed to directly do changes in the file.
sed -r '/(sample[0-9]).*\1/d' txt_file
What is the best way to remove all lines from a text file starting at first empty line in Bash? External tools (awk, sed...) can be used!
Example
1: ABC
2: DEF
3:
4: GHI
Line 3 and 4 should be removed and the remaining content should be saved in a new file.
With GNU sed:
sed '/^$/Q' "input_file.txt" > "output_file.txt"
With AWK:
$ awk '/^$/{exit} 1' test.txt > output.txt
Contents of output.txt
$ cat output.txt
ABC
DEF
Walkthrough: For lines that matches ^$ (start-of-line, end-of-line), exit (the whole script). For all lines, print the whole line -- of course, we won't get to this part after a line has made us exit.
Bet there are some more clever ways to do this, but here's one using bash's 'read' builtin. The question asks us to keep lines before the blank in one file and send lines after the blank to another file. You could send some of standard out one place and some another if you are willing to use 'exec' and reroute stdout mid-script, but I'm going to take a simpler approach and use a command line argument to let me know where the post-blank data should go:
#!/bin/bash
# script takes as argument the name of the file to send data once a blank line
# found
found_blank=0
while read stuff; do
if [ -z $stuff ] ; then
found_blank=1
fi
if [ $found_blank ] ; then
echo $stuff > $1
else
echo $stuff
fi
done
run it like this:
$ ./delete_from_empty.sh rest_of_stuff < demo
output is:
ABC
DEF
and 'rest_of_stuff' has
GHI
if you want the before-blank lines to go somewhere else besides stdout, simply redirect:
$ ./delete_from_empty.sh after_blank < input_file > before_blank
and you'll end up with two new files: after_blank and before_blank.
Perl version
perl -e '
open $fh, ">","stuff";
open $efh, ">", "rest_of_stuff";
while(<>){
if ($_ !~ /\w+/){
$fh=$efh;
}
print $fh $_;
}
' demo
This creates two output files and iterates over the demo data. When it hits a blank line, it flips the output from one file to the other.
Creates
stuff:
ABC
DEF
rest_of_stuff:
<blank line>
GHI
Another awk would be:
awk -vRS= '1;{exit}' file
By setting the record separator RS to be an empty string, we define the records as paragraphs separated by a sequence of empty lines. It is now easily to adapt this to select the nth block as:
awk -vRS= '(FNR==n){print;exit}' file
There is a problem with this method when processing files with a DOS line-ending (CRLF). There will be no empty lines as there will always be a CR in the line. But this problem applies to all presented methods.
I need to remove lines with a duplicate value. For example I need to remove line 1 and 3 in the block below because they contain "Value04" - I cannot remove all lines containing Value03 because there are lines with that data that are NOT duplicates and must be kept. I can use any editor; excel, vim, any other Linux command lines.
In the end there should be no duplicate "UserX" values. User1 should only appear 1 time. But if User1 exists twice, I need to remove the entire line containing "Value04" and keep the one with "Value03"
Value01,Value03,User1
Value02,Value04,User1
Value01,Value03,User2
Value02,Value04,User2
Value01,Value03,User3
Value01,Value03,User4
Your ideas and thoughts are greatly appreciated.
Edit: For clarity and leaving words out from the editing process.
The following Awk command removes all but the first occurrence of a value in the third column:
$ awk -F',' '{
if (!seen[$3]) {
seen[$3] = 1
print
}
}' textfile.txt
Output:
Value01,Value03,User1
Value01,Value03,User2
Value01,Value03,User3
Value01,Value03,User4
same thing in Perl:
perl -F, -nae 'print unless $c{$F[2]}++;' textfile.txt
this uses autosplit mode: "-F, -a" splits by comma and places the result into #F array