Print data from two line to another file using loop - linux

I have a file where I want to print data to another file except first line data
Data in the list.txt is
Vik
Ram
Raj
Pooja
OFA
JAL
Output should be into new file => fd.txt like this below except first line data 'VIK'
Ram
Raj
Pooja
OFA
JAL
Code not working
find $_filepath -type d > list.txt
for i in 2 3 4 5 .. N
do
echo $i
done<list.txt >>fd.txt

tail -n+2 outputs the last lines starting from the second one.
from https://superuser.com/questions/1071448/tail-head-all-line-except-x-last-first-lines

Related

Count the number of times a substring appears in a file and place it in a new column

Question:
I have 2 files, file 1 is a TSV (BED) file that has 23 base-pair sequences in column 7, for example:
1 779692 779715 Sample_3 + 1 ATGGTGCTTTGTTATGGCAGCTC
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC
File 2 is a FASTA file (hg19.fasta) that looks like this. Although it breaks across the lines, this continous string of A,C,G, and T's reprsents a continous string (i.e. a chromsosome). This file is the entire human reference genome build 19, so the two > headers followed by sequences essentially occurs 23 times for each of the 23 chromosomes:
>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
AATTTGACCAGAAGTTATGGGCATCCCTCCCCTGGGAAGGAGGCAGGCAGAAAAGTTTGGAATCTATGTAGTAAAATATG
TTACTCTTTTATATATATGAATAAGTCAAGTAAATGGACATACATATATGTGTGTATATGTGTATATATATATACACACA
TATATACATACATACATACATACATATTATCTGAATTAGGCCATGGTGCTTTGTTATGGCAGCTCTCTGGGATACATGTG
CAGAATGTACAGGTTTGTTACACAGGTATACACCTGCCATGGTTGTTTGCTGCACCCATCAACTCACCATCTACATTAGG
TATTTCTCCTAACGTTATCCCTCATGAATAAGTCAAGTAAATGGAC
>2 dna:chromosome chromosome:GRCh37:1:1:2492300:1
AATTTGACCAGAAGTTATGGGCATCCCTCCCCTGGGAAGGAGGCAGGCAGAAAAGTTTGGAATCTATGTAGTAAAATATG
TTACTCTTTTATATATATGAATAAGTCAAGTAAATGGACATACATATATGTGTGTATATGTGTATATATATATACACACA
TATATACATACATACATACATACATATTATCTGAATTAGGCCATGGTGCTTTGTTATGGCAGCTCTCTGGGATACATGTG
I want to 1) Find out how many times each 23bp sequence appears in the second file without overlapping any others including sequences that break across the lines and 2) append this number to a new column next to the sequence so the new file looks like this:
Desired Output:
1 779692 779715 Sample_3 + 1 ATGGTGCTTTGTTATGGCAGCTC 1
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC 2
My attempt:
I imagine solving the first part will be some variation on grep, so far I've managed:
grep -o ATGGTGCTTTGTTATGGCAGCTC "$file_2" | grep -c ""
which gets the count of a specific sequence, but not each sequence in the column. I think appending the grep results will require awk and paste but I haven't gotten that far!
Any help is appreciated as always! =)
Updates and Edits:
The actual size of these files is relatively massive (30mb or ~500,000 lines for each tsv/BED file) and the FASTA file is the entire human reference genome build 19, which is ~60,000,000 lines. The perl solution proposed by #choroba works, but doesn't scale well to these sizes.
Unfortunately, because of the need to identify matches across the lines, the awk and bash/grep solutions memtnioned below won't work.
I want multiple non-overlapping hits in the same chromosome to count as the actual number of hits. I.e. If you search for a sequence and get 2 hits in a single chromosome and 1 in another chromosome, the total count should be 3.
Ted Lyngmo is very kindly helping me develop a solution in C++ that allows this to be run in a realistic timeframe, there's more detail on his post in this thread. And link to the Github for this is here =)
If the second file is significantly bigger than the first one, I would try this awk script:
awk 'v==1 {a[$7];next} # Get the pattern from first file into the array a
v==2 { # For each line of the second file
for(i in a){ # Loop though all patterns
a[i]=split($0,b,i)-1 # Get the number of pattern match in the line
}
}
v==3 {print $0,a[$7]} # Re-read first file to add the number of pattern matches
' v=1 file1 v=2 file2 v=3 file1
I'd reach for a programming language like Perl.
#!/usr/bin/perl
use warnings;
use strict;
my ($fasta_file, $bed_file) = #ARGV;
open my $fasta, '<', $fasta_file or die "$fasta_file: $!";
open my $bed, '<', $bed_file or die "$bed_file: $!";
my $seq;
while (<$fasta>) {
$seq .= "\n", next if /^>/;
chomp;
$seq .= $_;
}
while (<$bed>) {
chomp;
my $short_seq = (split /\t/, $_)[-1];
my $count = () = $seq =~ /\Q$short_seq\E/g;
print "$_\t$count\n";
}
To count overlapping sequences, change the regex to a lookahead.
my $count = () = $seq =~ /(?=\Q$short_seq\E)/g;
Since grep -c seems to give you the correct count (matching lines, not counting multiple occurances on the same line) you could read the 7 fields from the TSV (BED) file and just print them again with the grep output added to the end:
#!/bin/bash
# read the fields into the array `v`:
while read -ra v
do
# print the 7 first elements in the array + the output from `grep -c`:
echo "${v[#]:0:7}" "$(grep -Fc "${v[6]}" hg19.fasta)"
done < tsv.bed > outfile
outfile will now contain
1 779692 779715 Sample_3 + 1 ATGGTGCTTTGTTATGGCAGCTC 1
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC 2
Benchmarks
This table is a comparison of the three different solutions presented as answers here, with timings to finish different amount of tsv/bed records with the full hg19.fa file (excluding the records containing only N:s). hg19.fa contains 57'946'726 such records. As a baseline I've used two versions of a C++ program (called hgsearch/hgsearchmm). hgsearch reads the whole hg19.fa file into memory and then searches it in parallel. hgsearchmm uses a memory mapped file instead and then searches that (also in parallel).
search \ beds
1
2
100
1000
10000
awk
1m0.606s
17m19.899s
-
-
-
perl
13.263s
15.618s
4m48.751s
48m27.267s
-
bash/grep
2.088s
3.670s
3m27.378s
34m41.129s
-
hgsearch
8.776s
9.425s
30.619s
3m56.508s
38m43.984s
hgsearchmm
1.942s
2.146s
21.715s
3m28.265s
34m56.783s
The tests were run on an Intel Core i9 12 cores/24 HT:s in WSL/Ubuntu 20.04 (SSD disk).
The sources for the scripts and baseline programs used can be found here

How to save in two columns of the same file from different output in bash

I am working on a project that require me to take some .bed in input, extract one column from each file, take only certain parameters and count how many of them there are for each file. I am extremely inexperienced with bash so I don't know most of the commands. But with this line of code it should do the trick.
for FILE in *; do cat $FILE | awk '$9>1.3'| wc -l ; done>/home/parallels/Desktop/EP_Cell_Type.xls
I saved those values in a .xls since I need to do some graphs with them.
Now I would like to take the filenames with -ls and save them in the first column of my .xls while my parameters should be in the 2nd column of my excel file.
I managed to save everything in one column with the command:
ls>/home/parallels/Desktop/EP_Cell_Type.xls | for FILE in *; do cat $FILE | awk '$9>1.3'-x| wc -l ; done >>/home/parallels/Desktop/EP_Cell_Type.xls
My sample files are:A549.bed, GM12878.bed, H1.bed, HeLa-S3.bed, HepG2.bed, Ishikawa.bed, K562.bed, MCF-7.bed, SK-N-SH.bed and are contained in a folder with those files only.
The output is the list of all filenames and the values on the same column like this:
Column 1
A549.bed
GM12878.bed
H1.bed
HeLa-S3.bed
HepG2.bed
Ishikawa.bed
K562.bed
MCF-7.bed
SK-N-SH.bed
4536
8846
6754
14880
25440
14905
22721
8760
28286
but what I need should be something like this:
Filenames
#BS
A549.bed
4536
GM12878.bed
8846
H1.bed
6754
HeLa-S3.bed
14880
HepG2.bed
25440
Ishikawa.bed
14905
K562.bed
22721
MCF-7.bed
8760
SK-N-SH.bed
28286
Assuming OP's awk program (correctly) finds all of the desired rows, an easier (and faster) solution can be written completely in awk.
One awk solution that keeps track of the number of matching rows and then prints the filename and line count:
awk '
FNR==1 { if ( count >= 1 ) # first line of new file? if line counter > 0
printf "%s\t%d\n", prevFN, count # then print previous FILENAME + tab + line count
count=0 # then reset our line counter
prevFN=FILENAME # and save the current FILENAME for later printing
}
$9>1.3 { count++ } # if field #9 > 1.3 then increment line counter
END { if ( count >= 1 ) # flush last FILENAME/line counter to stdout
printf "%s\t%d\n", prevFN, count
}
' * # * ==> pass all files as input to awk
For testing purposes I replaced $9>1.3 with /do/ (match any line containing the string 'do') and ran against a directory containing an assortment of scripts and data files. This generated the following tab-delimited output:
bigfile.txt 7
blocker_tree.sql 4
git.bash 2
hist.bash 4
host.bash 2
lines.awk 2
local.sh 3
multi_file.awk 2

Extract information (subset) from a main files using a list of identifiers saved in another file

I have one file containing a list of name (refer as file 1):
Apple
Bat
Cat
I have another file (refer as file 2) containing a list of name and details refer:
Apple bla blaa
aaaaaaaaaggggggggggttttttsssssssvvvvvvv
ssssssssiiuuuuuuuuuueeeeeeeeeeennnnnnnn
sdasasssssssssssssssssssssswwwwwwwwwwww
Aeroplane dsafgeq dasfqw dafsad
vvvvvvvvvvvvvvvvuuuuuuuuuuuuuuuuuuuuuus
fcsadssssssssssssssssssssssssssssssssss
ddddddddddddddddwwwwwwwwwwwwwwwwwwwwwww
sdddddddddddddddddddddddddddddwwwwwwwww
Bat sdasdas dsadw dasd
sssssssssssssssssssssssssssssssssssswww
ssssssssssssssssswwwwwwwwwwwwwwwwwwwwwf
aaaaaaaaaawwwwwwwwwwwwwwwwwwwwwwddddddd
sadddddddddddddddddd
Cat dsafw fasdsa dawwdwaw
sssssssssssssssssssssssssssssssssssssss
wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwssss
I need to extract info out of file 2 using the list of names in file 1.
Output file should be something like below:
Apple bla blaa
aaaaaaaaaggggggggggttttttsssssssvvvvvvv
ssssssssiiuuuuuuuuuueeeeeeeeeeennnnnnnn
sdasasssssssssssssssssssssswwwwwwwwwwww
Bat sdasdas dsadw dasd
sssssssssssssssssssssssssssssssssssswww
ssssssssssssssssswwwwwwwwwwwwwwwwwwwwwf
aaaaaaaaaawwwwwwwwwwwwwwwwwwwwwwddddddd
sadddddddddddddddddd
Cat dsafw fasdsa dawwdwaw
sssssssssssssssssssssssssssssssssssssss
wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwssss
Is there any commands for doing this using Linux (Ubuntu)? I am a new Linux user.
This might work for you (GNU sed):
sed 's#.*#/^&/bb#' file1 |
sed -e ':a' -f - -e 'd;:b;n;/^[A-Z]/!bb;ba' file2
Generate a string of sed commands from the first file and pipe them into another sed script which is run against the second file.
The first file creates a regexp for each line which when matched jumps to a piece of common code. If none of the regexps are matched the lines are deleted. If a regexp is matched then further lines are printed until a new delimiter is found at which point the code then jumps to the start and the process is repeated.
$ awk 'NR==FNR{a[$1];next} NF>1{f=($1 in a)} f' file1 file2
Apple bla blaa
aaaaaaaaaggggggggggttttttsssssssvvvvvvv
ssssssssiiuuuuuuuuuueeeeeeeeeeennnnnnnn
sdasasssssssssssssssssssssswwwwwwwwwwww
Bat sdasdas dsadw dasd
sssssssssssssssssssssssssssssssssssswww
ssssssssssssssssswwwwwwwwwwwwwwwwwwwwwf
aaaaaaaaaawwwwwwwwwwwwwwwwwwwwwwddddddd
sadddddddddddddddddd
Cat dsafw fasdsa dawwdwaw
sssssssssssssssssssssssssssssssssssssss
wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwssss
Taking into consideration that each section has to be separated by an empty line, this solution with awk works ok:
while read -r pat;do
pat="^\\\<${pat}\\\>"
awk -vpattern=$pat '$0 ~ pattern{p=1}$0 ~ /^$/{p=0}p==1' file2
done <file1
This solution to work , requires the file to like this:
Apple bla blaa
1 aaaaaaaaaggggggggggttttttsssssssvvvvvvv
2 ssssssssiiuuuuuuuuuueeeeeeeeeeennnnnnnn
3 sdasasssssssssssssssssssssswwwwwwwwwwww
Aeroplane dsafgeq dasfqw dafsad
4 vvvvvvvvvvvvvvvvuuuuuuuuuuuuuuuuuuuuuus
5 fcsadssssssssssssssssssssssssssssssssss
6 ddddddddddddddddwwwwwwwwwwwwwwwwwwwwwww
7 sdddddddddddddddddddddddddddddwwwwwwwww
Bat sdasdas dsadw dasd
8 sssssssssssssssssssssssssssssssssssswww
9 ssssssssssssssssswwwwwwwwwwwwwwwwwwwwwf
10 aaaaaaaaaawwwwwwwwwwwwwwwwwwwwwwddddddd
11 sadddddddddddddddddd
Cat dsafw fasdsa dawwdwaw
12 sssssssssssssssssssssssssssssssssssssss
13 wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwssss
PS: Numbering has been applied by me in order to be able to "check" that awk will return the correct results per section. Numbering is not required in your real file.
If there are not empty lines separating each section then it is much harder to achieve the correct result.

Need help in vlookup in linux using awk

I have two data files. One is having 1600 rows and the other one is having 2 million rows(tab delimited files). I need to vlookup between these two files. Please see below example for the expected output and kindly let me know if it's possible. I've tried using awk, but couldn't get the expected result.
File 1(small file)
BC1 10 100
BC2 20 200
BC3 30 300
File 2(large file)
BC1 XYZ
BC2 ABC
BC3 DEF
Expected Output:
BC1 10 100 XYZ
BC2 20 200 ABC
BC3 30 300 DEF
I also tried the join command. It is taking forever to complete. Please help me find a solution. Thanks
Commands for your output:
awk '{print $1}' *file | sort | uniq -d > out.txt
for i in $(cat out.txt)
do
grep "$i" large_file >> temp.txt
done
sort -g -t 1 temp.txt > out1.txt
sort -g -t 1 out.txt > out2.txt
paste out1.txt out2.txt | awk '{print $1 $2 $3 $5}'
Commands for Vlookup
Store 1st and 2nd column in file1 file2 respectively
cat file1 file2 | sort | uniq -d ### for records which are present in both files
cat file1 file2 | sort | uniq -u ### for records which are unique and not present in bulk file
This awk script will scan line by line each file, and will try to match the number in the BC column. Once matched, it will print all the columns.
If one of the files does not contain one of the numbers, it will be skipped in both files and will search for the next one. It will loop until one of the files ends.
The script also accepts any number of columns per file and any number of files, as long as the first column is BC and a number.
This awk script assumes that the files are ordered from minor to major number in the BC column (like in your example). Otherwise it will not work.
To execute the script, run this command:
awk -f vlookup.awk smallfile bigfile
The vlookup.awk file will have this content:
BEGIN {files=1;lines=0;maxlines=0;filelines[1]=0;
#Number of columns for SoD, PRN, reference file
col_bc=1;
#Initialize variables
bc_now=0;
new_bc=0;
end_of_process=0;
aux="";
text_result="";
}
{
if(FILENAME!=ARGV[1])exit;
no_bc=0;
new_bc=0;
#Save number of columns
NFields[1]=NF;
#Copy reference file data
for(j=0;j<=NF;j++)
{
file[1,j]=$j;
}
#Read lines from file
for(i=2;i<ARGC;i++)
{
ret=getline < ARGV[i];
if(ret==0) exit; #END OF FILE reached
#Copy columns to file variable
for(j=0;j<=NF;j++)
{
file[i,j]=$j;
}
#Save number of columns
NFields[i]=NF;
}
#Check that all files are in the same number
for(i=1;i<ARGC;i++)
{
bc[i]=file[i,col_bc];
bc[i]=sub("BC","",file[i,col_bc]);
if(bc[i]>bc_now) {bc_now=bc[i];new_bc=1;}
}
#One or more files have a new number
if (new_bc==1)
{
for(i=1;i<ARGC;i++)
{
while(bc_now!=file[i,col_bc])
{
#Read next line from file
if(i==1) ret=getline; #File 1 is the reference file
else ret=getline < ARGV[i];
if(ret==0) exit; #END OF FILE reached
#Copy columns to file variable
for(j=0;j<=NF;j++)
{
file[i,j]=$j;
}
#Save number of columns
NFields[i]=NF;
#Check if in current file data has gone to next number
if(file[i,col_bc]>bc_now)
{
no_bc=1;
break;
}
#No more data lines to compare, end of comparison
if(FILENAME!=ARGV[1])
{
exit;
}
}
#If the number is not in a file, the process to realign must be restarted to the next number available (Exit for loop)
if (no_bc==1) {break;}
}
#If the number is not in a file, the process to realign must be restarted to the next number available (Continue while loop)
if (no_bc==1) {next;}
}
#Number is aligned
for(i=1;i<ARGC;i++)
{
for(j=2;j<=NFields[i];j++) {
#Join colums in text_result variable
aux=sprintf("%s %s",text_result,file[i,j]);
text_result=sprintf("%s",aux);
}
}
printf("BC%d%s\n",bc_now,text_result)
#Reset text variables
aux="";
text_result="";
}
I also tried the join command. It is taking forever to complete.
Please help me find a solution.
It's improbable that you'll find a solution (scripted or not) that's faster than the compiled join command. If you can't wait for join to complete, you need more powerful hardware.

Looping through unique lines in a txt file with Bash

I am looping through tab-delimited lines in a txt file. This txt file is the output of an xml/xslt process and has duplicates. Below I am looking for a solution with the txt file, but solutions using XSLT are just as appreciated. Please see example txt file.
txtfile.txt: line 3 is a duplicate of line 1
hello#example.com running 1111
puppy#kennel.com running 9876
hello#example.com running 1111
husky#siberia.com shutdown 1234
puppy#kennel.com running 9876
hello#example.com running 1111
My question is: Can duplicate lines be skipped in a loop so that the loop only processes unique lines? In this case, how to configure to loop lines 1, 2, 4 and skip lines 3, 5, 6?
My current working loop which reads duplicates:
while read name status num
do
echo "<tag1>"
echo "<tag2>"$name"</tag2>"
echo "<tag3>"$status"</tag3>"
echo "<tag2>"$num"</tag2>"
echo "</tag1>"
done < txtfile.txt
In my txtfile there are hundreds of lines and nearly half are duplicates, so this is a huge problem for me! Any ideas/solutions appreciated. Thanks in Advance.
You can read that file via sort -u to eliminate duplicate lines:
sort -u /your/file | while read ...
I would suggest using awk:
$ awk '!a[$0]++{print "<tag1>\n<tag2>" $1 "</tag2>\n<tag3>" $2 "</tag3>\n<tag2>" $3 "</tag2>\n</tag1>"}' file
<tag1>
<tag2>hello#example.com</tag2>
<tag3>running</tag3>
<tag2>1111</tag2>
</tag1>
<tag1>
<tag2>puppy#kennel.com</tag2>
<tag3>running</tag3>
<tag2>9876</tag2>
</tag1>
<tag1>
<tag2>husky#siberia.com</tag2>
<tag3>shutdown</tag3>
<tag2>1234</tag2>
</tag1>
The condition !a[$0]++ evaluates to true the first time each line is seen and false thereafter. When the condition is true, the output is printed.
The basic principle is that the contents of the line $0 is used as a key in the array a. If there's a change that the spacing may differ between records, you could use !a[$1,$2,$3]++ instead, which will count lines as being the same as long as the 3 fields are the same, regardless of the spacing between them.

Resources