Perl: append numbers from one file to strings of second file - string

I would like to append numbers attached to (Unicode) strings in one file, to matched strings in a second file. Somehow I can't wrap my head around how to do this. Here is what my two files look like.
File 1:
दौरा, 2
प्रोत्साहन, 1
प्रगति, 4
File 2:
दौरा
dorA
प्रोत्साहन
prua2ts3Ahan
prua2ts2Ahan
prua2tsAhan
prua2t2s3Ahan
prua2t2s2Ahan
prua2t2sAhan
prOts3Ahan
prOts2Ahan
prOtsAhan
prOt2s3Ahan
prOt2s2Ahan
prOt2sAhan
प्रगति
praGat2I
praGatI
pragat2I
pragatI
The desired result would look like this:
Output:
dorA, 2
prua2ts3Ahan, 1
prua2ts2Ahan, 1
prua2tsAhan, 1
prua2t2s3Ahan, 1
prua2t2s2Ahan, 1
prua2t2sAhan, 1
prOts3Ahan, 1
prOts2Ahan, 1
prOtsAhan, 1
prOt2s3Ahan, 1
prOt2s2Ahan, 1
prOt2sAhan, 1
praGat2I, 4
praGatI, 4
pragat2I, 4
pragatI, 4
I have a hash created from File 1 that has the strings as keys, and the numbers as values. Now its a matter of matching these keys in File 2, collecting all following lines after the match, and appending the values to those following lines. Can someone point me in the right direction?

Your description of the solution is correct. Now just translate it to code:
#!/usr/bin/perl
use warnings;
use strict;
my %hash;
open my $F1, '<:encoding(UTF-8)', 'file.1' or die $!;
while (<$F1>) {
chomp;
my ($word, $num) = split /, /;
$hash{$word} = $num;
}
open my $F2, '<:encoding(UTF-8)', 'file.2' or die $!;
my $word;
while (<$F2>) {
chomp;
if (exists $hash{$_}) {
$word = $_;
} elsif ($_) {
print "$_, $hash{$word}\n";
} else {
print "\n";
}
}

Related

parsing table in perl [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
Suppose i have following table in table format( text format, in linux).
index score rank mark
0 100 0 4
1 873 23 89
2 500 15 90
3 224 30 115
and so on( more rows and column)..
#!/usr/bin/perl
open ($fh, '<', "data.txt")
or die "Couldn't open file file.txt, $!";
while ($line= <$fh>) {
chomp $line;
#field = split "\t",$line;
print "field\n";
}
close $fh;
what I need is, to delete some instant, and paste in another file, like
a) delete all instance of rank<= (less than equal) 15 and paste in another file
b) compare rank & mark , if difference is >10 , then delete and paste in another file.
please help..
thank you
Not an answer but some annotations, which do not fit into a comment.
Use strict and warnings, if you start learning Perl.
Declare variables with my.
Don't declare variables, you do not need ($line).
In Perl you can declare variables in list form.
If you do not chomp the input you can output it unmodified.
Splitting with a regular expression is more robust, than relying on a tab.
Example:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
if (/^index/) { print; next; }
my ($index, $score, $rank, $mark) = split /\s+/;
print unless int($rank) < 15;
}
__DATA__
index score rank mark
0 100 0 4
1 873 23 89
2 500 15 90
3 224 30 115
Output:
index score rank mark
1 873 23 89
2 500 15 90
3 224 30 115
The DATA change is just for a minimal reproducible example. Keep your file handling.
Start by adding some abstractions:
use strict;
use warnings;
use autodie; # Will crash the program if open, close, or print fails.
sub read_line {
my ($fh) = #_;
chomp(my $line = <$fh>);
return undef unless defined $line;
return split " ", $line;
}
sub write_line {
my ($fh, #fields) = #_;
say $fh join("\t", #fields);
}
Now you can open the various files for reading/writing and copy the header into output files:
open(my $input, "<", "data.txt");
open(my $default_out, ">", "data_new.txt");
open(my $low_rank, ">", "lowrank.txt");
open(my $large_diff, ">", "large_diff.txt");
my #header = read_line($input);
write_line($default_out, #header);
write_line($low_rank, #header);
write_line($large_diff, #header);
The main loop is now just a matter of reading lines (making sure to stop if read_line returns undef) and following the rules:
while (defined(my #fields = read_line($input))) {
my ($index, $score, $rank, $mark) = #fields;
if ($rank <= 15) {
write_line($low_rank, #fields);
} elsif (...) { # fill in the condition for large_diff yourself
write_line($large_diff, #fields);
} else {
write_line($default_out, #fields);
}
}

Perl/Linux filtering large file with content of another file

I'm filtering a 580 MB file using the content of another smaller file.
File1 (smaller file)
chr start End
1 123 150
2 245 320
2 450 600
File2 (large file)
chr pos RS ID A B C D E F
1 124 r2 3 s 4 s 2 s 2
1 165 r6 4 t 2 k 1 r 2
2 455 t2 4 2 4 t 3 w 3
3 234 r4 2 5 w 4 t 2 4
I would like to capture lines from the File2 if the following criteria is met.
File2.Chr == File1.Chr && File2.Pos > File1.Start && File2.Pos < File1.End
I’ve tried using awk but it runs very slow, also I was wondering if there is better way to accomplish the same?
Thank you.
Here is the code that I’m using:
#!/usr/bin/perl -w
use strict;
use warnings;
my $bed_file = "/data/1000G/Hotspots.bed";#File1 smaller file
my $SNP_file = "/data/1000G/SNP_file.txt";#File2 larger file
my $final_file = "/data/1000G/final_file.txt"; #final output file
open my $in_fh, '<', $bed_file
or die qq{Unable to open "$bed_file" for input: $!};
while ( <$in_fh> ) {
my $line_str = $_;
my #data = split(/\t/, $line_str);
next if /\b(?:track)\b/;# skip header line
my $chr = $data[0]; $chr =~ s/chr//g; print "chr is $chr\n";
my $start = $data[1]-1; print "start is $start\n";
my $end = $data[2]+1; print "end is $end\n";
my $cmd1 = "awk '{if(\$1==chr && \$2>$start && \$2</$end) print (\"chr\"\$1\"_\"\$2\"_\"\$3\"_\"\$4\"_\"\$5\"_\"\$6\"_\"\$7\"_\"\$8)}' $SNP_file >> $final_file"; print "cmd1\n";
my $cmd2 = `awk '{if(\$1==chr && \$2>$start && \$2</$end) print (\"chr\"\$1\"_\"\$2\"_\"\$3\"_\"\$4\"_\"\$5\"_\"\$6\"_\"\$7\"_\"\$8)}' $SNP_file >> $final_file`; print "cmd2\n";
}
Read the small file into a data structure and check every line of the other file against it.
Here I read it into an array, each element being an arrayref with fields from a line. Then each line of the data file is checked against the arrayrefs in this array, comparing fields per requirements.
use warnings 'all';
use strict;
my $ref_file = 'reference.txt';
open my $fh, '<', $ref_file or die "Can't open $ref_file: $!";
my #ref = map { chomp; [ split ] } grep { /\S/ } <$fh>;
my $data_file = 'data.txt';
open $fh, '<', $data_file or die "Can't open $data_file: $!";
# Drop header lines
my $ref_header = shift #ref;
my $data_header = <$fh>;
while (<$fh>)
{
next if not /\S/; # skip empty lines
my #line = split;
foreach my $refline (#ref)
{
next if $line[0] != $refline->[0];
if ($line[1] > $refline->[1] and $line[1] < $refline->[2]) {
print "#line\n";
}
}
}
close $fh;
This prints out correct lines from the provided samples. It allows for multiple lines to match. If this somehow can't be, add last in the if block to exit the foreach once a match is found.
A few comments on the code. Let me know if more can be useful.
When reading the reference file, <$fh> is used in list context so it returns all lines, and grep filters out the empty ones. The map first chomps the newline and then makes an arrayref by [ ], with elements being fields on the line obtained by split. The output list is assigned to #ref.
When we reuse $fh it is first closed (if it was open) so there is no need for a close.
I store the header lines just so, perhaps to print or check. We really only need to exclude them.
Another way, this time storing the smaller file in a Hash of Arrays (HoA) based on the 'chr' field:
use strict;
use warnings;
my $small_file = 'small.txt';
my $large_file = 'large.txt';
open my $small_fh, '<', $small_file or die $!;
my %small;
while (<$small_fh>){
next if $. == 1;
my ($chr, $start, $end) = split /\s+/, $_;
push #{ $small{$chr} }, [$start, $end];
}
close $small_fh;
open my $large_fh, '<', $large_file or die $!;
while (my $line = <$large_fh>){
my ($chr, $pos) = (split /\s+/, $line)[0, 1];
if (defined $small{$chr}){
for (#{ $small{$chr} }){
if ($pos > $_->[0] && $pos < $_->[1]){
print $line;
}
}
}
}
Put them into a SQLite database, do a join. This will be much faster and less buggy and use less memory than trying to write something yourself. And it's more flexible, now you can just do SQL queries on the data, you don't have to keep writing new scripts and reparsing the files.
You can import them by parsing and inserting yourself, or you can convert them to CSV and use SQLite's CSV import ability. Converting to CSV with that simple data can be as easy as s{ +}{,}g or you can use the full blown and very fast Text::CSV_XS.
Your tables look like this (you'll want to use better names for the tables and fields).
create table file1 (
chr integer not null,
start integer not null,
end integer not null
);
create table file2 (
chr integer not null,
pos integer not null,
rs integer not null,
id integer not null,
a char not null,
b char not null,
c char not null,
d char not null,
e char not null,
f char not null
);
Create some indexes on the columns you'll be searching on. Indexes will slow down the import, so make sure you do this after the import.
create index chr_file1 on file1 (chr);
create index chr_file2 on file2 (chr);
create index pos_file2 on file2 (pos);
create index start_file1 on file1 (start);
create index end_file1 on file1 (end);
And do the join.
select *
from file2
join file1 on file1.chr == file2.chr
where file2.pos between file1.start and file1.end;
1,124,r2,3,s,4,s,2,s,2,1,123,150
2,455,t2,4,2,4,t,3,w,3,2,450,600
You can do this in Perl via DBI and the DBD::SQLite driver.
As said before, calling awk at each iteration is very slow. A full awk solution would be possible, I just saw a Perl solution, here's my Python solution as the OP wouldn't mind:
create a dictionary from the small file: chr => list of couples start/end
iterate through the big file and try to match the chr & the position between one of the start/end tuples.
Code:
with open("smallfile.txt") as f:
next(f) # skip title
# build a dictionary with chr as key, and list of start,end as values
d = collections.defaultdict(list)
for line in f:
toks = line.split()
if len(toks)==3:
d[int(toks[0])].append((int(toks[1]),int(toks[2])))
with open("largefile.txt") as f:
next(f) # skip title
for line in f:
toks = line.split()
chr_tok = int(toks[0])
if chr_tok in d:
# key is in dictionary
pos = int(toks[1])
if any(lambda x : t[0]<pos<t[1] for t in d[chr_tok]):
print(line.strip())
We could be slightly faster by sorting the list of tuples and appyling bisect to avoid linear search. That is necessary only if the list of tuples is big in the "small" file.
awk power with single pass. Your code is iterating the file2 as many times as there are lines in file1, so the execution time is linearly increasing. Please let me know if this single pass solution is slower than other solutions.
awk 'NR==FNR {
i = b[$1]; # get the next index for the chr
a[$1][i][0] = $2; # store start
a[$1][i][1] = $3; # store end
b[$1]++; # increment the next index
next;
}
{
p = 0;
if ($1 in a) {
for (i in a[$1]) {
if ($2 > a[$1][i][0] && \
$2 < a[$1][i][1])
p = 1 # set p if $2 in range
}
}
}
p {print}'
One-Liner
awk 'NR==FNR {i = b[$1];a[$1][i][0] = $2; a[$1][i][1] = $3; b[$1]++;next; }{p = 0;if ($1 in a){for(i in a[$1]){if($2>a[$1][i][0] && $2<a[$1][i][1])p=1}}}p' file1 file2

How to create new string by replacing with string from another file in awk/linux

I have two files as shown below:
FileA:
AGCTTTTCATTC...
FileB:
POS ID REF ALT
2 . G C
8 . C -
12 . - T
POS column in FileB: gives the string position in FileA: i.e 2 under POS means second letter "G" in the sequence in FileA.
Based on the POS column in FileB, FileA should be modified to FileA'and FileB'as shown below.
FileA' is formed by inserting string under REF column at respective positions in POS column
FileB' is formed by replacing string under ALT column at respective positions in POS column. And
FileA':
AGCTTTTCATT-TC...
FileB':
ACCTTTT-ATTTTC...
Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
open my $DNA, '<', 'FileA' or die $!;
open my $POS, '<', 'FileB' or die $!;
my $dna = <$DNA>;
my $ref_out = $dna;
my $alt_out = $dna;
while (<$POS>) {
my ($pos, $id, $ref, $alt) = split;
next unless $pos =~ /^\d+$/; # Skip the header.
substr $ref_out, $pos - 1, 1, $ref;
substr $alt_out, $pos - 1, 1, $alt;
}
open my $REF, '>', "FileA'" or die $!;
print {$REF} $ref_out;
close $REF;
open my $ALT, '>', "FileB'" or die $!;
print {$ALT} $alt_out;
close $ALT;
You just keep two copies of the input string and modify it via substr for each line of FileB.

Perl cutting string

I have an array of numbers and file of strings looks like the below and I have written down a perl code which named string cutter. I can get the cut strings, but I can't get the frist "n" strings assgined by the array of numbers. Anyidea? I do not know why substr does not work.
string Cutter
file 1:
1234567890
0987654321
1234546789
ABCDEFGHIJ
JIHGFEDCBA
file 2: array of given length
2, 3, 4, 2, 1
Current Result:
34567890
7654321
546789
CDEFGHIJ
IHGFEDCBA
Supposed to be Result (perhaps \t delimited):
12 34567890
098 7654321
1234 546789
AB CDEFGHIJ
J IHGFEDCBA
My code:
#!/usr/bin/perl
use warnings;
use strict;
if (#ARGV != 2) {
die "Invalid usage\n"
. "Usage: perl program.pl [num_list] [string_file]\n";
}
my ($number_f, $string_f) = #ARGV;
open my $LIST, '<', $number_f or die "Cannot open $number_f: $!";
my #numbers = split /, */, <$LIST>;
close $LIST;
open my $DATA, '<', $string_f or die "Cannot open $string_f: $!";
while (my $string = <$DATA>) {
substr $string, 0, shift #numbers, q(); # Replace the first n characters with an empty string.
print $string;
}
Many thanks
perldoc -f substr:
Extracts a substring out of EXPR and returns it
So you should do this way:
$prefix = substr $string, 0, shift #numbers, q();
print $prefix . " " . $string;

unix - breakdown of how many records in file with number of character occurrences

Is there an inbuilt command to do this or has anyone had any luck with a script that does it?
I am looking to get counts of how many records (as defined by a specific EOL such as "^%!") had how many occurrences of a specfic character. (sorted descending by the number of occurrences)
For example, with this sample file:
jdk,|ljn^%!dk,|sn,|fgc^%!
ydfsvuyx^%!67ds5,|bvujhy,|s6d75
djh,|sudh^%!nhjf,|^%!fdiu^%!
Suggested input: delimiter EOL and filename as arguments.
bash/perl some_script_name ",|" "^%!" samplefile
Desired output:
occs count
3 1
2 1
1 2
0 2
This is because the 1st record had one delimiter, 2nd record had 2, 3rd record had 0, 4th record had 3, 5th record had 1, 6th record had 0.
Bonus pts if you can make the delimiter and EOL argument accept hex input (ie 2C7C) or normal character input (ie ,|) .
Script:
#!/usr/bin/perl
use strict;
$/ = $ARGV[1];
open my $fh, '<', $ARGV[2] or die $!;
my #records = <$fh> and close $fh;
$/ = $ARGV[0];
my %counts;
$counts{(split $_)-1}++ for #records;
delete $counts{-1};
print "$_\t$counts{$_}\n" for (reverse sort keys %counts);
Test:
perl script.pl ',|' '^%!' samplefile
Output:
3 1
2 1
1 2
0 2
This is what perl lives for:
#!perl -w
use 5.12.0;
my ($delim, $eol, $file) = #ARGV;
open my $fh, "<$file" or die "error opening $file $!";
$/ = $eol; # input record separator
my %counts;
while (<$fh>) {
my $matches = () = $_ =~ /(\Q$delim\E)/g; # "goatse" operator
$counts{$matches}++;
}
say "occs\tcount";
foreach my $num (reverse sort keys %counts) {
say "$num\t$counts{$num}";
}
(if you haven't got 5.12, remove the "use 5.12" line and replace the say with print)
A solution in awk:
BEGIN {
RS="\\^%!"
FS=",\\|"
max_occ = 0
}
{
if(match($0, "^ *$")) { # This is here to deal with the final separator.
next
}
if(NF - 1 > max_occ) {
max_occ = NF - 1
}
count[NF - 1]=count[NF - 1] + 1
}
END {
printf("occs count\n")
for(i = 0; i <= max_occ; i++) {
printf("%s %s\n", i, count[i])
}
}
Well, there's one more empty record at the end of the file which has 0. So, here's a script to do what you wanted. Adding headers and otherwise tweaking the printf output is left as an excercise for you. :)
Basically, read the whole file in, split it into records, and for each record, use a /g regex to count the sub-delimiters. Since /g returns an array of all matches, use #{[]} to make an arrayref then deref that in scalar context to get a count. There has to be a more elegant solution to that particular part of the problem, but whatever; it's perl line noise. ;)
user#host[/home/user]
$ ./test.pl ',|' '^%!' test.in
3 1
2 1
1 2
0 3
user#host[/home/user]
$ cat test.in
jdk,|ljn^%!dk,|sn,|fgc^%!
ydfsvuyx^%!67ds5,|bvujhy,|s6d75
djh,|sudh^%!nhjf,|^%!fdiu^%!
user#host[/home/user]
$ cat test.pl
#!/usr/bin/perl
my( $subdelim, $delim, $in,) = #ARGV;
$delim = quotemeta $delim;
$subdelim = quotemeta $subdelim;
my %counts;
open(F, $in) or die qq{Failed opening $in: $?\n};
foreach( split(/$delim/, join(q{}, <F>)) ){
$counts{ scalar(#{[m/.*?($subdelim)/g]}) }++;
}
printf( qq{%i% 4i\n}, $_, $counts{$_} ) foreach (sort {$b<=>$a} keys %counts);
And here's a modified version which only keeps fields which contain at least one non-space character. That removes the last field, but also has the consequence of removing any other empty fields. It also uses $/ and \Q\E to reduce a couple of explicit function calls (thank, Alex). And, like the previous one, it works with strict + warnings;
#!/usr/bin/perl
my( $subdelim, $delim, $in ) = #ARGV;
local $/=$delim;
my %counts;
open(F, $in) or die qq{Failed opening $in: $?\n};
foreach ( grep(/\S/, <F>) ){
$counts{ scalar(#{[m/.*?(\Q$subdelim\E)/g]}) }++;
}
printf( qq{%i% 4i\n}, $_, $counts{$_} ) foreach (sort {$b<=>$a} keys %counts);
If you really only want to remove the last record unconditionally, I'm partial to using pop:
#!/usr/bin/perl
my( $subdelim, $delim, $in ) = #ARGV;
local $/=$delim;
my %counts;
open(F, $in) or die qq{Failed opening $in: $?\n};
my #lines = <F>;
pop #lines;
$counts{ scalar(#{[m/.*?(\Q$subdelim\E)/g]}) }++ foreach (#lines);
printf( qq{%i% 4i\n}, $_, $counts{$_} ) foreach (sort {$b<=>$a} keys %counts);

Resources