parsing table in perl [closed] - linux

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
Suppose i have following table in table format( text format, in linux).
index score rank mark
0 100 0 4
1 873 23 89
2 500 15 90
3 224 30 115
and so on( more rows and column)..
#!/usr/bin/perl
open ($fh, '<', "data.txt")
or die "Couldn't open file file.txt, $!";
while ($line= <$fh>) {
chomp $line;
#field = split "\t",$line;
print "field\n";
}
close $fh;
what I need is, to delete some instant, and paste in another file, like
a) delete all instance of rank<= (less than equal) 15 and paste in another file
b) compare rank & mark , if difference is >10 , then delete and paste in another file.
please help..
thank you

Not an answer but some annotations, which do not fit into a comment.
Use strict and warnings, if you start learning Perl.
Declare variables with my.
Don't declare variables, you do not need ($line).
In Perl you can declare variables in list form.
If you do not chomp the input you can output it unmodified.
Splitting with a regular expression is more robust, than relying on a tab.
Example:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
if (/^index/) { print; next; }
my ($index, $score, $rank, $mark) = split /\s+/;
print unless int($rank) < 15;
}
__DATA__
index score rank mark
0 100 0 4
1 873 23 89
2 500 15 90
3 224 30 115
Output:
index score rank mark
1 873 23 89
2 500 15 90
3 224 30 115
The DATA change is just for a minimal reproducible example. Keep your file handling.

Start by adding some abstractions:
use strict;
use warnings;
use autodie; # Will crash the program if open, close, or print fails.
sub read_line {
my ($fh) = #_;
chomp(my $line = <$fh>);
return undef unless defined $line;
return split " ", $line;
}
sub write_line {
my ($fh, #fields) = #_;
say $fh join("\t", #fields);
}
Now you can open the various files for reading/writing and copy the header into output files:
open(my $input, "<", "data.txt");
open(my $default_out, ">", "data_new.txt");
open(my $low_rank, ">", "lowrank.txt");
open(my $large_diff, ">", "large_diff.txt");
my #header = read_line($input);
write_line($default_out, #header);
write_line($low_rank, #header);
write_line($large_diff, #header);
The main loop is now just a matter of reading lines (making sure to stop if read_line returns undef) and following the rules:
while (defined(my #fields = read_line($input))) {
my ($index, $score, $rank, $mark) = #fields;
if ($rank <= 15) {
write_line($low_rank, #fields);
} elsif (...) { # fill in the condition for large_diff yourself
write_line($large_diff, #fields);
} else {
write_line($default_out, #fields);
}
}

Related

Need to open a file and replace multiple strings

I have a really big xml file. It has certain incrementing numbers inside, which i would like to replace with a different incrementing number. I've looked and here is what someone suggested here before. Unfortunately i cant get it to work :(
In the code below all instances of 40960 should be replaced with 41984, all instances of 40961 with 41985 etc. Nothing happens. What am i doing wrong?
use strict;
use warnings;
my $old = 40960;
my $new = 41984;
my $string;
my $file = 'file.txt';
rename($file, $file.'.bak');
open(IN, '<'.$file.'.bak') or die $!;
open(OUT, '>'.$file) or die $!;
$old++;
$new++;
for (my $i = 0; $i < 42; $i++) {
while(<IN>) {
$_ =~ s/$old/$new/g;
print OUT $_;
}
}
close(IN);
close(OUT);
Other answers give you better solutions to your problem. Mine concentrates on explaining why your code didn't work.
The core of your code is here:
$old++;
$new++;
for (my $i = 0; $i < 42; $i++) {
while(<IN>) {
$_ =~ s/$old/$new/g;
print OUT $_;
}
}
You increment the values of $old and $new outside of your loops. And you never change those values again. So you're only making the same substitution (changing 40961 to 41985) 42 times. You never try to change any other numbers.
Also, look at the while loop that reads from IN. On your first iteration (when $i is 0) you read all of the data from IN and the file pointer is left at the end of the file. So when you go into the while loop again on your second iteration (and all subsequent iterations) you read no data at all from the file. You need to reset the file pointer to the start of your file at the end of each iteration.
Oh, and the basic logic is wrong. If you think about it, you'll end up writing each line to the output file 42 times. You need to do all possible substitutions before writing the line. So your inner loop needs to be the outer loop (and vice versa).
Putting those suggestions together, you need something like this:
my $old = 40960;
my $change = 1024;
while (<IN>) {
# Easier way to write your loop
for my $i ( 1 .. 42 ) {
my $new = $old + $change;
# Use \b to mark word boundaries
s/\b$old\b/$new/g;
$old++;
}
# Print each output line only once
print OUT $_;
}
Here's an example that works line by line, so the size of file is immaterial. The example assumes you want to replace things like "45678", but not "fred45678". The example also assumes that there is a range of numbers, and you want them replaced with a new range offset by a constant.
#!/usr/bin/perl
use strict;
use warnings;
use constant MIN => 40000;
use constant MAX => 90000;
use constant DIFF => +1024;
sub repl { $_[0] >= MIN && $_[0] <= MAX ? $_[0] + DIFF : $_[0] }
while (<>) {
s/\b(\d+)\b/repl($1)/eg;
print;
}
exit(0);
Invoked with the file you want to transform as an argument, it produces altered output on stdout. With the following input ...
foo bar 123
40000 50000 60000 99999
fred60000
fred 60000 fred
... it produces this output.
foo bar 123
41024 51024 61024 99999
fred60000
fred 61024 fred
There are a couple of classic Perlisms here, but the example shouldn't be hard to follow if you RTFM appropriately.
Here is an alternative way which reads the input file into a string and does all the substitutions at once:
use strict;
use warnings;
{
my $old = 40960;
my $new = 41984;
my ($regexp) = map { qr/$_/ } join '|', map { $old + $_ } 0..41;
my $file = 'file.txt';
rename($file, $file.'.bak');
open(IN, '<'.$file.'.bak') or die $!;
my $str = do {local $/; <IN>};
close IN;
$str =~ s/($regexp)/do_subst($1, $old, $new)/ge;
open(OUT, '>'.$file) or die $!;
print OUT $str;
close OUT;
}
sub do_subst {
my ( $old, $old_base, $new_base ) = #_;
my $i = $old - $old_base;
my $new = $new_base + $i;
return $new;
}
Note: Can probably be made more efficient by using Regexp::Assemble

How to only grep and sorted value from other column [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have an input file1.txt like below:
file1.txt :
test 1 Vertical 564N,
test 2 Vertical 551N,
test 3 Hydrophone 127N, 223D, 344D,
test 4 Hydrophone 350D,
test 6 Hydrophone 407D,
How can I grab only 4th column matched value and then sorted into following?
Output :
N D
564 223
551 344
127 350
407
In GNU awk:
$ awk '
BEGIN {
FS=",? *" #
OFS="\t" # output fields separated by tabs
}
{
for(i=4;i<=NF;i++) # process fields 4 and above
if($i~/N/) # hash n and d accordingly
n[++j]=$i
else if($i~/D/)
d[++k]=$i
}
END {
n[j=0]="N" # headers
d[k=0]="D"
while(n[j]!=""||d[k]!="") # output them untile they run out
print substr(n[j],1,length(n[j++]-1)),substr(d[k],1,length(d[k++]-1))
}' file
Output:
N D
564 223
551 344
127 350
407
And a perl version with output that's formatted like yours:
#!/usr/bin/perl
use warnings;
use strict;
use List::MoreUtils qw/zip/;
my %nums;
while (<>) {
my #f = split /\s+/;
for my $nd (#f[3..$#f]) {
if ($nd =~ /^(\d+)([ND])/) {
push #{$nums{$2}}, $1;
}
}
}
print "N D\n";
my #pairs = zip #{$nums{N}}, #{$nums{D}};
while (#pairs) {
my ($n, $d) = (shift #pairs, shift #pairs);
printf "%-3s %-3s\n", $n//"", $d//"";
}
Edit: After some golfing and playing around:
#!/usr/bin/perl
use warnings;
use strict;
use List::MoreUtils qw/zip6/;
my %nums = (N => [ "N" ], D => [ "D" ]);
while (<>) {
my #f = split /\s+/;
for my $nd (#f[3..$#f]) {
push #{$nums{$2}}, $1 if $nd =~ m/^(\d+)([ND])/;
}
}
printf "%-3s %-3s\n", $_->[0]//"", $_->[1]//"" for zip6 #{$nums{N}}, #{$nums{D}};
Here's a Perl solution
The input file is read and the first three columns discarded from each line. What remains is searched for fields that look like 999N or 999D and they are split into a hash of arrays with N and D as its keys
After sorting the arrays, they are displayed as columns beneath appropriate headers
use strict;
use warnings 'all';
use List::Util 'max';
open my $fh, '<', 'file1.txt' or die $!;
my %data;
while ( <$fh> ) {
my #fields = split ' ', $_, 4;
push #{ $data{$2} }, $1 while $fields[3] =~ /(\d+)([ND])/g;
}
$_ = [ sort { $a <=> $b } #$_ ] for values %data;
my $last = max map $#$_, values %data;
my $fmt = "%-3s %-3s\n";
printf $fmt, qw/ N D /;
for my $i ( 0 .. $last ) {
printf $fmt, map { $_->[$i] // '' } #data{qw/ N D /};
}
output
N D
127 223
551 344
564 350
407
This should help.
Python code
from itertools import izip_longest
#Python3
#from itertools import zip_longest
N = []
D = []
with open(filename) as infile:
for line in infile: #Iterate Each line
val = line.split()[3:] #Get Required value using slicing
for i in val:
if i.endswith("N,"):
N.append(int(i.rstrip("N,")))
else:
D.append(int(i.rstrip("D,")))
for i in izip_longest(sorted(N, reverse=True), sorted(D), fillvalue=""): #Sort and iterate.
print(i)
Output
564, 223
551, 344
127, 350
'', 407

Perl: append numbers from one file to strings of second file

I would like to append numbers attached to (Unicode) strings in one file, to matched strings in a second file. Somehow I can't wrap my head around how to do this. Here is what my two files look like.
File 1:
दौरा, 2
प्रोत्साहन, 1
प्रगति, 4
File 2:
दौरा
dorA
प्रोत्साहन
prua2ts3Ahan
prua2ts2Ahan
prua2tsAhan
prua2t2s3Ahan
prua2t2s2Ahan
prua2t2sAhan
prOts3Ahan
prOts2Ahan
prOtsAhan
prOt2s3Ahan
prOt2s2Ahan
prOt2sAhan
प्रगति
praGat2I
praGatI
pragat2I
pragatI
The desired result would look like this:
Output:
dorA, 2
prua2ts3Ahan, 1
prua2ts2Ahan, 1
prua2tsAhan, 1
prua2t2s3Ahan, 1
prua2t2s2Ahan, 1
prua2t2sAhan, 1
prOts3Ahan, 1
prOts2Ahan, 1
prOtsAhan, 1
prOt2s3Ahan, 1
prOt2s2Ahan, 1
prOt2sAhan, 1
praGat2I, 4
praGatI, 4
pragat2I, 4
pragatI, 4
I have a hash created from File 1 that has the strings as keys, and the numbers as values. Now its a matter of matching these keys in File 2, collecting all following lines after the match, and appending the values to those following lines. Can someone point me in the right direction?
Your description of the solution is correct. Now just translate it to code:
#!/usr/bin/perl
use warnings;
use strict;
my %hash;
open my $F1, '<:encoding(UTF-8)', 'file.1' or die $!;
while (<$F1>) {
chomp;
my ($word, $num) = split /, /;
$hash{$word} = $num;
}
open my $F2, '<:encoding(UTF-8)', 'file.2' or die $!;
my $word;
while (<$F2>) {
chomp;
if (exists $hash{$_}) {
$word = $_;
} elsif ($_) {
print "$_, $hash{$word}\n";
} else {
print "\n";
}
}

unix - breakdown of how many records in file with number of character occurrences

Is there an inbuilt command to do this or has anyone had any luck with a script that does it?
I am looking to get counts of how many records (as defined by a specific EOL such as "^%!") had how many occurrences of a specfic character. (sorted descending by the number of occurrences)
For example, with this sample file:
jdk,|ljn^%!dk,|sn,|fgc^%!
ydfsvuyx^%!67ds5,|bvujhy,|s6d75
djh,|sudh^%!nhjf,|^%!fdiu^%!
Suggested input: delimiter EOL and filename as arguments.
bash/perl some_script_name ",|" "^%!" samplefile
Desired output:
occs count
3 1
2 1
1 2
0 2
This is because the 1st record had one delimiter, 2nd record had 2, 3rd record had 0, 4th record had 3, 5th record had 1, 6th record had 0.
Bonus pts if you can make the delimiter and EOL argument accept hex input (ie 2C7C) or normal character input (ie ,|) .
Script:
#!/usr/bin/perl
use strict;
$/ = $ARGV[1];
open my $fh, '<', $ARGV[2] or die $!;
my #records = <$fh> and close $fh;
$/ = $ARGV[0];
my %counts;
$counts{(split $_)-1}++ for #records;
delete $counts{-1};
print "$_\t$counts{$_}\n" for (reverse sort keys %counts);
Test:
perl script.pl ',|' '^%!' samplefile
Output:
3 1
2 1
1 2
0 2
This is what perl lives for:
#!perl -w
use 5.12.0;
my ($delim, $eol, $file) = #ARGV;
open my $fh, "<$file" or die "error opening $file $!";
$/ = $eol; # input record separator
my %counts;
while (<$fh>) {
my $matches = () = $_ =~ /(\Q$delim\E)/g; # "goatse" operator
$counts{$matches}++;
}
say "occs\tcount";
foreach my $num (reverse sort keys %counts) {
say "$num\t$counts{$num}";
}
(if you haven't got 5.12, remove the "use 5.12" line and replace the say with print)
A solution in awk:
BEGIN {
RS="\\^%!"
FS=",\\|"
max_occ = 0
}
{
if(match($0, "^ *$")) { # This is here to deal with the final separator.
next
}
if(NF - 1 > max_occ) {
max_occ = NF - 1
}
count[NF - 1]=count[NF - 1] + 1
}
END {
printf("occs count\n")
for(i = 0; i <= max_occ; i++) {
printf("%s %s\n", i, count[i])
}
}
Well, there's one more empty record at the end of the file which has 0. So, here's a script to do what you wanted. Adding headers and otherwise tweaking the printf output is left as an excercise for you. :)
Basically, read the whole file in, split it into records, and for each record, use a /g regex to count the sub-delimiters. Since /g returns an array of all matches, use #{[]} to make an arrayref then deref that in scalar context to get a count. There has to be a more elegant solution to that particular part of the problem, but whatever; it's perl line noise. ;)
user#host[/home/user]
$ ./test.pl ',|' '^%!' test.in
3 1
2 1
1 2
0 3
user#host[/home/user]
$ cat test.in
jdk,|ljn^%!dk,|sn,|fgc^%!
ydfsvuyx^%!67ds5,|bvujhy,|s6d75
djh,|sudh^%!nhjf,|^%!fdiu^%!
user#host[/home/user]
$ cat test.pl
#!/usr/bin/perl
my( $subdelim, $delim, $in,) = #ARGV;
$delim = quotemeta $delim;
$subdelim = quotemeta $subdelim;
my %counts;
open(F, $in) or die qq{Failed opening $in: $?\n};
foreach( split(/$delim/, join(q{}, <F>)) ){
$counts{ scalar(#{[m/.*?($subdelim)/g]}) }++;
}
printf( qq{%i% 4i\n}, $_, $counts{$_} ) foreach (sort {$b<=>$a} keys %counts);
And here's a modified version which only keeps fields which contain at least one non-space character. That removes the last field, but also has the consequence of removing any other empty fields. It also uses $/ and \Q\E to reduce a couple of explicit function calls (thank, Alex). And, like the previous one, it works with strict + warnings;
#!/usr/bin/perl
my( $subdelim, $delim, $in ) = #ARGV;
local $/=$delim;
my %counts;
open(F, $in) or die qq{Failed opening $in: $?\n};
foreach ( grep(/\S/, <F>) ){
$counts{ scalar(#{[m/.*?(\Q$subdelim\E)/g]}) }++;
}
printf( qq{%i% 4i\n}, $_, $counts{$_} ) foreach (sort {$b<=>$a} keys %counts);
If you really only want to remove the last record unconditionally, I'm partial to using pop:
#!/usr/bin/perl
my( $subdelim, $delim, $in ) = #ARGV;
local $/=$delim;
my %counts;
open(F, $in) or die qq{Failed opening $in: $?\n};
my #lines = <F>;
pop #lines;
$counts{ scalar(#{[m/.*?(\Q$subdelim\E)/g]}) }++ foreach (#lines);
printf( qq{%i% 4i\n}, $_, $counts{$_} ) foreach (sort {$b<=>$a} keys %counts);

Using Perl or Linux built-in command-line tools how quickly map one integer to another?

I have a text file mapping of two integers, separated by commas:
123,456
789,555
...
It's 120Megs... so it's a very long file.
I keep to search for the first column and return the second, e.g., look up 789 --returns--> 555 and I need to do it FAST, using regular Linux built-ins.
I'm doing this right now and it takes several seconds per look-up.
If I had a database I could index it. I guess I need an indexed text file!
Here is what I'm doing now:
my $lineFound=`awk -F, '/$COLUMN1/ { print $2 }' ../MyBigMappingFile.csv`;
Is there any easy way to pull this off with a performance improvement?
The hash suggestions are the natural way an experienced Perler would do this, but it may be suboptimal in this case. It scans the entire file and builds a large, flat datastructure in linear time. Cruder methods can short circuit with a worst case linear time, usually less in practice.
I first made a big mapping file:
my $LEN = shift;
for (1 .. $LEN) {
my $rnd = int rand( 999 );
print "$_,$rnd\n";
}
With $LEN passed on the command line as 10000000, the file came out to 113MB. Then I benchmarked three implemntations. The first is the hash lookup method. The second slurps the file and scans it with a regex. The third reads line-by-line and stops when it matches. Complete implementation:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw{timethese};
my $FILE = shift;
my $COUNT = 100;
my $ENTRY = 40;
slurp(); # Initial file slurp, to get it into the hard drive cache
timethese( $COUNT, {
'hash' => sub { hash_lookup( $ENTRY ) },
'scalar' => sub { scalar_lookup( $ENTRY ) },
'linebyline' => sub { line_lookup( $ENTRY ) },
});
sub slurp
{
open( my $fh, '<', $FILE ) or die "Can't open $FILE: $!\n";
undef $/;
my $s = <$fh>;
close $fh;
return $s;
}
sub hash_lookup
{
my ($entry) = #_;
my %data;
open( my $fh, '<', $FILE ) or die "Can't open $FILE: $!\n";
while( <$fh> ) {
my ($name, $val) = split /,/;
$data{$name} = $val;
}
close $fh;
return $data{$entry};
}
sub scalar_lookup
{
my ($entry) = #_;
my $data = slurp();
my ($val) = $data =~ /\A $entry , (\d+) \z/x;
return $val;
}
sub line_lookup
{
my ($entry) = #_;
my $found;
open( my $fh, '<', $FILE ) or die "Can't open $FILE: $!\n";
while( <$fh> ) {
my ($name, $val) = split /,/;
if( $name == $entry ) {
$found = $val;
last;
}
}
close $fh;
return $found;
}
Results on my system:
Benchmark: timing 100 iterations of hash, linebyline, scalar...
hash: 47 wallclock secs (18.86 usr + 27.88 sys = 46.74 CPU) # 2.14/s (n=100)
linebyline: 47 wallclock secs (18.86 usr + 27.80 sys = 46.66 CPU) # 2.14/s (n=100)
scalar: 42 wallclock secs (16.80 usr + 24.37 sys = 41.17 CPU) # 2.43/s (n=100)
(Note I'm running this off an SSD, so I/O is very fast, and perhaps makes that initial slurp() unnecessary. YMMV.)
Interestingly, the hash implementation is just as fast as linebyline, which isn't what I expected. By using slurping, scalar may end up being faster on a traditional hard drive.
However, by far the fastest is a simple call to grep:
$ time grep '^40,' int_map.txt
40,795
real 0m0.508s
user 0m0.374s
sys 0m0.046
Perl could easily read that output and split apart the comma in hardly any time at all.
Edit: Never mind about grep. I misread the numbers.
120 meg isn't that big. Assuming you've got at least 512MB of ram, you could easily read the whole file into a hash and then do all of your lookups against that.
use:
sed -n "/^$COLUMN1/{s/.*,//p;q}" file
This optimizes your code in three ways:
1) No needless splitting each line in two on ",".
2) You stop processing the file after the first hit.
3) sed is faster than awk.
This should more than half your search time.
HTH Chris
It all depends on how often the data change and how often in the course of a single script invocation you need to look up.
If there are many lookups during each script invocation, I would recommend parsing the file into a hash (or array if the range of keys is narrow enough).
If the file changes every day, creating a new SQLite database might or might not be worth your time.
If each script invocation needs to look up just one key, and if the data file changes often, you might get an improvement by slurping the entire file into a scalar (minimizing memory overhead, and do a pattern match on that (instead of parsing each line).
#!/usr/bin/env perl
use warnings; use strict;
die "Need key\n" unless #ARGV;
my $lookup_file = 'lookup.txt';
my ($key) = #ARGV;
my $re = qr/^$key,([0-9]+)$/m;
open my $input, '<', $lookup_file
or die "Cannot open '$lookup_file': $!";
my $buffer = do { local $/; <$input> };
close $input;
if (my ($val) = ($buffer =~ $re)) {
print "$key => $val\n";
}
else {
print "$key not found\n";
}
On my old slow laptop, with a key towards the end of the file:
C:\Temp> dir lookup.txt
...
2011/10/14 10:05 AM 135,436,073 lookup.txt
C:\Temp> tail lookup.txt
4522701,5840
5439981,16075
7367284,649
8417130,14090
438297,20820
3567548,23410
2014461,10795
9640262,21171
5345399,31041
C:\Temp> timethis lookup.pl 5345399
5345399 => 31041
TimeThis : Elapsed Time : 00:00:03.343
This example loads the file into a hash (which takes about 20s for 120M on my system). Subsequent lookups are then nearly instantaneous. This assumes that each number in the left column is unique. If that's not the case then you would need to push numbers on the right with the same number on the left onto an array or something.
use strict;
use warnings;
my ($csv) = #ARGV;
my $start=time;
open(my $fh, $csv) or die("$csv: $!");
$|=1;
print("loading $csv... ");
my %numHash;
my $p=0;
while(<$fh>) { $p+=length; my($k,$v)=split(/,/); $numHash{$k}=$v }
print("\nprocessed $p bytes in ",time()-$start, " seconds\n");
while(1) { print("\nEnter number: "); chomp(my $i=<STDIN>); print($numHash{$i}) }
Example usage and output:
$ ./lookup.pl MyBigMappingFile.csv
loading MyBigMappingFile.csv...
processed 125829128 bytes in 19 seconds
Enter number: 123
322
Enter number: 456
93
Enter number:
does it help if you cp the file to your /dev/shm, and using /awk/sed/perl/grep/ack/whatever query a mapping?
don't tell me you are working on a 128MB ram machine. :)

Resources