perl : String extraction based on pattern match [closed] - string

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
A file FILE1 has several thousands of lines with a terminating pattern _Pattern1.
A second file too has several thousands of lines with the same terminating pattern _Pattern1.
I must now:
Read FILE1 line by line
Find out if the line has any string terminating with _Pattern1
Extract the string and store it into a variable
Open FILE2 and read it line by line
Find out if the line just read from FILE2 contains the string stored in the variable above
How is this to be done in perl?
EDIT2:
Allright, with a bit of googling and referring to the links enlisted below, I solved my problem.
Here is the code snippet.
#!/usr/bin/perl
use strict;
use warnings;
my $OriginalHeader=$ARGV[0]; ## Source file
my $GeneratedHeader=$ARGV[1];## File to compare against
my $DeltaHeader=$ARGV[2]; ## File to store misses
my $MatchingPattern="_Pos";
my $FoundPattern;
open FILE1, $OriginalHeader or die $!;
open FILE2, $GeneratedHeader or die $!;
open (FILE3, ">$DeltaHeader") or die $!;
my $lineFromOriginalHeader;
my $lineFromGeneratedHeader;
my $TotalMacrosExamined = 0;
my $TotalMacrosMissed = 0;
while($lineFromOriginalHeader=<FILE1>)
{
if($lineFromOriginalHeader =~ /$MatchingPattern/)
{
my $index = index($lineFromOriginalHeader,$MatchingPattern);
my $BackIndex = $index;
my $BackIndexStart = $index;
$BackIndex = $BackIndex - 1;
## Use this while loop to extract the substring.
while (1)
{
my $ExtractedChar = substr($lineFromOriginalHeader,$BackIndex,1);
if ($ExtractedChar =~ / /)
{
$FoundPattern = substr($lineFromOriginalHeader,$BackIndex + 1,$BackIndexStart + 3 -
$BackIndex);
print "Identified $FoundPattern \n";
$TotalMacrosExamined = $TotalMacrosExamined + 1;
##Skip the next line
$lineFromOriginalHeader = <FILE1>;
last;
}
else
{
$BackIndex = $BackIndex - 1;
}
} ##while(1)
## We now look for $FoundPattern in FILE2
while ($lineFromGeneratedHeader = <FILE2>)
{
if (index($lineFromGeneratedHeader,$FoundPattern)!= -1)
{
##Pattern found. Reset file pointer and break out of while loop
seek FILE2,0,0;
last;
}
else
{
if (eof(FILE2) == 1)
{
print FILE3 "Generated header misses $FoundPattern\n";
$TotalMacrosMissed = $TotalMacrosMissed + 1;
seek FILE2,0,0;
last;
}
}
} ##while(1)
}
else
{
##NOP
}
} ##while (linefromoriginalheader)
close FILE1;
close FILE2;
close FILE3;
print "Total number of bitfields examined = $TotalMacrosExamined\n";
print "Number of macros obsolete = $TotalMacrosMissed\n";

Just a first cut at making your code more Perly. Actually plenty more could be done, including $some_var is usually used vs $SomeVar in Perl, but I didn't get that far.
#!/usr/bin/perl
use strict;
use warnings;
my ($OriginalHeader, $GeneratedHeader, $DeltaHeader) = #ARGV;
my $MatchingPattern=qr/(\S*_Pos)/; # all non-whitespace terminated by _Pos
open my $file1, '<', $OriginalHeader or die $!;
open my $file2, '<', $GeneratedHeader or die $!;
open my $file3, '>', $DeltaHeader or die $!;
my $TotalMacrosExamined = 0;
my $TotalMacrosMissed = 0;
while(my $lineFromOriginalHeader=<$file1>) {
next unless $lineFromOriginalHeader =~ $MatchingPattern;
my $FoundPattern = $1; # matched string
print "Identified $FoundPattern \n";
$TotalMacrosExamined++;
##Skip the next line
<$file1>;
## We now look for $FoundPattern in FILE2
my $match_found = 0;
while (my $lineFromGeneratedHeader = <$file2>) {
if (index($lineFromGeneratedHeader,$FoundPattern)!= -1) {
##Pattern found. Close the file and break out of while loop
$match_found++;
last;
}
}
unless ($match_found) {
print $file3 "Generated header misses $FoundPattern\n";
$TotalMacrosMissed++;
}
seek $file2,0,0;
}
print "Total number of bitfields examined = $TotalMacrosExamined\n";
print "Number of macros obsolete = $TotalMacrosMissed\n";

Having programmed in C all my life, I googled usage of the perl constructs below and wrote a C like program. This works flawlessly for me. :-)
Edit : This is to clarify why I must skip a line in the algo below. The pattern which is retrieved and later searched for in the second file occurs on two consecutive lines. It therefore suffices that its first occurence is reliably detected. Also a nitpick, It is always guaranteed that the substring containing the pattern is always the second substring on the line.
e.g #define Something_Pos (Some Value)
#!/usr/bin/perl
use strict;
use warnings;
my $OriginalHeader=$ARGV[0];
my $GeneratedHeader=$ARGV[1];
my $DeltaHeader=$ARGV[2];
my $MatchingPattern="_Pos";
my $FoundPattern;
open FILE1, $OriginalHeader or die $!;
open FILE2, $GeneratedHeader or die $!;
open (FILE3, ">$DeltaHeader") or die $!;
my $lineFromOriginalHeader;
my $lineFromGeneratedHeader;
my $TotalMacrosExamined = 0;
my $TotalMacrosMissed = 0;
while($lineFromOriginalHeader=<FILE1>)
{
if($lineFromOriginalHeader =~ /$MatchingPattern/)
{
my $index = index($lineFromOriginalHeader,$MatchingPattern);
my $BackIndex = $index;
my $BackIndexStart = $index;
$BackIndex = $BackIndex - 1;
## Use this while loop to extract the substring.
while (1)
{
my $ExtractedChar = substr($lineFromOriginalHeader,$BackIndex,1);
if ($ExtractedChar =~ / /)
{
$FoundPattern = substr($lineFromOriginalHeader,$BackIndex + 1,$BackIndexStart + 3 -
$BackIndex);
print "Identified $FoundPattern \n";
$TotalMacrosExamined = $TotalMacrosExamined + 1;
##Skip the next line
$lineFromOriginalHeader = <FILE1>;
last;
}
else
{
$BackIndex = $BackIndex - 1;
}
} ##while(1)
## We now look for $FoundPattern in FILE2
while ($lineFromGeneratedHeader = <FILE2>)
{
##print "Read the following line from FILE2: $lineFromGeneratedHeader\n";
if (index($lineFromGeneratedHeader,$FoundPattern)!= -1)
{
##Pattern found. Close the file and break out of while loop
seek FILE2,0,0;
last;
}
else
{
if (eof(FILE2) == 1)
{
print FILE3 "Generated header misses $FoundPattern\n";
$TotalMacrosMissed = $TotalMacrosMissed + 1;
seek FILE2,0,0;
last;
}
}
} ##while(1)
}
else
{
}
} ##while (linefromoriginalheader)
close FILE1;
close FILE2;
close FILE3;
print "Total number of bitfields examined = $TotalMacrosExamined\n";
print "Number of macros obsolete = $TotalMacrosMissed\n";

Related

Need to open a file and replace multiple strings

I have a really big xml file. It has certain incrementing numbers inside, which i would like to replace with a different incrementing number. I've looked and here is what someone suggested here before. Unfortunately i cant get it to work :(
In the code below all instances of 40960 should be replaced with 41984, all instances of 40961 with 41985 etc. Nothing happens. What am i doing wrong?
use strict;
use warnings;
my $old = 40960;
my $new = 41984;
my $string;
my $file = 'file.txt';
rename($file, $file.'.bak');
open(IN, '<'.$file.'.bak') or die $!;
open(OUT, '>'.$file) or die $!;
$old++;
$new++;
for (my $i = 0; $i < 42; $i++) {
while(<IN>) {
$_ =~ s/$old/$new/g;
print OUT $_;
}
}
close(IN);
close(OUT);
Other answers give you better solutions to your problem. Mine concentrates on explaining why your code didn't work.
The core of your code is here:
$old++;
$new++;
for (my $i = 0; $i < 42; $i++) {
while(<IN>) {
$_ =~ s/$old/$new/g;
print OUT $_;
}
}
You increment the values of $old and $new outside of your loops. And you never change those values again. So you're only making the same substitution (changing 40961 to 41985) 42 times. You never try to change any other numbers.
Also, look at the while loop that reads from IN. On your first iteration (when $i is 0) you read all of the data from IN and the file pointer is left at the end of the file. So when you go into the while loop again on your second iteration (and all subsequent iterations) you read no data at all from the file. You need to reset the file pointer to the start of your file at the end of each iteration.
Oh, and the basic logic is wrong. If you think about it, you'll end up writing each line to the output file 42 times. You need to do all possible substitutions before writing the line. So your inner loop needs to be the outer loop (and vice versa).
Putting those suggestions together, you need something like this:
my $old = 40960;
my $change = 1024;
while (<IN>) {
# Easier way to write your loop
for my $i ( 1 .. 42 ) {
my $new = $old + $change;
# Use \b to mark word boundaries
s/\b$old\b/$new/g;
$old++;
}
# Print each output line only once
print OUT $_;
}
Here's an example that works line by line, so the size of file is immaterial. The example assumes you want to replace things like "45678", but not "fred45678". The example also assumes that there is a range of numbers, and you want them replaced with a new range offset by a constant.
#!/usr/bin/perl
use strict;
use warnings;
use constant MIN => 40000;
use constant MAX => 90000;
use constant DIFF => +1024;
sub repl { $_[0] >= MIN && $_[0] <= MAX ? $_[0] + DIFF : $_[0] }
while (<>) {
s/\b(\d+)\b/repl($1)/eg;
print;
}
exit(0);
Invoked with the file you want to transform as an argument, it produces altered output on stdout. With the following input ...
foo bar 123
40000 50000 60000 99999
fred60000
fred 60000 fred
... it produces this output.
foo bar 123
41024 51024 61024 99999
fred60000
fred 61024 fred
There are a couple of classic Perlisms here, but the example shouldn't be hard to follow if you RTFM appropriately.
Here is an alternative way which reads the input file into a string and does all the substitutions at once:
use strict;
use warnings;
{
my $old = 40960;
my $new = 41984;
my ($regexp) = map { qr/$_/ } join '|', map { $old + $_ } 0..41;
my $file = 'file.txt';
rename($file, $file.'.bak');
open(IN, '<'.$file.'.bak') or die $!;
my $str = do {local $/; <IN>};
close IN;
$str =~ s/($regexp)/do_subst($1, $old, $new)/ge;
open(OUT, '>'.$file) or die $!;
print OUT $str;
close OUT;
}
sub do_subst {
my ( $old, $old_base, $new_base ) = #_;
my $i = $old - $old_base;
my $new = $new_base + $i;
return $new;
}
Note: Can probably be made more efficient by using Regexp::Assemble

Perl daemon doesn't go through entire loop

I'm trying to make a program that works like a very simple server in Perl.
The program itself is meant to work as a library catalogue, giving the user options of searching for books by title or author, and borrowing or returning books. The list of books is provided in a separate file.
Basically, it's supposed to take requests (files) from "Requests" folder, process them, and then give answers (also files) in "Answers" folder. After the process is over, it deletes the old requests and repeats the process (answers are deleted by the client after they are accepted).
It's meant to run as a daemon, but for some reason only the loop responsible for deleting the request files works in the background - the requests are not processed into answers, but are just deleted. Whenever a new request appears, it's almost immediately deleted.
I'm learning to use daemons and tried to emulate what is in this thread.
#!/usr/bin/perl
use warnings;
use strict;
use Proc::Daemon;
#FUNCTIONS DEFINTIONS
sub FindAuthor
{
#try to find book by this author in the catalogue
}
sub FindTitle
{
#try to find book with this title in the catalogue
}
sub CheckIfCanBeReturned
{
#check if the book is borrowed and by whom
}
#attempt at daemonization
Proc::Daemon::Init;
my $continueWork = 1;
$SIG{TERM} = sub { $continueWork = 0 };
while ( $continueWork )
{
sleep(2);
my #RequestFilesArray = `ls /home/Ex5/Requests`;
#list all requests currently in the Request folder
for ( my $b = 0; $b < #RequestFilesArray; $b++)
{
my $cut = `printf "$RequestFilesArray[$b]" | wc -m`;
$cut = $cut - 1;
$RequestFilesArray[$b] = substr $RequestFilesArray[$b], 0, $cut;
}
#the requests are formatted in such way,
#that the first 2 letters indicate what the client wants to be done
#and the rest is taken parameters used in the processing
for (my $i = 0; $i < #RequestFilesArray; $i++)
{
my $UserRequest = `tail -1 Requests/$RequestFilesArray[$i]`;
my $fix = `printf "$UserRequest" | wc -m`;
$fix = $fix - 1;
$UserRequest = substr $UserRequest, 0, $fix;
my $RequestType = substr $UserRequest, 0, 2;
my $RequestedValue = substr $UserRequest, 3;
my $RequestNumber = $i;
if ($RequestType eq "fa")
{
#FIND BY AUTHOR
my #results = FindAuthor ($RequestedValue);
my $filename = "/home/Ex5/Answers/" . $RequestFilesArray[$RequestNumber];
open (my $answerFile, '>', $filename) or die "$!";
for (my $a = 0; $a < #results; $a++)
{
print $answerFile $results[$a],"\n";
}
close $answerFile;
}
elsif ($RequestType eq "ft")
{
#FIND BY TITLE
my #results = FindTitle ($RequestedValue);
my $filename = "/home/Ex5/Answers/" . $RequestFilesArray[$RequestNumber];
open ( my $answerFile, '>', $filename) or die "$!";
for (my $a = 0; $a < #results; $a++)
{
print $answerFile $results[$a],"\n";
}
close $answerFile;
}
elsif ($RequestType eq "br")
{
#BOOK RETURN
my $result = CheckIfCanBeReturned ($RequestedValue, $RequestFilesArray[$RequestNumber]);
my $filename = "/home/Ex5/Answers/" . $RequestFilesArray[$RequestNumber];
open ( my $answerFile, '>', $filename) or die "$!";
print $answerFile $result;
close $answerFile;
}
elsif ($RequestType eq "bb")
{
#BOOK BORROW
my $result = CheckIfCanBeBorrowed ($RequestedValue, $RequestFilesArray[$RequestNumber]);
my $filename = "/home/Ex5/Answers/" . $RequestFilesArray[$RequestNumber];
open ( my $answerFile, '>', $filename) or die "$!";
print $answerFile $result;
close $answerFile;
}
else
{
print "something went wrong with this request";
}
}
#deleting processed requests
for ( my $e = 0; $e < #RequestFilesArray; $e++)
{
my $removeReq = "/home/Ex5/Requests/" . $RequestFilesArray[$e];
unlink $removeReq;
}
#$continueWork =0;
}
You have written way too much code before attempting to test it. You have also started shell processes at every opportunity rather than learning the correct way to achieve things in Perl
The first mistake is to use ls to discover what jobs are waiting. ls prints multiple files per line, and you treat the whole of each line as a file name, using the bizarre printf "$RequestFilesArray[$b]" | wc -m instead of length $RequestFilesArray[$b]
Things only get worse after that
I suggest the following
Start again from scratch
Write your program in Perl. Perl isn't a shell language
Advance in very small increments, making sure that your code compiles and does what it is supposed to every three or four lines. It does wonders for the confidence to know that you're enhancing working code rather than creating a magical sequence of random characters
Learn how to debug. You appear to be staring at your code hoping for inspiration to strike in the manner of someone staring at their car engine in the hope of seeing why it won't start
Delete request files as part of processing the request, and only once the request has been processed and the answer file successfully written. It shouldn't be done in a separate loop
Taking what you provided, here's some pseudocode I've devised for you that you can use as somewhat of a template. This is by NO MEANS exhaustive. I think the advice #Borodin gave is sound and prudent.
This is all untested and much of the new stuff is pseudocode. However, hopefully, there are some breadcrumbs from which to learn. Also, as I stated above, your use of Proc::Daemon::Init is suspect. At the very least, it is so minimally used that it is gobbling up whatever error(s) is/are occurring and you've no idea what's wrong with the script.
#!/usr/bin/perl -wl
use strict;
use File::Basename;
use File::Spec;
use Proc::Daemon;
use Data::Dumper;
# turn off buffering
$|++;
#FUNCTIONS DEFINTIONS
sub FindAuthor
{
#try to find book by this author in the catalogue
}
sub FindTitle
{
#try to find book with this title in the catalogue
}
sub CheckIfCanBeReturned
{
#check if the book is borrowed and by whom
}
sub tail
{
my $file = shift;
# do work
}
sub find_by
{
my $file = shift;
my $val = shift;
my $by = shift;
my #results;
my $xt = 0;
# sanity check args
# do work
if ( $by eq 'author' )
{
my #results = FindByAuthor(blah);
}
elsif ( $by eq 'blah' )
{
#results = blah();
}
#...etc
# really should use File::Spec IE
my $filename = File::Spec->catfile('home', 'Ex5', 'Answers', $file);
# might be a good idea to either append or validate you're not clobbering
# an existent file here because this is currently clobbering.
open (my $answerFile, '>', $filename) or die "$!";
for ( #results )
{
print $answerFile $_,"\n";
}
close $answerFile;
# have some error checking in place and set $xt to 1 if an error occurs
return $xt;
}
#attempt at daemonization
# whatever this is is completely broken methinks.
#Proc::Daemon::Init;
my $continueWork++;
my $r_dir = '/home/user/Requests';
$SIG{TERM} = sub { $continueWork = 0 };
# going with pseudocode
while ( $continueWork )
{
#list all requests currently in the Request folder
my #RequestFilesArray = grep(/[^\.]/, <$r_dir/*>);
#the requests are formatted in such way,
#that the first 2 letters indicate what the client wants to be done
#and the rest is taken parameters used in the processing
for my $request_file ( #RequestFilesArray )
{
my $result = 0;
$request_file = basename($request_file);
my $cut = length($request_file) - 1;
my $work_on = substr $request_file, 0, $cut;
my $UserRequest = tail($request_file);
my $fix = length($UserRequest) - 1;
$UserRequest = substr $UserRequest, 0, $fix;
my $RequestType = substr $UserRequest, 0, 2;
my $RequestedValue = substr $UserRequest, 3;
if ($RequestType eq "fa")
{
#FIND BY AUTHOR
$result = find_by($request_file, $RequestedValue, 'author');
}
elsif ($RequestType eq "ft")
{
#FIND BY TITLE
$result = find_by($request_file, $RequestedValue, 'title');
}
elsif ($RequestType eq "br")
{
#BOOK RETURN
$result = CheckIfCanBeReturned ($RequestedValue, $request_file) or handle();
}
elsif ($RequestType eq "bb")
{
#BOOK BORROW
$result = CheckIfCanBeBorrowed ($RequestedValue, $request_file) or handle();
}
else
{
print STDERR "something went wrong with this request";
}
}
#deleting processed requests
if ( $result == 1 )
{
unlink $work_on;
}
sleep(2);
}
Take special note to my "mild" attempt and DRYing up your code by using the find_by subroutine. You had a LOT of duplicate code in your original script, which I moved into a single sub routine. DRY eq 'Don't Repeat Yourself'.

Perl Conditions

Trying to iterate through two files. Everything works although once I get to the negation of my if statement it messes everything up. The only thing that will print is the else statement
Please disregard any unused variables, when defined. Will clean it up after.
#!/usr/bin/perl
#
# Packages and modules
#
use strict;
use warnings;
use version; our $VERSION = qv('5.16.0'); # This is the version of Perl to be used
use Text::CSV 1.32; # We will be using the CSV module (version 1.32 or higher)
# to parse each line
#
# readFile.pl
# Authors: schow04#mail.uoguelph + anilam#mail.uoguelph.ca
# Project: Lab Assignment 1 Script (Iteration 0)
# Date of Last Update: Monday, November 16, 2015.
#
# Functional Summary
# readFile.pl takes in a CSV (comma separated version) file
# and prints out the fields.
# There are three fields:
# 1. name
# 2. gender (F or M)
# 3. number of people with this name
#
# This code will also count the number of female and male
# names in this file and print this out at the end.
#
# The file represents the names of people in the population
# for a particular year of birth in the United States of America.
# Officially it is the "National Data on the relative frequency
# of given names in the population of U.S. births where the individual
# has a Social Security Number".
#
# Commandline Parameters: 1
# $ARGV[0] = name of the input file containing the names
#
# References
# Name files from http://www.ssa.gov/OACT/babynames/limits.html
#
#
# Variables to be used
#
my $EMPTY = q{};
my $SPACE = q{ };
my $COMMA = q{,};
my $femalecount = 0;
my $malecount = 0;
my $lines = 0;
my $filename = $EMPTY;
my $filename2 = $EMPTY;
my #records;
my #records2;
my $record_count = -1;
my $top_number = 0;
my $male_total = 0;
my $male_count = 0;
my #first_name;
my #gender;
my #first_name2;
my #number;
my $count = 0;
my $count2 = 0;
my $csv = Text::CSV->new({ sep_char => $COMMA });
#
# Check that you have the right number of parameters
#
if ($#ARGV != 1) {
print "Usage: readTopNames.pl <names file> <course names file>\n" or
die "Print failure\n";
exit;
}
$filename = $ARGV[0];
$filename2 = $ARGV[1];
#
# Open the input file and load the contents into records array
#
open my $names_fh, '<', $filename
or die "Unable to open names file: $filename\n";
#records = <$names_fh>;
close $names_fh or
die "Unable to close: $ARGV[0]\n"; # Close the input file
open my $names_fh2, '<', $filename2
or die "Unable to open names file: $filename2\n";
#records2 = <$names_fh2>;
close $names_fh2 or
die "Unable to close: $ARGV[1]\n"; # Close the input file
#
# Parse each line and store the information in arrays
# representing each field
#
# Extract each field from each name record as delimited by a comma
#
foreach my $class_record (#records)
{
chomp $class_record;
$record_count = 0;
$count = 0;
foreach my $name_record ( #records2 )
{
if ($csv->parse($name_record))
{
my #master_fields = $csv->fields();
$record_count++;
$first_name[$record_count] = $master_fields[0];
$gender[$record_count] = $master_fields[1];
$number[$record_count] = $master_fields[2];
if($class_record eq $first_name[$record_count])
{
if($gender[$record_count] eq 'F')
{
print("$first_name[$record_count] ($record_count)\n");
}
if($gender[$record_count] eq 'M')
{
my $offset = $count - 2224;
print("$first_name[$record_count] ($offset)\n");
}
}
} else {
warn "Line/record could not be parsed: $records[$record_count]\n";
}
$count++;
}
}
#
# End of Script
#
Adam (187)
Alan (431)
Alejandro (1166)
Alex (120)
Alicia (887)
Ambrose (305)
Caleb (794)
Sample output from running the following code.
This is correct: Although if a name is not found in the second file it is supposed to say:
Adam (187)
Alan (431)
Name (0)
Alejandro (1166)
Alex (120)
Alicia (887)
Ambrose (305)
Caleb (794)
That is what the else is supposed to find. Whether the if statement returned nothing.
else {
print("$first_name[$record_count] (0)\n");
}
The output that i get when i add that else, to account for the negation is literally:
Elzie (0)
Emer (0)
Enna (0)
Enriqueta (0)
Eola (0)
Eppie (0)
Ercell (0)
Estellar (0)
It's really tough to help you properly without better information, so I've written this, which looks for each name from the names file in the master data file and displays the associated values
There's never a reason to write a long list of declarations like that at the top of a program, and you've written way too much code before you started debugging. You should write no more than three or four lines of code before you test that it works and carry on adding to it. You've ended up with 140 lines — mostly of them comments — that don't do what you want, and you're now lost as to what you should fix first
I haven't been able to fathom what all your different counters are for, or why you're subtracting a magic 2224 for male records, so I've just printed the data directly from the master file
I hope you'll agree that it's far clearer with the variables declared when they're required instead of making a huge list at the top of your program. I've dropped the arrays #first_name, #gender and #number because you were only ever using the latest value so they had no purpose
#!/usr/bin/perl
use strict;
use warnings;
use v5.16.0;
use autodie;
use Text::CSV;
STDOUT->autoflush;
if ( #ARGV != 2 ) {
die "Usage: readTopNames.pl <names file> <master names file>\n";
}
my ( $names_file, $master_file ) = #ARGV;
my #names = do {
open my $fh, '<', $names_file;
<$fh>;
};
chomp #names;
my #master_data = do {
open my $fh, '<', $master_file;
<$fh>;
};
chomp #master_data;
my $csv = Text::CSV->new;
for my $i ( 0 .. $#names ) {
my $target_name = $names[$i];
my $found;
for my $j ( 0 .. $#master_data ) {
my $master_rec = $master_data[$j];
my $status = $csv->parse($master_rec);
unless ( $status ) {
warn qq{Line/record "$master_rec" could not be parsed\n};
next;
}
my ( $name, $gender, $count ) = $csv->fields;
if ( $name eq $target_name ) {
$found = 1;
printf "%s %s (%d)\n", $name, $gender, $count;
}
}
unless ( $found ) {
printf "%s (%d)\n", $target_name, 0;
}
}
output
Adam F (7)
Adam M (5293)
Alan F (9)
Alan M (2490)
Name (0)
Alejandro F (6)
Alejandro M (2593)
Alex F (157)
Alex M (3159)
Alicia F (967)
Ambrose M (87)
Caleb F (14)
Caleb M (9143)
4 changes proposed:
foreach my $class_record (#records)
{
chomp $class_record;
$record_count = 0;
$count = 0;
# add found - modification A
my $found = 0;
foreach my $name_record ( #records2 )
{
# should not be here
#$record_count++;
if ($csv->parse($name_record))
{
my #master_fields = $csv->fields();
$record_count++;
$first_name[$record_count] = $master_fields[0];
$gender[$record_count] = $master_fields[1];
$number[$record_count] = $master_fields[2];
if($class_record eq $first_name[$record_count])
{
if($gender[$record_count] eq 'F')
{
print("$first_name[$record_count] ($record_count)\n");
}
if($gender[$record_count] eq 'M')
{
my $offset = $count - 2224;
print("$first_name[$record_count] ($offset)\n");
}
# modification B - set found =1
$found = 1;
#last; # no need to keep looping
next; # find next one if try to find more than 1
}
} else {
warn "Line/record could not be parsed: $records[$record_count]\n";
}
$count++;
}
# modification C -
if($found){
}else{
print "${class_record}(0)\n";
}
}

unix - breakdown of how many records in file with number of character occurrences

Is there an inbuilt command to do this or has anyone had any luck with a script that does it?
I am looking to get counts of how many records (as defined by a specific EOL such as "^%!") had how many occurrences of a specfic character. (sorted descending by the number of occurrences)
For example, with this sample file:
jdk,|ljn^%!dk,|sn,|fgc^%!
ydfsvuyx^%!67ds5,|bvujhy,|s6d75
djh,|sudh^%!nhjf,|^%!fdiu^%!
Suggested input: delimiter EOL and filename as arguments.
bash/perl some_script_name ",|" "^%!" samplefile
Desired output:
occs count
3 1
2 1
1 2
0 2
This is because the 1st record had one delimiter, 2nd record had 2, 3rd record had 0, 4th record had 3, 5th record had 1, 6th record had 0.
Bonus pts if you can make the delimiter and EOL argument accept hex input (ie 2C7C) or normal character input (ie ,|) .
Script:
#!/usr/bin/perl
use strict;
$/ = $ARGV[1];
open my $fh, '<', $ARGV[2] or die $!;
my #records = <$fh> and close $fh;
$/ = $ARGV[0];
my %counts;
$counts{(split $_)-1}++ for #records;
delete $counts{-1};
print "$_\t$counts{$_}\n" for (reverse sort keys %counts);
Test:
perl script.pl ',|' '^%!' samplefile
Output:
3 1
2 1
1 2
0 2
This is what perl lives for:
#!perl -w
use 5.12.0;
my ($delim, $eol, $file) = #ARGV;
open my $fh, "<$file" or die "error opening $file $!";
$/ = $eol; # input record separator
my %counts;
while (<$fh>) {
my $matches = () = $_ =~ /(\Q$delim\E)/g; # "goatse" operator
$counts{$matches}++;
}
say "occs\tcount";
foreach my $num (reverse sort keys %counts) {
say "$num\t$counts{$num}";
}
(if you haven't got 5.12, remove the "use 5.12" line and replace the say with print)
A solution in awk:
BEGIN {
RS="\\^%!"
FS=",\\|"
max_occ = 0
}
{
if(match($0, "^ *$")) { # This is here to deal with the final separator.
next
}
if(NF - 1 > max_occ) {
max_occ = NF - 1
}
count[NF - 1]=count[NF - 1] + 1
}
END {
printf("occs count\n")
for(i = 0; i <= max_occ; i++) {
printf("%s %s\n", i, count[i])
}
}
Well, there's one more empty record at the end of the file which has 0. So, here's a script to do what you wanted. Adding headers and otherwise tweaking the printf output is left as an excercise for you. :)
Basically, read the whole file in, split it into records, and for each record, use a /g regex to count the sub-delimiters. Since /g returns an array of all matches, use #{[]} to make an arrayref then deref that in scalar context to get a count. There has to be a more elegant solution to that particular part of the problem, but whatever; it's perl line noise. ;)
user#host[/home/user]
$ ./test.pl ',|' '^%!' test.in
3 1
2 1
1 2
0 3
user#host[/home/user]
$ cat test.in
jdk,|ljn^%!dk,|sn,|fgc^%!
ydfsvuyx^%!67ds5,|bvujhy,|s6d75
djh,|sudh^%!nhjf,|^%!fdiu^%!
user#host[/home/user]
$ cat test.pl
#!/usr/bin/perl
my( $subdelim, $delim, $in,) = #ARGV;
$delim = quotemeta $delim;
$subdelim = quotemeta $subdelim;
my %counts;
open(F, $in) or die qq{Failed opening $in: $?\n};
foreach( split(/$delim/, join(q{}, <F>)) ){
$counts{ scalar(#{[m/.*?($subdelim)/g]}) }++;
}
printf( qq{%i% 4i\n}, $_, $counts{$_} ) foreach (sort {$b<=>$a} keys %counts);
And here's a modified version which only keeps fields which contain at least one non-space character. That removes the last field, but also has the consequence of removing any other empty fields. It also uses $/ and \Q\E to reduce a couple of explicit function calls (thank, Alex). And, like the previous one, it works with strict + warnings;
#!/usr/bin/perl
my( $subdelim, $delim, $in ) = #ARGV;
local $/=$delim;
my %counts;
open(F, $in) or die qq{Failed opening $in: $?\n};
foreach ( grep(/\S/, <F>) ){
$counts{ scalar(#{[m/.*?(\Q$subdelim\E)/g]}) }++;
}
printf( qq{%i% 4i\n}, $_, $counts{$_} ) foreach (sort {$b<=>$a} keys %counts);
If you really only want to remove the last record unconditionally, I'm partial to using pop:
#!/usr/bin/perl
my( $subdelim, $delim, $in ) = #ARGV;
local $/=$delim;
my %counts;
open(F, $in) or die qq{Failed opening $in: $?\n};
my #lines = <F>;
pop #lines;
$counts{ scalar(#{[m/.*?(\Q$subdelim\E)/g]}) }++ foreach (#lines);
printf( qq{%i% 4i\n}, $_, $counts{$_} ) foreach (sort {$b<=>$a} keys %counts);

Why is my word frequency counter example written in Perl failing to produce useful output?

I am very new to Perl, and I am trying to write a word frequency counter as a learning exercise.
However, I am not able to figure out the error in my code below, after working on it. This is my code:
$wa = "A word frequency counter.";
#wordArray = split("",$wa);
$num = length($wa);
$word = "";
$flag = 1; # 0 if previous character was an alphabet and 1 if it was a blank.
%wordCount = ("null" => 0);
if ($num == -1) {
print "There are no words.\n";
} else {
print "$length";
for $i (0 .. $num) {
if(($wordArray[$i]!=' ') && ($flag==1)) { # start of a new word.
print "here";
$word = $wordArray[$i];
$flag = 0;
} elsif ($wordArray[$i]!=' ' && $flag==0) { # continuation of a word.
$word = $word . $wordArray[$i];
} elsif ($wordArray[$i]==' '&& $flag==0) { # end of a word.
$word = $word . $wordArray[$i];
$flag = 1;
$wordCount{$word}++;
print "\nword: $word";
} elsif ($wordArray[$i]==" " && $flag==1) { # series of blanks.
# do nothing.
}
}
for $i (keys %wordCount) {
print " \nword: $i - count: $wordCount{$i} ";
}
}
It's neither printing "here", nor the words. I am not worried about optimization at this point, though any input in that direction would also be much appreciated.
This is a good example of a problem where Perl will help you work out what's wrong if you just ask it for help. Get used to always adding the lines:
use strict;
use warnings;
to the top of your Perl programs.
Fist off,
$wordArray[$i]!=' '
should be
$wordArray[$i] ne ' '
according to the Perl documentation for comparing strings and characters. Basically use numeric operators (==, >=, …) for numbers, and string operators for text (eq, ne, lt, …).
Also, you could do
#wordArray = split(" ",$wa);
instead of
#wordArray = split("",$wa);
and then #wordArray wouldn't need to do the wonky character checking and you never would have had the problem. #wordArray will be split into the words already and you'll just have to count the occurrences.
You seem to be writing C in Perl. The difference is not just one of style. By exploding a string into a an array of individual characters, you cause the memory footprint of your script to explode as well.
Also, you need to think about what constitutes a word. Below, I am not suggesting that any \w+ is a word, rather pointing out the difference between \S+ and \w+.
#!/usr/bin/env perl
use strict; use warnings;
use YAML;
my $src = '$wa = "A word frequency counter.";';
print Dump count_words(\$src, 'w');
print Dump count_words(\$src, 'S');
sub count_words {
my $src = shift;
my $class = sprintf '\%s+', shift;
my %counts;
while ($$src =~ /(?<sequence> $class)/gx) {
$counts{ $+{sequence} } += 1;
}
return \%counts;
}
Output:
---
A: 1
counter: 1
frequency: 1
wa: 1
word: 1
---
'"A': 1
$wa: 1
=: 1
counter.";: 1
frequency: 1
word: 1

Resources