Sorting files with unordered multi-part key - linux

Using any combination of Linux tools (without going into any full featured programming language) how can I sort this list
A,C 1
C,B 2
B,A 3
into
A,B 3
A,C 1
B,C 2

Not applying for any beauty contest, this seems to come close:
#!/bin/bash
while read one two; do
one=`echo $one | sed -e 's/,/\n/g' | sort | sed -e '
1 {h; d}
$! {H; d}
H; g; s/\n/,/g;
'`
echo $one $two
done | sort

Change the internal field separator, then compare the the first two letters with ">":
(
IFS=" ,";
while read a b n; do
if [ "$a" \> "$b" ]; then
echo "$b,$a $n";
else
echo "$a,$b $n";
fi;
done;
) <<EOF | sort
A,C 1
C,B 2
B,A 3
EOF

In case somebody is interested. I was not realy satisfied with any suggestions. Probably because I hoped for view lines solution and such doesn't exist as far as I know.
Anyway I did wrote an utility, called ljoin (for left join like in databases) which does exactly what I was asking for (of course :D)
#!/usr/bin/perl
=head1 NAME
ljoin.pl - Utility to left join files by specified key column(s)
=head1 SYNOPSIS
ljoin.pl [OPTIONS] <INFILE1>..<INFILEN> <OUTFILE>
To successfully join rows one must suply at least one input file and exactly one output file. Input files can be real file names or a patern, like [ABC].txt or *.in etc.
=head1 DESCRIPTION
This utility merges multiple file into one using specified column as a key
=head2 OPTIONS
=item --field-separator=<separator>, -fs <separator>
Specifies what string should be used to separate columns in plain file. Default value for this option is tab symbol.
=item --no-sort-fields, -no-sf
Do not sort columns when creating a key for merging files
=item --complex-key-separator=<separator>, -ks <separator>
Specifies what string should be used to separate multiple values in multikey column. For example "A B" in one file can be presented as "B A" meaning that this application should somehow understand that this is the same key. Default value for this option is space symbol.
=item --no-sort-complex-keys, -no-sk
Do not sort complex column values when creating a key for merging files
=item --include-primary-field, -i
Specifies whether key which is used to find matching lines in multiple files should be included in the output file. First column in output file will be the key in any case, but in case of complex column the value of first column will be sorted. Default value for this option is false.
=item --primary-field-index=<index>, -f <index>
Specifies index of the column which should be used for matching lines. You can use multiple instances of this option to specify a multi-column key made of more than one column like this "-f 0 -f 1"
=item --help, -?
Get help and documentation
=cut
use strict;
use warnings;
use Getopt::Long;
use Pod::Usage;
my $fieldSeparator = "\t";
my $complexKeySeparator = " ";
my $includePrimaryField = 0;
my $containsTitles = 0;
my $sortFields = 1;
my $sortComplexKeys = 1;
my #primaryFieldIndexes;
GetOptions(
"field-separator|fs=s" => \$fieldSeparator,
"sort-fields|sf!" => \$sortFields,
"complex-key-separator|ks=s" => \$complexKeySeparator,
"sort-complex-keys|sk!" => \$sortComplexKeys,
"contains-titles|t!" => \$containsTitles,
"include-primary-field|i!" => \$includePrimaryField,
"primary-field-index|f=i#" => \#primaryFieldIndexes,
"help|?!" => sub { pod2usage(0) }
) or pod2usage(2);
pod2usage(0) if $#ARGV < 1;
push #primaryFieldIndexes, 0 if $#primaryFieldIndexes < 0;
my %primaryFieldIndexesHash;
for(my $i = 0; $i <= $#primaryFieldIndexes; $i++)
{
$primaryFieldIndexesHash{$i} = 1;
}
print "fieldSeparator = $fieldSeparator\n";
print "complexKeySeparator = $complexKeySeparator \n";
print "includePrimaryField = $includePrimaryField\n";
print "containsTitles = $containsTitles\n";
print "primaryFieldIndexes = #primaryFieldIndexes\n";
print "sortFields = $sortFields\n";
print "sortComplexKeys = $sortComplexKeys\n";
my $fieldsCount = 0;
my %keys_hash = ();
my %files = ();
my %titles = ();
# Read columns into a memory
foreach my $argnum (0 .. ($#ARGV - 1))
{
# Find files with specified pattern
my $filePattern = $ARGV[$argnum];
my #matchedFiles = < $filePattern >;
foreach my $inputPath (#matchedFiles)
{
open INPUT_FILE, $inputPath or die $!;
my %lines;
my $lineNumber = -1;
while (my $line = <INPUT_FILE>)
{
next if $containsTitles && $lineNumber == 0;
# Don't use chomp line. It doesn't handle unix input files on windows and vice versa
$line =~ s/[\r\n]+$//g;
# Skip lines that don't have columns
next if $line !~ m/($fieldSeparator)/;
# Split fields and count them (store maximum number of columns in files for later use)
my #fields = split($fieldSeparator, $line);
$fieldsCount = $#fields+1 if $#fields+1 > $fieldsCount;
# Sort complex key
my #multipleKey;
for(my $i = 0; $i <= $#primaryFieldIndexes; $i++)
{
my #complexKey = split ($complexKeySeparator, $fields[$primaryFieldIndexes[$i]]);
#complexKey = sort(#complexKey) if $sortFields;
push #multipleKey, join($complexKeySeparator, #complexKey)
}
# sort multiple keys and create key string
#multipleKey = sort(#multipleKey) if $sortFields;
my $fullKey = join $fieldSeparator, #multipleKey;
$lines{$fullKey} = \#fields;
$keys_hash{$fullKey} = 1;
}
close INPUT_FILE;
$files{$inputPath} = \%lines;
}
}
# Open output file
my $outputPath = $ARGV[$#ARGV];
open OUTPUT_FILE, ">" . $outputPath or die $!;
my #keys = sort keys(%keys_hash);
# Leave blank places for key columns
for(my $pf = 0; $pf <= $#primaryFieldIndexes; $pf++)
{
print OUTPUT_FILE $fieldSeparator;
}
# Print column headers
foreach my $argnum (0 .. ($#ARGV - 1))
{
my $filePattern = $ARGV[$argnum];
my #matchedFiles = < $filePattern >;
foreach my $inputPath (#matchedFiles)
{
print OUTPUT_FILE $inputPath;
for(my $f = 0; $f < $fieldsCount - $#primaryFieldIndexes - 1; $f++)
{
print OUTPUT_FILE $fieldSeparator;
}
}
}
# Print merged columns
print OUTPUT_FILE "\n";
foreach my $key ( #keys )
{
print OUTPUT_FILE $key;
foreach my $argnum (0 .. ($#ARGV - 1))
{
my $filePattern = $ARGV[$argnum];
my #matchedFiles = < $filePattern >;
foreach my $inputPath (#matchedFiles)
{
my $lines = $files{$inputPath};
for(my $i = 0; $i < $fieldsCount; $i++)
{
next if exists $primaryFieldIndexesHash{$i} && !$includePrimaryField;
print OUTPUT_FILE $fieldSeparator;
print OUTPUT_FILE $lines->{$key}->[$i] if exists $lines->{$key}->[$i];
}
}
}
print OUTPUT_FILE "\n";
}
close OUTPUT_FILE;

Related

Split string to fixed length chunks and write in separate line in Raku

I have a file test.txt:
Stringsplittingskills
I want to read this file and write to another file out.txt with three characters in each line like
Str
ing
spl
itt
ing
ski
lls
What I did
my $string = "test.txt".IO.slurp;
my $start = 0;
my $elements = $string.chars;
# open file in writing mode
my $file_handle = "out.txt".IO.open: :w;
while $start < $elements {
my $line = $string.substr($start,3);
if $line.chars == 3 {
$file_handle.print("$line\n")
} elsif $line.chars < 3 {
$file_handle.print("$line")
}
$start = $start + 3;
}
# close file handle
$file_handle.close
This runs fine when the length of string is not multiple of 3. When the string length is multiple of 3, it inserts extra newline at the end of output file. How can I avoid inserting new line at the end when the string length is multiple of 3?
I tried another shorter approach,
my $string = "test.txt".IO.slurp;
my $file_handle = "out.txt".IO.open: :w;
for $string.comb(3) -> $line {
$file_handle.print("$line\n")
}
Still it suffers from same issue.
I looked for here, here but still unable to solve it.
spurt "out.txt", "test.txt".IO.comb(3).join("\n")
Another approach using substr-rw.
subset PositiveInt of Int where * > 0;
sub break( Str $str is copy, PositiveInt $length )
{
my $i = $length;
while $i < $str.chars
{
$str.substr-rw( $i, 0 ) = "\n";
$i += $length + 1;
}
$str;
}
say break("12345678", 3);
Output
123
456
78
The correct answer is of course to use .comb and .join.
That said, this is how you might fix your code.
You could change the if line to check if it is at the end, and use else.
if $start+3 < $elements {
$file_handle.print("$line\n")
} else {
$file_handle.print($line)
}
Personally I would change it so that only the addition of \n is conditional.
while $start < $elements {
my $line = $string.substr($start,3);
$file_handle.print( $line ~ ( "\n" x ($start+3 < $elements) ));
$start += 3;
}
This works because < returns either True or False.
Since True == 1 and False == 0, the x operator repeats the \n at most once.
'abc' x 1; # 'abc'
'abc' x True; # 'abc'
'abc' x 0; # ''
'abc' x False; # ''
If you were very cautious you could use x+?.
(Which is actually 3 separate operators.)
'abc' x 3; # 'abcabcabc'
'abc' x+? 3; # 'abc'
infix:« x »( 'abc', prefix:« + »( prefix:« ? »( 3 ) ) );
I would probably use loop if I were going to structure it like this.
loop ( my $start = 0; $start < $elements ; $start += 3 ) {
my $line = $string.substr($start,3);
$file_handle.print( $line ~ ( "\n" x ($start+3 < $elements) ));
}
Or instead of adding a newline to the end of each line, you could add it to the beginning of every line except the first.
while $start < $elements {
my $line = $string.substr($start,3);
my $nl = "\n";
# clear $nl the first time through
once $nl = "";
$file_handle.print($nl ~ $line);
$start = $start + 3;
}
At the command line prompt, three one-liner solutions below.
Using comb and batch (retains incomplete set of 3 letters at end):
~$ echo 'StringsplittingskillsX' | perl6 -ne '.join.put for .comb.batch(3);'
Str
ing
spl
itt
ing
ski
lls
X
Simplifying (no batch, only comb):
~$ echo 'StringsplittingskillsX' | perl6 -ne '.put for .comb(3);'
Str
ing
spl
itt
ing
ski
lls
X
Alternatively, using comb and rotor (discards incomplete set of 3 letters at end):
~$ echo 'StringsplittingskillsX' | perl6 -ne '.join.put for .comb.rotor(3);'
Str
ing
spl
itt
ing
ski
lls

Perl/Linux filtering large file with content of another file

I'm filtering a 580 MB file using the content of another smaller file.
File1 (smaller file)
chr start End
1 123 150
2 245 320
2 450 600
File2 (large file)
chr pos RS ID A B C D E F
1 124 r2 3 s 4 s 2 s 2
1 165 r6 4 t 2 k 1 r 2
2 455 t2 4 2 4 t 3 w 3
3 234 r4 2 5 w 4 t 2 4
I would like to capture lines from the File2 if the following criteria is met.
File2.Chr == File1.Chr && File2.Pos > File1.Start && File2.Pos < File1.End
I’ve tried using awk but it runs very slow, also I was wondering if there is better way to accomplish the same?
Thank you.
Here is the code that I’m using:
#!/usr/bin/perl -w
use strict;
use warnings;
my $bed_file = "/data/1000G/Hotspots.bed";#File1 smaller file
my $SNP_file = "/data/1000G/SNP_file.txt";#File2 larger file
my $final_file = "/data/1000G/final_file.txt"; #final output file
open my $in_fh, '<', $bed_file
or die qq{Unable to open "$bed_file" for input: $!};
while ( <$in_fh> ) {
my $line_str = $_;
my #data = split(/\t/, $line_str);
next if /\b(?:track)\b/;# skip header line
my $chr = $data[0]; $chr =~ s/chr//g; print "chr is $chr\n";
my $start = $data[1]-1; print "start is $start\n";
my $end = $data[2]+1; print "end is $end\n";
my $cmd1 = "awk '{if(\$1==chr && \$2>$start && \$2</$end) print (\"chr\"\$1\"_\"\$2\"_\"\$3\"_\"\$4\"_\"\$5\"_\"\$6\"_\"\$7\"_\"\$8)}' $SNP_file >> $final_file"; print "cmd1\n";
my $cmd2 = `awk '{if(\$1==chr && \$2>$start && \$2</$end) print (\"chr\"\$1\"_\"\$2\"_\"\$3\"_\"\$4\"_\"\$5\"_\"\$6\"_\"\$7\"_\"\$8)}' $SNP_file >> $final_file`; print "cmd2\n";
}
Read the small file into a data structure and check every line of the other file against it.
Here I read it into an array, each element being an arrayref with fields from a line. Then each line of the data file is checked against the arrayrefs in this array, comparing fields per requirements.
use warnings 'all';
use strict;
my $ref_file = 'reference.txt';
open my $fh, '<', $ref_file or die "Can't open $ref_file: $!";
my #ref = map { chomp; [ split ] } grep { /\S/ } <$fh>;
my $data_file = 'data.txt';
open $fh, '<', $data_file or die "Can't open $data_file: $!";
# Drop header lines
my $ref_header = shift #ref;
my $data_header = <$fh>;
while (<$fh>)
{
next if not /\S/; # skip empty lines
my #line = split;
foreach my $refline (#ref)
{
next if $line[0] != $refline->[0];
if ($line[1] > $refline->[1] and $line[1] < $refline->[2]) {
print "#line\n";
}
}
}
close $fh;
This prints out correct lines from the provided samples. It allows for multiple lines to match. If this somehow can't be, add last in the if block to exit the foreach once a match is found.
A few comments on the code. Let me know if more can be useful.
When reading the reference file, <$fh> is used in list context so it returns all lines, and grep filters out the empty ones. The map first chomps the newline and then makes an arrayref by [ ], with elements being fields on the line obtained by split. The output list is assigned to #ref.
When we reuse $fh it is first closed (if it was open) so there is no need for a close.
I store the header lines just so, perhaps to print or check. We really only need to exclude them.
Another way, this time storing the smaller file in a Hash of Arrays (HoA) based on the 'chr' field:
use strict;
use warnings;
my $small_file = 'small.txt';
my $large_file = 'large.txt';
open my $small_fh, '<', $small_file or die $!;
my %small;
while (<$small_fh>){
next if $. == 1;
my ($chr, $start, $end) = split /\s+/, $_;
push #{ $small{$chr} }, [$start, $end];
}
close $small_fh;
open my $large_fh, '<', $large_file or die $!;
while (my $line = <$large_fh>){
my ($chr, $pos) = (split /\s+/, $line)[0, 1];
if (defined $small{$chr}){
for (#{ $small{$chr} }){
if ($pos > $_->[0] && $pos < $_->[1]){
print $line;
}
}
}
}
Put them into a SQLite database, do a join. This will be much faster and less buggy and use less memory than trying to write something yourself. And it's more flexible, now you can just do SQL queries on the data, you don't have to keep writing new scripts and reparsing the files.
You can import them by parsing and inserting yourself, or you can convert them to CSV and use SQLite's CSV import ability. Converting to CSV with that simple data can be as easy as s{ +}{,}g or you can use the full blown and very fast Text::CSV_XS.
Your tables look like this (you'll want to use better names for the tables and fields).
create table file1 (
chr integer not null,
start integer not null,
end integer not null
);
create table file2 (
chr integer not null,
pos integer not null,
rs integer not null,
id integer not null,
a char not null,
b char not null,
c char not null,
d char not null,
e char not null,
f char not null
);
Create some indexes on the columns you'll be searching on. Indexes will slow down the import, so make sure you do this after the import.
create index chr_file1 on file1 (chr);
create index chr_file2 on file2 (chr);
create index pos_file2 on file2 (pos);
create index start_file1 on file1 (start);
create index end_file1 on file1 (end);
And do the join.
select *
from file2
join file1 on file1.chr == file2.chr
where file2.pos between file1.start and file1.end;
1,124,r2,3,s,4,s,2,s,2,1,123,150
2,455,t2,4,2,4,t,3,w,3,2,450,600
You can do this in Perl via DBI and the DBD::SQLite driver.
As said before, calling awk at each iteration is very slow. A full awk solution would be possible, I just saw a Perl solution, here's my Python solution as the OP wouldn't mind:
create a dictionary from the small file: chr => list of couples start/end
iterate through the big file and try to match the chr & the position between one of the start/end tuples.
Code:
with open("smallfile.txt") as f:
next(f) # skip title
# build a dictionary with chr as key, and list of start,end as values
d = collections.defaultdict(list)
for line in f:
toks = line.split()
if len(toks)==3:
d[int(toks[0])].append((int(toks[1]),int(toks[2])))
with open("largefile.txt") as f:
next(f) # skip title
for line in f:
toks = line.split()
chr_tok = int(toks[0])
if chr_tok in d:
# key is in dictionary
pos = int(toks[1])
if any(lambda x : t[0]<pos<t[1] for t in d[chr_tok]):
print(line.strip())
We could be slightly faster by sorting the list of tuples and appyling bisect to avoid linear search. That is necessary only if the list of tuples is big in the "small" file.
awk power with single pass. Your code is iterating the file2 as many times as there are lines in file1, so the execution time is linearly increasing. Please let me know if this single pass solution is slower than other solutions.
awk 'NR==FNR {
i = b[$1]; # get the next index for the chr
a[$1][i][0] = $2; # store start
a[$1][i][1] = $3; # store end
b[$1]++; # increment the next index
next;
}
{
p = 0;
if ($1 in a) {
for (i in a[$1]) {
if ($2 > a[$1][i][0] && \
$2 < a[$1][i][1])
p = 1 # set p if $2 in range
}
}
}
p {print}'
One-Liner
awk 'NR==FNR {i = b[$1];a[$1][i][0] = $2; a[$1][i][1] = $3; b[$1]++;next; }{p = 0;if ($1 in a){for(i in a[$1]){if($2>a[$1][i][0] && $2<a[$1][i][1])p=1}}}p' file1 file2

Perl Conditions

Trying to iterate through two files. Everything works although once I get to the negation of my if statement it messes everything up. The only thing that will print is the else statement
Please disregard any unused variables, when defined. Will clean it up after.
#!/usr/bin/perl
#
# Packages and modules
#
use strict;
use warnings;
use version; our $VERSION = qv('5.16.0'); # This is the version of Perl to be used
use Text::CSV 1.32; # We will be using the CSV module (version 1.32 or higher)
# to parse each line
#
# readFile.pl
# Authors: schow04#mail.uoguelph + anilam#mail.uoguelph.ca
# Project: Lab Assignment 1 Script (Iteration 0)
# Date of Last Update: Monday, November 16, 2015.
#
# Functional Summary
# readFile.pl takes in a CSV (comma separated version) file
# and prints out the fields.
# There are three fields:
# 1. name
# 2. gender (F or M)
# 3. number of people with this name
#
# This code will also count the number of female and male
# names in this file and print this out at the end.
#
# The file represents the names of people in the population
# for a particular year of birth in the United States of America.
# Officially it is the "National Data on the relative frequency
# of given names in the population of U.S. births where the individual
# has a Social Security Number".
#
# Commandline Parameters: 1
# $ARGV[0] = name of the input file containing the names
#
# References
# Name files from http://www.ssa.gov/OACT/babynames/limits.html
#
#
# Variables to be used
#
my $EMPTY = q{};
my $SPACE = q{ };
my $COMMA = q{,};
my $femalecount = 0;
my $malecount = 0;
my $lines = 0;
my $filename = $EMPTY;
my $filename2 = $EMPTY;
my #records;
my #records2;
my $record_count = -1;
my $top_number = 0;
my $male_total = 0;
my $male_count = 0;
my #first_name;
my #gender;
my #first_name2;
my #number;
my $count = 0;
my $count2 = 0;
my $csv = Text::CSV->new({ sep_char => $COMMA });
#
# Check that you have the right number of parameters
#
if ($#ARGV != 1) {
print "Usage: readTopNames.pl <names file> <course names file>\n" or
die "Print failure\n";
exit;
}
$filename = $ARGV[0];
$filename2 = $ARGV[1];
#
# Open the input file and load the contents into records array
#
open my $names_fh, '<', $filename
or die "Unable to open names file: $filename\n";
#records = <$names_fh>;
close $names_fh or
die "Unable to close: $ARGV[0]\n"; # Close the input file
open my $names_fh2, '<', $filename2
or die "Unable to open names file: $filename2\n";
#records2 = <$names_fh2>;
close $names_fh2 or
die "Unable to close: $ARGV[1]\n"; # Close the input file
#
# Parse each line and store the information in arrays
# representing each field
#
# Extract each field from each name record as delimited by a comma
#
foreach my $class_record (#records)
{
chomp $class_record;
$record_count = 0;
$count = 0;
foreach my $name_record ( #records2 )
{
if ($csv->parse($name_record))
{
my #master_fields = $csv->fields();
$record_count++;
$first_name[$record_count] = $master_fields[0];
$gender[$record_count] = $master_fields[1];
$number[$record_count] = $master_fields[2];
if($class_record eq $first_name[$record_count])
{
if($gender[$record_count] eq 'F')
{
print("$first_name[$record_count] ($record_count)\n");
}
if($gender[$record_count] eq 'M')
{
my $offset = $count - 2224;
print("$first_name[$record_count] ($offset)\n");
}
}
} else {
warn "Line/record could not be parsed: $records[$record_count]\n";
}
$count++;
}
}
#
# End of Script
#
Adam (187)
Alan (431)
Alejandro (1166)
Alex (120)
Alicia (887)
Ambrose (305)
Caleb (794)
Sample output from running the following code.
This is correct: Although if a name is not found in the second file it is supposed to say:
Adam (187)
Alan (431)
Name (0)
Alejandro (1166)
Alex (120)
Alicia (887)
Ambrose (305)
Caleb (794)
That is what the else is supposed to find. Whether the if statement returned nothing.
else {
print("$first_name[$record_count] (0)\n");
}
The output that i get when i add that else, to account for the negation is literally:
Elzie (0)
Emer (0)
Enna (0)
Enriqueta (0)
Eola (0)
Eppie (0)
Ercell (0)
Estellar (0)
It's really tough to help you properly without better information, so I've written this, which looks for each name from the names file in the master data file and displays the associated values
There's never a reason to write a long list of declarations like that at the top of a program, and you've written way too much code before you started debugging. You should write no more than three or four lines of code before you test that it works and carry on adding to it. You've ended up with 140 lines — mostly of them comments — that don't do what you want, and you're now lost as to what you should fix first
I haven't been able to fathom what all your different counters are for, or why you're subtracting a magic 2224 for male records, so I've just printed the data directly from the master file
I hope you'll agree that it's far clearer with the variables declared when they're required instead of making a huge list at the top of your program. I've dropped the arrays #first_name, #gender and #number because you were only ever using the latest value so they had no purpose
#!/usr/bin/perl
use strict;
use warnings;
use v5.16.0;
use autodie;
use Text::CSV;
STDOUT->autoflush;
if ( #ARGV != 2 ) {
die "Usage: readTopNames.pl <names file> <master names file>\n";
}
my ( $names_file, $master_file ) = #ARGV;
my #names = do {
open my $fh, '<', $names_file;
<$fh>;
};
chomp #names;
my #master_data = do {
open my $fh, '<', $master_file;
<$fh>;
};
chomp #master_data;
my $csv = Text::CSV->new;
for my $i ( 0 .. $#names ) {
my $target_name = $names[$i];
my $found;
for my $j ( 0 .. $#master_data ) {
my $master_rec = $master_data[$j];
my $status = $csv->parse($master_rec);
unless ( $status ) {
warn qq{Line/record "$master_rec" could not be parsed\n};
next;
}
my ( $name, $gender, $count ) = $csv->fields;
if ( $name eq $target_name ) {
$found = 1;
printf "%s %s (%d)\n", $name, $gender, $count;
}
}
unless ( $found ) {
printf "%s (%d)\n", $target_name, 0;
}
}
output
Adam F (7)
Adam M (5293)
Alan F (9)
Alan M (2490)
Name (0)
Alejandro F (6)
Alejandro M (2593)
Alex F (157)
Alex M (3159)
Alicia F (967)
Ambrose M (87)
Caleb F (14)
Caleb M (9143)
4 changes proposed:
foreach my $class_record (#records)
{
chomp $class_record;
$record_count = 0;
$count = 0;
# add found - modification A
my $found = 0;
foreach my $name_record ( #records2 )
{
# should not be here
#$record_count++;
if ($csv->parse($name_record))
{
my #master_fields = $csv->fields();
$record_count++;
$first_name[$record_count] = $master_fields[0];
$gender[$record_count] = $master_fields[1];
$number[$record_count] = $master_fields[2];
if($class_record eq $first_name[$record_count])
{
if($gender[$record_count] eq 'F')
{
print("$first_name[$record_count] ($record_count)\n");
}
if($gender[$record_count] eq 'M')
{
my $offset = $count - 2224;
print("$first_name[$record_count] ($offset)\n");
}
# modification B - set found =1
$found = 1;
#last; # no need to keep looping
next; # find next one if try to find more than 1
}
} else {
warn "Line/record could not be parsed: $records[$record_count]\n";
}
$count++;
}
# modification C -
if($found){
}else{
print "${class_record}(0)\n";
}
}

Why is my word frequency counter example written in Perl failing to produce useful output?

I am very new to Perl, and I am trying to write a word frequency counter as a learning exercise.
However, I am not able to figure out the error in my code below, after working on it. This is my code:
$wa = "A word frequency counter.";
#wordArray = split("",$wa);
$num = length($wa);
$word = "";
$flag = 1; # 0 if previous character was an alphabet and 1 if it was a blank.
%wordCount = ("null" => 0);
if ($num == -1) {
print "There are no words.\n";
} else {
print "$length";
for $i (0 .. $num) {
if(($wordArray[$i]!=' ') && ($flag==1)) { # start of a new word.
print "here";
$word = $wordArray[$i];
$flag = 0;
} elsif ($wordArray[$i]!=' ' && $flag==0) { # continuation of a word.
$word = $word . $wordArray[$i];
} elsif ($wordArray[$i]==' '&& $flag==0) { # end of a word.
$word = $word . $wordArray[$i];
$flag = 1;
$wordCount{$word}++;
print "\nword: $word";
} elsif ($wordArray[$i]==" " && $flag==1) { # series of blanks.
# do nothing.
}
}
for $i (keys %wordCount) {
print " \nword: $i - count: $wordCount{$i} ";
}
}
It's neither printing "here", nor the words. I am not worried about optimization at this point, though any input in that direction would also be much appreciated.
This is a good example of a problem where Perl will help you work out what's wrong if you just ask it for help. Get used to always adding the lines:
use strict;
use warnings;
to the top of your Perl programs.
Fist off,
$wordArray[$i]!=' '
should be
$wordArray[$i] ne ' '
according to the Perl documentation for comparing strings and characters. Basically use numeric operators (==, >=, …) for numbers, and string operators for text (eq, ne, lt, …).
Also, you could do
#wordArray = split(" ",$wa);
instead of
#wordArray = split("",$wa);
and then #wordArray wouldn't need to do the wonky character checking and you never would have had the problem. #wordArray will be split into the words already and you'll just have to count the occurrences.
You seem to be writing C in Perl. The difference is not just one of style. By exploding a string into a an array of individual characters, you cause the memory footprint of your script to explode as well.
Also, you need to think about what constitutes a word. Below, I am not suggesting that any \w+ is a word, rather pointing out the difference between \S+ and \w+.
#!/usr/bin/env perl
use strict; use warnings;
use YAML;
my $src = '$wa = "A word frequency counter.";';
print Dump count_words(\$src, 'w');
print Dump count_words(\$src, 'S');
sub count_words {
my $src = shift;
my $class = sprintf '\%s+', shift;
my %counts;
while ($$src =~ /(?<sequence> $class)/gx) {
$counts{ $+{sequence} } += 1;
}
return \%counts;
}
Output:
---
A: 1
counter: 1
frequency: 1
wa: 1
word: 1
---
'"A': 1
$wa: 1
=: 1
counter.";: 1
frequency: 1
word: 1

Calculating the Mean from aPerl Script

I m still in here. ;)
I've got this code from a very expert guy, and I'm shy to ask him this basic questions...anyway this is my question now; this Perl Script prints the median of a column of numbers delimited space, and, I added some stuff to get the size of it, now I'm trying to get the sum of the same column. I did and got not results, did I not take the right column? ./stats.pl 1 columns.txt
#!/usr/bin/perl
use strict;
use warnings;
my $index = shift;
my $filename = shift;
my $columns = [];
open (my $fh, "<", $filename) or die "Unable to open $filename for reading\n";
for my $row (<$fh>) {
my #vals = split/\s+/, $row;
push #{$columns->[$_]}, $vals[$_] for 0 .. $#vals;
}
close $fh;
my #column = sort {$a <=> $b} #{$columns->[$index]};
my $offset = int($#column / 2);
my $length = 2 - #column % 2;
my #medians = splice(#column, $offset, $length);
my $median;
$median += $_ for #medians;
$median /= #medians;
print "MEDIAN = $median\n";
################################################
my #elements = #{$columns->[$index]};
my $size = #elements;
print "SIZE = $size\n";
exit 0;
#################################################
my $sum = #{$columns->[$index]};
for (my $size=0; $size < $sum; $size++) {
my $mean = $sum/$size;
};
print "$mean\n";
thanks in advance.
OK some pointers to get you going :
You can put all the numbers into an array :
my #result = split(m/\d+/, $line);
#average
use List::Util qw(sum);
my $sum = sum(#result);
You can then access individual columns with $result[$index] where index is the number of column you want to access.
Also note that :
$total = $line + $total;
$count = $count + 1;
Can be rewritten as :
$total += $line;
$count += 1;
Finally make sure that you are reading the file :
put a "debugging" print into the while loop :
print $line, "\n";
This should get you going :)

Resources