Resolving Out of Memory error when executing Perl script - linux

I'm attempting to build a n-gram language model based on the top 100K words found in the english language wikipedia dump. I've already extracted out the plain text with a modified XML parser written in Java, but need to convert it to a vocab file.
In order to do this, I found a perl script that is said to do the job, but lacks instructions on how to execute. Needless to say, I'm a complete newbie to Perl and this is the first time I've encountered a need for its usage.
When I run this script, I'm getting an Out of Memory Error when using this on a 7.2GB text file on two separate dual core machines with 4GB RAM and runnung Ubuntu 10.04 and 10.10.
When I contacted the author, he said this script ran fine on a MacBook Pro with 4GB RAM, and the total in-memory usage was about 78 MB when executed on a 6.6GB text file with perl 5.12. The author also said that the script reads the input file line by line and creates a hashmap in memory.
The script is:
#! /usr/bin/perl
use FindBin;
use lib "$FindBin::Bin";
use strict;
require 'english-utils.pl';
## Create a list of words and their frequencies from an input corpus document
## (format: plain text, words separated by spaces, no sentence separators)
## TODO should words with hyphens be expanded? (e.g. three-dimensional)
my %dict;
my $min_len = 3;
my $min_freq = 1;
while (<>) {
chomp($_);
my #words = split(" ", $_);
foreach my $word (#words) {
# Check validity against regexp and acceptable use of apostrophe
if ((length($word) >= $min_len) && ($word =~ /^[A-Z][A-Z\'-]+$/)
&& (index($word,"'") < 0 || allow_apostrophe($word))) {
$dict{$word}++;
}
}
}
# Output words which occur with the $min_freq or more often
foreach my $dictword (keys %dict) {
if ( $dict{$dictword} >= $min_freq ) {
print $dictword . "\t" . $dict{$dictword} . "\n";
}
}
I'm executing this script from the command line via mkvocab.pl corpus.txt
The included extra script is simply a regex script to test the placement of apostrophe's and whether they match English grammar rules.
I thought the memory leak was due to the different versions, as 5.10 was installed on my machine. So I upgraded to 5.14, but the error still persists. According to free -m, I have approximately 1.5GB free memory on my system.
As I am completely unfamiliar with the syntax and structure of language, can you point out the problem areas along with why the issue exists and how to fix it.

Loading a 7,2Gb file into a hash could be possible if there is some repetition in the words, e.g. the occurs 17,000 times, etc. It seems to be rather a lot, though.
Your script assumes that the lines in the file are appropriately long. If your file does not contain line breaks, you will load the whole file into memory in $_, then double that memory load with split, and then add quite a whole lot more into your hash. Which would strain any system.
One idea may be to use space " " as your input record separator. It will do approximately what you are already doing with split, except that it will leave other whitespace characters alone, and will not trim excess whitespace as prettily. For example:
$/ = " ";
while (<>) {
for my $word ( split ) { # avoid e.g. "foo\nbar" being considered one word
if (
(length($word) >= $min_len) &&
($word =~ /^[A-Z][A-Z\'-]+$/) &&
(index($word,"'") < 0 || allow_apostrophe($word))
) {
$dict{$word}++;
}
}
}
This will allow even very long lines to be read in bite size chunks, assuming you do have spaces between the words (and not tabs or newlines).

Try running
dos2unix corpus.txt
It is possible that you are reading the entire file as one line...

Related

Perl string replacements of file paths in text file

I'm trying to match file paths in a text file and replace them with their share file path. E.G. The string "X:\Group_14\Project_Security" I want to replace with "\\Project_Security$".
I'm having a problem at getting my head around the syntax, as I have use the backslash (\) to escape another backslash (\\) but this does not seem to work for matching a path in a text file.
open INPUT, '< C:\searchfile.txt';
open OUTPUT, '> C:\logsearchfiletest.txt';
#lines = <INPUT>;
%replacements = (
"X:\\Group_14\\Project_Security" => "\\\\Project_Security\$",
...
(More Paths as above)
...
);
$pattern = join '|', keys %replacements;
for (#lines) {
s/($pattern)/#{[$replacements{$1}]}/g;
print OUTPUT;
}
Not totally sure whats happening as "\\\\Project_Security\$" appears as \\Project_Security$" correctly.
So I think the issues lies with "X:\\Group_14\\Project_Security" not evaluating to
"X:\Group_14\Project_Security" correctly therefore not match within the text file?
Any advice on this would be appreciated, Cheers.
If all the file paths and replacements are in a similar format to your example, you should just be able to do the following rather than using a hash for looking up replacements:
for my $line (#lines) {
$line =~ s/.+\\(.+)$/\\\\$1\$/;
print OUTPUT $line;
}
Some notes:
Always use the 3-argument open
Always check for errors on open, print, or close
Sometimes is easier to use a loop than clever coding
Try:
#!/usr/bin/env perl
use strict;
use warnings;
# --------------------------------------
use charnames qw( :full :short );
use English qw( -no_match_vars ); # Avoids regex performance penalty
use Data::Dumper;
# Make Data::Dumper pretty
$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Indent = 1;
# Set maximum depth for Data::Dumper, zero means unlimited
local $Data::Dumper::Maxdepth = 0;
# conditional compile DEBUGging statements
# See http://lookatperl.blogspot.ca/2013/07/a-look-at-conditional-compiling-of.html
use constant DEBUG => $ENV{DEBUG};
# --------------------------------------
# place file names in variables to they are easily changed
my $search_file = 'C:\\searchfile.txt';
my $log_search_file = 'C:\\logsearchfiletest.txt';
my %replacements = (
"X:\\Group_14\\Project_Security" => "\\\\Project_Security\$",
# etc
);
# use the 3-argument open as a security precaution
open my $search_fh, '<', $search_file or die "could not open $search_file: $OS_ERROR\n";
open my $log_search_fh, '>', $log_search_file or die "could not open $log_search_file: $OS_ERROR\n";
while( my $line = <$search_fh> ){
# scan for replacements
while( my ( $pattern, $replacement ) = each %replacements ){
$line =~ s/\Q$pattern\E/$replacement/g;
}
print {$log_search_fh} $line or die "could not print to $log_search_file: $OS_ERROR\n";
}
# always close the file handles and always check for errors
close $search_fh or die "could not close $search_file: $OS_ERROR\n";
close $log_search_fh or die "could not close $log_search_file: $OS_ERROR\n";
I see you've posted my rusty Perl code here, how embarrassing. ;) I made an update earlier today to my answer in the original PowerShell thread that gives a more general solution that also handles regex metacharacters and doesn't require you to manually escape each of 600 hash elements: PowerShell multiple string replacement efficiency. I added the perl and regex tags to your original question, but my edit hasn't been approved yet.
[As I mentioned, since I've been using PowerShell for everything in recent times (heck, these days I prepare breakfast with PowerShell...), my Perl has gotten a tad dusty, which I see hasn't gone unnoticed here. :P I fixed several things that I noticed could be coded better when I looked at it a second time, which are noted at the bottom. I don't bother with error messages and declarations and other verbosity for limited use quick-and-dirty scripts like this, and I don't particularly recommend it. As the Perl motto goes, "making easy things easy and hard things possible". Well, this is a case of making easy things easy, and one of Perl's main advantages is that it doesn't force you to be "proper" when you're trying to do something quick and simple. But I did close the filehandles. ;)

Issue with filepath name, possible corrupt characters

Perl and html, CGI on Linux.
Issue with file path name, being passed in a form field, to a CGI on server.
The issue is with the Linux file path, not the PC side.
I am using 2 programs,
1) program written years ago, dynamic html generated in a perl program, and presented to the user as a form. I modified by inserting the needed code to allow a the user to select a file from their PC, to be placed on the Linux machine.
Because this program already knew the filepath, needed on the linux side, I pass this filepath in a hidden form field, to program 2.
2) CGI program on Linux side, to run when form on (1) is posted.
Strange issue.
The filepath that I pass, has a very strange issue.
I can extract it using
my $filepath = $query->param("serverfpath");
The above does populate $filepath with what looks like exactly the correct path.
But it fails, and not in a way that takes me to the file open error block, but such that the call to the CGI script gives an error.
However, if I populate $filepath with EXACTLY the same string, via hard coding it, it works, and my file successfully uploads.
For example:
$fpath1 = $query->param("serverfpath");
$fpath2 = "/opt/webhost/ims/DOCURVC/data"
A comparison of $fpath1 and $fpath2 reveals that they are exactly equal.
A length check of $fpath1 and $fpath2 reveals that they are exactly the same length.
I have tried many methods of cleaning the data in $fpath1.
I chomp it.
I remove any non standard characters.
$fpath1 =~ s/[^A-Za-z0-9\-\.\/]//g;
and this:
my $safe_filepath_characters = "a-zA-Z0-9_.-/";
$fpath1 =~ s/[^$safe_filepath_characters]//g;
But no matter what I do, using $fpath1 causes an error, using $fpath2 works.
What could be wrong with the data in the $fpath1, that would cause it to successfully compare to $fpath2, yet not be equal, visually look exactly equal, show as having the exact same length, but not work the same?
For the below file open block.
$upload_dir = $fpath1
causes complete failure of CGI to load, as if it can not find the CGI (which I know is sometimes caused by syntax error in the CGI script).
$uplaod_dir = $fpath2
I get a successful file upload
$uplaod_dir = ""
The call to the cgi does not fail, it executes the else block of the below if, as expected.
here is the file open block:
if (open ( UPLOADFILE, ">$upload_dir/$filename" ))
{
binmode UPLOADFILE;
while ( <$upload_filehandle> )
{
print UPLOADFILE;
}
close UPLOADFILE;
$msgstr="Done with Upload: upload_dir=$upload_dir filename=$filename";
}
else
{
$msgstr="ERROR opening for upload: upload_dir=$upload_dir filename=$filename";
}
What other tests should I be performing on $fpath1, to find out why it does not work the same as its hard-coded equivalent $fpath2
I did try character replacement, a single character at a time, from $fpath2 to $fpath1.
Even doing this with a single character, caused $fpath1 to have the same error as $fpath2, although the character looked exactly the same.
Is your CGI perhaps running perl with the -T (taint mode) switch (e.g., #!/usr/bin/perl -T)? If so, any value coming from untrusted sources (such as user input, URIs, and form fields) is not allowed to be used in system operations, such as open, until it has been untainted by using a regex capture. Note that using s/// to modify it in-place will not untaint the value.
$fpath1 =~ /^([A-Za-z0-9\-\.\/]*)$/;
$fpath1 = $1;
die "Illegal character in fpath1" unless defined $fpath1;
should work if taint mode is your issue.
But it fails, and not in a way that takes me to the file open error block, but such that the call to the CGI script gives an error.
Premature end of script headers? Try running the CGI from the command line:
perl your_upload_script.cgi serverfpath=/opt/webhost/ims/DOCURVC/data

Issue with Spaces when Scripting Scala with Process

I am attempting to write an automated script to pngcrush my images (for a website I am working on) and I used scala as a scripting language to write the script to do this. Everything is going well, except that I am having a problem regarding using spaces when I execute the command. I read that you need to use
Seq(a,b,c,d)
where a,b,c,d are strings (that are meant to be separated by a single space) to deal with how Scala/Java handle Strings
The relevant code I have for generating the command to be executed is here. The result variable contains literal path to every filename
for (fileName <- result) {
val string = Seq("pngcrush","brute","-d","\"" + folder.getPath + "/\"","-e",fileName.getName) ++ fileName.getCanonicalPath.replace(" ","\\ ").split(" ").toSeq
I then use
string!
To execute the command. The problem is that the filename for the last section of the command (after the "-e " flag) isn't executed properly because it cannot deal with the directories that have spaces. An example output is shown below
List(pngcrush, brute, -d, "/tmp/d75f7d89-9ed5-4ff9-9181-41ae2fd82da8/", -e, users_off.png, /Users/mdedetrich/3dot/blublocks/src/main/webapp/img/sidebar/my\, group/users_off.png)
And if I run reduceLeft to get the spaces back I obviously get what the proper string is.
pngcrush brute -d "/tmp/1eaca157-0e14-430c-b0a4-677491d70583/" -e users_off.png /Users/mdedetrich/3dot/blublocks/src/main/webapp/img/sidebar/my\ group/users_off.png
Which is what the correct command should be (running the string manually in terminal works fine). However when I attempt to run this through Scala script, I get this
Could not find file: users_off.png
Could not find file: /Users/mdedetrich/3dot/blublocks/src/main/webapp/img/sidebar/my\
Could not find file: group/users_off.png
CPU time decoding 0.000, encoding 0.000, other 0.000, total 0.000 seconds
Any idea what I am doing incorrectly? It seems to be a problem with Scala not parsing strings that have spaces (and splitting it with Seq is not working either). I have tried both using a literal string with spaces and Seq, neither of which seem to work.
Why are you doing this:
replace(" ","\\ ").split(" ")
That is what's splitting the argument, not Process. Why don't you just use the following?
val string = Seq("pngcrush",
"brute",
"-d","\"" + folder.getPath + "/\"",
"-e",fileName.getName,
fileName.getCanonicalPath)

Splitting A File On Delimiter

I have a file on a Linux system that is roughly 10GB. It contains 20,000,000 binary records, but each record is separated by an ASCII delimiter "$". I would like to use the split command or some combination thereof to chunk the file into smaller parts. Ideally I would be able to specify that the command should split every 1,000 records (therefore every 1,000 delimiters) into separate files. Can anyone help with this?
The only unorthodox part of the problem seems to be the record separator. I'm sure this is fixable in awk pretty simply - but I happen to hate awk.
I would transfer it in the realm of 'normal' problems first:
tr '$' '\n' < large_records.txt | split -l 1000
This will by default create xaa, xab, xac... files; look at man split for more options
I love awk :)
BEGIN { RS="$"; chunk=1; count=0; size=1000 }
{
print $0 > "/tmp/chunk" chunk;
if (++count>=size) {
chunk++;
count=0;
}
}
(note that the redirection operator in awk only truncates/creates the file on its first invocation - subsequent references are treated as append operations - unlike shell redirection)
Make sure by default the unix split will exhaust with suffixes once it reaches max threshold of default suffix limit of 2. More info on : https://www.gnu.org/software/coreutils/manual/html_node/split-invocation.html

Using Awk to process a file where each record has different fixed-width fields

I have some data files from a legacy system that I would like to process using Awk. Each file consists of a list of records. There are several different record types and each record type has a different set of fixed-width fields (there is no field separator character). The first two characters of the record indicate the type, from this you then know which fields should follow. A file might look something like this:
AAField1Field2LongerField3
BBField4Field5Field6VeryVeryLongField7Field8
CCField99
Using Gawk I can set the FIELDWIDTHS, but that applies to the whole file (unless I am missing some way of setting this on a record-by-record basis), or I can set FS to "" and process the file one character at a time, but that's a bit cumbersome.
Is there a good way to extract the fields from such a file using Awk?
Edit: Yes, I could use Perl (or something else). I'm still keen to know whether there is a sensible way of doing it with Awk though.
Hopefully this will lead you in the right direction. Assuming your multi-line records are guaranteed to be terminated by a 'CC' type row you can pre-process your text file using simple if-then logic. I have presumed you require fields1,5 and 7 on one row and a sample awk script would be.
BEGIN {
field1=""
field5=""
field7=""
}
{
record_type = substr($0,1,2)
if (record_type == "AA")
{
field1=substr($0,3,6)
}
else if (record_type == "BB")
{
field5=substr($0,9,6)
field7=substr($0,21,18)
}
else if (record_type == "CC")
{
print field1"|"field5"|"field7
}
}
Create an awk script file called program.awk and pop that code into it. Execute the script using :
awk -f program.awk < my_multi_line_file.txt
You maybe can use two passes:
1step.awk
/^AA/{printf "2 6 6 12" }
/^BB/{printf "2 6 6 6 18 6"}
/^CC/{printf "2 8" }
{printf "\n%s\n", $0}
2step.awk
NR%2 == 1 {FIELDWIDTHS=$0}
NR%2 == 0 {print $2}
And then
awk -f 1step.awk sample | awk -f 2step.awk
You probably need to suppress (or at least ignore) awk's built-in field separation code, and use a program along the lines of:
awk '/^AA/ { manually process record AA out of $0 }
/^BB/ { manually process record BB out of $0 }
/^CC/ { manually process record CC out of $0 }' file ...
The manual processing will be a bit fiddly - I suppose you'll need to use the substr function to extract each field by position, so what I've got as one line per record type will be more like one line per field in each record type, plus the follow-on printing.
I do think you might be better off with Perl and its unpack feature, but awk can handle it too, albeit verbosely.
Could you use Perl and then select an unpack template based on the first two chars of the line?
Better use some fully featured scripting language like perl or ruby.
What about 2 scripts? E.g. 1st script inserts field separators based on the first characters, then the 2nd should process it?
Or first of all define some function in your AWK script, which splits the lines into variables based on the input - I would go this way, for the possible re-usage.

Resources