Using Perl6 to process a large text file, and it's Too Slow.(2014-09) - text

The code host in https://github.com/yeahnoob/perl6-perf , as follow:
use v6;
my $file=open "wordpairs.txt", :r;
my %dict;
my $line;
repeat {
$line=$file.get;
my ($p1,$p2)=$line.split(' ');
if ?%dict{$p1} {
%dict{$p1} = "{%dict{$p1}} {$p2}".words;
} else {
%dict{$p1} = $p2;
}
} while !$file.eof;
Running well when the "wordpairs.txt" is small.
But when the "wordpairs.txt" file is about 140,000 lines (each line, two words), it is running Very Very Slow. And it cannot Finish itself, even after 20 seconds running.
What's the problem with it? Is there any fault in the code??
Thanks for anyone help!
Following contents Added # 2014-09-04, THANKS for many suggestions from SE Answers and IRC#freenode#perl6
The code(for now, 2014-09-04):
my %dict;
grammar WordPairs {
token word-pair { (\S*) ' ' (\S*) "\n" }
token TOP { <word-pair>* }
}
class WordPairsActions {
method word-pair($/) { %dict{$0}.push($1) }
}
my $match = WordPairs.parse(slurp, :actions(WordPairsActions));
say ?$match;
Running time cost(for now):
$ time perl6 countpairs.pl wordpairs.txt
True
The pairs count of the key word "her" in wordpairs.txt is 1036
real 0m24.043s
user 0m23.854s
sys 0m0.181s
$ perl6 --version
This is perl6 version 2014.08 built on MoarVM version 2014.08
This test's time performance is not reasonable for now(as the same proper Perl 5 code only cost about 160ms), but Much Better than my original old Perl6 code. :)
PS. The whole thing, including original test code, patch and sample text, is on github.

I've tested this with code very similar to Christoph's using a file containing 10,000 lines. It takes around 15 seconds, which as you say, is significantly slower than Perl 5. I suspect that the code is slow because something this code uses hasn't seen as much optimisation effort as other parts of Rakudo and MoarVM have received recently. I'm sure that the performance of the code will improve dramatically over the next few months as whatever is slow sees more attention.
When trying to determine why some Perl 6 code is slow I suggest running perl6 on MoarVM with --profile to see whether it helps you find the bottleneck. Unfortunately, with this code it'll point to rakudo internals rather than anything you can improve.
It's certainly worth talking to #perl6 on irc.freenode.net as they'll have the knowledge to offer an alternative solution and will be able to improve its performance in the future.

Rakudo isn't exactly known for its stellar performance.
Using more idiomatic code might or might not help:
my %dict;
for open('wordpairs.txt', :r).lines {
my ($key, #words) = .words;
push %dict{$key}, #words;
}
You could also check the other backends (Rakudo runs on MoarVM, Parrot and JVM) to see if it is equally slow everywhere.
It would be interesting to know if it's IO or processing that's slow, eg via
my %dict;
say 'start IO';
my #lines = eager open('wordpairs.txt', :r).lines;
say 'done IO';
say 'start processing';
for #lines { ... }
say 'done processing';
I believe there's also a profiler available, if you want to dig into the issue yourself.

Related

How can I autoflush a Perl 6 filehande?

There are a couple of answers for Perl 6 back in its Parrot days and they don't seem to work currently:
This is Rakudo version 2017.04.3 built on MoarVM version 2017.04-53-g66c6dda
implementing Perl 6.c.
The answer to Does perl6 enable “autoflush” by default? says it's enabled by default (but that was 2011).
Here's a program I was playing with:
$*ERR.say: "1. This is an error";
$*OUT.say: "2. This is standard out";
And its output, which is an unfortunate order:
2. This is standard out
1. This is an error
So maybe I need to turn it on. There's How could I disable autoflush? which mentions an autoflush method:
$*ERR.autoflush = True;
$*ERR.say: "1. This is an error";
$*OUT.say: "2. This is standard out";
But that doesn't work:
No such method 'autoflush' for invocant of type 'IO::Handle'
I guess I could fake this myself by making my IO class that flushes after every output. For what it's worth, it's the lack of this feature that prevented me from using Perl 6 for a particular task today.
As a secondary question, why doesn't Perl 6 have this now, especially when it looks like it used to have it? How would you persaude a Perl 5 person this isn't an issue?
There was an output refactory very recently. With my local version of rakudo i can't get it to give the wrong order any more (2017.06-173-ga209082 built on MoarVM version 2017.06-48-g4bc916e)
There's now a :buffer argument to io handles that you can set to a number (or pass it as :!buffer) that will control this.
I assume the default if the output isatty is to not buffer.
This might not have worked yet when you asked the question, but:
$*ERR.out-buffer = False;
$*ERR.say: "1. This is an error";
$*OUT.say: "2. This is standard out";
It's a bit hard to find, but it's documented here.
Works for me in Rakudo Star 2017.10.
Rakudo doesn't support autoflush (yet). There's a note in 5to6-perlvar under the $OUTPUT_AUTOFLUSH entry.
raiph posted a comment elsewhere with a #perl6 IRC log search showing that people keep recommending autoflush and some other people keep saying it's not implemented. Since it's not a documented method (although flush is), I suppose we'll have to live without for a bit.
If you're primarily interested in STDOUT and STDERR, the following seems to reopen them without buffering (auto-flushed):
$*OUT = $*OUT.open(:!buffer);
$*ERR = $*ERR.open(:!buffer);
This isn't thoroughly tested yet, and I'm surprised this works. It's a funny API that lets you re-open an open stream.

How to check the available free space in Haxe?

I don't find a way to check the free space available in a device using Haxe, Openfl, Lime or another library.
I would like to avoid download data that will exceed the size recommended for an app in each device.
What do you do to check that?
Try creating a file of that size! Then either delete it or reopen and write (not append) over its contents.
I don't know whether all platforms Haxe supports will work fine with this trick, but this algorithm is reported to work in many places and languages (I personally tested it in Ruby and saw the same suggestion for C++/.NET). To check whether X bytes of disk space are available:
open a new file for writing
seek X-1 bytes from the beginning
write a byte of data (whatever you want, 0, 42...)
close the file (probably unrelated to the task at hand, but don't forget to do that anyway)
If there's insufficient disk space, you'll likely get an exception at some point in this algorithm. You'll have to find out what errors to expect and process them properly.
Using ihx I've found this is working and requires nothing but Haxe Standard Library:
haxe interactive shell v0.3.4
type "help" for help
>> import sys.io.*;
>> var f = File.write('loca', true)
sys.io.FileOutput : { __f => #abstract }
>> f.seek(39999, FileSeek.SeekBegin)
Void : null
>> f.writeByte(0)
Void : null
>> f.close()
Void : null
After these manipulations, I had a file named loca of exactly 40000 bytes in my working directory.
By the way, be careful when doing things like these in ihx since it re-runs the entire session with the last entered line appended each time.
Ongoing experimentation:
However, when there's insufficient disk space, it may not fail with errors. In this case you'll have to check the real size with sys.FileSystem.stat(path).size. And don't forget to delete the file if there's not enough space.

Ubuntu 14.04 arbtt-stats index to large error

I've recently installed arbtt which seems to be an intersting, rule based, automatic time tracker. http://arbtt.nomeata.de/#what
I've got it working for the most part, but after 30 minutes or so of gathering stats, I end up with the following error.
Processing data [=>......................................................................................................................................................................................] 1%
arbtt-stats: Prelude.(!!): index too large
Does anyone have any suggestions on ways I can troubleshoot this issue, or better yet, solve it? I have 0 experience with the coding language used to create the rules (Haskell I believe). All I've done to this point is follow the documentation as closely as possible.
This error ultimately renders the tool useless since it doesn't gather data any longer than 30 minutes. To fix it, I have to delete the log and start from scratch. I'm primarily interested in the notion of having a customizable, rule based time tracker but I'm by no means tied to using arbtt.
Based on the comments below, I'm including some more information below.
When I try to run arbtt-recover I get a long list of errors that look like this. All of them seem to be related to an Unsupported TimeLogEntry.
Trying at position 1726098.
Failed to read value at position 1726098:
Unsupported TimeLogEntry version tag 0
As for the configuration file, here is what I have so far.
$idle > 30 ==> tag inactive,
-- A rule that matches on a list of strings
current window $program == ["Chrome", "Firefox"] ==> tag Web,
current window $program == ["skype"] ==> tag Skype,
current window $program == ["jetbrains-phpstorm"] ==> tag PhpStorm,
( current window $title =~ m!Inbox! ||
current window $title =~ m!Outlook! ) ==> tag Emails,
( current window $title =~ m!AdWords! ||
current window $title =~ m!Analytics! ) ==> tag Adwords,
It goes on further, but I'm fairly confident I've followed this same syntax for all other lines. The rest of the lines are following the same format but are project/client specific for me. If required, I'm happy to include the rest of the file.
As discussed in the comments: This is a case of a corrupt ~/.arbtt/capture.log. You can usually fix this by
running arbtt-recover
and then moving ~/.arbtt/capture.log.recovered to ~/.arbtt/capture.log.
The second manual step is required to avoid accidentially deleting too much data. You can test that the recovered file is better by making arbtt-stats using the recovered file by passing --logfile=~/.arbtt/capture.log.recovered to it.
Data corruption happens for example when there is an unclean shutdown, or other undetermined reasons. But the log file format is such that even after a corruption (e.g. a partial write of one sample), further samples will be written correctly and should be picked up by arbtt-recover, so you did not lose more than a few samples.

Perl program structure for parsing

I've got question about program architecture.
Say you've got 100 different log files with different formats and you need to parse and put that info into an SQL database.
My view of it is like:
use general config file like:
program1->name1("apache",/var/log/apache.log) (modulename,path to logfile1)
program2->name2("exim",/var/log/exim.log) (modulename,path to logfile2)
....
sqldb->configuration
use something like a module (1 file per program) type1.module (regexp, logstructure(somevariables), sql(tables and functions))
fork or thread processes (don't know what is better on Linux now) for different programs.
So question is, is my view of this correct? I should use one module per program (web/MTA/iptablat)
or there is some better way? I think some regexps would be the same, like date/time/ip/url. What to do with that? Or what have I missed?
example: mta exim4 mainlog
2011-04-28 13:16:24 1QFOGm-0005nQ-Ig
<= exim#mydomain.org.ua** H=localhost
(exim.mydomain.org.ua)
[127.0.0.1]:51127 I=[127.0.0.1]:465
P=esmtpsa
X=TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32
CV=no A=plain_server:spam S=763
id=1303985784.4db93e788cb5c#mydomain.org.ua T="test" from
<exim#exim.mydomain.org.ua> for
test#domain.ua
everything that is bold is already parsed and will be putted into sqldb.incoming table. now im having structure in perl to hold every parsed variable like $exim->{timstamp} or $exim->{host}->{ip}
my program will do something like tail -f /file and parse it line by line
Flexability: let say i want to add supprot to apache server (just timestamp userip and file downloaded). all i need to know what logfile to parse, what regexp shoud be and what sql structure should be. So im planning to have this like a module. just fork or thread main process with parameters(logfile,filetype). Maybe further i would add some options what not to parse (maybe some log level is low and you just dont see mutch there)
I would do it like this:
Create a config file that is formatted like this: appname:logpath:logformatname
Create a collection of Perl class that inherit from a base parser class.
Write a script which loads the config file and then loops over its contents, passing each iteration to its appropriate handler object.
If you want an example of steps 1 and 2, we have one on our project. See MT::FileMgr and MT::FileMgr::* here.
The log-monitoring tool wots could do a lot of the heavy lifting for you here. It runs as a daemon, watching as many log files as you could want, running any combination of perl regexes over them and executing something when matches are found.
I would be inclined to modify wots itself (which its licence freely allows) to support a database write method - have a look at its existing handle_* methods.
Most of the hard work has already been done for you, and you can tackle the interesting bits.
I think File::Tail is a nice fit.
You can make an array of File::Tail objects and poll them with select like this:
while (1) {
($nfound,$timeleft,#pending)=
File::Tail::select(undef,undef,undef,$timeout,#files);
unless ($nfound) {
# timeout - do something else here, if you need to
} else {
foreach (#pending) {
# here you can handle log messages depending on filename
print $_->{"input"}." (".localtime(time).") ".$_->read;
}
(from perl File::Tail doc)

Is there an Open-Source Web Search Library that does not use a Search Index File?

I'm looking for an open-source web search library that does not use a search index file.
Do you know any?
Thanks,
Kenneth
You mean:
search.cgi
#/bin/sh
arg=`echo $QUERY | sed -e 's/^s=//' -e 's/&.*$//'`
cd /var/www/httpd
find . -type f | xargs egrep -l "$arg" | awk 'BEGIN {
print "Content-type: text/html";
print "";
print "<HTML><HEAD><TITLE>Search Result</TITLE></HEAD>";
print "<BODY><P>Here are your search results, sorry it took so long.</P>";
print "<UL>";
}
{ print "<LI>" $1 "</LI>"; }
END {
print "</UL></BODY>";
}'
Untested...
The original poster clarified in a comment to this reply that what he is looking for is essentially "greplike search but through HTTP", and mentioned that he is looking for something that uses little disk as he's working with an embedded system.
I am not aware of any related projects, but you might want to look at html parsers and xquery implementations in your language of choice. You should be able to take care of "real-life" messiness of html with the former, and write a search that's almost as detailed as you might desire with the latter.
I assume that you will be working with a set of urls that will either be provided, or already stored locally, since the idea of actually crawling the whole web, discovering links, etc, in an embedded device is thoroughly unrealistic.
Although with a good html/xquery implementation, you do have the tools to extract all the links..
My original answer, which was really a request for clarification:
Not sure what you mean. How do you picture a search working without an index? Crawling the web for every query? Piping through to google? Or are you referring to a specific kind of search index file that you are trying to avoid?
I guess there is none (at least that is popular enough for users here to be aware of).
We've went ahead to code our own Search system.

Resources