Multithreaded reading line by line a file in Crystal

Multithreaded reading line by line a file in Crystal - multithreading

I’m beginning with Crystal Lang and I’d like to know if we can make multithreaded reading a file line by line like in C# with Parallel (and the option MaxDegreeOfParallelism)
Thanks

As far as I understand C#'s Parallel correctly, it just implements concurrent (and eventually multithreaded) execution of a number of similar tasks. This is obviously possible in Crystal, even without multithreading. In the stdlib, HTTP::Server uses this and there are several shards for job processing for example. Once multithreading lands, this will give us the option to run tasks truly in parallel.
Issue #6468 makes a suggestion how to structure such conccurent tasks, and potentially also configure how many tasks are to be executed in parallel.
I'm not sure what you mean by "multithreaded reading a file line by line". Sharing a file descriptor for simultaneous access from multiple threads sounds like a dangerous idea in any language. Are you certain, C#'s Parallel can do that?

Related

Use cases for ithreads (interpreter threads) in Perl and rationale for using or not using them?

If you want to learn how to use Perl interpreter threads, there's good documentation in perlthrtut (threads tutorial) and the threads pragma manpage. It's definitely good enough to write some simple scripts.
However, I have found little guidance on the web on why and what to sensibly use Perl's interpreter threads for. In fact, there's not much talk about them, and if people talk about them it's quite often to discourage people from using them.
These threads, available when perl -V:useithreads is useithreads='define'; and unleashed by use threads, are also called ithreads, and maybe more appropriately so as they are very different from threads as offered by the Linux or Windows operating systems or the Java VM in that nothing is shared by default and instead a lot of data is copied, not just the thread stack, thus significantly increasing the process size. (To see the effect, load some modules in a test script, then create threads in a loop pausing for key presses each time around, and watch memory rise in task manager or top.)
[...] every time you start a thread all data structures are copied to
the new thread. And when I say all, I mean all. This e.g. includes
package stashes, global variables, lexicals in scope. Everything!
-- Things you need to know before programming Perl ithreads (Perlmonks 2003)
When researching the subject of Perl ithreads, you'll see people discouraging you from using them ("extremely bad idea", "fundamentally flawed", or "never use ithreads for anything").
The Perl thread tutorial highlights that "Perl Threads Are Different", but it doesn't much bother to explain how they are different and what that means for the user.
A useful but very brief explanation of what ithreads really are is from the Coro manpage under the heading WINDOWS PROCESS EMULATION. The author of that module (Coro - the only real threads in perl) also discourages using Perl interpreter threads.
Somewhere I read that compiling perl with threads enabled will result in a significantly slower interpreter.
There's a Perlmonks page from 2003 (Things you need to know before programming Perl ithreads), in which the author asks: "Now you may wonder why Perl ithreads didn't use fork()? Wouldn't that have made a lot more sense?" This seems to have been written by the author of the forks pragma. Not sure the info given on that page still holds true in 2012 for newer Perls.
Here are some guidelines for usage of threads in Perl I have distilled from my readings (maybe erroneously so):
Consider using non-blocking IO instead of threads, like with HTTP::Async, or AnyEvent::Socket, or Coro::Socket.
Consider using Perl interpreter threads on Windows only, not on UNIX because on UNIX, forks are more efficient both for memory and execution speed.
Create threads at beginning of program, not when memory concumption already considerable - see "ideal way to reduce these costs" in perlthrtut.
Minimize communication between threads because it's slow (all answers on that page).
So far my research. Now, thanks for any more light you can shed on this issue of threads in Perl. What are some sensible use cases for ithreads in Perl? What is the rationale for using or not using them?

The short answer is that they're quite heavy (you can't launch 100+ of them cheaply), and they exhibit unexpected behaviours (somewhat mitigated by recent CPAN modules).
You can safely use Perl ithreads by treating them as independent Actors.
Create a Thread::Queue::Any for "work".
Launch multiple ithreads and "result" Queues passing them the ("work" + own "result") Queues by closure.
Load (require) all the remaining code your application requires (not before threads!)
Add work for the threads into the Queue as required.
In "worker" ithreads:
Bring in any common code (for any kind of job)
Blocking-dequeue a piece of work from the Queue
Demand-load any other dependencies required for this piece of work.
Do the work.
Pass the result back to the main thread via the "result" queue.
Back to 2.
If some "worker" threads start to get a little beefy, and you need to limit "worker" threads to some number then launch new ones in their place, then create a "launcher" thread first, whose job it is to launch "worker" threads and hook them up to the main thread.
What are the main problems with Perl ithreads?
They're a little inconvenient with for "shared" data as you need to explicity do the sharing (not a big issue).
You need to look out for the behaviour of objects with DESTROY methods as they go out of scope in some thread (if they're still required in another!)
The big one: Data/variables that aren't explicitly shared are CLONED into new threads. This is a performance hit and probably not at all what you intended. The work around is to launch ithreads from a pretty much "pristine" condition (not many modules loaded).
IIRC, there are modules in the Threads:: namespace that help with making dependencies explicit and/or cleaning up cloned data for new threads.
Also, IIRC, there's a slightly different model using ithreads called "Apartment" threads, implemented by Thread::Appartment which has a different usage pattern and another set of trade-offs.
The upshot:
Don't use them unless you know what you're doing :-)
Fork may be more efficient on Unix, but the IPC story is much simpler for ithreads. (This may have been mitigated by CPAN modules since I last looked :-)
They're still better than Python's threads.
There might, one day, be something much better in Perl 6.

I have used perl's "threads" on several occasions. They're most useful for launching some process and continuing on with something else. I don't have a lot of experience in the theory of how they work under the hood, but I do have a lot of practical coding experience with them.
For example, I have a server thread that listens for incoming network connections and spits out a status response when someone asks for it. I create that thread, then move on and create another thread that monitors the system, checking five items, sleeping a few seconds, and looping again. It might take 3-4 seconds to collect the monitor data, then it gets shoved into a shared variable, and the server thread can read that when needed and immediately return the last known result to whomever asks. The monitor thread, when it finds that an item is in a bad state, kicks off a separate thread to repair that item. Then it moves on, checking the other items while the bad one is repaired, and kicking off other threads for other bad items or joining finished repair threads. The main program all the while is looping every few seconds, making sure that the monitor and server threads aren't joinable/still running. All of this could be written as a bunch of separate programs utilizing some other form of IPC, but perl's threads make it simple.
Another place where I've used them is in a fractal generator. I would split up portions of the image using some algorithm and then launch as many threads as I have CPUs to do the work. They'd each stuff their results into a single GD object, which didn't cause problems because they were each working on different portions of the array, and then when done I'd write out the GD image. It was my introduction to using perl threads, and was a good introduction, but then I rewrote it in C and it was two orders of magnitude faster :-). Then I rewrote my perl threaded version to use Inline::C, and it was only 20% slower than the pure C version. Still, in most cases where you'd want to use threads due to being CPU intensive, you'd probably want to just choose another language.
As mentioned by others, fork and threads really overlap for a lot of purposes. Coro, however, doesn't really allow for multi-cpu use or parallel processing like fork and thread do, you'll only ever see your process using 100%. I'm over-simplifying this, but I think the easiest way to describe Coro is that it's a scheduler for your subroutines. If you have a subroutine that blocks you can hop to another and do something else while you wait, for example of you have an app that calculates results and writes them to a file. One block might calculate results and push them into a channel. When it runs out of work, another block starts writing them to disk. While that block is waiting on disk, the other block can start calculating results again if it gets more work. Admittedly I haven't done much with Coro; it sounds like a good way to speed some things up, but I'm a bit put off by not being able to do two things at once.
My own personal preference if I want to do multiprocessing is to use fork if I'm doing lots of small or short things, threads for a handful of large or long-lived things.

PERL parallel multi threading

I am writing a PERL script involving multithreading. It has a GUI and the number of threads to be used will be taken as user input. Depending on this number, the script should generate threads which all access the same sub. I want the n threads to work in parallel. But when I create a loop, the parallel processing is lost. Any idea as to how to overcome this issue?

I believe that the simplest way to answer would be to recommend you to look at something like POE. The framework cookbook webpage provides many examples that surely will be a good starting point for your original issue.
Depending on your GUI platform, you may also want to spend time on event loops provided by the framework itself.

You probably need to call threads->yield() function occasionally in the processing loops. The yield() function gives a "hint" to give up the CPU for a thread.

C# TPL Tasks - How many at one time

I'm learning how to use the TPL for parellizing an application I have. The application processes ZIP files, exctracting all of the files held within them and importing the contents into a database. There may be several thousand zip files waiting to be processed at a given time.
Am I right in kicking off a separate task for each of these ZIP files or is this an inefficient way to use the TPL?
Thanks.

This seems like a problem better suited for worker threads (separate thread for each file) managed with the ThreadPool rather than the TPL. TPL is great when you can divide and conquer on a single item of data but your zip files are treated individually.
Disc I/O is going to be your bottle neck so I think that you will need to throttle the number of jobs running simultaneously. It's simple to manage this with worker threads but I'm not sure how much control you have (if nay) over the parallel for, foreach as far as how parallelism goes on at once, which could choke your process and actually slow it down.

Anytime that you have a long running process, you can typically gain additional performance on multi-processor systems by making different threads for each input task. So I would say that you are most likely going down the right path.

I would have thought that this would depend on if the process is limited by CPU or disk. If the process is limited by disk I'd thought that it might be a bad idea to kick off too many threads since the various extractions might just compete with each other.
This feels like something you might need to measure to get the correct answer for what's best.

I have to disagree with certain statements here guys.
First of all, I do not see any difference between ThreadPool and Tasks in coordination or control. Especially when tasks runs on ThreadPool and you have easy control over tasks, exceptions are nicely propagated to the caller during await or awaiting on Tasks.WhenAll(tasks) etc.
Second, I/O wont have to be the only bottleneck here, depending on data and level of compression the ZIPping is going to take msot likely more time than reading the file from the disc.
It can be thought of in many ways, but I would best go for something like number of CPU cores or little less.
Loading file paths to ConcurrentQueue and then allowing running tasks to dequeue filepaths, load files, zip them, save them.
From there you can tweak the number of cores and play with load balancing.
I do not know if ZIP supports file partitioning during compression, but in some advanced/complex cases it could be good idea especially on large files...
WOW, it is 6 years old question, bummer! I have not noticed...:)

Can parallel operations speed the availability of a file from a hard disk in R?

I have a huge datafile (~4GB) that I am passing through R (to do some string clean up) on its way into an MySQL database. Each row/line is independent from the other. Is there any speed advantage to be had by using parallel operations to finish this process? That is, could one thread start with by skipping no lines and scan every second line and another start with a skip of 1 line and read every second line? If so, would it actually speed up the process or would the two threads fighting for the 10K Western Digital hard drive (not SSD) negate any possible advantages?

The answer is maybe. At some point, disk access will become limiting. Whether this happens with 2 cores running or 8 depends on the characteristics of your hardware setup. It'd be pretty easy to just try it out, while watching your system with top. If your %wa is consitently above zero, it means that the CPUs are waiting for the disk to catch up and you're likely slowing the whole process down.

Why not just use some of the standard Unix tools to split the file into chunks and call several R command-line expressions in parallel working on a chunk each? No need to be fancy if simple can do.

The bottleneck will likely be the HDD. It doesn't matter how many processes are trying to access it, it can only read/write one thing at a time.
This assumes the "string clean up" uses minimal CPU. awk or sed are generally better for this than R.

You probably want to read from the disk in one linear forward pass, as the OS and the disk optimize heavily for that case. But you could parcel out blocks of lines to worker threads/processes from where you're reading the disk. (If you can do process parallelism rather than thread parallelism, you probably should - way less hassle all 'round.)
Can you describe the string cleanup that's required? R is not the first thing I would reach for for string bashing.

Ruby is another easy scripting language for file manipulations and clean up. But still it is an issue of the ratio of processing time vs reading time. If the point is to do things like select out columns or rearrange things you are far better off going with ruby, awk or sed, even for simple computations those would be better. but if for each line you are say, fitting a regression model or performing a simulation, you would be better doing the tasks in parallel. The question cannot have a definite answer because we don't know the exact parameters. But it sound like for most simple cleanup jobs it would be better to use a language well suited for it like ruby and run it in a single thread.

Threading run time without adding extra lines in program

Is there any thread library which can parse through code and find blocks of code which can be threaded and accordingly add the required threading instructions.
Also I want to check performance of a multithreaded program as compared to its single thread version. For this I would need to monitor the CPU usage(how much each processor is getting used). Is there any tool available to do this?

I'd say the decision whether or not a given block of code can be rewritten to be multi-threaded is way too hard for an automated process to make. To make matters worse, multi-threaded code typically accesses resources outside its own scope, such as pulling data over the network, loading large files, waiting for events, executing database queries, etc.; without detailed information about all these external factors, it is impossible to decide where to go multithreaded, simply because not all the required information is in the code.
Also, a lot of code that is multi-threadable in theory will not run faster if multi-threaded, but in fact slow down.

Some compilers (such as recent versions of the Intel compiler and gcc) can automatically parallelize simple loops, but anything beyond that is too complex. On the other hand, there are task libraries that use thread pools, and will automatically scale the number of threads to the available processors, and divide the work between them. Of course, using such a library will require rewriting your code to do so.
Structuring your application to make best use of multithreading is not a simple matter, and requires careful thought about which parts of your application can best make use of it. This is not something that can be automated.

Consider multi-threading as an approach to make full utilization of available resources. This is when it works the best. Consider an application which has multiple modules/areas which are multi-threadable. If all of them are made multi-threaded, the available resources might go down substantially. This could at times be detrimental to the application itself. Thus, multi-threading has to be used very carefully.
As Chris mentioned, there are a lot of profilers which do profiling for given combination of OS/language.

The first thing you need to do is profile your code in a single thread and see if the areas you think are good candidates for multithreading are actually a problem. It's easy to waste a lot of time multithreading working code only to end up with a buggy mess that's slower than the original implementation if you don't carefully consider the problem first.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string