I know that calls to a functor using thrust::for_each with data in thrust::host_vector's have a parallel execution policy, but do they actually execute in parallel?
If not, what would be the correct way to invoke these knowing that the system I'm running this on is virtualized so that all cores appear to be on the same machine?
[EDIT]
I realize that there is such a thing as thrust::omp::par, however, I can't seem to be able to to find a full Thrust example using OpenMP.
In general, thrust operations dispatched on the "host" are not run in parallel. They use a single host thread.
If you want to run thrust operations in parallel on the CPU (using multiple CPU threads) then the recommended practice would be to use the thrust OpenMP backend.
A fully worked example is here.
Another worked example is here.
Related
I am working on a program where I am required to download a large amount of JSON files from different URLs.
Currently, my program creates multiple threads, and in each thread, it calls the LibCurl easy_perform() function but I am running into issues where the program fails occasionally with an error of "double free". It seems to be some sort of Heisenbug but I have been able to catch it in GDB which confirms the error originates in LibCurl (backtraced).
While I would love suggestions on the issue I am having, my actual question is this: Would it be better if I were to change the structure of my code to use the LibCurl Multi Interface on one thread instead of calling the single interface across multiple threads? What are the trade offs of using one over the other?
Note: By "better", I mean is it faster and less taxing on my CPU? Is it more reliable as the multi interface was designed for this?
EDIT:
The three options I have as I understand it are these:
1) Reuse the same easy_handle in a single thread. The connections wont need to be reestablished making it faster.
2) Call curl_easy_perform() in each individual thread. They all run in parallel, again, making it faster.
3) Call curl_multi_perform() in a single thread. This is non-blocking so I imagine all of the files are downloaded in parallel making it faster?
Which of these options is the most time efficient?
curl_easy_perform is blocking operation. That means if you run in in one thread you have to download files sequentially. In multithreaded application you can run many operations in parallel - this usually means faster download time (if speed is not limited by network or destination server).
But there is non-blocking variant that may work better for you if you want to go single threaded way - curl_multi_perform
From curl man
You can do any amount of calls to curl_easy_perform while using the
same easy_handle. If you intend to transfer more than one file, you
are even encouraged to do so. libcurl will then attempt to re-use the
same connection for the following transfers, thus making the
operations faster, less CPU intense and using less network resources.
Just note that you will have to use curl_easy_setopt between the
invokes to set options for the following curl_easy_perform.
In short - it will give few benefits you want vs curl_easy_perform.
Does fork always create a process in a separate processor?
Is there a way, I could control the forking to a particular processor. For example, if I have 2 processors and want the fork to create a parallel process but in the same processor that contains the parent. Does NodeJS provide any method for this? I am looking for a control over the allocation of the processes. ... Is this even a good idea?
Also, what are the maximum number of processes that could be forked and why?
I've no Node.js wisdom to impart, simply some info on what OSes generally do.
Any modern OS will schedule processes / threads on CPUs and cores according to the prevailing burden on the machine. The whole point is that they're very good at this, so one is going to have to try very hard to come up with scheduling / core affinity decisions that beat the OS. Almost no one bothers. Unless you're running on very specific hardware (which perhaps, perhaps one might get to understand very well), you're having to make a lot of complex decisions for every single different machine the code runs on.
If you do want to try then I'm assuming that you'll have to dig deep below node.JS to make calls to the underlying C library. Most OSes (including Linux) provide means for a process to control core affinity (it's exposed in Linux's glibc).
I am using nodejs for a CPU intensive task ,which basicly generates large amount of data and stores it in a file. I am streaming the data to output files as it is generated for a single type of data.
Aim : I want to make the task of generating this data for multiple types of data in parallel (utilizing my multi-core cpu to its best).Without each of process having its own heap memory .Thus providing with larger process memory and increased speed of execution.
I was planning to use node fibers which is also used by meteor js for its own callback handling.But I am not sure if this will achieve what I want,as in one of the video on meteor fibers by Chris Mather mentions at the end that eventually everything is single threaded and node fibers somehow manges the same single threaded event loop to provide its functionality.
So,
Does this mean that if I use node fibers I wont be running my task in
parallel ,thus not utilizing my cpu cores ?
Does node webworker-threads will help me in achieving the
functionality I desire.As is mentioned on modules home page which
says that ,webworker threads will run on seperate/parallel cpu
process ,thus providing multi-threading in real sense ??
As ending question ,Does this mean that node.js is not advisable for
such CPU intensive tasks ?
note : I dont want to use asynchronous code structuring libs which are presented as threads,but infact just add syntatical sugar over same async code, as the tasks are largely CPU intensive .I have already used async capabilities to max .
// Update 1 (based on answer for clusters )
Sorry I forgot to mention this ,but problem with clusters I faced is :
Complex to load balance the amount of work I have in a way which makes sure a particular set of parallel tasks execute before certain other tasks.
Not sure if clusters really do what I want ,referring to these lines on webworker-threads npm homepage
The "can't block the event loop" problem is inherent to Node's evented model. No matter how many Node processes you have running as a Node-cluster, it won't solve its issues with CPU-bound tasks.
..... any light on how ..would be helpfull.
Rather than trying to implement multiple threads, you should find it much easier to use multiple processes with Node.js
See, for example, the cluster module. This allows you to easily run the same js code in multiple processes, e.g. one per core, and collect their results / be notified once they're completed.
If cluster does more than you need, then you can also just call fork directly.
If you must have thread-parallelism rather than process-, then you may want to look at writing an async native module. Then you have access to the libuv thread pool (though starving it may reduce I/O performance) or can fork your own threads as you wish (but then you're on your own for synchronising with the rest of Node).
After update 1
For load balancing, if what cluster does isn't working for you, then you can just do it yourself with fork, as I mentioned. The source for cluster is available.
For the other point, it means if the task is truly CPU-bound then there's no advantage Node will give you over other technologies, other than being simpler if everything else is using Node. The only option you have is to make sure you're using all the available CPU resources, which a worker pool will give you. If you're already using Node then the easiest options are using the ones it's already got (cluster or libuv). If they're not sufficient then yeah, you'll have to find something else.
Regardless of technology, it remains true that multi-process parallelism is a lot easier than multi-thread parallelism.
Note: despite what you say, you definitely do want to use async code precisely because it is CPI-intensive, otherwise your tasks will block all I/O. You do not want this to happen.
I don't understand. Isn't this the whole idea of multi-threading?
Edit: Question modified from "Why two threads within the same process cannot run simultaneously on two processors?".
In the article you link to, it lists this as a limitation of user-level threads (that are implemented by an application itself, without being backed by OS-level threads).
That's correct, but it does not apply to "real" threads. The OS is free to schedule them across multiple processors.
Now that most operating systems have robust support for multithreading, I believe that those user-level threads are a thing of the past.
So, yes, the whole point of multi-threading is to be able to run code in parallel on as many CPU as you want to assign to it. And "user-level threads" were a workaround for platforms without proper native thread support, and it was limited in the way you describe (no multiple CPU for a single application process).
I haven't been able to write a program in Lua that will load more than one CPU. Since Lua supports the concept via coroutines, I believe it's achievable.
Reason for me failing can be one of:
It's not possible in Lua
I'm not able to write it ☺ (and I hope it's the case )
Can someone more experienced (I discovered Lua two weeks ago) point me in right direction?
The point is to write a number-crunching script that does hi-load on ALL cores...
For demonstrative purposes of power of Lua.
Thanks...
Lua coroutines are not the same thing as threads in the operating system sense.
OS threads are preemptive. That means that they will run at arbitrary times, stealing timeslices as dictated by the OS. They will run on different processors if they are available. And they can run at the same time where possible.
Lua coroutines do not do this. Coroutines may have the type "thread", but there can only ever be a single coroutine active at once. A coroutine will run until the coroutine itself decides to stop running by issuing a coroutine.yield command. And once it yields, it will not run again until another routine issues a coroutine.resume command to that particular coroutine.
Lua coroutines provide cooperative multithreading, which is why they are called coroutines. They cooperate with each other. Only one thing runs at a time, and you only switch tasks when the tasks explicitly say to do so.
You might think that you could just create OS threads, create some coroutines in Lua, and then just resume each one in a different OS thread. This would work so long as each OS thread was executing code in a different Lua instance. The Lua API is reentrant; you are allowed to call into it from different OS threads, but only if are calling from different Lua instances. If you try to multithread through the same Lua instance, Lua will likely do unpleasant things.
All of the Lua threading modules that exist create alternate Lua instances for each thread. Lua-lltreads just makes an entirely new Lua instance for each thread; there is no API for thread-to-thread communication outside of copying parameters passed to the new thread. LuaLanes does provide some cross-connecting code.
It is not possible with the core Lua libraries (if you don't count creating multiple processes and communicating via input/output), but I think there are Lua bindings for different threading libraries out there.
The answer from jpjacobs to one of the related questions links to LuaLanes, which seems to be a multi-threading library. (I have no experience, though.)
If you embed Lua in an application, you will usually want to have the multithreading somehow linked to your applications multithreading.
In addition to LuaLanes, take a look at llthreads
In addition to already suggested LuaLanes, llthreads and other stuff mentioned here, there is a simpler way.
If you're on POSIX system, try doing it in old-fashioned way with posix.fork() (from luaposix). You know, split the task to batches, fork the same number of processes as the number of cores, crunch the numbers, collate results.
Also, make sure that you're using LuaJIT 2 to get the max speed.
It's very easy just create multiple Lua interpreters and run lua programs inside all of them.
Lua multithreading is a shared nothing model. If you need to exchange data you must serialize the data into strings and pass them from one interpreter to the other with either a c extension or sockets or any kind of IPC.
Serializing data via IPC-like transport mechanisms is not the only way to share data across threads.
If you're programming in an object-oriented language like C++ then it's quite possible for multiple threads to access shared objects across threads via object pointers, it's just not safe to do so, unless you provide some kind of guarantee that no two threads will attempt to simultaneously read and write to the same data.
There are many options for how you might do that, lock-free and wait-free mechanisms are becoming increasingly popular.