python 3 multiprocessing or multithreading?

python 3 multiprocessing or multithreading? - multithreading

i have about 50 instances of class, which send SNMP-requests to different devices (one instance per device) every two minutes and save results in their self.variables. What should I use - multiprocessing or multithreading?

You should use neither. Instead, embrace I/O multiplexing. You can easily handle 50 connections sending one message per 120 seconds in a single thread.
There are built-in facilities in Python 3 for this: https://docs.python.org/3.4/library/selectors.html

Related

How to convert a multiprocess Flask/unicorn to a single multithreaded process

I would like to cache a large amount of data in a Flask application. Currently it runs on K8S pods with the following unicorn.ini
bind = "0.0.0.0:5000"
workers = 10
timeout = 900
preload_app = True
To avoid caching the same data in those 10 workers I would like to know if Python supports a way to multi-thread instead of multi-process. This would be very easy in Java but I am not sure if it is possible in Python. I know that you can share cache between Python instances using the file system or other methods. However it would be a lot simpler if it is all share in the same process space.
Edited:
There are couple post that suggested threads are supported in Python. This comment by Filipe Correia, or this answer in the same question.
Based on the above comment the Unicorn design document talks about workers and threads:
Since Gunicorn 19, a threads option can be used to process requests in multiple threads. Using threads assumes use of the gthread worker.
Based on how Java works, to shared some data among threads, I would need one worker and multiple threads. Based on this other link
I know it is possible. So I assume I can change my gunicorn configuration as follows:
bind = "0.0.0.0:5000"
workers = 1
threads = 10
timeout = 900
preload_app = True
This should give me 1 worker and 10 threads which should be able to process the same number of request as current configuration. However the question is: Would the cache still be instantiated once and shared among all the threads? How or where should I instantiate the cache to make sure is shared among all the threads.

would like to ... multi-thread instead of multi-process.
I'm not sure you really want that. Python is rather different from Java.
workers = 10
One way to read that is "ten cores", sure.
But another way is "wow, we get ten GILs!"
The global interpreter lock must be held
before the interpreter interprets a new bytecode instruction.
Ten interpreters offers significant parallelism,
executing ten instructions simultaneously.
Now, there are workloads dominated by async I/O, or where
the interpreter calls into a C extension to do the bulk of the work.
If a C thread can keep running, doing useful work
in the background, and the interpreter gathers the result later,
terrific. But that's not most workloads.
tl;dr: You probably want ten GILs, rather than just one.
To avoid caching the same data in those 10 workers
Right! That makes perfect sense.
Consider pushing the cache into a storage layer, or a daemon like Redis.
Or access memory-resident cache, in the context of your own process,
via mmap or shmat.
When running Flask under Gunicorn, you are certainly free
to set threads greater than 1,
though it's likely not what you want.
YMMV. Measure and see.

How can I make FutureProducer to perform at least near the performance of ThreadedProducer in rust rdkafka?

I'm just playing around the examples, and I tried to use FutureProducer with Tokio::spawn, and I'm getting about 11 ms per produce.
1000 messages in 11000ms (11 seconds).
While ThreadedProducer produced 1000000 (1 million messages) in about 4.5 seconds (dev), and 2.6 seconds (on --release) !!!, this is insane difference between the two and maybe I missed something, or I'm not doing something ok.
Why to use FutureProducer if this big speed difference exists?
Maybe someone can shed the light to let me understand and to learn about the FutureProducer.
Kafka topic name is "my-topic" and it has 3 partitions.
Maybe my code is not written in the suitable way (for future producer), I need to produce 1000000 messages / less than 10 seconds using FutureProducer.
My attempts are written in the following gists (I updated this question to add new gists)
Note:
After I wrote my question I tried to solve my issue by adding different ideas until I succeeded at the 7th attempt
1- spawn blocking:
https://gist.github.com/arkanmgerges/cf1e43ce0b819ebdd1b383d6b51bb049
2- threaded producer
https://gist.github.com/arkanmgerges/15011348ef3f169226f9a47db78c48bd
3- future producer
https://gist.github.com/arkanmgerges/181623f380d05d07086398385609e82e
4- os threads with base producer
https://gist.github.com/arkanmgerges/1e953207d5a46d15754d58f17f573914
5- os thread with future producer
https://gist.github.com/arkanmgerges/2f0bb4ac67d91af0d8519e262caed52d
6- os thread with spawned tokio tasks for the future producer
https://gist.github.com/arkanmgerges/7c696fef6b397b9235564f1266443726
7- tokio multithreading using #[tokio::main] with FutureProducer
https://gist.github.com/arkanmgerges/24e1a1831d62f9c5e079ee06e96a6329

In my 5th example, I needed to use os threads (thanks for the discussion with #BlackBeans), and inside the os thread I've used tokio runtime that uses 4 worker thread and which it will block in the os thread.
The example used 100 os threads, and each one has tokio runtime with 4 worker threads.
Each os thread will produce 10000 messages.
The code is not optimized and I ran it in build dev.
A new example that I've done in my 7th attempt, which I used #[tokio::main] which is by default will use block_on and when I spawn a new task, it can be put in a new os thread (I've made a separate test to check it using #[tokio::main]) under the main scheduler (inside block_on). And could produced 1 million messages in 2.93 seconds (dev build) and 2.29 seconds (release build)

I think I went through a similar journey: Starting with the FutureProducer because it seemed a good place to start, totally terrible performance. Switching to ThreadedProducer, very fast.
I know Kafka quite well, but a noob at Rust. The FutureProducer is broken, as far as I can see, as every await you call will flush and wait for a confirmation.
That is simply not how Kafka is intended, what makes Kafka fast is that you can keep pumping messages and only occasionally and asynchronously get acks from the current offsets.
I like how you managed to improve thoughput by using many threads, but that is more complex than it should be, and I suppose also much more demanding on both the broker and the client.
If there at least was a batch variant the performance would be bearable, but as I see it now it is suitable for low volume only.
Did you have any insights since you tried this?

Non thread/cpu dependent workers? - Python

I'm looking to run a pseudo "multithreading" process which is independent of core/thread count. Is there a similar process to this where it's not cpu/thread based?
with Pool(int(Multithread_Count)) as p:
#p.map(Cas_off, Input_Casoff) #OCT10
#print(p.map(Cas_off, Input_Casoff))
print(p.starmap(Cas_off, zipped))
So i can create 64x instances of a "worker" without being limited on thread count?
For reference this is part of a larger code which takes 6000 files and processes them in intervals of 24, due to 24 threads and each thread being a worker.
It takes about 4 hours using multithreading....
Best regards,
James

Node's epoll behaviour on socket

I wrote a simple node.js program that sends out 1000 http requests and it records when these requests comes back and just increases counter by 1 upon response. Endpoint is very light weight and it just has simple http resonse without any heavy html. I recorded that it returns me around 200-300 requests per second for 3 seconds. On other hand, when i start this same process 3 times (4 total processes = amount of my available cores) i notice that it performs x4 faster. So i receive aproximately 300 * 4 requests per second back. I want to understand what happens when epoll gets triggered upon Kernel notifying the poll about new file descriptor being ready to compute (new tcp payload arrived). Does v8 take out this filedescriptor and read the data / manipulate with it and where is the actuall bottleneck? Is it in computing and unpacking the payload? It seems that only 1 core is working on sending/receiving these requests for this 1 process and when i start multiple (on amount of my cores), it performs faster.

where is the actuall bottleneck?
Sounds like you're interested in profiling your application. See https://nodejs.org/en/docs/guides/simple-profiling/ for the official documentation on that.
What I can say up front is that V8 does not deal with file descriptors or epoll, that's all Node territory. I don't know Node's internals myself (only V8), but I do know that the code is on https://github.com/nodejs/node, so you can look up how any given feature is implemented.

Python threading with 4000 threads

Looking at picking up stock data using urls from nasdaq. For 4000 stocks am thinking of doing each in a thread so 4000 url threads. Has anyone tried this ? Does it overload windows stack?

I suggest using concurrent.futures.ThreadPoolExecutor() with default max workers as too many threads will cause a giant overhead.
Max workers default to (processor_count) * 5 which is good in your case I believe.
A client using asyncio is also an option but it's quite a bit more complicated.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string