Python threading with 4000 threads

Python threading with 4000 threads - python-3.x

Looking at picking up stock data using urls from nasdaq. For 4000 stocks am thinking of doing each in a thread so 4000 url threads. Has anyone tried this ? Does it overload windows stack?

I suggest using concurrent.futures.ThreadPoolExecutor() with default max workers as too many threads will cause a giant overhead.
Max workers default to (processor_count) * 5 which is good in your case I believe.
A client using asyncio is also an option but it's quite a bit more complicated.

Related

How to convert a multiprocess Flask/unicorn to a single multithreaded process

I would like to cache a large amount of data in a Flask application. Currently it runs on K8S pods with the following unicorn.ini
bind = "0.0.0.0:5000"
workers = 10
timeout = 900
preload_app = True
To avoid caching the same data in those 10 workers I would like to know if Python supports a way to multi-thread instead of multi-process. This would be very easy in Java but I am not sure if it is possible in Python. I know that you can share cache between Python instances using the file system or other methods. However it would be a lot simpler if it is all share in the same process space.
Edited:
There are couple post that suggested threads are supported in Python. This comment by Filipe Correia, or this answer in the same question.
Based on the above comment the Unicorn design document talks about workers and threads:
Since Gunicorn 19, a threads option can be used to process requests in multiple threads. Using threads assumes use of the gthread worker.
Based on how Java works, to shared some data among threads, I would need one worker and multiple threads. Based on this other link
I know it is possible. So I assume I can change my gunicorn configuration as follows:
bind = "0.0.0.0:5000"
workers = 1
threads = 10
timeout = 900
preload_app = True
This should give me 1 worker and 10 threads which should be able to process the same number of request as current configuration. However the question is: Would the cache still be instantiated once and shared among all the threads? How or where should I instantiate the cache to make sure is shared among all the threads.

would like to ... multi-thread instead of multi-process.
I'm not sure you really want that. Python is rather different from Java.
workers = 10
One way to read that is "ten cores", sure.
But another way is "wow, we get ten GILs!"
The global interpreter lock must be held
before the interpreter interprets a new bytecode instruction.
Ten interpreters offers significant parallelism,
executing ten instructions simultaneously.
Now, there are workloads dominated by async I/O, or where
the interpreter calls into a C extension to do the bulk of the work.
If a C thread can keep running, doing useful work
in the background, and the interpreter gathers the result later,
terrific. But that's not most workloads.
tl;dr: You probably want ten GILs, rather than just one.
To avoid caching the same data in those 10 workers
Right! That makes perfect sense.
Consider pushing the cache into a storage layer, or a daemon like Redis.
Or access memory-resident cache, in the context of your own process,
via mmap or shmat.
When running Flask under Gunicorn, you are certainly free
to set threads greater than 1,
though it's likely not what you want.
YMMV. Measure and see.

How can I make FutureProducer to perform at least near the performance of ThreadedProducer in rust rdkafka?

I'm just playing around the examples, and I tried to use FutureProducer with Tokio::spawn, and I'm getting about 11 ms per produce.
1000 messages in 11000ms (11 seconds).
While ThreadedProducer produced 1000000 (1 million messages) in about 4.5 seconds (dev), and 2.6 seconds (on --release) !!!, this is insane difference between the two and maybe I missed something, or I'm not doing something ok.
Why to use FutureProducer if this big speed difference exists?
Maybe someone can shed the light to let me understand and to learn about the FutureProducer.
Kafka topic name is "my-topic" and it has 3 partitions.
Maybe my code is not written in the suitable way (for future producer), I need to produce 1000000 messages / less than 10 seconds using FutureProducer.
My attempts are written in the following gists (I updated this question to add new gists)
Note:
After I wrote my question I tried to solve my issue by adding different ideas until I succeeded at the 7th attempt
1- spawn blocking:
https://gist.github.com/arkanmgerges/cf1e43ce0b819ebdd1b383d6b51bb049
2- threaded producer
https://gist.github.com/arkanmgerges/15011348ef3f169226f9a47db78c48bd
3- future producer
https://gist.github.com/arkanmgerges/181623f380d05d07086398385609e82e
4- os threads with base producer
https://gist.github.com/arkanmgerges/1e953207d5a46d15754d58f17f573914
5- os thread with future producer
https://gist.github.com/arkanmgerges/2f0bb4ac67d91af0d8519e262caed52d
6- os thread with spawned tokio tasks for the future producer
https://gist.github.com/arkanmgerges/7c696fef6b397b9235564f1266443726
7- tokio multithreading using #[tokio::main] with FutureProducer
https://gist.github.com/arkanmgerges/24e1a1831d62f9c5e079ee06e96a6329

In my 5th example, I needed to use os threads (thanks for the discussion with #BlackBeans), and inside the os thread I've used tokio runtime that uses 4 worker thread and which it will block in the os thread.
The example used 100 os threads, and each one has tokio runtime with 4 worker threads.
Each os thread will produce 10000 messages.
The code is not optimized and I ran it in build dev.
A new example that I've done in my 7th attempt, which I used #[tokio::main] which is by default will use block_on and when I spawn a new task, it can be put in a new os thread (I've made a separate test to check it using #[tokio::main]) under the main scheduler (inside block_on). And could produced 1 million messages in 2.93 seconds (dev build) and 2.29 seconds (release build)

I think I went through a similar journey: Starting with the FutureProducer because it seemed a good place to start, totally terrible performance. Switching to ThreadedProducer, very fast.
I know Kafka quite well, but a noob at Rust. The FutureProducer is broken, as far as I can see, as every await you call will flush and wait for a confirmation.
That is simply not how Kafka is intended, what makes Kafka fast is that you can keep pumping messages and only occasionally and asynchronously get acks from the current offsets.
I like how you managed to improve thoughput by using many threads, but that is more complex than it should be, and I suppose also much more demanding on both the broker and the client.
If there at least was a batch variant the performance would be bearable, but as I see it now it is suitable for low volume only.
Did you have any insights since you tried this?

Choice of technology

As you might probably already know, python3 is a single threaded, mono processor program, this seems to goes well with tinydb (json) that state being only made of full python, as well as bottle (web server).
In a world where you want to have something in pre-production or early production and low traffic (<100 ppl a week) what do you think of the idea of having a running bottle website with the built-in HTTP server (python) and tinydb as a database.
The two things I was wondering was :
a) data isolation (or concurrency) : but since everything is single threaded the processor will do the job of queuing the CRUD operations, one after the other, there won't be any concurrent access but regarding the low traffic, should I care ?
b) wait time, while the processor is queuing 10 ppl that wants to have access to the same table stored in ram, the processor will queue the requests and people will have wait time. Now the question being will this be humanly noticeable, Python being fast (milisec). However I don't really know how to stat test 50 ppl connecting to the website at the same time and requesting for the same ressources.
I am open to every feedback, let me know.

if you are going to have such a low traffic + very fast RAM only operations then it seems like it might be worthy option and also easily testable.
import bottle
a = bottle.Bottle()
#a.get('/')
def root():
return {'cheese': '🧀'}
if __name__ == '__main__':
a.run()
and our test_file:
import time
import requests
from concurrent.futures import ThreadPoolExecutor
def test(index):
requests.get('http://localhost:8080/').raise_for_status()
pool = ThreadPoolExecutor(max_workers=10)
for i in (1, 10, 50, 100, 1000):
t = time.time()
pool.map(test, range(i))
print(i, 'took', time.time() - t)
print('🥳')
on my Mac this the output:
1 took 0.00046515464782714844
10 took 0.003888845443725586
50 took 0.0003077983856201172
100 took 0.0006000995635986328
1000 took 0.0058138370513916016
🥳
which is indeed, not noticeable.
that said, every IO/CPU/presistancy thing that will be added later will break your assumptions, so with the price of a small overhead it might be better to use a bigger concurrent DB and such.
😀

Python - Threading takes too much CPU

So I have a Python 3.7 program that uses Threading library to multiprocess tasks
def myFunc(stName,ndName,ltName):
##logic here
names = open('names.txt').read().splitlines() ## more than 30k name
for i in names:
processThread = threading.Thread(target=myFunc, args=(i,name2nd,lName,))
processThread.start()
time.sleep(0.4)
I have to open multiple windows to complete the tasks with different inputs, but eventually I ran into a very laggy situation where I cant even browse my OSX , I tried to use the multiprocessing library to solve the issue but unfortunately, multiprocessing seems not to be working correctly in OSX .
Anyone can advise ?

This behavior is to be expected. If myFunc is a CPU intensive task that takes time, you are potentially starting up to 30k threads doing this task which will use all the machine resources.
Another potential issue with your code is that Threads are expensive in term of memory (each thread uses 8MB of memory). Creating 30k threads would use up to 240GB of memory which your machine probably doesn't have, and will lead to an OutOfMemoryError.
Finally, another issue with that code is that your main routine is starting up all those threads, but not waiting for any of them to finish executing. This means that the last started threads will most likely not run until the end.
I would recommend using a ThreadPoolExecutor to solve all those issues:
from concurrent.futures.thread import ThreadPoolExecutor
def myFunc(stName,ndName,ltName):
##logic here
names = open('names.txt').read().splitlines() ## more than 30k name
num_workers = 8
with ThreadPoolExecutor(max_workers=num_workers) as executor:
for i in names:
executor.map(myFunc, (i, name2nd, lName))
You can play with num_workers to find a balance between amount of resources being used by this program and speed of execution that fits you.

python: ProcessPoolExecutor processing in chunks

I am trying to work out how to process bulk records into elastic search using the bulk function and need to use threads to get some performance out of it. But I am stuck trying to work out how to limit the threads to 5 concurrent so its not to heavy on elastic.
I was thinking of just looping the db and filling a list, then when it hits eg (50), push to a thread for processing and continue. But this method will spawn to many threads and I cannot see an obvious way to limit the treads without waiting for all of them to finish, before adding another thread.
I have done this in golang before, where you can just add threads and when it hits the limit it will just wait before adding more to the queue, but seeming a little more elusive in python so far.
I am open to alternatives but this seems like the cleanest way to go so far, but there might be better methods like db -> queue with limit, then just threads to consume from the queue.. ?
look forward to some responses.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string