Where to set the locks in apply pandas with multithreading? - python-3.x

I am trying to asynchronously read and write from a pandas df with an apply function. For this purpose I am using the multithreading.dummy package. Since I am doing read and write simultaneously (multithreaded) on my df, I am using multiprocessing.Lock() so that no more than one thread can edit the df at the a given time. However I am a bit confused to where I should be adding a lock.acquire() and lock.release()with an apply function in pandas. I have tried doing as per below, however, it seems that doing as so the entire process becomes synchronous, so it defeats the whole purpose of multithreading.
self._lock.acquire()
to_df[col_name] = to_df.apply(lambda row: getattr(Object(row['col_1'],
row['col_2'],
row['col_3']),
someattribute), axis=1)
self._lock.release()
Note: In my case I have to be doing getattr. someattribute is simply a #property in Object. Object takes 3 arguments, which some from rows 1,2,3 from my df.

There 2 possible solutions. 1 - locks. 2 - queues. Code below is just a skeleton, it may contain typos/errors and cannot be used as is.
First. Locks where they actually needed:
def method_to_process_url(df):
lock.acquire()
url = df.loc[some_idx, some_col]
lock.release()
info = process_url(url)
lock.acquire()
# add info to df
lock.release()
Second. Queues instead of locks:
def method_to_process_url(df, url_queue, info_queue):
for url in url_queue.get():
info = process_url(url)
info_queue.put(info)
url_queue = queue.Queue()
# add all urls to process to the url_queue
info_queue = queue.Queue()
# working_thread_1
threading.Thread(
target=method_to_process_url,
kwargs={'url_queue': url_queue, 'info_queue': info_queue},
daemon=True).start()
# more working threads
counter = 0
while counter < amount_of_urls:
info = info_queue.get():
# add info to df
counter += 1
In the second case you may even start separate thread for every url without url_queue (reasonable if amount of urls is on the order of thousands or less). counter is some simple way to stop the program when all urls are processed.
I would use the second approach if you ask me. It is more flexible in my opinion.

Related

Need to do CPU bound processing using 2+ processes in Python by reading from a gzipped file

I have a gzipped file spanning (compressed 10GB, uncompressed 100GB) and which has some reports separated by demarcations and I have to parse it.
The parsing and processing the data is taking long time and hence is a CPU bound problem (not an IO bound problem). So I am planning to split the work into multiple processes using multiprocessing module. The problem is I am unable to send/share data to child processes efficiently. I am using subprocess.Popen to stream in the uncompressed data in parent process.
process = subprocess.Popen('gunzip --keep --stdout big-file.gz',
shell=True,
stdout=subprocess.PIPE)
I am thinking of using a Lock() to read/parse one report in child-process-1 and then release the lock, and switch to child-process-2 to read/parse next report and then switch back to child-process-1 to read/parse next report). When I share the process.stdout as args with the child processes, I get a pickling error.
I have tried to create multiprocessing.Queue() and multiprocessing.Pipe() to send data to child processes, but this is way too slow (in fact it is way slower than doing it in single thread ie serially).
Any thoughts/examples about sending data to child processes efficiently will help.
Could you try something simple instead? Have each worker process run its own instance of gunzip, with no interprocess communication at all. Worker 1 can process the first report and just skip over the second. The opposite for worker 2. Each worker skips every other report. Then an obvious generalization to N workers.
Or not ...
I think you'll need to be more specific about what you tried, and perhaps give more info about your problem (like: how many records are there? how big are they?).
Here's a program ("genints.py") that prints a bunch of random ints, one per line, broken into groups via "xxxxx\n" separator lines:
from random import randrange, seed
seed(42)
for i in range(1000):
for j in range(randrange(1, 1000)):
print(randrange(100))
print("xxxxx")
Because it forces the seed, it generates the same stuff every time. Now a program to process those groups, both in parallel and serially, via the most obvious way I first thought of. crunch() takes time quadratic in the number of ints in a group, so it's quite CPU-bound. The output from one run, using (as shown) 3 worker processes for the parallel part:
parallel result: 10,901,000,334 0:00:35.559782
serial result: 10,901,000,334 0:01:38.719993
So the parallelized run took about one-third the time. In what relevant way(s) does that differ from your problem? Certainly, a full run of "genints.py" produces less than 2 million bytes of output, so that's a major difference - but it's impossible to guess from here whether it's a relevant difference. Perahps, e.g., your problem is only very mildly CPU-bound? It's obvious from output here that the overheads of passing chunks of stdout to worker processes are all but insignificant in this program.
In short, you probably need to give people - as I just did for you - a complete program they can run that reproduces your problem.
import multiprocessing as mp
NWORKERS = 3
DELIM = "xxxxx\n"
def runjob():
import subprocess
# 'py' is just a shell script on my box that
# invokes the desired version of Python -
# which happened to be 3.8.5 for this run.
p = subprocess.Popen("py genints.py",
shell=True,
text=True,
stdout=subprocess.PIPE)
return p.stdout
# Return list of lines up to (but not including) next DELIM,
# or EOF. If the file is already exhausted, return None.
def getrecord(f):
result = []
foundone = False
for line in f:
foundone = True
if line == DELIM:
break
result.append(line)
return result if foundone else None
def crunch(rec):
total = 0
for a in rec:
for b in rec:
total += abs(int(a) - int(b))
return total
if __name__ == "__main__":
import datetime
now = datetime.datetime.now
s = now()
total = 0
f = runjob()
with mp.Pool(NWORKERS) as pool:
for i in pool.imap_unordered(crunch,
iter((lambda: getrecord(f)), None)):
total += i
f.close()
print(f"parallel result: {total:,}", now() - s)
s = now()
# try the same thing serially
total = 0
f = runjob()
while True:
rec = getrecord(f)
if rec is None:
break
total += crunch(rec)
f.close()
print(f"serial result: {total:,}", now() - s)

Dask: Submit continuously, work on all submitted data

Having 500, continously growing DataFrames, I would like to submit operations on the (for each DataFrame indipendent) data to dask. My main question is: Can dask hold the continously submitted data, so I can submit a function on all the submitted data - not just the newly submitted?
But lets explain it on an example:
Creating a dask_server.py:
from dask.distributed import Client, LocalCluster
HOST = '127.0.0.1'
SCHEDULER_PORT = 8711
DASHBOARD_PORT = ':8710'
def run_cluster():
cluster = LocalCluster(dashboard_address=DASHBOARD_PORT, scheduler_port=SCHEDULER_PORT, n_workers=8)
print("DASK Cluster Dashboard = http://%s%s/status" % (HOST, DASHBOARD_PORT))
client = Client(cluster)
print(client)
print("Press Enter to quit ...")
input()
if __name__ == '__main__':
run_cluster()
Now I can connect from my my_stream.py and start to submit and gather data:
DASK_CLIENT_IP = '127.0.0.1'
dask_con_string = 'tcp://%s:%s' % (DASK_CLIENT_IP, DASK_CLIENT_PORT)
dask_client = Client(self.dask_con_string)
def my_dask_function(lines):
return lines['a'].mean() + lines['b'].mean
def async_stream_redis_to_d(max_chunk_size = 1000):
while 1:
# This is a redis queue, but can be any queueing/file-stream/syslog or whatever
lines = self.queue_IN.get(block=True, max_chunk_size=max_chunk_size)
futures = []
df = pd.DataFrame(data=lines, columns=['a','b','c'])
futures.append(dask_client.submit(my_dask_function, df))
result = self.dask_client.gather(futures)
print(result)
time sleep(0.1)
if __name__ == '__main__':
max_chunk_size = 1000
thread_stream_data_from_redis = threading.Thread(target=streamer.async_stream_redis_to_d, args=[max_chunk_size])
#thread_stream_data_from_redis.setDaemon(True)
thread_stream_data_from_redis.start()
# Lets go
This works as expected and it is really quick!!!
But next, I would like to actually append the lines first before the computation takes place - And wonder if this is possible? So in our example here, I would like to calculate the mean over all lines which have been submitted, not only the last submitted ones.
Questions / Approaches:
Is this cummulative calculation possible?
Bad Alternative 1: I
cache all lines locally and submit all the data to the cluster
every time a new row arrives. This is like an exponential overhead. Tried it, it works, but it is slow!
Golden Option: Python
Program 1 pushes the data. Than it would be possible to connect with
another client (from another python program) to that cummulated data
and move the analysis logic away from the inserting logic. I think Published DataSets are the way to go, but are there applicable for this high-speed appends?
Maybe related: Distributed Variables, Actors Worker
Assigning a list of futures to a published dataset seems ideal to me. This is relatively cheap (everything is metadata) and you'll be up-to-date as of a few milliseconds
client.datasets["x"] = list_of_futures
def worker_function(...):
futures = get_client().datasets["x"]
data = get_client.gather(futures)
... work with data
As you mention there are other systems like PubSub or Actors. From what you say though I suspect that Futures + Published datasets are simpler and a more pragmatic option.

Memory efficient massive http requests

I need to do an unlimited HTTP requests from a web API one after another and make it work efficiently and quite fast. (I need it for a utility so it should work no matter how many time im using it, also it should be able to be used on a web server(people use at the same time))
right now I'm using a threading with a queue but after a while of doing it I'm getting errors like:
'cant start a new thread'
'MemoryError'
or it may work a bit, but pretty slow.
this is a part of my code:
concurrent = 25
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=receiveJson)
t.daemon = True
t.start()
for url in get_urls():
q.put(url.strip())
q.join()
*get_urls() is a simple function that returns a list of urls(unknown length)
this is my recieveJson(thread target):
def receiveJson():
while True:
url = q.get()
res = request.get(url).json()
q.task_done()
The problem is coming from your Threads never ending, notice that there is no exit condition in your receiveJson function. The simplest way to signal it should end is usually by enqueuing None:
def receiveJson():
while True:
url = q.get()
if url is None: # Exit condition allows thread to complete
q.task_done()
break
res = request.get(url).json()
q.task_done()
and then you can change the other code as follows:
concurrent = 25
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=receiveJson)
t.daemon = True
t.start()
for url in get_urls():
q.put(url.strip())
for i in range(concurrent):
q.put(None) # Add a None for each thread to be able to get and complete
q.join()
There are other ways of doing this, but this is the how to do it with the least amount of change to your code. If this is happening often, it might be worth looking into the concurrent.futures.ThreadPoolExecutor class to avoid the cost of opening threads very often.

How can i use multithreading (or multiproccessing?) for faster data upload?

I have a list of issues (jira issues):
listOfKeys = [id1,id2,id3,id4,id5...id30000]
I want to get worklogs of this issues, for this I used jira-python library and this code:
listOfWorklogs=pd.DataFrame() (I used pandas (pd) lib)
lst={} #dictionary for help, where the worklogs will be stored
for i in range(len(listOfKeys)):
worklogs=jira.worklogs(listOfKeys[i]) #getting list of worklogs
if(len(worklogs)) == 0:
i+=1
else:
for j in range(len(worklogs)):
lst = {
'self': worklogs[j].self,
'author': worklogs[j].author,
'started': worklogs[j].started,
'created': worklogs[j].created,
'updated': worklogs[j].updated,
'timespent': worklogs[j].timeSpentSeconds
}
listOfWorklogs = listOfWorklogs.append(lst, ignore_index=True)
########### Below there is the recording to the .xlsx file ################
so I simply go into the worklog of each issue in a simple loop, which is equivalent to referring to the link:
https://jira.mycompany.com/rest/api/2/issue/issueid/worklogs and retrieving information from this link
The problem is that there are more than 30,000 such issues.
and the loop is sooo slow (approximately 3 sec for 1 issue)
Can I somehow start multiple loops / processes / threads in parallel to speed up the process of getting worklogs (maybe without jira-python library)?
I recycled a piece of code I made into your code, I hope it helps:
from multiprocessing import Manager, Process, cpu_count
def insert_into_list(worklog, queue):
lst = {
'self': worklog.self,
'author': worklog.author,
'started': worklog.started,
'created': worklog.created,
'updated': worklog.updated,
'timespent': worklog.timeSpentSeconds
}
queue.put(lst)
return
# Number of cpus in the pc
num_cpus = cpu_count()
index = 0
# Manager and queue to hold the results
manager = Manager()
# The queue has controlled insertion, so processes don't step on each other
queue = manager.Queue()
listOfWorklogs=pd.DataFrame()
lst={}
for i in range(len(listOfKeys)):
worklogs=jira.worklogs(listOfKeys[i]) #getting list of worklogs
if(len(worklogs)) == 0:
i+=1
else:
# This loop replaces your "for j in range(len(worklogs))" loop
while index < len(worklogs):
processes = []
elements = min(num_cpus, len(worklogs) - index)
# Create a process for each cpu
for i in range(elements):
process = Process(target=insert_into_list, args=(worklogs[i+index], queue))
processes.append(process)
# Run the processes
for i in range(elements):
processes[i].start()
# Wait for them to finish
for i in range(elements):
processes[i].join(timeout=10)
index += num_cpus
# Dump the queue into the dataframe
while queue.qsize() != 0:
listOfWorklogs.append(q.get(), ignore_index=True)
This should work and reduce the time by a factor of little less than the number of CPUs in your machine. You can try and change that number manually for better performance. In any case I find it very strange that it takes about 3 seconds per operation.
PS: I couldn't try the code because I have no examples, it probably has some bugs
I have some troubles((
1) indents in the code where the first "for" loop appears and the first "if" instruction begins (this instruction and everything below should be included in the loop, right?)
for i in range(len(listOfKeys)-99):
worklogs=jira.worklogs(listOfKeys[i]) #getting list of worklogs
if(len(worklogs)) == 0:
....
2) cmd, conda prompt and Spyder did not allow your code to work for a reason:
Python Multiprocessing error: AttributeError: module '__ main__' has no attribute 'spec'
After researching in the google, I had to set a bit higher in the code: spec = None (but I'm not sure if this is correct) and this error disappeared.
By the way, the code in Jupyter Notebook worked without this error, but listOfWorklogs is empty and this is not right.
3) when I corrected indents and set __spec __ = None, a new error occurred in this place:
processes[i].start ()
error like this:
"PicklingError: Can't pickle : attribute lookup PropertyHolder on jira.resources failed"
if I remove the parentheses from the start and join methods, the code will work, but I will not have any entries in the listOfWorklogs(((
I ask again for your help!)
How about thinking about it not from a technical standpoint but a logical one? You know your code works, but at a rate of 3sec per 1 issue which means it would take 25 hours to complete. If you have the ability to split up the # of Jira issues that are passed into the script (maybe use date or issue key, etc) you could create multiple different .py files with basically the same code, you would just be passing each one a different list of Jira tickets. So you could just run say 4 of them at the same time and you would reduce your time to 6.25 hours each.

How to change the number of multiprocessing pool workers on the go

I want to change the number of workers in the pool that are currently used.
My current idea is
while True:
current_connection_number = get_connection_number()
forced_break = False
with mp.Pool(current_connection_number) as p:
for data in p.imap_unordered(fun, some_infinite_generator):
yield data
if current_connection_number != get_connection_number():
forced_break = True
break
if not forced_break:
break
The problem is that it just terminates the workers and so the last items that were gotten from some_infinite_generator and weren't processed yet are lost. Is there some standard way of doing this?
Edit: I've tried printing inside some_infinite_generator and it turns out p.imap_unordered requests 1565 items with just 2 pool workers even before anything is processed, how do I limit the number of items requested from generator? If I use the code above and change number of connections after just 2 items, I will loose 1563 items
The problem is that the Pool will consume the generator internally in a separate thread. You have no way to control that logic.
What you can do, is feeding to the Pool.imap_unordered method a portion of the generator and get that consumed before scaling according to the available connections.
CHUNKSIZE = 100
while True:
current_connection_number = get_connection_number()
with mp.Pool(current_connection_number) as p:
while current_connection_number == get_connection_number():
for data in p.imap_unordered(fun, grouper(CHUNKSIZE, some_infinite_generator)):
yield data
def grouper(n, iterable):
it = iter(iterable)
while True:
chunk = tuple(itertools.islice(it, n))
if not chunk:
return
yield chunk
It's a bit less optimal as the scaling happens every chunk instead of every iteration but with a bit of fine tuning of the CHUNKSIZE value you can easily get it right.
The grouper recipe.

Resources