So I am running a DNN that is based upon the iris Model located here:https://www.tensorflow.org/get_started/estimator and the textlineReader advice located here: https://www.tensorflow.org/api_guides/python/reading_data
It is having a memory leak problem, and I have narrowed down the leak to these few lines of code:
import numpy as np
import tensorflow as tf
def main():
filename_queue = tf.train.string_input_producer(file_path)
defaults = [[0.],[0.],[0.],[0.],[0]]
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
for i in range(50000):
columns = tf.decode_csv(value, record_defaults=defaults)
if __name__ == "__main__":
main()
Where the .csv file referred to by file_path contains 1 line:
5.9,3.0,4.2,1.5,1
When I run the program this is my system usage over 60 seconds:
Interestingly, all of the memory gets deallocated when I kill the program, or when the OOM manager does.
Anyway, I have to use batch processing in my program because of the size of the training dataset, so I have to perform the decoding of the .csv file in batches as well.
Is there a way to circumvent this leak, or is this a bug that should be reported?
Any information or suggestions are welcome.
Sort of obviously, the leak is coming from calling the decode_csv function, which is allocating some space that isn't deallocated until the program returns. The solution is to call the tf.decode_csv function outside of the for loop when getting a batch. As unintuitive as this sounds, I have been able to verify that it still shuffles the data with consecutive reads.
More importantly, this gives insight into the nature of what I believe are called graph operation in Tensorflow. One allocation no where near the session and it still works. I guess it is more like setting up a pipeline, and then feeding data through that pipelinne.
My code actually runs faster now too without all those mallocs!
Related
I am trying to implement a bounded buffer like solution where data generator and the model work as two separate processes. The data generator preprocess the data and stores in a shared queue (with predefined max size to limit the memory usage). The model on the other hand consumes data from this queue at its own pace until the queue is empty. Below is the snippet of my implementation.
'''
self._buffer is an object of multiprocessing.Queue
'''
def produce(self):
for obj in self._generator:
self._buffer.put(obj=obj, block=True, timeout=None)
self._buffer.put(obj=None)
def consume(self):
while True:
dat = self._buffer.get(block=True, timeout=None)
if dat is None:
break
else:
# Train model on `dat`
def run(self):
pt = multiprocessing.Process(target=self.produce)
ct = multiprocessing.Process(target=self.consume)
pt.start()
ct.start()
pt.join()
ct.join()
However, the solution above does not work. I used the torch.multiprocessing as instructed the documentation. I also set torch.multiprocessing.set_start_method('spawn') in order to avoid "RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method"
But now I get "TypeError: cannot pickle 'generator' object". How this can be fixed?
Since you work with pytorch you should use the Dataset and Dataloader approach. This handles all problems with multiprocessing, shared memory and so on for you.
You can have map style datasets or things like iterable-style.Best to read the official documentation, what is what and how they work.
In your case you probably are fine with an iterable-style dataset. I used both approaches for similar cases. You can have the iterable style dataset, which you might need if you don't know how much samples you will be processing. For other cases I had a map-style dataset, where I knew the total number of my samples beforehand (e.g. processing all images in a directory) and could use a sequential sampler to give me the elements in order.
Regarding one of your problems. All errors like this TypeError: cannot pickle 'generator' object happen when you have objects which can't be serialized. For serialization pickle is used. In your case self._generator seems to be an object which can't be serialized for some reason. Without code it is not possible to say why. I had cases where used wrapped c++ packages created with pybind where objects were not serializable or I had some mutex variables somewhere.
I am trying to use multiprocessing module to initialize each column of a dataframe using a separate CPU core in Python 3.6 but my code doesn't work. Does anybody know the issue with this code? I appreciate your help.
My laptop has Windows 10 and its CPU is Core i7 8th Gen:
import time
import pandas as pd
import numpy as np
import multiprocessing
df=pd.DataFrame(index=range(10),columns=["A","B","C","D"])
def multiprocessing_func(col):
for i in range(0,df.shape[0]):
df.iloc[i,col]=np.random(4)
print("column "+str(col)+ " is completed" )
if __name__ == '__main__':
starttime = time.time()
processes = []
for i in range(0,df.shape[1]):
p = multiprocessing.Process(target=multiprocessing_func, args=(i,))
processes.append(p)
p.start()
for process in processes:
process.join()
print('That took {} seconds'.format(time.time() - starttime))
When you start a Process, it is basically a copy of the parent process. (I'm skipping over some details here, but they shouldn't matter for the explanation).
Unlike threads, processes don't share data. (Processes can use shared memory, but this is not automatic. To the best of my knowledge, the mechanisms in multiprocessing for sharing data cannot handle a dataframe.)
So what happens is that each of the worker processes is modifying its own copy of the dataframe, not the dataframe in the parent process.
For this to work, you'd have to send the new data back to the parent process. You could do that by e.g. return-ing it from the worker function, and then putting the returned data into the original dataframe.
It only makes sense to use multiprocessing like this if the work of generating the data takes significantly longer then launching a new worker process, sending the data back to the parent process and putting it into the dataframe. Since you are basically filling the columns with random data, I don't think that is the case here.
So I don't see why you would use multiprocessing here.
Edit: Based on your comment that it takes days to calculate each column, I would propose the following.
Use Proces like you have been doing, but have each of the worker processes save the numbers they produce in a file where the filename includes the value of i. Have the workers return a status code so you can determine that thay have succeeded or failed. In case of failure, also return some kind of index of the amount of data successfully completed, so you don't have to re-calculate that again.
The file format should be simple and preferable readable. E.g. one number per line.
Wait for all processes to finish, read the files and fill the dataframe.
I am trying to parallelize a function on my pandas dataframe and I'm running into an issue where it seems that the multiprocessing library is hanging. I am doing this all within a Jupyter notebook with myFunction() existing in a separate .py file. Can someone point out what I am doing wrong here?
Surprisingly, this piece of code has worked previously on my Windows 7 machine with the same version of python. I have just copied the file over to my Mac laptop.
I also use tqdm so I can monitor the progress, the behavior is the same with or without it.
#This function hands the multiprocessing
from multiprocessing import Pool, cpu_count
import numpy as np
import tqdm
def parallelize_dataframe(df, func):
num_partitions = cpu_count()*2 # number of partitions to split dataframe
num_cores = cpu_count() # number of cores on your machine
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
return pd.concat(list(tqdm.tqdm_notebook(pool.imap(func, df_split),total=num_partitions)))
#My function that I am applying to the dataframe is in another file
#myFunction retrieves a JSON from an API for each ID in myDF and converts it to a dataframe
from myFuctions import myFunction
#Code that calls the parallelize function
finalDF = parallelize_dataframe(myDF,myFunction)
The expected result is a concatenation of a list of dataframes that have been retrieved by myFunction(). This is worked in the past, but now the process seems to hang indefinitely without any error messages.
Q : Can someone point out what I am doing wrong here?
You just expected the MacOS to use the same mechanism for process-instantiations as the WinOS did in past.
The multiprocessing module does not do the same set of things on either of the supported O/S-es and even reported some methods to be dangerous and also had changed the default behaviour on MacOS- and Linux-based systems.
Next steps to try to move forward :
re-read how to do the explicit setup of the call-signatures in multiprocessing documentation ( avoid hidden dependency of the code-behaviour on "new" default values )
test if may avoid the cases where multiprocessing will spawn the full-copy of the python-interpreter process, that many times as you instruct ( memory allocations could soon get devastatingly large, if many replicas try to get instantiated beyond the localhost RAM-footprint, just due to a growing number of CPU-cores )
test if the "worker"-code is not computing intensive but rather network-remote API-call latency driven. In such a case asyncio/await decorated tools will help more with latency-masking than going into in the case of IO-latency dominated use-cases inefficient multiprocessing spawned and rather expensive full-copy concurrency of many python-processes (that just stay waitin for receiving remote-API answers ).
last but not least - performance-sensitive code best runs outside any mediating-ecosystem, like the interactivity-focused Jupyter-notebooks are.
I have a thread in my python program that acquires images from a webcam and puts them in a multiprocessing queue. A separate process then takes these images from the queue and does some processing. However, if I try to empty the queue from the image acquisition (producer) thread I do not free any memory, and the program eventually uses all the available memory and crashes the machine (Python 3.6.6 / Ubuntu 18.04 64bit / Linux 4.15.0-43-generic)
I have a simple working example that reproduces the problem.
import multiprocessing
import time
import numpy as np
queue_mp = multiprocessing.Queue(maxsize=500)
def producer(q):
while True:
# Generate object to put in queue
dummy_in = np.ones((1000,1000))
# If the queue is full, get the oldest object (FIFO),
# to make space for the latest incoming object.
if q.full():
__ = q.get()
q.put(dummy_in)
def consumer(q):
while True:
# Get object from queue
dummy_out = q.get()
# Do some processing on the object, which we simulate here by time.sleep
time.sleep(3)
producer_process = multiprocessing.Process(target=producer,
args=(queue_mp,),
daemon=False)
consumer_process = multiprocessing.Process(target=consumer,
args=(queue_mp,),
daemon=False)
# Start producer and consumer processes
producer_process.start()
consumer_process.start()
I can rewrite my code to avoid this problem, but I'd like to understand what is happening. Is there a general rule that producers and consumers of a multiprocessing queue must be running in separate processes?
If anyone understands why this happens, or what exactly is happening behind the scenes of multiprocessing queues that would explain this memory behavior I would appreciate it. The docs did not go into a lot of detail.
I figured out what was happening, so I'll post it here for the benefit of anyone that stumbles across question.
My memory problem resulted from a numpy bug in numpy version 1.16.0. Reverting to numpy version 1.13.3 resolved the problem.
To answer the basic question: No, there is no need to worry which thread/process is doing the consuming (get) and which thread/process is doing the producing (put) for multiprocessing queues. There is nothing special about multiprocessing queues with respect to garbage collection. As kindall explains in response to a similar question:
When there are no longer any references to an object, the memory it occupies is freed immediately and can be reused by other Python objects
I hope that helps someone. In any case, the numpy bug should be resolved in the 1.16.1 release.
I'm attempting to download around 3,000 files (each being maybe 3 MB in size) from Amazon S3 using requests_futures, but the download slows down badly after about 900, and actually starts to run slower than a basic for-loop.
It doesn't appear that I'm running out of memory or CPU bandwidth. It does, however, seem like the Wifi connection on my machine slows to almost nothing: I drop from a few thousand packets/sec to just 3-4. The weirdest part is that I can't load any websites until the Python process exits and I restart my wifi adapter.
What in the world could be causing this, and how can I go about debugging it?
If it helps, here's my Python code:
import requests
from requests_futures.sessions import FuturesSession
from concurrent.futures import ThreadPoolExecutor, as_completed
# get a nice progress bar
from tqdm import tqdm
def download_threaded(urls, thread_pool, session):
futures_session = FuturesSession(executor=thread_pool, session=session)
futures_mapping = {}
for i, url in enumerate(urls):
future = futures_session.get(url)
futures_mapping[future] = i
results = [None] * len(futures_mapping)
with tqdm(total=len(futures_mapping), desc="Downloading") as progress:
for future in as_completed(futures_mapping):
try:
response = future.result()
result = response.text
except Exception as e:
result = e
i = futures_mapping[future]
results[i] = result
progress.update()
return results
s3_paths = [] # some big list of file paths on Amazon S3
def make_s3_url(path):
return "https://{}.s3.amazonaws.com/{}".format(BUCKET_NAME, path)
urls = map(make_s3_url, s3_paths)
with ThreadPoolExecutor() as thread_pool:
with requests.session() as session:
results = download_threaded(urls, thread_pool, session)
Edit with various things I've tried:
time.sleep(0.25) after every future.result() (performance degrades sharply around 900)
4 threads instead of the default 20 (performance degrades more gradually, but still degrades to basically nothing)
1 thread (performance degrades sharply around 900, but recovers intermittently)
ProcessPoolExecutor instead of ThreadPoolExecutor (performance degrades sharply around 900)
calling raise_for_status() to throw an exception whenever the status is greater than 200, then catching this exception by printing it as a warning (no warnings appear)
use ethernet instead of wifi, on a totally different network (no change)
creating futures in a normal requests session instead of using a FutureSession (this is what I did originally, and found requests_futures while trying to fix the issue)
running the download only only a narrow range of files around the failure point (e.g. file 850 through file 950) -- performance is just fine here, print(response.status_code) shows 200 all the way, and no exceptions are caught.
For what it's worth, I have previously been able to download ~1500 files from S3 in about 4 seconds using a similar method, albeit with files an order of magnitude smaller
Things I will try when I have time today:
Using a for-loop
Using Curl in the shell
Using Curl + Parallel in the shell
Using urllib2
Edit: it looks like the number of threads is stable, but when the performance starts to go bad the number of "Idle Wake Ups" appears to spike from a few hundred to a few thousand. What does that number mean, and can I use it to solve this problem?
Edit 2 from the future: I never ended up figuring out this problem. Instead of doing it all in one application, I just chunked the list of files and ran each chunk with a separate Python invocation in a separate terminal window. Ugly but effective! The cause of the problem will forever be a mystery, but I assume it was some kind of problem deep in the networking stack of my work machine at the time.
This isn't a surprise.
You don't get any parallelism when you have more threads than cores.
You can prove this to yourself by simplifying the problem to a single core with multiple threads.
What happens? You can only have one thread running at a time, so the operating system context switches each thread to give everyone a turn. One thread works, the others sleep until they are woken up in turn to do their bit. In that case you can't do better than single thread.
You may do worse because context switching and memory allocated for each thread (1MB each) have a price, too.
Read up on Amdahl's Law.