I'm writing a piece of code for my thesis which collects a series of .csv files that represent temperature images. I want to make the use of these .csv files more efficient by storing them into a dataframe.
I have multiple videos of +10000 frames and each video should be stored in a separate dataframe. I made a code that works for lower numbers of files, however when running the code on a bunch of videos (lets say 10), it crashes after a couple of videos. It returns the MemoryError. I already tried gc.collect() functions and deleted the dataframe after using df.to_hdf to prevent python of keeping lists open in memory that have been created. My memory usage still keeps increasing until it completely fills up the RAM, and it crashes.
the relevant piece of the code is added here.
dfs={}
for l in range(k*number_of_csv,len(i)):
df=pd.read_csv(i[l],sep=',',header=None)
dfs['{:0>8}'.format(i[l])] = df
dfs=pd.concat(dfs)
dfs.to_hdf('experiment_data_series_'+str(k)+'.h5',key='raw',mode='w')
del dfs
gc.collect()
in short: it builds a dataframe from all the csv files and then stores them into the h5 file.
Can someone detect what is missing to prevent this from overconsuming memory?
I already inserted a chunking procedure, so that the number of .csv files that is stored in a single h5 is always <20000. (usually the h5 file has a size of 3-4GB)
I suspect that python allocates some memory for the storing operation, but doesn't liberate it afterwards or something.
I'd appreciate the help.
kind regards,
Cedric
Since you have stated that each video should be stored in a separate dataframe I don't think you need to concatenate them all into dfs. Simply read in each .csv file and then write them to your HDFStore under its own key; imagine a dict of DataFrame objects.
Consider the following code:
# create an HDF5 file which will contain all your data;
# this statement automatically opens the h5 file for you
hdf = pd.HDFStore('experiment_data_series.h5')
for l in range(k*number_of_csv,len(i)):
df = pd.read_csv(i[l], sep=',', header=None)
# store this dataframe in the HDF5 file under some key
hdf['raw_{}'.format(k)] = df
# when you are finished writing to the HDF5 file remember to close it
hdf.close()
later on you can open the file again and view some information
hdf = pd.HDFStore('experiment_data_series.h5', mode='r')
print(hdf.info())
print(hdf.keys())
# you can get one dataframe by its key
df = hdf.select('raw_0', auto_close=True)
I think this should work because you won't be having all of your data loaded in memory at once, i.e., you are using the disk instead.
Related
For a project I need to append frequently but on a non-periodic way about one thousand or more data files (tabular data) on one existing CSV or parquet file with same schema in Hadoop/HDFS (master=yarn). At the end, I need to be able to do some filtering on the result file - to extract subset of data.
One dummy file may look like this (very simple example):
id,uuid,price
1,16c533c3-c191-470c-97d9-e1e01ccc3080,46159
2,6bb0917b-2414-4b24-85ca-ae2c2713c9a0,50222
3,7b1fa3f9-2db2-4d93-a09d-ca6609cfc834,74591
4,e3a3f874-380f-4c89-8b3e-635296a70d76,91026
5,616dd6e8-5d05-4b07-b8f2-7197b579a058,73425
6,23e77a21-702d-4c87-a69c-b7ace0626616,34874
7,339e9a7f-efb1-4183-ac32-d365e89537bb,63317
8,fee09e5f-6e16-4d4f-abd1-ecedb1b6829c,6642
9,2e344444-35ee-47d9-a06a-5a8bc01d9eab,55931
10,d5cba8d6-f0e1-49c8-88e9-2cd62cde9737,51792
Number of rows may vary between 10 and about 100000
On user request, all input files copied in a source folder should be ingested by an ETL pipeline and appended at the end of one single CSV/parquet file or any other appropriate file format (no DB). Data from a single input file may be spread over one, two or more partitions.
Because the input data files may all have different number of rows, I am concerned about getting partitions with different sizes in the resulting CSV/parquet file. Sometimes all the data may be append in one new file. Sometimes the data is so big that several files are appended.
And because input files may be appended a lot of time from different users and different sources, I am also concerned that the result CSV/parquet may contains too much part-files for the namenode to handle them.
I have done some small test appending data on existing CSV / parquet files and noticed that for each appending, a new file was generated - for example:
df.write.mode('append').csv('/user/applepy/pyspark_partition/uuid.csv')
will append the new data as a new file in the file 'uuid.csv' (which is actually a directory generated by pyspark containing all pieces of appended data).
Doing some load tests based on real conditions, I quickly realized that I was generating A LOT of files (several 10-thousands). At some point I got so much files that PySpark was unable to simple count the number of rows (NameNode memory overflow).
So I wonder how to solve this problem. What would be the best practice here? Read the whole file, append the data chunk, same the data in a new file doesn't seems to be very efficient here.
NameNode memory overflow
Then increase the heapsize of the namenode
quickly realized that I was generating A LOT of files
HDFS write operations almost never append to single files. They append "into a directory", and create new files, yes.
From Spark, you can use coalesce and repartition to create larger writer batches.
As you'd mentioned, you wanted parquet, so write that then. That'll cause you to have even smaller file sizes in HDFS.
or any other appropriate file format (no DB)
HDFS is not really the appropriate tool for this. Clickhouse, Druid, and Pinot are the current real time ingest / ETL tools being used, especially when data is streamed in "non periodically" from Kafka
I dont know which part of the code I should share since what I do is basically as below(I will share a simple code algorithm instead for reference):
Task: I need to search for file A and then match the values in file A with column values in File B(It has more than 100 csv files, with each contained more than 1millions rows in CSV), then after matched, combined the results into a single CSV.
Extract column values for File A and then make it into list of values.
Load File B in pyspark and then use .isin to match with File A list of values.
Concatenate the results into single csv file.
"""
first = pd.read_excel("fileA.xlsx")
list_values = first[first["columnA"].apply(isinstance,args=(int,))]["columnA"].values.tolist()
combine = []
for file in glob.glob("directory/"): #here will loop at least 100 times.
second = spark.read.csv("fileB")
second = second["columnB"].isin(list_values) # More than hundreds thousands rows will be expected to match.
combine.append(second)
total = pd.concat(combine)
Error after 30hours of running time:
UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
Is there a way to better perform such task? currently, to complete the process it takes more than 30hours to just run the code but it ended with failure with above error. Something like parallel programming or which I could speed up the process or to clear the above error? ?
Also, when I test it with running only 2 CSV files, it took less than a minute to complete but when I try to loop the whole folder with 100 files, it takes more than 30hours.
There are several things that I think you can try to optimize given that your configuration and resource unchanged:
Repartition when you read your CSV. Didn't study the source code on how spark read the csv, but based on my experience / case in SO, when you use spark to read the csv, all the data will be in single partition, which might cause you the Java OOM error and also it's not fully utilize your resource. Try to check the partitioning of the data and make sure that there is no data skewness before you do any transformation and action.
Rethink on how to do the filtering based on another dataframe column value. From your code, your current approach is to use a python list to collect and store the reference, and then use .isin() to search if the main dataframe column contain value which is in this reference list. If the length of your reference list is very large, the searching operation of EACH ROW to go through the whole reference list is definitely a high cost. Instead, you can try to use the leftsemi .join() operation to achieve the same goal. Even if the dataset is small and you want to prevent the data shuffling, you can use the broadcast to copy your reference dataframe to every single node.
If you can achieve in Spark SQL, don't do it by pandas. In your last step, you're trying to concat all the data after the filtering. In fact, you can achieve the same goal with .unionAll() or .unionByName(). Even you do the pd.concat() in the spark session, all the pandas operation will be done in the driver node but not distributed. Therefore, it might cause Java OOM error and degrade the performance too.
I make great efforts to save my parquet files with a index using the datetime64['ns'] dtype. But when I then read multiple of the parquet files in Dask Dataframe it converts the index to dtype object (str). Why?
I cannot use parse_dates argument in my read_parquet call as it only works on columns. I read in each individual underlying parquet file with pandas and checked the dtype of the index, they are consistent.
My code is simple
try:
df = dd.read_parquet(data_filenames, columns=list(cols_to_retrieve),
engine='pyarrow')
except Exception as ex:
self.build_error(ex, end_date)
df = df[list(cols_to_retrieve)]
df = df.compute()
What is the recommended approach to fixing Dasks tendency to change dtypes?
It sounds like you might have found a bug. You should generate a little code that creates the file(s) with the datetime type, but on loading they are no longer that type, and post this at https://github.com/dask/dask/issues/new
i.e., we need a MCVE, so that we can debug and fix, and make your example into a test against future regressions. It also often reveals something you might have been doing wrong, if there is in fact no bug.
That said, you could also try the fastparquet parquet engine, just in case it happens to cope with your particular data.
We have a csv file that is being used like a database, and an ETL script that takes input Excel files and transforms them into the same format to append to the csv file.
The script reads the csv file into a dataframe and appends the new input dataframe to the end, and then uses to_csv to overwrite the old csv file.
The problem is, when we updated to a new version of Python (downloaded with Anaconda), the output csv file is growing larger and larger every time we append data to it. The more lines in the original csv read into the script (which gets output with the new appended data), the larger the output file size is magnified. The actual number of rows and data in the csv files are fine, it's just the file size itself that is unusually large.
Does anyone know if updating to a new version of Python could have broken this process?
Is Python storing data in the csv file that we cannot see?
Any ideas or help is appreciated! Thank you.
I am using RandomForestClassifier in python to predict whether the pixel in the input image is inside the cell or outside it as a pre-processing stage to improve the image , the problem is that the data size of the training set is 8.36GB and also the size of the test data is 8.29GB so whenever I run my program I get (out of memory) error. Will extending the memory not work?. Is there any way to read csv files which contain the data in more than one step and then free the memory after each step?
Hopefully you are using pandas to process this csv file as it would be nearly impossible in native python. As for your memory problem here is a great article explaining how to process large csv files by chunking the data in pandas.
http://pythondata.com/working-large-csv-files-python/