pandas read_csv memory consumption - python-3.x

I am reading huge Pandas (Version 18.1, on purpose) DataFrames stored in csv Format (~ summed up 30 GB). Working with read_csv however, memory consumption grows to the double of the initial csv. files --> 60 GB. I am aware of the chunksize parameter. This however was way slower and didn't really reduce memory usage. I tried it with an 4 GB DataFrame. Having read the DataFrame, the script still consumed ~7 GB RAM. Here's my code:
df = None
for chunk in pandas.read_csv(fn, chunksize=50000):
if df is None:
df = chunk
else:
df = pandas.concat([df, chunk])
This is only a short version. I am also aware, that specifying the dtype saves memory. So here's my question. What's the best way (performance, memory) to read huge pandas DataFrames?

Depending on the types of operations you want to do on the dataframes, you might find dask useful. One of its key features is allowing operations on larger-than-memory dataframes. For example, to do a groupby on a larger-than-memory dataframe:
import dask.dataframe as dd
df = dd.read_csv(fn)
df_means = df.groupby(key).mean().compute()
Note the addition of compute() at the end, as compared to a typical pandas groupby operation.

You are using chunksize incorrectly. It is not meant to be used for simply appending to the dataframe in chunks. You have to break up a dataset in pieces so that you can process a large dataset one piece at a time. This way, only the chunk that is being processed needs to stay in memory.
Using dtypes and usecols is the best way to reduce memory usage.
It is hard to say because you didn't provide any details about the dataset, such as row count, row size, column data types, number of columns, whether it is clean and structured data etc. If the data in your columns is not consistent, it can cause unexpected upcasting and a memory spike. So you may need to pre-process it before loading the dataframe.
Consider using the category data type for any object/string
columns with low cardinality and low selectivity.
Use dtypes to lower the precision of your numeric columns.
Use chunksize to process data in chunks, not just to append it. Or use dask.

Related

Need to release the memory used by unused spark dataframes

I am not caching or persisting the spark dataframe. If I have to do many additional things in the same session by aggregating and modifying content of the dataframe as part of the process then when and how would the initial dataframe be released from memory?
Example:
I load a dataframe DF1 with 10 million records. Then I do some transformation on the dataframe which creates a new dataframe DF2. Then there are a series of 10 steps I do on DF2. All through this, I do not need DF1 anymore. How can I be sure that DF1 no longer exists in memory and hampering performance? Is there any approach using which I can directly remove DF1 from memory? Or does DF1 get automatically removed based on Least Recently Used (LRU) approach?
That's not how spark work. Dataframes are lazy ... the only things stored in memories are the structures and the list of tranformation you have done on your dataframes. The data are not stored in memory (unless you cache them and apply an action).
Therefore, I do not see any problem in your question.
Prompted by a question from A Pantola, in comments I'm returning here to post a better answer to this question. Note there are MANY possible correct answers how to optimize RAM usage which will depend on the work being done!
First, write the dataframe to DBFS, something like this:
spark.createDataFrame(data=[('A',0)],schema=['LETTERS','NUMBERS'])\
.repartition("LETTERS")\
.write.partitionBy("LETTERS")\
.parquet(f"/{tmpdir}",mode="overwrite")
Now,
df = spark.read.parquet(f"/{tmpdir}")
Assuming you don't set up any caching on the above df, then each time spark finds a reference to df it will parallel read the dataframe and compute whatever is specified.
Note the above solution will minimize RAM usage, but may require more CPU on every read. Also, the above solution will have a cost of writing to parquet.

Repartition by dates for high concurrency and big output files

I'm running Spark job on AWS Glue. The job transforms the data and saves output to parquet files, partitioned by date (year, month, day directories). Job must be able to handle terabytes of input data and uses hundreds of executors, each with 5.5 GB memory limit.
Input covers over 2 years of data. The output parquet files for each date should be as big as possible, optionally split into 500 MB chunks. Creating multiple small files for each day is not wanted.
Few tested approaches:
repartitioning by the same columns as in write results in Out Of Memory errors on executors:
df = df.repartition(*output_partitions)
(df
.write
.partitionBy(output_partitions)
.parquet(output_path))
repartitioning with an additional column with random value results in having multiple small output files written (corresponding to spark.sql.shuffle.partitions value):
df = df.repartition(*output_partitions, "random")
(df
.write
.partitionBy(output_partitions)
.parquet(output_path))
setting the number of partitions in repartition function, for example to 10, gives 10 quite big output files, but I'm afraid it will cause Out Of Memory errors when actual data (TBs in size) will be loaded:
df = df.repartition(10, *output_partitions, "random")
(df
.write
.partitionBy(output_partitions)
.parquet(output_path))
(df in code snippets is a regular Spark Data Frame)
I know I can limit the output file size with maxRecordsPerFile write option. But this limits the output created from a single memory partition, so in the first place, I would need to have partitions created by date.
So the question is how to repartition data in memory to:
split it over multiple executors to prevent Out Of Memory errors,
save output for each day to limited number of big parquet files,
write output files in parallel (using as much executors as possible)?
I've read those sources but did not find a solution:
https://mungingdata.com/apache-spark/partitionby/
https://stackoverflow.com/a/42780452
https://stackoverflow.com/a/50812609

Replacing empty string with null leads to INCREASE in dataframe size?

I'm having trouble understanding the following phenomenon: in Spark 2.2, on Scala, I witness a significant incease in the persisted DataFrame size after replacing literal empty string values with lit(null).
This is the function I use to replace empty string values:
def nullifyEmptyStrings(df:DataFrame): DataFrame = {
var in = df
for (e <- df.columns) {
in = in.withColumn(e, when(length(col(e))===0, lit(null:String)).otherwise(col(e)))
}
in
}
I observe that the persisted (DISK_ONLY) size of my initial dataframe before running this function is 1480MB, and afterwards is 1610MB. The number of partitions remains unchanged.
Any thoughts? The nulling works fine by the way, but my main reason for introducing this was to reduce shuffle size, and it seems I only increase it this way.
I'm going to answer this myself, as we have now done some investigation that might be useful to share.
Testing on large (10s of millions of rows) DataFrames with entirely String columns, we observe that replacing empty Strings with nulls results in a slight decrease of the overall disk footprint when serialized to parquet on S3 (1.1-1.5%).
However, dataframes cached either MEMORY_ONLY or DISK_ONLY were 6% and 8% larger respectively. I can only speculate how Spark is internally representing the NULL value when the Column is of StringType ... but whatever it is, its bigger than an empty string. If there's any way to inspect this I'll be glad to hear it.
The phenomenon is identical in PySpark and Scala.
Our goal in using nulls was to reduce shuffle size in a complex join action. Overall, we experienced the opposite. However we'll keep using nulls because the automatic pushdown of isNotNull filters makes writing joins much cleaner in Spark SQL.
same results here. Perhaps should also check number of partitions as huge partitions with many distinct values may store columns as row strings as opposed to a dictionary.

Python, How read very large files into a dataframe

I have a data set containing 2 billion rows in 9 rows, 1 one contains integers and the other contain strings. The total csv file is around the 80 gb. I'm trying to load the data into a dataframe using read_csv, but the file is to big to read into my memory (I get a memory error). I have around the 150 gb available RAM so it should be no problem. After doing some digging here on the forum I found these 2 possible solutions:
here they give a solution to do it chunk by chunk, but this process takes a very long time and it still gives me a memory error because the datafile takes more space than the available 150gb in RAM.
df = pd.read_csv('path_to_file', iterator=True, chunksize=100000, dtype=int or string)
dataframe = pd.concat(df, ignore_index=True)
here they give a solution of specifying the data type for each column using dtype. Specifying them still gives me a memory error (specifying the integer column as int and the other columns as string).
df = pd.read_csv('path_to_file', dtype=int or string)
I also have a hdf file from the same data file, but this one only contains integers. Reading in this hdf file (equal size as the csv file) on both ways specified above still gives me a memory error (exceeding 150 gb of memory).
Is there a quick and memory efficient way of loading this data into a dataframe to process it?
Thanks for the help!

Memory efficient cartesian join in PySpark

I have a large dataset of string ids, that can fit into memory on a single node in my spark cluster. The issue is that it consumes most of the memory for a single node.
These ids are about 30 characters long. For example:
ids
O2LWk4MAbcrOCWo3IVM0GInelSXfcG
HbDckDXCye20kwu0gfeGpLGWnJ2yif
o43xSMBUJLOKDxkYEQbAEWk4aPQHkm
I am looking to write to file a list of all of the pairs of ids. For example:
id1,id2
O2LWk4MAbcrOCWo3IVM0GInelSXfcG,HbDckDXCye20kwu0gfeGpLGWnJ2yif
O2LWk4MAbcrOCWo3IVM0GInelSXfcG,o43xSMBUJLOKDxkYEQbAEWk4aPQHkm
HbDckDXCye20kwu0gfeGpLGWnJ2yif,O2LWk4MAbcrOCWo3IVM0GInelSXfcG
# etc...
So I need to cross join the dataset on itself. I was hoping to do this on PySpark using a 10 node cluster, but it needs to be memory efficient.
pySpark will handle your dataset easily and memory efficient but it will take time to process 10^8 * 10^8 records (this is estimated size of cross join result). See sample code:
from pyspark.sql.types import *
df = spark.read.csv('input.csv', header=True, schema=StructType([StructField('id', StringType())]))
df.withColumnRenamed('id', 'id1').crossJoin(df.withColumnRenamed('id', 'id2')).show()

Resources