Python, How read very large files into a dataframe - python-3.x

I have a data set containing 2 billion rows in 9 rows, 1 one contains integers and the other contain strings. The total csv file is around the 80 gb. I'm trying to load the data into a dataframe using read_csv, but the file is to big to read into my memory (I get a memory error). I have around the 150 gb available RAM so it should be no problem. After doing some digging here on the forum I found these 2 possible solutions:
here they give a solution to do it chunk by chunk, but this process takes a very long time and it still gives me a memory error because the datafile takes more space than the available 150gb in RAM.
df = pd.read_csv('path_to_file', iterator=True, chunksize=100000, dtype=int or string)
dataframe = pd.concat(df, ignore_index=True)
here they give a solution of specifying the data type for each column using dtype. Specifying them still gives me a memory error (specifying the integer column as int and the other columns as string).
df = pd.read_csv('path_to_file', dtype=int or string)
I also have a hdf file from the same data file, but this one only contains integers. Reading in this hdf file (equal size as the csv file) on both ways specified above still gives me a memory error (exceeding 150 gb of memory).
Is there a quick and memory efficient way of loading this data into a dataframe to process it?
Thanks for the help!

Related

Create 1GB partitions Spark SQL

I'm trying to split my data in 1GB when writing in S3 using spark. The approach I tried was to calculate the size of the DeltaTable in GB (the define_coalesce function), round, and using that number to write in S3:
# Vaccum to leave 1 week of history
deltaTable = DeltaTable.forPath(spark, f"s3a://{delta_table}")
deltaTable.vacuum(168)
deltaTable.generate("symlink_format_manifest")
# Reading delta table and rewriting with coalesce to reach 1GB per file
df = spark.read.format('delta').load(f"s3a://{delta_table}")
coalesce_number = define_coalesce(delta_table) < this function calculates the size of the delta in GB
df.coalesce(coalesce_number).write.format("delta").mode('overwrite').option('overwriteSchema', 'true').save(f"s3a://{delta_table}")
deltaTable = DeltaTable.forPath(spark, f"s3a://{delta_table}")
deltaTable.generate("symlink_format_manifest")
I'm trying this way cause our Delta is the opensource one and we don't have the optimize method built in.
I did some searching and found the spark.sql.files.maxPartitionBytes configuration in Spark, but some people said that it was not solving their problems, and that this config partitions when reading and not writing.
Any suggestions?
I understand your problem, and what you are trying to do but i am not sure what is the output of your current solution. If partitions are still not equal to 1 gb you may try to replace coalesce with repartition. Coalesce does not guarantee that after this operation partitions are equal so your formula may not work. If you know how many partition you need on output use repartition(coalesce_number) and it should create equal partitions with round robin
If the problem is with function which is calculating dataset size (so number of partitions) i know two solutions:
You can cache dataset and then take its size from statistics. Of course this may be problematic and you have to spend some resource to due that. Something similar is done here in first answer: How spark get the size of a dataframe for broadcast?
You can calculate count and divide it by number of records you want to have in single partition. Size of single record depends on your schema, it may be tricky to estimate it but it is viable option to try
Finally solved my problem. Since we are using Delta, I had the idea of trying to read the manifest files to find all the parquet names. After that, I get the sum of the list of parquets on manifest connecting in S3 with boto3:
def define_repartition(delta_table_path):
conn = S3Connection()
bk = conn.get_bucket(bucket)
manifest = spark.read.text(f's3a://{delta_table_path}_symlink_format_manifest/manifest')
parquets = [data[0].replace(f's3a://{bucket}/','') for data in manifest.select('value').collect()]
size = 0
for parquet in parquets:
key = bk.lookup(parquet)
size = size + key.size
return round(size/1073741824)
Thank you all for the help.Regards from Brazil. :)

How to specify file size using repartition() in spark

Im using pyspark and I have a large data source that I want to repartition specifying the files size per partition explicitly.
I know using the repartition(500) function will split my parquet into 500 files with almost equal sizes.
The problem is that new data gets added to this data source every day. On some days there might be a large input, and on some days there might be smaller inputs. So when looking at the partition file size distribution over a period of time, it varies between 200KB to 700KB per file.
I was thinking of specifying the max size per partition so that I get more or less the same file size per file per day irrespective of the number of files.
This will help me when running my job on this large dataset later on to avoid skewed executor times and shuffle times etc.
Is there a way to specify it using the repartition() function or while writing the dataframe to parquet?
You could consider writing your result with the parameter maxRecordsPerFile.
storage_location = //...
estimated_records_with_desired_size = 2000
result_df.write.option(
"maxRecordsPerFile",
estimated_records_with_desired_size) \
.parquet(storage_location, compression="snappy")

Repartition by dates for high concurrency and big output files

I'm running Spark job on AWS Glue. The job transforms the data and saves output to parquet files, partitioned by date (year, month, day directories). Job must be able to handle terabytes of input data and uses hundreds of executors, each with 5.5 GB memory limit.
Input covers over 2 years of data. The output parquet files for each date should be as big as possible, optionally split into 500 MB chunks. Creating multiple small files for each day is not wanted.
Few tested approaches:
repartitioning by the same columns as in write results in Out Of Memory errors on executors:
df = df.repartition(*output_partitions)
(df
.write
.partitionBy(output_partitions)
.parquet(output_path))
repartitioning with an additional column with random value results in having multiple small output files written (corresponding to spark.sql.shuffle.partitions value):
df = df.repartition(*output_partitions, "random")
(df
.write
.partitionBy(output_partitions)
.parquet(output_path))
setting the number of partitions in repartition function, for example to 10, gives 10 quite big output files, but I'm afraid it will cause Out Of Memory errors when actual data (TBs in size) will be loaded:
df = df.repartition(10, *output_partitions, "random")
(df
.write
.partitionBy(output_partitions)
.parquet(output_path))
(df in code snippets is a regular Spark Data Frame)
I know I can limit the output file size with maxRecordsPerFile write option. But this limits the output created from a single memory partition, so in the first place, I would need to have partitions created by date.
So the question is how to repartition data in memory to:
split it over multiple executors to prevent Out Of Memory errors,
save output for each day to limited number of big parquet files,
write output files in parallel (using as much executors as possible)?
I've read those sources but did not find a solution:
https://mungingdata.com/apache-spark/partitionby/
https://stackoverflow.com/a/42780452
https://stackoverflow.com/a/50812609

Spark coalescing on the number of objects in each partition

We are starting to experiment with spark on our team.
After we do reduce job in Spark, we would like to write the result to S3, however we would like to avoid collecting the spark result.
For now, we are writing the files to Spark forEachPartition of the RDD, however this resulted in a lot of small files. We would like to be able to aggregate the data into a couple files partitioned by the number of objects written to the file.
So for example, our total data is 1M objects (this is constant), we would like to produce 400K objects file, and our current partition produce around 20k objects file (this varies a lot for each job). Ideally we want to produce 3 files, each containing 400k, 400k and 200k instead of 50 files of 20K objects
Does anyone have a good suggestion?
My thought process is to let each partition handle which index it should write it to by assuming that each partition will roughy produce the same number of objects.
So for example, partition 0 will write to the first file, while partition 21 will write to the second file since it will assume that the starting index for the object is 20000 * 21 = 42000, which is bigger than the file size.
The partition 41 will write to the third file, since it is bigger than 2 * file size limit.
This will not always result on the perfect 400k file size limit though, more of an approximation.
I understand that there is coalescing, but as I understand it coalesce is to reduce the number of partition based on the number of partition wanted. What I want is to coalesce the data based on the number of objects in each partition, is there a good way to do it?
What you want to do is to re-partition the files into three partitions; the data will be split approximately 333k records per partition. The partition will be approximate, it will not be exactly 333,333 per partition. I do not know of a way to get the 400k/400k/200k partition you want.
If you have a DataFrame `df', you can repartition into n partitions as
df.repartition(n)
Since you want a maximum number or records per partition, I would recommend this (you don't specify Scala or pyspark, so I'm going with Scala; you can do the same in pyspark) :
val maxRecordsPerPartition = ???
val numPartitions = (df.count() / maxRecordsPerPartition).toInt + 1
df
.repartition(numPartitions)
.write
.format('json')
.save('/path/file_name.json')
This will guarantee your partitions are less than maxRecordsPerPartition.
We have decided to just go with the number of files being generated and just making sure that each files contain less than 1 million line items

pandas read_csv memory consumption

I am reading huge Pandas (Version 18.1, on purpose) DataFrames stored in csv Format (~ summed up 30 GB). Working with read_csv however, memory consumption grows to the double of the initial csv. files --> 60 GB. I am aware of the chunksize parameter. This however was way slower and didn't really reduce memory usage. I tried it with an 4 GB DataFrame. Having read the DataFrame, the script still consumed ~7 GB RAM. Here's my code:
df = None
for chunk in pandas.read_csv(fn, chunksize=50000):
if df is None:
df = chunk
else:
df = pandas.concat([df, chunk])
This is only a short version. I am also aware, that specifying the dtype saves memory. So here's my question. What's the best way (performance, memory) to read huge pandas DataFrames?
Depending on the types of operations you want to do on the dataframes, you might find dask useful. One of its key features is allowing operations on larger-than-memory dataframes. For example, to do a groupby on a larger-than-memory dataframe:
import dask.dataframe as dd
df = dd.read_csv(fn)
df_means = df.groupby(key).mean().compute()
Note the addition of compute() at the end, as compared to a typical pandas groupby operation.
You are using chunksize incorrectly. It is not meant to be used for simply appending to the dataframe in chunks. You have to break up a dataset in pieces so that you can process a large dataset one piece at a time. This way, only the chunk that is being processed needs to stay in memory.
Using dtypes and usecols is the best way to reduce memory usage.
It is hard to say because you didn't provide any details about the dataset, such as row count, row size, column data types, number of columns, whether it is clean and structured data etc. If the data in your columns is not consistent, it can cause unexpected upcasting and a memory spike. So you may need to pre-process it before loading the dataframe.
Consider using the category data type for any object/string
columns with low cardinality and low selectivity.
Use dtypes to lower the precision of your numeric columns.
Use chunksize to process data in chunks, not just to append it. Or use dask.

Resources