Spark iteration logic to write the dataset filtered by date to parquet format failing OOM - apache-spark

I have a scenario where i have dataset with date column and later i use the dataset in iteration to save the dataset into multiple partition files in parquet format. I do iterate the date list and while writing to parquet format with that date partition folder i do filter the dataset with date.
I was able to write for certain iterations but after that its failing with Spark out of memory exceptions.
Whats the best way to optimise this to persist the data with OOM.
dataset = dataset with some transformations
for date in date-list
pd.write_part_file("part-data-file", dataset.filter(archive_date==date))
The code looks like above.

Related

Write large data set around 100 GB having just one partition to hive using spark

I am trying to write large dataset to a partitioned hive table (partitioned by date) using spark .The data set results in just one date, so just one partition. It is taking long time to write to table. It is also causing shuffling while writing . My code does not contain any join. It has just some map function, filter and union. How to efficiently write this kind of data to hive table? Check image of spark UI here

Pyspark: how to filter by date and read parquet files which is partitioned by date

I have a huge dataset of partitioned parquet files stored in AWS s3 in data-store/year=<>/month=<>/day=<>/hour=<>/ folder format. e.g data-store/year=2020/month=06/day=01/hour=05.
I want to read the files only for a specific date range e.g 2020/06/01 to 2020/08/30 or something like all the records with a date greater than equal to 2020/06/01.
How to do this effectively so that only the required data is loaded to the spark memory.

how to speed up saving partitioned data with only one partition?

The spark data saving operation is quite slow if:
the dataframe df partitioned by date (year, month, day), df contains data from exactly one day, say 2019-02-14.
If I save the df by:
df.write.partitionBy("year", "month", "day").parquet("/path/")
It will be slow due to all data belong to one partition, which is processed by one task (??).
If saving df with explicit partition path:
df.write.parquet("/path/year=2019/month=02/day=14/")
It works well, but it will create the _metadata, _common_metadata, _SUCCESS files in "/path/year=2019/month=02/day=14/"
in stead of "/path/". Drop partition columns are required to keep same fields as using method partitionBy.
So, how to speed up saving data with only one partition without changing metadata files location, which can be updated in each OP.
Is it safe to use explicit partition path instead of using partitionBy?

PySpark: how to read in partitioning columns when reading parquet

I have data stored in a parquet files and hive table partitioned by year, month, day. Thus, each parquet file is stored in /table_name/year/month/day/ folder.
I want to read in data for only some of the partitions. I have list of paths to individual partitions as follows:
paths_to_files = ['hdfs://data/table_name/2018/10/29',
'hdfs://data/table_name/2018/10/30']
And then try to do something like:
df = sqlContext.read.format("parquet").load(paths_to_files)
However, then my data does not include the information about year, month and day, as this is not part of the data per se, rather the information is stored in the path to the file.
I could use sql context and a send hive query with some select statement with where on the year, month and day columns to select only data from partitions i am interested in. However, i'd rather avoid constructing SQL query in python as I am very lazy and don't like reading SQL.
I have two questions:
what is the optimal way (performance-wise) to read in the data stored as parquet, where information about year, month, day is not present in the parquet file, but is only included in the path to the file? (either send hive query using sqlContext.sql('...'), or use read.parquet,... anything really.
Can i somehow extract the partitioning columns when using the
approach i outlined above?
Reading the direct file paths to the parent directory of the year partitions should be enough for a dataframe to determine there's partitions under it. However, it wouldn't know what to name the partitions without the directory structure /year=2018/month=10, for example.
Therefore, if you have Hive, then going via the metastore would be better because the partitions are named there, Hive stores extra useful information about your table, and then you're not reliant on knowing the direct path to the files on disk from the Spark code.
Not sure why you think you need to read/write SQL, though.
Use the Dataframe API instead, e.g
df = spark.table("table_name")
df_2018 = df.filter(df['year'] == 2018)
df_2018.show()
Your data isn't stored in a way optimal for parquet so you'd have to load files one by one and add the dates
Alternatively, you can move the files to a directory structure fit for parquet
( e.g. .../table/year=2018/month=10/day=29/file.parquet)
then you can read the parent directory (table) and filter on year, month, and day (and spark will only read the relevant directories) also you'd get these as attributes in your dataframe

Reading Parquet columns as RDD rows

Is there a way to read columns from a Parquet file as rows in a Spark RDD, materializing the full contents of each column as a list within an RDD tuple?
The idea is that for cases where I need to run a non-distributable, in-memory-only algorithm (processing a full column of data) on a set of executors, I would like to be able to parallelize the processing by shipping the full contents of each column to the executors. My initial implementation, which involved reading the Parquet file as a DataFrame, then converting it to RDD and transposing the rows via aggregateByKey, has turned out to be too expensive in terms of time (probably due to the extensive shuffling required).
If possible, I would prefer to use an existing implementation, rather than rolling my own implementations of ParquetInputFormat, ReadSupport, and/or RecordMaterializer.
Suggestions for alternative approaches are welcome as well.

Resources