Pyspark: how to filter by date and read parquet files which is partitioned by date - apache-spark

I have a huge dataset of partitioned parquet files stored in AWS s3 in data-store/year=<>/month=<>/day=<>/hour=<>/ folder format. e.g data-store/year=2020/month=06/day=01/hour=05.
I want to read the files only for a specific date range e.g 2020/06/01 to 2020/08/30 or something like all the records with a date greater than equal to 2020/06/01.
How to do this effectively so that only the required data is loaded to the spark memory.

Related

pyspark csv format - mergeschema

I have a large dump of data that spans in TB's. The files contain activity data on a daily basis.
Day 1 can have 2 columns and Day 2 can have 3 columns. The file dump was in csv format. Now I need to read all these files and load it into a table. Problem is the format is csv and I am not sure how to merge the schema so as to lose not any columns. I know this can be achieved in parquet through mergeschema, but I cant convert these files one by one into parquet as the data is huge. Is there any way to merge schema with format as csv?

Spark iteration logic to write the dataset filtered by date to parquet format failing OOM

I have a scenario where i have dataset with date column and later i use the dataset in iteration to save the dataset into multiple partition files in parquet format. I do iterate the date list and while writing to parquet format with that date partition folder i do filter the dataset with date.
I was able to write for certain iterations but after that its failing with Spark out of memory exceptions.
Whats the best way to optimise this to persist the data with OOM.
dataset = dataset with some transformations
for date in date-list
pd.write_part_file("part-data-file", dataset.filter(archive_date==date))
The code looks like above.

PySpark: how to read in partitioning columns when reading parquet

I have data stored in a parquet files and hive table partitioned by year, month, day. Thus, each parquet file is stored in /table_name/year/month/day/ folder.
I want to read in data for only some of the partitions. I have list of paths to individual partitions as follows:
paths_to_files = ['hdfs://data/table_name/2018/10/29',
'hdfs://data/table_name/2018/10/30']
And then try to do something like:
df = sqlContext.read.format("parquet").load(paths_to_files)
However, then my data does not include the information about year, month and day, as this is not part of the data per se, rather the information is stored in the path to the file.
I could use sql context and a send hive query with some select statement with where on the year, month and day columns to select only data from partitions i am interested in. However, i'd rather avoid constructing SQL query in python as I am very lazy and don't like reading SQL.
I have two questions:
what is the optimal way (performance-wise) to read in the data stored as parquet, where information about year, month, day is not present in the parquet file, but is only included in the path to the file? (either send hive query using sqlContext.sql('...'), or use read.parquet,... anything really.
Can i somehow extract the partitioning columns when using the
approach i outlined above?
Reading the direct file paths to the parent directory of the year partitions should be enough for a dataframe to determine there's partitions under it. However, it wouldn't know what to name the partitions without the directory structure /year=2018/month=10, for example.
Therefore, if you have Hive, then going via the metastore would be better because the partitions are named there, Hive stores extra useful information about your table, and then you're not reliant on knowing the direct path to the files on disk from the Spark code.
Not sure why you think you need to read/write SQL, though.
Use the Dataframe API instead, e.g
df = spark.table("table_name")
df_2018 = df.filter(df['year'] == 2018)
df_2018.show()
Your data isn't stored in a way optimal for parquet so you'd have to load files one by one and add the dates
Alternatively, you can move the files to a directory structure fit for parquet
( e.g. .../table/year=2018/month=10/day=29/file.parquet)
then you can read the parent directory (table) and filter on year, month, and day (and spark will only read the relevant directories) also you'd get these as attributes in your dataframe

parquet column pruning in spark

I know parquet supports read user selected columns only. But when I use
dataframe.read.parquet().select(col*)
to read the data, it looks like still read the whole file. Is there any way in spark to read the selected column chunks only?

parquet to spark dataframe with location column

I am trying to read in a parquet file into a Dataframe using spark. My requirement is to create another column in the dataframe using the parquet path of the parquet file.
Eg: I have parquet files in the following path in hdfs:
/event/2018-01-01/abc/xyz=10/parquet1.parquet
/event/2018-01-01/abc/xyz=10/parquet2.parquet
I want to read all the files in /event/2018-01-01/abc and create a column "dt" in the dataframe which specifies the date in the path. How do I extract the date from the path and create it as a column in the spark dataframe?

Resources