How to store Spark data frame as a dynamic partitioned Hive table in Parquet format? - apache-spark

The current raw data is on Hive. I want to do a join of several partitioned terabytes Hive tables, and then output the result as a partitioned Hive table in Parquet format.
I am considering to load all partitions of Hive tables as Spark dataframes. And then do join, group by, and etc. Is this the right way to do?
Finally I will need to save the data, can we save Spark dataframe as a dynamic partitioned Hive table in Parquet format? How to deal with the metadata?

If one of the several data set is sufficiently smaller than the other, you may want to consider using Broadcast for data transfer efficiency.
Depending on the nature of the data, you could try group by, then join. So each machine only need to process a specific set of data, reduce the amount of data transferred during task run.
Hive supports storing data into Parquet format directly. https://cwiki.apache.org/confluence/display/Hive/Parquet. Have you given a try?

Related

Spark partitioning of related data into row groups

With Apache Spark we can partition a dataframe into separate files when saving into Parquet format.
In the way Parquet files are written, each partition contains multiple row groups each of include column statistics pertaining to each group (e.g., min/max values, as well as number of NULL values).
Now, it would seem ideal in some situations to organize the Parquet file such that related data appears together in one or more row groups. This would be a secondary level of partitioning within each partition file (which constitutes the first level).
This is possible using for example pyarrow, but how can we do this with a distributed SQL engine such as Spark?
Besides partitioning you can order your data to group related data together in a limited set of partitions. Statement from Databricks:
Z-Ordering is a technique to colocate related information in the same
set of files
(
df
.write.option("header", True)
.orderBy(df.col_1.desc())
.partitionBy("col_2")
)

Apache Hive: CREATE TABLE statement without schema over parquet can fail to infer partition column

I have a partitioned parquet at the following path:
/path/to/partitioned/parq/
with partitions like:
/path/to/partitioned/parq/part_date=2021_01_01_01_01_01
/path/to/partitioned/parq/part_date=2021_01_02_01_01_01
/path/to/partitioned/parq/part_date=2021_01_03_01_01_01
When I run a Spark SQL CREATE TABLE statement like:
CREATE TABLE IF NOT EXISTS
my_db.my_table
USING PARQUET
LOCATION '/path/to/partitioned/parq'
The partition column part_date shows up in my dataset, but DESCRIBE EXTENDED indicates there are no PARTITIONS. SHOW PARTITIONS my_db.my_table shows no partition data.
This seems to happen intermittently, like sometimes spark infers the partitions, other times it doesn't. This is causing issues downstream where we add a partition and try to MSCK REPAIR TABLE my_db.my_table and it says you can't run that on non-partitioned tables.
I see that if you DO declare schema, you can FORCE the PARTITIONED BY part of the clause but we do not have the luxury of a schema, just the files from underneath.
Why is spark intermittently unable to determine partition columns from a parquet in this shape?
Unfortunately with Hive you need to specify the schema, even if parquet obviously has this itself.
You need to add partition by clause to DDL.
Use ALTER table statement to add each partition separately with location.

Write large data set around 100 GB having just one partition to hive using spark

I am trying to write large dataset to a partitioned hive table (partitioned by date) using spark .The data set results in just one date, so just one partition. It is taking long time to write to table. It is also causing shuffling while writing . My code does not contain any join. It has just some map function, filter and union. How to efficiently write this kind of data to hive table? Check image of spark UI here

PySpark: how to read in partitioning columns when reading parquet

I have data stored in a parquet files and hive table partitioned by year, month, day. Thus, each parquet file is stored in /table_name/year/month/day/ folder.
I want to read in data for only some of the partitions. I have list of paths to individual partitions as follows:
paths_to_files = ['hdfs://data/table_name/2018/10/29',
'hdfs://data/table_name/2018/10/30']
And then try to do something like:
df = sqlContext.read.format("parquet").load(paths_to_files)
However, then my data does not include the information about year, month and day, as this is not part of the data per se, rather the information is stored in the path to the file.
I could use sql context and a send hive query with some select statement with where on the year, month and day columns to select only data from partitions i am interested in. However, i'd rather avoid constructing SQL query in python as I am very lazy and don't like reading SQL.
I have two questions:
what is the optimal way (performance-wise) to read in the data stored as parquet, where information about year, month, day is not present in the parquet file, but is only included in the path to the file? (either send hive query using sqlContext.sql('...'), or use read.parquet,... anything really.
Can i somehow extract the partitioning columns when using the
approach i outlined above?
Reading the direct file paths to the parent directory of the year partitions should be enough for a dataframe to determine there's partitions under it. However, it wouldn't know what to name the partitions without the directory structure /year=2018/month=10, for example.
Therefore, if you have Hive, then going via the metastore would be better because the partitions are named there, Hive stores extra useful information about your table, and then you're not reliant on knowing the direct path to the files on disk from the Spark code.
Not sure why you think you need to read/write SQL, though.
Use the Dataframe API instead, e.g
df = spark.table("table_name")
df_2018 = df.filter(df['year'] == 2018)
df_2018.show()
Your data isn't stored in a way optimal for parquet so you'd have to load files one by one and add the dates
Alternatively, you can move the files to a directory structure fit for parquet
( e.g. .../table/year=2018/month=10/day=29/file.parquet)
then you can read the parent directory (table) and filter on year, month, and day (and spark will only read the relevant directories) also you'd get these as attributes in your dataframe

Spark DataFrame Repartition and Parquet Partition

I am using repartition on columns to store the data in parquet. But
I see that the no. of parquet partitioned files are not same with the
no. of Rdd partitions. Is there no correlation between rdd partitions
and parquet partitions?
When I write the data to parquet partition and I use Rdd
repartition and then I read the data from parquet partition , is
there any condition when the rdd partition numbers will be same
during read / write?
How is bucketing a dataframe using a column id and repartitioning a
dataframe via the same column id different?
While considering the performance of joins in Spark should we be
looking at bucketing or repartitioning (or maybe both)
Couple of things here that you;re asking - Partitioning, Bucketing and Balancing of data,
Partitioning:
Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion.
Partitioning tables changes how persisted data is structured and will now create subdirectories reflecting this partitioning structure.
This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering.
In Spark, this is done by df.write.partitionedBy(column*) and groups data by partitioning columns into same sub directory.
Bucketing:
Bucketing is another technique for decomposing data sets into more manageable parts. Based on columns provided, the entire data is hashed into a user-defined number of buckets (files).
Synonymous to Hive's Distribute By
In Spark, this is done by df.write.bucketBy(n, column*) and groups data by partitioning columns into same file. number of files generated is controlled by n
Repartition:
It returns a new DataFrame balanced evenly based on given partitioning expressions into given number of internal files. The resulting DataFrame is hash partitioned.
Spark manages data on these partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors.
In Spark, this is done by df.repartition(n, column*) and groups data by partitioning columns into same internal partition file. Note that no data is persisted to storage, this is just internal balancing of data based on constraints similar to bucketBy
Tl;dr
1) I am using repartition on columns to store the data in parquet. But I see that the no. of parquet partitioned files are not same with the no. of Rdd partitions. Is there no correlation between rdd partitions and parquet partitions?
repartition has correlation to bucketBy not partitionedBy. partitioned files is governed by other configs like spark.sql.shuffle.partitions and spark.default.parallelism
2) When I write the data to parquet partition and I use Rdd repartition and then I read the data from parquet partition , is there any condition when the rdd partition numbers will be same during read / write?
during read time, the number of partitions will be equal to spark.default.parallelism
3) How is bucketing a dataframe using a column id and repartitioning a dataframe via the same column id different?
Working similar, except, bucketing is a write operation and is used for persistence.
4) While considering the performance of joins in Spark should we be looking at bucketing or repartitioning (or maybe both)
repartition of both datasets are in memory, if one or both the datasets are persisted, then look into bucketBy also.

Resources