How Spark stores Parquet Table? - apache-spark

I have a fact table which is 10Tb (Parquet) which contains 100+ columns. When I have created another table with just 10 columns from the fact table and size is 2TB.
I was expecting the size should be in some GBs because I am storing just few (10) columns?
My question is when we have more columns does Parquet format stores in more efficient manner?

Parquet is a column based storage. Say if I have a table with fields userId, name, address, state, phone no.
In non-parquet storage If I do a select * where state = "TN" it will go through every record in my table (i.e all the columns of each row) and output the records that match my where condition. However in parquet format all the columns are stored together so I don't need to go through all the other columns. The same select query will directly go to column 'state' and output records that match the where condition. Parquet is good for faster retrieval (to get results faster). It doesn't matter how many columns are present in total.
Parquet uses snappy compression. Since all the columns are stored together it makes compression very effective.

Related

Performance of pyspark + hive when a table has many partition columns

I am trying to understand the performance impact on the partitioning scheme when Spark is used to query a hive table. As an example:
Table 1 has 3 partition columns, and data is stored in paths like
year=2021/month=01/day=01/...data...
Table 2 has 1 partition column
date=20210101/...data...
Anecdotally I have found that queries on the second type of table are faster, but I don't know why, and I don't why. I'd like to understand this so I know how to design the partitioning of larger tables that could have more partitions.
Queries being tested:
select * from table limit 1
I realize this won't benefit from any kind of query pruning.
The above is meant as an example query to demonstrate what I am trying to understand. But in case details are important
This is using s3 not HDFS
The data in the table is very small, and there are not a large number of partitons
The time for running the query on the first table is ~2 minutes, and ~10 seconds on the second
Data is stored as parquet
Except all other factors which you did not mention: storage type, configuration, cluster capacity, the number of files in each case, your partitioning schema does not correspond to the use-case.
Partitioning schema should be chosen based on how the data will be selected or how the data will be written or both. In your case partitioning by year, month, day separately is over-partitioning. Partitions in Hive are hierarchical folders and all of them should be traversed (even if using metadata only) to determine the data path, in case of single date partition, only one directory level is being read. Two additional folders: year+month+day instead of date do not help with partition pruning because all columns are related and used together always in the where.
Also, partition pruning probably does not work at all with 3 partition columns and predicate like this: where date = concat(year, month, day)
Use EXPLAIN and check it and compare with predicate like this where year='some year' and month='some month' and day='some day'
If you have one more column in the WHERE clause in the most of your queries, say category, which does not correlate with date and the data is big, then additional partition by it makes sense, you will benefit from partition pruning then.

External Table in Azure synapse very slow performance

I have a parquet file and created a new External table, but the performance is very slow as compare to a normal table in the synapse. Can you please let me know how to over come this.
Very broad question. So I'll give broad answer:
Use normal table. Hard to beat performance of "normal table" with external tables. "normal table" means a table created in a Dedicated SQL pool using CREATE TABLE. If you're querying data from one or more tables repeatedly and each query is different (group-by, join, selected columns) then you can't get beat performance of "normal" table with external tables.
Understand and apply basic best practices:
Use parquet format, which you're doing.
Pick right partition column and partition your data by storing partitions to different folders or file names.
If a query targets a single large file, you'll benefit from splitting it into multiple smaller files.
Try to keep your CSV (if using csv) file size between 100 MB and 10 GB.
Use correct data type.
Manually create statistics for CSV files
Use CETAS to enhance query performance and joins
...and many more.
a) The first step is to partition your Parquet File using a relevant partition column, such as Year, Month, and Date.
b) I recommend using a View rather than an external table as a second recommendation. External Tables don't support Partition Prunning and won't use the partition columns to eliminate unnecessary files during the read.
c) Assure that data types are enforced, and that string types are being used appropriately.
d) If possible, convert your Parquet file to Delta format. Synapse is able to read Partition columns from Delta without the need for the filepath() and filename() functions. External tables do not support Delta, only views.
Note: External tables doesn't support Parquet partition columns.
SELECT *,
CAST(fct.filepath(1) AS SMALLINT) AS SalesOrderPathYear,
CAST(fct.filepath(2) AS TINYINT) AS SalesOrderPathMonth,
CAST(fct.filepath(3) AS DATE) AS SalesOrderPathDate
FROM
OPENROWSET
(
BULK 'conformed/facts/factsales/*/*/*/*.parquet',
DATA_SOURCE = 'ExternalDataSourceDataLake',
FORMAT = 'Parquet'
) AS fct
WITH
(
ColA as String(10),
ColB as Integer,
ColC as ...
)
Ref: https://www.serverlesssql.com/certification/mastering-dp-500-exam-querying-partitioned-sources-in-azure-storage/

spark parquet partitioning which remove the partition column

If am using df.write.partitionby(col1).parquet(path) .
the data will remove the partition column on the data.
how to avoid it ?
You can duplicate col1 before writing:
df.withColumn("partition_col", col("col1")).write.partitionBy("partition_col").parquet(path)
Note that this step is not really necessary, because whenever you read a Parquet file in a partitioned directory structure, Spark will automatically add that as a new column to the dataframe.
Actually spark does not remove the column but it uses that column in a way to organize the files so that when you read the files it adds that as a column and display that to you in a table format. If you check the schema of the table or the schema of the dataframe you would still see that as a column in the table.
Also you are partitioning your data so you know how that data from table is queried frequently and based on that information you might have decided to partition the data so that your reads becomes faster and more efficient.

How does Parquet file size changes with the count in Spark Dataset

I came across a scenario where I had spark dataset with 24 columns, of which I was grouping by First 22 columns and summing up last two columns.
I removed the group by from the query and I have all 24 columns selected now.
Initial count of the dataset was 79,304.
After I removed group by the count has increased to 138,204 which is understood because I have removed the group by.
But I was not clear with the behavior that The initial size of parquet file was 2.3MB but later it got reduced to 1.5MB . Can anyone please help me understand this.
Also not every time the size reduces,
I had a similar scenario for 22 columns
count before was 35,298,226 and after removing group by was 59,874,208
and here the size has increased from 466.5MB to 509.8MB
When dealing with parquet sizes it's not about the number of rows it's about the data it self.
Parquet is columnar oriented format and therefore it store data column wise and compress the data column wise. Therefore it's not about the number of rows but rather the diversity of it's columns.
Parquet will do better compression as the diversity of the most diverse column in the table. So if you have one column dataframe it will be compress good as the distance between the values of the column.

Enum equivalent in Spark Dataframe/Parquet

I have a table with hundreds of millions of rows, that I want to store in a dataframe in Spark and persist to disk as a parquet file.
The size of my Parquet file(s) is now in excess of 2TB and I want to make sure I have optimized this.
A large proportion of these columns are string values, that can be lengthy, but also often have very few values. For example I have a column with only two distinct values (a 20charcter and a 30 character string) and I have another column with a string that is on average 400characters long but only has about 400 distinct values across all entries.
In a relational database I would usually normalize those values out into a different table with references, or at least define my table with some sort of enum type.
I cannot see anything that matches that pattern in DF or parquet files. Is the columnar storage handling this efficiently? Or should I look into something to optimize this further?
Parquet doesn't have a mechanism for automatically generating enum-like types, but you can use the page dictionary. The page dictionary stores a list of values per parquet page to allow the rows to just reference back to the dictionary instead of rewriting the data. To enable the dictionary for the parquet writer in spark:
spark.conf.set("parquet.dictionary.enabled", "true")
spark.conf.set("parquet.dictionary.page.size", 2 * 1024 * 1024)
Note that you have to write the file with these options enabled, or it won't be used.
To enable filtering for existence using the dictionary, you can enable
spark.conf.set("parquet.filter.dictionary.enabled", "true")
Source: Parquet performance tuning:
The missing guide

Resources