Spark parquet reading - apache-spark

Our spark program is reading in parquet files. These files are partitioned by date in a directory like structure (e.g month=201703/day=20170313/). The filenames themselves contain a number which reflects the kafka partition they were from originally (e.g. data.232.parquet). The data of a specific user always ends up in the same partition, so it would make sense if one spark executor would read in all the parquet files of the same partition (to avoid shuffling down the road) across all the dates.
How can we accomplish this ? Maybe we also have to put the partition number in the directory hierarchy ? But even then it's not clear to me how we can tell Spark to use this information.

Related

How parquet columns can be skipped when reading from hdfs?

We all know parquet is column-oriented so we can get only columns we desired and reduce IO.
But what if the parquet file is stored in HDFS, should we download the entire file first, and then apply column filter locally?
For example, if we use spark to read a parquet column from HDFS/Hive:
spark.sql("select name from wide_table")
Still we must download the entire parquet file, is that right?
Or maybe there is a way we can filter the columns just before the network transfer?
Actually "predicate pushdown" which is a feature of Spark SQL will try to use column filters to reduce the amount of information that is processed by spark. Technically the entire hdfs block is still read into memory, but it uses smart logic to only return relevant results. This is normally called out in the physical plan. You can read this by using .explain() on your query to see if the feature is being used. (Not all versions of hdfs support this.)

Is one parquet files under the parquet folder a partition?

I saved my dataframe as parquet format
df.write.parquet('/my/path')
When checking on HDFS, I can see that there is 10 part-xxx.snappy.parquet files under the parquet directory /my/path
My question is: is one part-xxx.snappy.parquet file correspond to a partition of my dataframe ?
Yes, part-** files are created based on number of partitions in the dataframe while writing to HDFS.
To check number of partitions in the dataframe:
df.rdd.getNumPartitions()
To control number of files writing to filesystem we can use .repartition (or) .coalesce() (or) dynamically based on our requirement.
Yes, this creates one file per Spark-partition.
Note, that you can also partition files by some attribute:
df.write.partitionBy("key").parquet("/my/path")
in such case Spark is going to create up to Spark-partition number of files for each parquet-partition. Common way to reduce number of files in such case is to repartition data by key before writing (this effectively creates one file per partition).

what caused different pattern in hive table partition?

We have spark job but also randomly run hive query in current hadoop cluster
I have seen the same hive table has different partition pattern like below:
i.e. if the table is partition by date, so
hdfs dfs -ls /data/hive/warehouse/db_name/table_name/part_date=2019-12-01/
gave result
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-00001
....
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06669
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06670
however if find data from different partition date
hdfs dfs -ls /data/hive/warehouse/db_name/table_name/part_date=2020-01-01/
list files with different name patter
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000007_0
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000008_0
....
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000010_0
What I can tell the difference not only in one partition the data files come with part- prefix and the other is like 00000n_0, also there are a lot more amount of files for part- file but each file is quite small.
I also found aggregation on part- files are a lot slower than 00000n_0 files
what could be the possible cause of the file pattern difference and what could be the configuration to change from one to another?
When spark streaming writes data in Hive it creates lots of small files named as part- in Hive and which keep on the increase. This will give performance issue while querying on Hive table. Hive takes too much time to give result due to large no of small files in the partition.
When spark job write data in Hive it looks like -
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-00001
....
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06669
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06670
But here different file pattern is due to compaction logic on the partition's file to compact the small file into a large. Here n in 00000n_0 is the no of reducer.
Sample compaction script, which compacts the small file into a big file within partition for example table under-sample database -
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.reducers.bytes.per.reducer=268435456; --256MB reducer size.
CREATE TABLE example_tmp
STORED AS parquet
LOCATION '/user/hive/warehouse/sample.db/example_tmp'
AS
SELECT * FROM example
INSERT OVERWRITE table sample.example PARTITION (part_date) select * from sample.example_tmp;
DROP TABLE IF EXISTS sample.example_tmp PURGE;
The above script will compact the small files into some big file within the partition. And filename will be 00000n_0
what could be the possible cause of the file pattern difference and what could be the configuration to change from one to another?
There might be someone run compaction logic on the partition using Hive. Or might be reload the partition data using Hive. This is not an issue, data remains the same.

Understanting file distribution and partitioning in HDFS when using Hive

On the one hand, in HDFS docs they say:
HDFS is designed to support very large files. Applications that are
compatible with HDFS are those that deal with large data sets. These
applications write their data only once but they read it one or more
times and require these reads to be satisfied at streaming speeds.
HDFS supports write-once-read-many semantics on files. A typical block
size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64
MB chunks, and if possible, each chunk will reside on a different
DataNode.
Meaning every file will be splitted between nodes.
On the other hand, when I use Hive or Spark SQL, I manage the partitions in such a way that there is a folder for each partition, and all the files inside belong to this partition. For example:
/Sales
/country=Spain
/city=Barcelona
/2019-08-28.parquet
/2019-08-27.parquet
/city=Madrid
/2019-08-28.parquet
/2019-08-27.parquet
Let's say that each file's size is 1GB and the HDFS block size is 128 MB.
So I am confused. I don't understand if city=Barcelonav/2019-08-28.parquet is saved on only one node as a whole (even together with city=Barcelona/2019-08-27.parquet), or each file is distributed between 8 nodes.
If each file is distributed, then what is the benefit of the partitions?
If the data is distributed according to the partitions I define, how does HDFS know to do that? Does it look for folders with a name in the form of key=value and make sure they will be saved intact?
You are confused between "how HDFS stores the files that we dump into it" and "how Hive/Spark creates different directories in case of partitioning".
Let me try to provide you a perspective.
HDFS works as you have mentioned.
HDFS breaks up the files into n number of blocks depending upon the block size and the size of the file to be stored. The metadata (directories, permissions, etc..) is an abstraction in a sense that the file (2019-08-27.parquet) that you see as one is indeed distributed among nodes. Namenode maintains the metadata.
However, when we partition it creates different directories on HDFS. This ultimately helps when you want to query the data with conditions on the partitioned column. Only relevant directories are searched for the requested data. If you go ahead and query on your partitioned data and write an explain to have a look at the logical plan, you can notice the Partition Filters while FileScan phase.
The partitioned data is still stored on HDFS in the same way that you mentioned.
Hope this helps!

Need less parquet files

I am doing the following process
rdd.toDF.write.mode(SaveMode.Append).partitionBy("Some Column").parquet(output_path)
However, under each partition, there are too many parquet files and each of them, the size is very small, that will makes my following steps become very slow to load all the parquet files. Is there a better way that under each partition, make less parquet files and increase the single parquet file size?
You can repartition before save:
rdd.toDF.repartition("Some Column").write.mode(SaveMode.Append).partitionBy("Some Column")
I used to have this problem.
Actually you can't control the partition of files because it depends on the executor doing.
The way to work around it is using method coalesce to make a shuffle and you can make how many partition you want but it's not efficient way you also need to set driver memory enough to handle this operation.
df = df.coalesce(numPartitions).write.partitionBy(""yyyyy").parquet("xxxx")
I also faced this issue. The problem is if you use coalesce each partition gets same number of parquet files. Now different partitions have different size so ideally I need different coalesce for each partition.
It's going to be really quite expensive if you open a lot of small files. Let's say you open 1k files and each filesize are far from the value of your parquet.block.size.
Here are my suggestions:
Create a job that will first merge your input parquet files to have smaller number of files where their sizes are near or equal to parquet.block.size. The default block size for 128Mb, though it's configurable by updating parquet.block.size. Spark would love if your parquet file is near before or equal the value of your parquet.block.size. The block size is the size of a row group being buffered in memory.
Or update your spark job to just read limited number of files
Or if you have a big machine and/or resources, just do the right tuning.
Hive query has a way to merge small files into larger one. This is not available in spark sql. Also, reducing spark.sql.shuffle.partitions wont help with Dataframe API.
I tried below solution and it generated lesser number of parquet files(from 800 parquet files to 29).
Suppose the data is loaded to a dataframe df
Create a temporary table in hive.
df.createOrReplaceTempView("tempTable")
spark.sql("CREATE TABLE test_temp LIKE test")
spark.sql("INSERT INTO TABLE test_temp SELECT * FROM tempTable")
The test_temp will contain small parquet files.
Populate final hive table from temporary table
spark.sql("INSERT INTO test SELECT * FROM test_temp")
The final table will contain lesser files. Drop temporary table after populating final table.

Resources