I am saving dataframe as SaveAsTable, in the spark UI i see all the stages are completed. But SQL tab it still active. When check in the HDFS under table location ( hadoop fs -ls /user/hive/warehouse/database.db/mytablename | wc -l
) and I see that files are getting appended in the table location. Just wondering is there any way to speedup this writing step?
Related
We have spark job but also randomly run hive query in current hadoop cluster
I have seen the same hive table has different partition pattern like below:
i.e. if the table is partition by date, so
hdfs dfs -ls /data/hive/warehouse/db_name/table_name/part_date=2019-12-01/
gave result
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-00001
....
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06669
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06670
however if find data from different partition date
hdfs dfs -ls /data/hive/warehouse/db_name/table_name/part_date=2020-01-01/
list files with different name patter
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000007_0
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000008_0
....
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000010_0
What I can tell the difference not only in one partition the data files come with part- prefix and the other is like 00000n_0, also there are a lot more amount of files for part- file but each file is quite small.
I also found aggregation on part- files are a lot slower than 00000n_0 files
what could be the possible cause of the file pattern difference and what could be the configuration to change from one to another?
When spark streaming writes data in Hive it creates lots of small files named as part- in Hive and which keep on the increase. This will give performance issue while querying on Hive table. Hive takes too much time to give result due to large no of small files in the partition.
When spark job write data in Hive it looks like -
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-00001
....
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06669
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06670
But here different file pattern is due to compaction logic on the partition's file to compact the small file into a large. Here n in 00000n_0 is the no of reducer.
Sample compaction script, which compacts the small file into a big file within partition for example table under-sample database -
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.reducers.bytes.per.reducer=268435456; --256MB reducer size.
CREATE TABLE example_tmp
STORED AS parquet
LOCATION '/user/hive/warehouse/sample.db/example_tmp'
AS
SELECT * FROM example
INSERT OVERWRITE table sample.example PARTITION (part_date) select * from sample.example_tmp;
DROP TABLE IF EXISTS sample.example_tmp PURGE;
The above script will compact the small files into some big file within the partition. And filename will be 00000n_0
what could be the possible cause of the file pattern difference and what could be the configuration to change from one to another?
There might be someone run compaction logic on the partition using Hive. Or might be reload the partition data using Hive. This is not an issue, data remains the same.
The question is regarding spark 1.6
When a dataframe is written to HDFS in SaveMode.APPEND mode, I want to know which files were created new.
A way to do this is to keep track of files in HDFS before and after job, is there a better way?
Also Map-Reduce prints job statistics at the end, do we have something similar for every spark action.
I have my Spark job which is running every 30 minutes and writing output to hdfs-(/tmp/data/1497567600000). I have this job continuously running in the cluster.
How can I create a Hive table on top of this data? I have seen one solution in StackOverFlow which creates a hive table on top of data partitioned by date field. which is like,
CREATE EXTERNAL TABLE `mydb.mytable`
(`col1` string,
`col2` decimal(38,0),
`create_date` timestamp,
`update_date` timestamp)
PARTITIONED BY (`my_date` string)
STORED AS ORC
LOCATION '/tmp/out/'
and the solution suggests to Alter the table as,
ALTER TABLE mydb.mytable ADD PARTITION (my_date=20160101) LOCATION '/tmp/out/20160101'
But, in my case, I have no idea on how the output directories are being written, and so I clearly can't create the partitions as suggested above.
How can I handle this case, where the output directories are being randomly written in timestamp basis and is not in format (/tmp/data/timestamp= 1497567600000)?
How can I make Hive pick the data under the directory /tmp/data?
I can suggest two solutions:
If you can change your Spark job, then you can partition your data by hour (e.g. /tmp/data/1, /tmp/data/2), add Hive partitions for each hour and just write to relevant partition
you can write bash script responsible for adding Hive partitions which can be achieved by:
listing HDFS subdirectories using command hadoop fs -ls /tmp/data
listing hive partitions for table using command: hive -e 'show partitions table;'
comparing above lists to find missing partitions
adding new Hive partitions with command provided above: ALTER TABLE mydb.mytable ADD PARTITION (my_date=20160101) LOCATION '/tmp/out/20160101'
I am trying to understand how I could improve (or increase) the parallelism of tasks that run for a particular spark job.
Here is my observation...
scala> spark.read.parquet("hdfs://somefile").toJavaRDD.partitions.size()
25
$ hadoop fs -ls hdfs://somefile | grep 'part-r' | wc -l
200
$ hadoop fs -du -h -s hdfs://somefile
2.2 G
I notice that, depending on what the repartition / coalesce the number of part files to HDFS is created appropriately during the save operation. Meaning the number of part files can be tweaked according to this parameter.
But, how do I control the read's 'partitions.size()'? Meaning, I want to have this to be 200 (without having to repartition it during the read operation so that I would be able have more number of tasks run for this job)
This has a major impact in-terms of the time it takes to perform query operations on this job.
On a side note, I do understand that 200 parquet part files for the above 2.2 G seems over-kill for a 128 MB block size. Ideally it should be 18 parts or so.
Please advice.
I am working with a large dataset, that is partitioned by two columns - plant_name and tag_id. The second partition - tag_id has 200000 unique values, and I mostly access the data by specific tag_id values. If I use the following Spark commands:
sqlContext.setConf("spark.sql.hive.metastorePartitionPruning", "true")
sqlContext.setConf("spark.sql.parquet.filterPushdown", "true")
val df = sqlContext.sql("select * from tag_data where plant_name='PLANT01' and tag_id='1000'")
I would expect a fast response as this resolves to a single partition. In Hive and Presto this takes seconds, however in Spark it runs for hours.
The actual data is held in a S3 bucket, and when I submit the sql query, Spark goes off and first gets all the partitions from the Hive metastore (200000 of them), and then calls refresh() to force a full status list of all these files in the S3 object store (actually calling listLeafFilesInParallel).
It is these two operations that are so expensive, are there any settings that can get Spark to prune the partitions earlier - either during the call to the metadata store, or immediately afterwards?
Yes, spark supports partition pruning.
Spark does a listing of partitions directories (sequential or parallel listLeafFilesInParallel) to build a cache of all partitions first time around. The queries in the same application, that scan data takes advantage of this cache. So the slowness that you see could be because of this cache building. The subsequent queries that scan data make use of the cache to prune partitions.
These are the logs which shows partitions being listed to populate the cache.
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-01 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-02 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-03 on driver
These are the logs showing pruning is happening.
App > 16/11/10 12:29:16 main INFO DataSourceStrategy: Selected 1 partitions out of 20, pruned 95.0% partitions.
Refer convertToParquetRelation and getHiveQlPartitions in HiveMetastoreCatalog.scala.
Just a thought:
Spark API documentation for HadoopFsRelation says,
( https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/sources/HadoopFsRelation.html )
"...when reading from Hive style partitioned tables stored in file
systems, it's able to discover partitioning information from the paths
of input directories, and perform partition pruning before start
reading the data..."
So, i guess "listLeafFilesInParallel" could not be a problem.
A similar issue is already in spark jira: https://issues.apache.org/jira/browse/SPARK-10673
In spite of "spark.sql.hive.verifyPartitionPath" set to false and, there is no effect in performance, I suspect that the
issue might have been caused by unregistered partitions. Please list out the partitions of the table and verify if all
the partitions are registered. Else, recover your partitions as shown in this link:
Hive doesn't read partitioned parquet files generated by Spark
Update:
I guess appropriate parquet block size and page size were set while writing the data.
Create a fresh hive table with partitions mentioned, and file-format as parquet, load it from non-partitioned table using dynamic partition approach.
( https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions )
Run a plain hive query and then compare by running a spark program.
Disclaimer: I am not a spark/parquet expert. The problem sounded interesting, and hence responded.
similar question popped up here recently:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-reads-all-leaf-directories-on-a-partitioned-Hive-table-td35997.html#a36007
This question is old but I thought I'd post the solution here as well.
spark.sql.hive.convertMetastoreParquet=false
will use the Hive parquet serde instead of the spark inbuilt parquet serde. Hive's Parquet serde will not do a listLeafFiles on all partitions, but only and directly read from the selected partitions. On tables with many partitions and files, this is much faster (and cheaper, too). Feel free to try it ou! :)