Performance consideration when reading from hive view Vs hive table via DataFrames - apache-spark

We have a view that unions multiple hive tables. If i use spark SQL in pyspark and read that view will there be any performance issue as against reading directly from the table.
In hive we had something called full table scan if we don't limit the where clause to an exact table partition. Is spark intelligent enough to directly read the table that has the data that we are looking for rather than searching through the entire view ?
Please advise.

You are talking about partition pruning.
Yes spark supports it spark automatically omits large data read when partition filters are specified.
Partition pruning is possible when data within a table is split across multiple logical partitions. Each partition corresponds to a particular value of a partition column and is stored as a subdirectory within the table root directory on HDFS. Where applicable, only the required partitions (subdirectories) of a table are queried, thereby avoiding unnecessary I/O
After partitioning the data, subsequent queries can omit large amounts of I/O when the partition column is referenced in predicates. For example, the following query automatically locates and loads the file under peoplePartitioned/age=20/and omits all others:
val peoplePartitioned = spark.read.format("orc").load("peoplePartitioned")
peoplePartitioned.createOrReplaceTempView("peoplePartitioned")
spark.sql("SELECT * FROM peoplePartitioned WHERE age = 20")
more detailed info is provided here
You can also see this in the logical plan if you run an explain(True) on your query:
spark.sql("SELECT * FROM peoplePartitioned WHERE age = 20").explain(True)
it will show which partitions are read by spark

Related

what caused different pattern in hive table partition?

We have spark job but also randomly run hive query in current hadoop cluster
I have seen the same hive table has different partition pattern like below:
i.e. if the table is partition by date, so
hdfs dfs -ls /data/hive/warehouse/db_name/table_name/part_date=2019-12-01/
gave result
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-00001
....
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06669
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06670
however if find data from different partition date
hdfs dfs -ls /data/hive/warehouse/db_name/table_name/part_date=2020-01-01/
list files with different name patter
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000007_0
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000008_0
....
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000010_0
What I can tell the difference not only in one partition the data files come with part- prefix and the other is like 00000n_0, also there are a lot more amount of files for part- file but each file is quite small.
I also found aggregation on part- files are a lot slower than 00000n_0 files
what could be the possible cause of the file pattern difference and what could be the configuration to change from one to another?
When spark streaming writes data in Hive it creates lots of small files named as part- in Hive and which keep on the increase. This will give performance issue while querying on Hive table. Hive takes too much time to give result due to large no of small files in the partition.
When spark job write data in Hive it looks like -
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-00001
....
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06669
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06670
But here different file pattern is due to compaction logic on the partition's file to compact the small file into a large. Here n in 00000n_0 is the no of reducer.
Sample compaction script, which compacts the small file into a big file within partition for example table under-sample database -
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.reducers.bytes.per.reducer=268435456; --256MB reducer size.
CREATE TABLE example_tmp
STORED AS parquet
LOCATION '/user/hive/warehouse/sample.db/example_tmp'
AS
SELECT * FROM example
INSERT OVERWRITE table sample.example PARTITION (part_date) select * from sample.example_tmp;
DROP TABLE IF EXISTS sample.example_tmp PURGE;
The above script will compact the small files into some big file within the partition. And filename will be 00000n_0
what could be the possible cause of the file pattern difference and what could be the configuration to change from one to another?
There might be someone run compaction logic on the partition using Hive. Or might be reload the partition data using Hive. This is not an issue, data remains the same.

Efficient Filtering on a huge data frame in Spark

I have a Cassandra table with 500 million rows. I would like to filter based on a field which is a partition key in Cassandra using spark.
Can you suggest the best possible/efficient approach to filter in Spark/Spark SQL based on the list keys which is also a pretty large.
Basically i need only those rows from the Cassandra table which are present in the list of keys.
We are using DSE and its features.
The approach i am using is taking lot of time roughly around an hour.
Have you checked repartitionByCassandraReplica and joinWithCassandraTable ?
https://github.com/datastax/spark-cassandra-connector/blob/75719dfe0e175b3e0bb1c06127ad4e6930c73ece/doc/2_loading.md#performing-efficient-joins-with-cassandra-tables-since-12
joinWithCassandraTable utilizes the java drive to execute a single
query for every partition required by the source RDD so no un-needed
data will be requested or serialized. This means a join between any
RDD and a Cassandra Table can be performed without doing a full table
scan. When performed between two Cassandra Tables which share the same
partition key this will not require movement of data between machines.
In all cases this method will use the source RDD's partitioning and
placement for data locality.
The method repartitionByCassandraReplica can be used to relocate data
in an RDD to match the replication strategy of a given table and
keyspace. The method will look for partition key information in the
given RDD and then use those values to determine which nodes in the
Cluster would be responsible for that data.

Does Spark support Partition Pruning with Parquet Files

I am working with a large dataset, that is partitioned by two columns - plant_name and tag_id. The second partition - tag_id has 200000 unique values, and I mostly access the data by specific tag_id values. If I use the following Spark commands:
sqlContext.setConf("spark.sql.hive.metastorePartitionPruning", "true")
sqlContext.setConf("spark.sql.parquet.filterPushdown", "true")
val df = sqlContext.sql("select * from tag_data where plant_name='PLANT01' and tag_id='1000'")
I would expect a fast response as this resolves to a single partition. In Hive and Presto this takes seconds, however in Spark it runs for hours.
The actual data is held in a S3 bucket, and when I submit the sql query, Spark goes off and first gets all the partitions from the Hive metastore (200000 of them), and then calls refresh() to force a full status list of all these files in the S3 object store (actually calling listLeafFilesInParallel).
It is these two operations that are so expensive, are there any settings that can get Spark to prune the partitions earlier - either during the call to the metadata store, or immediately afterwards?
Yes, spark supports partition pruning.
Spark does a listing of partitions directories (sequential or parallel listLeafFilesInParallel) to build a cache of all partitions first time around. The queries in the same application, that scan data takes advantage of this cache. So the slowness that you see could be because of this cache building. The subsequent queries that scan data make use of the cache to prune partitions.
These are the logs which shows partitions being listed to populate the cache.
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-01 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-02 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-03 on driver
These are the logs showing pruning is happening.
App > 16/11/10 12:29:16 main INFO DataSourceStrategy: Selected 1 partitions out of 20, pruned 95.0% partitions.
Refer convertToParquetRelation and getHiveQlPartitions in HiveMetastoreCatalog.scala.
Just a thought:
Spark API documentation for HadoopFsRelation says,
( https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/sources/HadoopFsRelation.html )
"...when reading from Hive style partitioned tables stored in file
systems, it's able to discover partitioning information from the paths
of input directories, and perform partition pruning before start
reading the data..."
So, i guess "listLeafFilesInParallel" could not be a problem.
A similar issue is already in spark jira: https://issues.apache.org/jira/browse/SPARK-10673
In spite of "spark.sql.hive.verifyPartitionPath" set to false and, there is no effect in performance, I suspect that the
issue might have been caused by unregistered partitions. Please list out the partitions of the table and verify if all
the partitions are registered. Else, recover your partitions as shown in this link:
Hive doesn't read partitioned parquet files generated by Spark
Update:
I guess appropriate parquet block size and page size were set while writing the data.
Create a fresh hive table with partitions mentioned, and file-format as parquet, load it from non-partitioned table using dynamic partition approach.
( https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions )
Run a plain hive query and then compare by running a spark program.
Disclaimer: I am not a spark/parquet expert. The problem sounded interesting, and hence responded.
similar question popped up here recently:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-reads-all-leaf-directories-on-a-partitioned-Hive-table-td35997.html#a36007
This question is old but I thought I'd post the solution here as well.
spark.sql.hive.convertMetastoreParquet=false
will use the Hive parquet serde instead of the spark inbuilt parquet serde. Hive's Parquet serde will not do a listLeafFiles on all partitions, but only and directly read from the selected partitions. On tables with many partitions and files, this is much faster (and cheaper, too). Feel free to try it ou! :)

Spark SQL and Cassandra JOIN

My Cassandra schema contains a table with a partition key which is a timestamp, and a parameter column which is a clustering key.
Each partition contains 10k+ rows. This is logging data at a rate of 1 partition per second.
On the other hand, users can define "datasets" and I have another table which contains, as a partition key the "dataset name" and a clustering column which is a timestamp referring to the other table (so a "dataset" is a list of partition keys).
Of course what I would like to do looks like an anti-pattern for Cassandra as I'd like to join two tables.
However using Spark SQL I can run such a query and perform the JOIN.
SELECT * from datasets JOIN data
WHERE data.timestamp = datasets.timestamp AND datasets.name = 'my_dataset'
Now the question is: is Spark SQL smart enough to read only the partitions of data which correspond to the timestamps defined in datasets?
Edit: fix the answer with regard to join optimization
is Spark SQL smart enough to read only the partitions of data which correspond to the timestamps defined in datasets?
No. In fact, since you provide the partition key for the datasets table, the Spark/Cassandra connector will perform predicate push down and execute the partition restriction directly in Cassandra with CQL. But there will be no predicate push down for the join operation itself unless you use the RDD API with joinWithCassandraTable()
See here for all possible predicate push down situations: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/BasicCassandraPredicatePushDown.scala

What is the metastore for in Spark?

I am using SparkSQL in python. I have created a partitioned table (~few hundreds of partitions) stored it into Hive Internal Table using the hiveContext. The hive warehouse is located in S3.
When I simply do "df = hiveContext.table("mytable"). It would take over a minute to going through all the partitions the first time. I thought the metastore stored all the metadata. Why would spark still need to going through each partition? Is it possible to avoid this step so my startup can be faster?
The key here is that it takes this long to load the file metadata only on the first query. The reason is that SparkSQL doesn't store the partition metadata in the Hive metastore. For Hive partitioned tables, the partition information needs to be stored in the metastore. Depending on how the table is created will dictate how this behaves. From the information provided, it sounds like you created a SparkSQL table.
SparkSQL stores the table schema (which includes partition information) and the root directory of your table, but still discovers each partition directory on S3 dynamically when the query is run. My understanding is that this is a tradeoff so you don't need to manually add new partitions whenever the table is updated.

Resources