Hive partitions to Spark partitions - apache-spark

We need to work on a big dataset with partitioned data, for efficiency reasons. Data source resides in Hive, but with a different partition criteria. In other words, we need to retrieve data from Hive to Spark, and re-partition in Spark.
But there is an issue in Spark that causes reordering/redistributing partitioning when data is persisted (either to parquet or ORC). Therefore, our new partitioning in Spark is lost.
As an alternative, we are considering building our new partitioning in a new Hive table. The question is: is it possible to map Spark partitions from Hive partitions (for read)?

Partition Discovery --> might be what you are looking for:
" Passing the path/to/table to either SparkSession.read.parquet or SparkSession.read.load, Spark SQL will automatically extract the partitioning information from the paths. "

Related

Apache Spark not using partition information from Hive partitioned external table

I have a simple Hive-External table which is created on top of S3 (Files are in CSV format). When I run the hive query it shows all records and partitions.
However when I use the same table in Spark ( where the Spark SQL has a where condition on the partition column) it does not show that a partition filter is applied. However for a Hive Managed table , Spark is able to use the information of partitions and apply the partition filter.
Is there any flag or setting that can help me make use of partitions of Hive external tables in Spark ? Thanks.
Update :
For some reason, only the spark plan is not showing the Partition Filters. However, when you look at the data loaded its only loading the data needed from the partitions.
Ex: Where rating=0 , loads only one file of 1 MB, when I don't have filter its reads all 3 partition for 3 MB
tl; dr set the following before the running sql for external table
spark.sql("set spark.sql.hive.convertMetastoreOrc=true")
The difference in behaviour is not because of extenal/managed table.
The behaviour depends on two factors
1. Where the table was created(Hive or Spark)
2. File format (I believe it is ORC in this case, from the screen capture)
Where the table was created(Hive or Spark)
If the table was create using Spark APIs, it is considered as Datasource table.
If the table was created usng HiveQL, it is considered as Hive native table.
The metadata of both these tables are store in Hive metastore, the only difference is in the provider field of TBLPROPERTIES of the tables(describe extended <tblName>). The value of the property is orcor empty in Spark table and hive for a Hive.
How spark uses this information
When provider is not hive(datasource table), Spark uses its native way of processing the data.
If provider is hive, Spark uses Hive code to process the data.
Fileformat
Spark gives config flag to instruct the engine to use Datasource way of processing the data for the floowing file formats = Orc and Parquet
Flags:
Orc
val CONVERT_METASTORE_ORC = buildConf("spark.sql.hive.convertMetastoreOrc")
.doc("When set to true, the built-in ORC reader and writer are used to process " +
"ORC tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
Parquet
val CONVERT_METASTORE_PARQUET = buildConf("spark.sql.hive.convertMetastoreParquet")
.doc("When set to true, the built-in Parquet reader and writer are used to process " +
"parquet tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
I also ran into this kind of problem having multiple joins of internal and external tables.
None of the tricks work including:
spark.sql("set spark.sql.hive.convertMetastoreParquet=false")
spark.sql("set spark.sql.hive.metastorePartitionPruning=true")
spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")
anyone who knows how to solve this problem.

Spark Connect Hive to HDFS vs Spark connect HDFS directly and Hive on the top of it?

Summary of the problem:
I have a perticular usecase to write >10gb data per day to HDFS via spark streaming. We are currently in the design phase. We want to write the data to HDFS (constraint) using spark streaming. The data is columnar.
We have 2 options(so far):
Naturally, I would like to use hive context to feed data to HDFS. The schema is defined and the data is feeded in batches or row wise.
There is another option. We can directly write data to HDFS thanks to spark streaming API. We are also considering this because we can query data from HDFS through hive then in this usecase. This will leave options open to use other technologies in future for the new usecases that may come.
What is best?
Spark Streaming -> Hive -> HDFS -> Consumed by Hive.
VS
Spark Streaming -> HDFS -> Consumed by Hive , or other technologies.
Thanks.
So far I have not found a discussion on the topic, my research may be short. If there is any article that you can suggest, I would be most happy to read it.
I have a particular use case to write >10gb data per day and data is columnar
that means you are storing day-wise data. if thats the case hive has partition column as date, so that you can query the data for each day easily. you can query the raw data from BI tools like looker or presto or any other BI tool. if you are querying from spark then you can use hive features/properties. Moreover if you store the data in columnar format in parquet impala can query the data using hive metastore.
If your data is columnar consider parquet or orc.
Regarding option2:
if you have hive an option NO need to feed data in to HDFS and create an external table from hive and access it.
Conclusion :
I feel both are same. but hive is preferred considering direct query on raw data using BI tools or spark. From HDFS also we can query data using spark. if its there in the formats like json or parquet or xml there wont be added advantage for option 2.
It depends on your final use cases. Please consider below two scenarios while taking decision:
If you have RT/NRT case and all your data is full refresh then I would suggest to go with second approach Spark Streaming -> HDFS -> Consumed by Hive. It will be faster than your first approach Spark Streaming -> Hive -> HDFS -> Consumed by Hive. Since there is one less layer in it.
If your data is incremental and also have multiple update, delete operations then It will be difficult to use HDFS or Hive over HDFS with spark. Since Spark does not allow to update or delete data from HDFS. In that case, both your approaches will be difficult to implement. Either you can go with Hive managed table and do update/delete using HQL (only supported in Hortonwork Hive version) or you can go with NOSQL database like HBase or Cassandra so that spark can do upsert & delete easily. From program perspective, it will be also easy in compare to both your approaches.
If you dump data in NoSQL then you can use hive over it for normal SQL or reporting purpose.
There are so many tools & approaches are available but go with that which fit in your all cases. :)

read/write bucketed tables in Spark

I have a number of tables (with 100 million-ish rows) that are stored as external Hive tables using Parquet format. The Spark job needs to join several of them together, using a single column, with almost no filtering. The join column has unique values about 2/3X fewer than the number of rows.
I can see that there are shuffles happening by the join key; and I have been trying to utilize bucketing/partitioning to improve join performance. My thought is that if Spark can be made aware that each of these tables has been bucketed using the same column, it can load the dataframes and join them without shuffling. I have tried using Hive bucketing, but the shuffles don't go away. (From Spark's documentation it looks like Hive bucketing is not supported as of Spark 2.3.0 at least, which I found out later.) Can I use Spark's bucketing feature to do this? If yes, would I have to disable Hive support and just read the files directly? Or could I rewrite the tables once using Spark's bucketing scheme and still be able to read them as Hive tables?
EDIT: For writing out the Hive bucketed tables I was using something like:
customerDF
.write
.option("path", "/some/path")
.mode("overwrite")
.format("parquet")
.bucketBy(200, "customer_key")
.sortBy("customer_key")
.saveAsTable("table_name")
The writing part seems to work. However, reading from two tables written that way and joining them didn't work as I expected. That is, Spark was repartitioning both tables again into 200 partitions.
I don't have code for doing Spark bucketing right now but will update if I figure it out.

Retrieve Cassandra partition data in Apache Spark

I have my data well organized by partition key on Cassandra. I would like to retrieve this data in Spark and keep the same partitions.
My goal is to avoid a very large shuffle.
PS : I am using Cassandra 2.1 and Spark 1.5
The Spark Cassandra Connector reads C* Token Ranges into Spark Partitions. This means all of the values for any given Cassandra Partition key will be in the same Spark Partition.
https://academy.datastax.com/demos/how-spark-cassandra-connector-reads-data

What is the metastore for in Spark?

I am using SparkSQL in python. I have created a partitioned table (~few hundreds of partitions) stored it into Hive Internal Table using the hiveContext. The hive warehouse is located in S3.
When I simply do "df = hiveContext.table("mytable"). It would take over a minute to going through all the partitions the first time. I thought the metastore stored all the metadata. Why would spark still need to going through each partition? Is it possible to avoid this step so my startup can be faster?
The key here is that it takes this long to load the file metadata only on the first query. The reason is that SparkSQL doesn't store the partition metadata in the Hive metastore. For Hive partitioned tables, the partition information needs to be stored in the metastore. Depending on how the table is created will dictate how this behaves. From the information provided, it sounds like you created a SparkSQL table.
SparkSQL stores the table schema (which includes partition information) and the root directory of your table, but still discovers each partition directory on S3 dynamically when the query is run. My understanding is that this is a tradeoff so you don't need to manually add new partitions whenever the table is updated.

Resources