Table bucketing is not leveraged in a join spark Glue - apache-spark

I have a dimension table which is bucketed into 8000 buckets and key is (market_cd and account_id). While doing a join with a fact table key (market_cd and account_id), I am expecting the join shouldn't have shuffle (exchange) for the dimension. But that's not happening.
As show in the pic: The dimension is still going for shuffle.
Parameter addedd
--conf spark.sql.shuffle.partitions=8000 --conf spark.sql.sources.bucketing.enabled=True

Related

Number of partitions scanned (=1000) on table 'table' exceeds limit (=100)

All queries for table fail with this error.
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Number of partitions scanned (=1000) on table 'table' exceeds limit (=100). This is controlled on the metastore server by metastore.limit.partition.request.)
spark.table("table").
filter($"dt" === "2023-01-01").
show
I have these configs for spark-shell
--conf spark.sql.hive.convertMetastoreOrc=false \
--conf spark.sql.hive.metastorePartitionPruning=true \
Spark seems to be scanning the whole table despite the filter and the configs. Why does this happen? My table.
CREATE EXTERNAL TABLE table(
columns ...
PARTITIONED BY (dt date)
STORED AS ORC
TBLPROPERTIES (external.table.purge'='true', orc.compress'='ZLIB')
You should set the hive.metastore.limit.partition.request(see docs) property (as the error says). The default values is -1 (unlimited), but probably your spark config sets that value somewhere.
To set hive properties in Spark you need to prefix them with spark.sql. (as you already do in your example.

Hive: insert into table by Hue produces different number of files than pyspark

I have a Cloudera cluster on which I am accumulating large amounts of data in a Hive table stored as Parquet. The table is partitioned by an integer batch_id. My workflow for inserting a new batch of rows is to first insert the rows into a staging table, then insert into the large accumulating table. I am using a local-mode Python Pyspark script to do this. The script is essentially:
sc = pyspark.SparkContext()
hc = pyspark.HiveContext(sc)
hc.sql(
"""
INSERT INTO largeAccumulatorTable
PARTITION (batch_id = {0})
SELECT * FROM stagingBatchId{0}
"""
.format(batch_id)
)
I execute it using this shell script:
#!/bin/bash
spark-submit \
--master local[*] \
--num-executors 8 \
--executor-cores 1 \
--executor-memory 2G \
spark_insert.py
I have noticed that the resulting Parquet files in the large accumulating table are very small (some just a few KB) and numerous. I want to avoid this. I want the Parquet files to be large and few. I've tried setting different Hive configuration values at runtime in Pyspark to no avail:
Set hive.input.format = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
Set mapred.map.tasks to a small number
Set num-executors to a small number
Use local[1] master instead of local[*]
Set mapreduce.input.fileinputformat.split.minsize and mapreduce.input.fileinputformat.split.maxsize to high values
None of these changes had any effect on the number or sizes of Parquet files. However, when I open Cloudera Hue and enter this simple statement:
INSERT INTO largeAccumulatorTable
PARTITION (batch_id = XXX)
SELECT * FROM stagingBatchIdXXX
It works exactly as I would hope, producing a small number of Parquet files that are all about 100 MB.
What am I doing wrong in Pyspark? How can I make it achieve the same result as in Hue? Thanks!
spark default shuffle partitions are 200. Based on data size try reducing or increasing the configuration value. sqlContext.sql("set spark.sql.shuffle.partitions=20");

Does Spark know the partitioning key of a DataFrame?

I want to know if Spark knows the partitioning key of the parquet file and uses this information to avoid shuffles.
Context:
Running Spark 2.0.1 running local SparkSession. I have a csv dataset that I am saving as parquet file on my disk like so:
val df0 = spark
.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.option("inferSchema", false)
.load("SomeFile.csv"))
val df = df0.repartition(partitionExprs = col("numerocarte"), numPartitions = 42)
df.write
.mode(SaveMode.Overwrite)
.format("parquet")
.option("inferSchema", false)
.save("SomeFile.parquet")
I am creating 42 partitions by column numerocarte. This should group multiple numerocarte to same partition. I don't want to do partitionBy("numerocarte") at the write time because I don't want one partition per card. It would be millions of them.
After that in another script I read this SomeFile.parquet parquet file and do some operations on it. In particular I am running a window function on it where the partitioning is done on the same column that the parquet file was repartitioned by.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val df2 = spark.read
.format("parquet")
.option("header", true)
.option("inferSchema", false)
.load("SomeFile.parquet")
val w = Window.partitionBy(col("numerocarte"))
.orderBy(col("SomeColumn"))
df2.withColumn("NewColumnName",
sum(col("dollars").over(w))
After read I can see that the repartition worked as expected and DataFrame df2 has 42 partitions and in each of them are different cards.
Questions:
Does Spark know that the dataframe df2 is partitioned by column numerocarte?
If it knows, then there will be no shuffle in the window function. True?
If it does not know, It will do a shuffle in the window function. True?
If it does not know, how do I tell Spark the data is already partitioned by the right column?
How can I check a partitioning key of DataFrame? Is there a command for this? I know how to check number of partitions but how to see partitioning key?
When I print number of partitions in a file after each step, I have 42 partitions after read and 200 partitions after withColumn which suggests that Spark repartitioned my DataFrame.
If I have two different tables repartitioned with the same column, would the join use that information?
Does Spark know that the dataframe df2 is partitioned by column numerocarte?
It does not.
If it does not know, how do I tell Spark the data is already partitioned by the right column?
You don't. Just because you save data which has been shuffled, it does not mean, that it will be loaded with the same splits.
How can I check a partitioning key of DataFrame?
There is no partitioning key once you loaded data, but you can check queryExecution for Partitioner.
In practice:
If you want to support efficient pushdowns on the key, use partitionBy method of DataFrameWriter.
If you want a limited support for join optimizations use bucketBy with metastore and persistent tables.
See How to define partitioning of DataFrame? for detailed examples.
I am answering my own question for future reference what worked.
Following suggestion of #user8371915, bucketBy works!
I am saving my DataFrame df:
df.write
.bucketBy(250, "userid")
.saveAsTable("myNewTable")
Then when I need to load this table:
val df2 = spark.sql("SELECT * FROM myNewTable")
val w = Window.partitionBy("userid")
val df3 = df2.withColumn("newColumnName", sum(col("someColumn")).over(w)
df3.explain
I confirm that when I do window functions on df2 partitioned by userid there is no shuffle! Thanks #user8371915!
Some things I learned while investigating it
myNewTable looks like a normal parquet file but it is not. You could read it normally with spark.read.format("parquet").load("path/to/myNewTable") but the DataFrame created this way will not keep the original partitioning! You must use spark.sql select to get correctly partitioned DataFrame.
You can look inside the table with spark.sql("describe formatted myNewTable").collect.foreach(println). This will tell you what columns were used for bucketing and how many buckets there are.
Window functions and joins that take advantage of partitioning often require also sort. You can sort data in your buckets at the write time using .sortBy() and the sort will be also preserved in the hive table. df.write.bucketBy(250, "userid").sortBy("somColumnName").saveAsTable("myNewTable")
When working in local mode the table myNewTable is saved to a spark-warehouse folder in my local Scala SBT project. When saving in cluster mode with mesos via spark-submit, it is saved to hive warehouse. For me it was located in /user/hive/warehouse.
When doing spark-submit you need to add to your SparkSession two options: .config("hive.metastore.uris", "thrift://addres-to-your-master:9083") and .enableHiveSupport(). Otherwise the hive tables you created will not be visible.
If you want to save your table to specific database, do spark.sql("USE your database") before bucketing.
Update 05-02-2018
I encountered some problems with spark bucketing and creation of Hive tables. Please refer to question, replies and comments in Why is Spark saveAsTable with bucketBy creating thousands of files?

spark cross join memory leak

I have two tables to be cross joined,
table 1: query 300M rows
table 2: product description 3000 rows
The following query does a cross join and calculate a score between the tuple, and pick the top 3 matches,
query_df.repartition(10000).registerTempTable('queries')
product_df.coalesce(1).registerTempTable('products')
CREATE TABLE matches AS
SELECT *
FROM
(SELECT *,
row_number() over (partition BY a.query_id
ORDER BY 0.40 + 0.15*score_a + 0.20*score_b + 0.5*score_c DESC) AS rank
FROM
(SELECT /*+ MAPJOIN(b) */ a.query_id,
b.product_id,
func_a(a.qvec,b.pvec) AS score_a,
func_b(a.qvec,b.pvec) AS score_b,
func_c(a.qvec,b.pvec) AS score_c
FROM queries a CROSS
JOIN products b) a) a
WHERE rn <= 3
My spark cluster looks like the following,
MASTER="yarn-client" /opt/mapr/spark/spark-1.6.1/bin/pyspark --num-executors 22 --executor-memory 30g --executor-cores 7 --driver-memory 10g --conf spark.yarn.executor.memoryOverhead=10000 --conf spark.akka.frameSize=2047
Now the issue is, as expected, due to memory leak the job fails after a couple of stages because of the extremely large temp data produced. I'm looking for some help/suggestion in optimizing the above operation in a such a way that the job should be able to run both the match and filter operation for a query_id before picking the next query_id, in a parallel fashion - similar to a sort within for loop against the queries table. If the job is slow but successful, I'm ok with it, since I can request a bigger cluster.
The above query works fine for a smaller query table, say one with 10000 records.
In the scenario where you want to join table A (big) with table B (small), the best practice is to leverage broadcast join.
A clear overview is given in https://stackoverflow.com/a/39404486/1203837.
Hope this helps.
Cartesian joins or cross join in spark is extremely expensive. I would suggest to join the tables with inner join and save the output data first. Then use that dataframe for further aggregation.
One small suggestion the map join or broadcast join could fail sometime if the smaller table is not small enough. Unless you are sure about the size of the small table refrain using the broadcast join.

How to process Kafka partitions separately and in parallel with Spark executors?

I use Spark 2.1.1.
I read messages from 2 Kafka partitions using Structured Streaming. I am submitting my application to Spark Standalone cluster with one worker and 2 executors (2 cores each).
./bin/spark-submit \
--class MyClass \
--master spark://HOST:IP \
--deploy-mode cluster \
/home/ApplicationSpark.jar
I want the functionality such that, the messages from each Kafka partition should be processed by each separate executor independently. But now what is happening is, executors read and .map the partition data separately, but after mapping the unbounded tables which is formed is used commonly and having data from both the partitions.
When I ran the structured query on table, the query has to deal with data from both the partitions (more amount of data).
select product_id, max(smr.order_time), max(product_price) , min(product_price)
from OrderRecords
group by WINDOW(order_time, "120 seconds"), product_id
where Kafka partition is on Product_id
Is there any way to run the same structured query parallel but separately on the data, from the Kafka partition to which the executor is mapped?
But now what is happening is, executors read and .map the partition data separately, but after mapping the unbounded tables which is formed is used commonly and having data from both the partitions. Hence when I ran the structured query on table, the query has to deal with data from both the partitions (more amount of data).
That's the key to understand what and how can be executed without causing shuffle and sending data across partitions (possibly even over the wire).
The definitive answer depends on what your queries are. If they work on groups of records where the groups are spread across multiple topic partitions and hence on two different Spark executors, you'd have to be extra careful with your algorithm/transformation to do the processing on separate partitions (using only what's available in partitions) and aggregating the results only.

Resources