I was trying to optimize my DataBricks delta tables in order to improve query performance.
I am having few doubts to be clarified, asking here since I did not get answer from the documentation.
I am having few tables which are not partitioned but adding data incrementally. Is there any way to only optimize only the incrementally adding data at each week.
(I understood optimization happens at partitions and if it is not partitioned, optimization can be made on complete table data, currently we are doing like that. but just to confirm that there are no other work arounds.)
I have tables which are partitioned, and i try to optimize the table by providing a subquery to fetch the partition to be optimized. Below is the query used where load_date is the partition column.
OPTMIZE database.table where
load_date > (select to_date(max(load_date)) as load_date
from audit.delta_optimization_audit
where source = 'abc' and job_status = 'success')
But the optimization failed with ERROR
org.apache.spark.sql.AnalysisException: Subquery is not supported in partition predicates.
What and all conditions can be added in the OPTIMIZE WHERE clause? Is subquery not allowed in WHERE clause for OPTIMIZE command?
If there are multiple partitions like YEAR then MONTH then DATE, How should I provide Partitions in WHERE clause to OPTIMIZE data incrementally?
Is it useful to run OPTIMIZE query on a table which is not partitioned ?
Any Leads Appreciated. Thanks in Advance!
Related
I am trying multiple ways to optimize executions of large datasets using partitioning. In particular I'm using a function commonly used with traditional SQL databases called nTile.
The objective is to place a certain number of rows into a bucket using a combination of buckettind and repartitioning. This allows Apache Spark to process data more efficient when processing partitioned datasets or should I say bucketted datasets.
Below is two examples. The first example shows how I've used ntile to split a dataset into two buckets followed by repartitioning the data into 2 partitions on the bucketted nTile called skew_data.
I then follow with the same query but without any bucketing or repartitioning.
The problem is query without the bucketting is faster then the query with bucketting, even the query without bucketting places all the data into one partition whereas the query with bucketting splits the query into 2 partitions.
Can someone let me know why that is.
FYI
I'm running the query on a Apache Spark cluster from Databricks.
The cluster just has one single node with 2 cores and 15Gb memory.
First example with nTile/Bucketting and repartitioning
allin = spark.sql("""
SELECT
t1.make
, t2.model
, NTILE(2) OVER (ORDER BY t2.sale_price) AS skew_data
FROM
t1 INNER JOIN t2
ON t1.engine_size = t2.engine_size2
""")
.repartition(2, col("skew_data"), rand())
.drop('skew_data')
The above code splits the data into partitions as follows, with the corresponding partition distribution
Number of partitions: 2
Partitioning distribution: [5556767, 5556797]
The second example: with no nTile/Bucketting or repartitioning
allin_NO_nTile = spark.sql("""
SELECT
t1.make
,t2.model
FROM
t1 INNER JOIN t2
ON t1.engine_size = t2.engine_size2
""")
The above code puts all the data into a single partition as shown below:
Number of partitions: 1
Partitioning distribution: [11113564]
My question is, why is it that the second query(without nTile or repartitioning) is faster than query with nTile and repartitioning?
I have gone to great lengths to write this question out as fully as possible, but if you need further explanation please don't hesitate to ask. I really want to get to the bottom of this.
I abandoned my original approached and used the new PySpark function called bucketBy(). If you want to know how to apply bucketBy() to bucket data go to
https://www.youtube.com/watch?v=dv7IIYuQOXI&list=PLOmMQN2IKdjvowfXo_7hnFJHjcE3JOKwu&index=39
I am trying to extract data from a big table in SAP HANA, which is around 1.5tb in size, and the best way is to run in parallel across nodes and threads. Spark JDBC is the perfect candidate for the task, but in order to actually extract in parallel it requires partition column, lower/upper bound and number of partitions option to be set. To make the operation of the extraction easier, I considered adding an added partition column which would be the row_number() function and use MIN(), MAX() as lower/upper bounds respectively. And then the operations team just would be required to provide the number of partitions to have.
The problem is that HANA runs out of memory and it is very likely that row_number() is too costly on the engine. I can only imagine that over 100 threads run the same query during every fetch to apply the where filters and retrieve the corresponding chunk.
So my question is, if I disable the predicate pushdown option, how does spark behave? is it only read by one executor and then the filters are applied on spark side? Or does it do some magic to split the fetching part from the DB?
What could you suggest for extracting such a big table using the available JDBC reader?
Thanks in advance.
Before executing your primary query from Spark, run pre-ingestion query to fetch the size of the Dataset being loaded, i.e. as you have mentioned Min(), Max() etc.
Expecting that the data is uniformly distributed between Min and Max keys, you can partition across executors in Spark by providing Min/Max/Number of Executors.
You don't need(want) to change your primary datasource by adding additional columns to support data ingestion in this case.
We have a view that unions multiple hive tables. If i use spark SQL in pyspark and read that view will there be any performance issue as against reading directly from the table.
In hive we had something called full table scan if we don't limit the where clause to an exact table partition. Is spark intelligent enough to directly read the table that has the data that we are looking for rather than searching through the entire view ?
Please advise.
You are talking about partition pruning.
Yes spark supports it spark automatically omits large data read when partition filters are specified.
Partition pruning is possible when data within a table is split across multiple logical partitions. Each partition corresponds to a particular value of a partition column and is stored as a subdirectory within the table root directory on HDFS. Where applicable, only the required partitions (subdirectories) of a table are queried, thereby avoiding unnecessary I/O
After partitioning the data, subsequent queries can omit large amounts of I/O when the partition column is referenced in predicates. For example, the following query automatically locates and loads the file under peoplePartitioned/age=20/and omits all others:
val peoplePartitioned = spark.read.format("orc").load("peoplePartitioned")
peoplePartitioned.createOrReplaceTempView("peoplePartitioned")
spark.sql("SELECT * FROM peoplePartitioned WHERE age = 20")
more detailed info is provided here
You can also see this in the logical plan if you run an explain(True) on your query:
spark.sql("SELECT * FROM peoplePartitioned WHERE age = 20").explain(True)
it will show which partitions are read by spark
I have a need of joining tables using Spark SQL or Dataframe API. Need to know what would be optimized way of achieving it.
Scenario is:
All data is present in Hive in ORC format (Base Dataframe and Reference files).
I need to join one Base file (Dataframe) read from Hive with 11-13 other reference file to create a big in-memory structure (400 columns) (around 1 TB in size)
What can be best approach to achieve this? Please share your experience if some one has encounter similar problem.
My default advice on how to optimize joins is:
Use a broadcast join if you can (see this notebook). From your question it seems your tables are large and a broadcast join is not an option.
Consider using a very large cluster (it's cheaper that you may think). $250 right now (6/2016) buys about 24 hours of 800 cores with 6Tb RAM and many SSDs on the EC2 spot instance market. When thinking about total cost of a big data solution, I find that humans tend to substantially undervalue their time.
Use the same partitioner. See this question for information on co-grouped joins.
If the data is huge and/or your clusters cannot grow such that even (3) above leads to OOM, use a two-pass approach. First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.
Side note: I say "appending" above because in production I never use SaveMode.Append. It is not idempotent and that's a dangerous thing. I use SaveMode.Overwrite deep into the subtree of a partitioned table tree structure. Prior to 2.0.0 and 1.6.2 you'll have to delete _SUCCESS or metadata files or dynamic partition discovery will choke.
Hope this helps.
Spark uses SortMerge joins to join large table. It consists of hashing each row on both table and shuffle the rows with the same hash into the same partition. There the keys are sorted on both side and the sortMerge algorithm is applied. That's the best approach as far as I know.
To drastically speed up your sortMerges, write your large datasets as a Hive table with pre-bucketing and pre-sorting option (same number of partitions) instead of flat parquet dataset.
tableA
.repartition(2200, $"A", $"B")
.write
.bucketBy(2200, "A", "B")
.sortBy("A", "B")
.mode("overwrite")
.format("parquet")
.saveAsTable("my_db.table_a")
tableb
.repartition(2200, $"A", $"B")
.write
.bucketBy(2200, "A", "B")
.sortBy("A", "B")
.mode("overwrite")
.format("parquet")
.saveAsTable("my_db.table_b")
The overhead cost of writing pre-bucketed/pre-sorted table is modest compared to the benefits.
The underlying dataset will still be parquet by default, but the Hive metastore (can be Glue metastore on AWS) will contain precious information about how the table is structured. Because all possible "joinable" rows are colocated, Spark won't shuffle the tables that are pre-bucketd (big savings!) and won't sort the rows within the partition of table that are pre-sorted.
val joined = tableA.join(tableB, Seq("A", "B"))
Look at the execution plan with and without pre-bucketing.
This will not only save you a lot of time during your joins, it will make it possible to run very large joins on relatively small cluster without OOM. At Amazon, we use that in prod most of the time (there are still a few cases where it is not required).
To know more about pre-bucketing/pre-sorting:
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
https://data-flair.training/blogs/bucketing-in-hive/
https://mapr.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2-x/
https://databricks.com/session/hive-bucketing-in-apache-spark
Partition the source use hash partitions or range partitions or you can write custom partitions if you know better about the joining fields. Partition will help to avoid repartition during joins as spark data from same partition across tables will exist in same location.
ORC will definitely help the cause.
IF this is still causing spill, try using tachyon which will be faster than disk
After days thinking about it I'm still stuck with this problem: I have one table where "timestamp" is the partition key. This table contains billions of rows.
I also have "timeseries" tables that contain timestamps related to specific measurement processes.
With Spark I want to analyze the content of the big table. Of course it is not efficient to do a full table scan, and with a rather fast lookup in the timeseries table I should be able to target only, say, 10k partitions.
What is the most efficient way to achieve this?
Is SparkSQL smart enough to optimize something like this
sqlContext.sql("""
SELECT timeseries.timestamp, bigtable.value1 FROM timeseries
JOIN bigtable ON bigtable.timestamp = timeseries.timestamp
WHERE timeseries.parameter = 'xyz'
""")
Ideally I would expect Cassandra to fetch the timestamps from the timeseries table and then use that to query only that subset of partitions from bigtable.
If you add an "Explain" call to your query you'll see what the Catalyst planner will do for your query but I know it will not do the optimizations you want.
Currently Catalyst has no support for pushing down joins to DataSources which means the structure of your query is most likely got to look like.
Read Data From Table timeseries with predicate parameter = 'xyz'
Read Data From Table bigtable
Join these two results
Filter on bigtable.timestamp == timeseries.timestamp
The Spark Cassandra Connector will be given the predicate from the timeseries table read and will be able to optimize it if is a clustering key or a partition key. See the Spark Cassandra Connector Docs. If it doesn't fit into one of those pushdown categories it will require a Full Table Scan followed by a filter in Spark.
Since the Read Data From Table bigtable has no restrictions on it, Spark will instruct the Connector to read the entire table (Full Table Scan).
I can only take a guess on the optimizations done by the driver, but I'd surely expect a query such as that to restrict the JOIN on the WHERE, which means that your simple query will be optimized.
What I will do as well is point you in the general direction of optimizing Spark SQL. Have a look at Catalyst for Spark SQL, which is a tool for greatly optimizing queries all the way down to the physical level.
Here is a breakdown of how it works:
Deep Dive into Spark SQL Catalyst Optimizer
And the link to the git-repo: Catalyst repo