Spark: does DataFrameWriter have to be a blocking step? - apache-spark

I have data partitioned by a column (say, id) and I have this dataset saved some place. Every now and then, I get a smaller incremental dataset with the same structure and I essentially have to upsert my existing data based on my id with a date column deciding which record is the newest. (I don't write it in the same place, I save the whole new blob some place else.)
There are two ways I've been doing this - either grouping in a window and taking the row with the highest date. Or via dropDuplicates, relying on the fact, that my data is ordered. (I'd rather use the former, but I've been trying various things.)
The one big issue is that each id group is not negligible (a few gigabytes), so I was hoping Spark (with n workers) would understand that since I'm reading id-partitioned data and writing id-partitioned data, it would process n ids at once and continually write them to my storage, taking new ids as it's finished with the previous ones.
Unfortunately, what seems to be happening, is that Spark processes all my id groups in one big job (and spills to disk, naturally) before writing anything to disk. It gets really really slow.
The question is thus: Is there a way to force Spark to process these groups and write them as soon as they're ready? Again, they are partitioned, so no other task will affect my partition.
Here's a bit of code that reproduces the problem:
# generate dummy data first
import random
from typing import List
from datetime import datetime, timedelta
from pyspark.sql.functions import desc, col, row_number
from pyspark.sql.window import Window
from pyspark.sql.dataframe import DataFrame
def gen_data(n: int) -> List[tuple]:
names = 'foo, bar, baz, bak'.split(', ')
return [(random.randint(1, 25), random.choice(names), datetime.today() - timedelta(days=random.randint(1, 100))) \
for j in range(n)]
def get_df(n: int) -> DataFrame:
return spark.createDataFrame(gen_data(n), ['id', 'name', 'date'])
n = 10_000
df = get_df(n)
dd = get_df(n*10)
df.write.mode('overwrite').partitionBy('id').parquet('outputs/first')
dd.write.mode('overwrite').partitionBy('id').parquet('outputs/second')
d1 and d2 are both partitioned by id and so is the resulting dataset, but it's not reflected in the plan:
w = Window().partitionBy('id').orderBy(desc('date'))
d1 = spark.read.parquet('outputs/first')
d2 = spark.read.parquet('outputs/second')
d1.union(d2).\
withColumn('rn', row_number().over(w)).filter(col('rn') == 1).drop('rn').\
write.mode('overwrite').partitionBy('id').parquet('outputs/window')
I also tried to explicitly state the partition key (otherwise the code is the same):
d1 = spark.read.parquet('outputs/first').repartition('id')
d2 = spark.read.parquet('outputs/second').repartition('id')
d1.union(d2).\
withColumn('rn', row_number().over(w)).filter(col('rn') == 1).drop('rn').\
write.mode('overwrite').partitionBy('id').parquet('outputs/window2')
Here's the same using dropDuplicates:
d1 = spark.read.parquet('outputs/first')
d2 = spark.read.parquet('outputs/second')
d1.union(d2).\
dropDuplicates(subset=['id']).\
write.mode('overwrite').partitionBy('id').parquet('outputs/window3')
I also tried emphasising that my union is still partitioned using something like this, but again to no avail:
df.union(d2).repartition('id').\
.withColumn...
I could list all partitions (ids), load them one by one while leveraging partition pruning, deduplicating and writing. But that seems like extra boilerplate that shouldn't be necessary. Or is it possible to do this via foreach?
Update (2018-03-27):
Turns out, the information about partitioning is indeed present in the window functionality in one way or another, because when I filter at the very end, partition pruning on the inputs does take place:
d1 = spark.read.parquet('outputs/first')
d2 = spark.read.parquet('outputs/second')
w = Window().partitionBy('id', 'name').orderBy(desc('date'))
d1.union(d2).withColumn('rn', row_number().over(w)).filter(col('rn') == 1).filter(col('id') == 12).explain(True)
Results in
== Physical Plan ==
*(4) Filter (isnotnull(rn#387) && (rn#387 = 1))
+- Window [row_number() windowspecdefinition(id#187, name#185, date#186 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rn#387], [id#187, name#185], [date#186 DESC NULLS LAST]
+- *(3) Sort [id#187 ASC NULLS FIRST, name#185 ASC NULLS FIRST, date#186 DESC NULLS LAST], false, 0
+- Exchange hashpartitioning(id#187, name#185, 200)
+- Union
:- *(1) FileScan parquet [name#185,date#186,id#187] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/.../spark_perf_partitions/outputs..., PartitionCount: 1, PartitionFilters: [isnotnull(id#187), (id#187 = 12)], PushedFilters: [], ReadSchema: struct<name:string,date:timestamp>
+- *(2) FileScan parquet [name#191,date#192,id#193] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/.../spark_perf_partitions/outputs..., PartitionCount: 1, PartitionFilters: [isnotnull(id#193), (id#193 = 12)], PushedFilters: [], ReadSchema: struct<name:string,date:timestamp>
So it indeed only reads two partitions, one per each file. So I could, instead of looping, just run the code with one filter at a time (the filter being between the window function and .write). Tedious and not very practical, but potentially faster than spilling everything to disk.

Yes this is exactly how spark partitioning works. So it computes the whole lineage and then write in a partitioned form on the disk. There are several advantages for that. One of the important reason is parallel write. So when the computation is done spark can write all the partitions in parallel on the disk. This significantly improves the performance.
If you want to write as an when the data is ready you might as well filter on the dataframe by different Ids and compute the process in a loop and write. However, in my experience this approach requires several iterations on the same dataframe resulting huge performance loss.

Related

Spark Performance issue - Writing partitions to S3 as individual files

I'm running a spark job whose job is to scan a large file and split it into smaller files. The file is in Json Lines format and I'm trying to partition it by a certain column (id) and save each partition as a separate file to S3. The file size is about 12 GB but there are about 500000 distinct values of id. The query is taking almost 15 hours. What can I do to improve performance? Is Spark a poor choice for such a task? Please note that I do have the liberty to making sure that the source as a fixed number of rows per id.
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.window import Window
from awsglue.utils import getResolvedOptions
from awsglue.transforms import *
from pyspark.sql.functions import udf, substring, instr, locate
from datetime import datetime, timedelta
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# Get parameters that were passed to the job
args = getResolvedOptions(sys.argv, ['INPUT_FOLDER', 'OUTPUT_FOLDER', 'ID_TYPE', 'DATASET_DATE'])
id_type = args["ID_TYPE"]
output_folder = "{}/{}/{}".format(args["OUTPUT_FOLDER"], id_type, args["DATASET_DATE"])
input_folder = "{}/{}/{}".format(args["INPUT_FOLDER"], id_type, args["DATASET_DATE"])
INS_SCHEMA = StructType([
StructField("camera_capture_timestamp", StringType(), True),
StructField(id_type, StringType(), True),
StructField("image_uri", StringType(), True)
])
data = spark.read.format("json").load(input_folder, schema=INS_SCHEMA)
data = data.withColumn("fnsku_1", F.col("fnsku"))
data.coalesce(1).write.partitionBy(["fnsku_1"]).mode('append').json(output_folder)
I have tried repartition instead of coalesce too.
I'm using AWS Glue
Please consider the following as one of possible options. It would be awesome to see if it helped :)
First, if you coalesce, as said #Lamanus in the comments, it means that you will reduce the number of partitions, hence also reduce the number of
writer task, hence shuffle all data to 1 task. It can be the first factor to improve.
To overcome the issue, ie. write a file per partition and keep the parallelization level, you can change the logic on the following one:
object TestSoAnswer extends App {
private val testSparkSession = SparkSession.builder()
.appName("Demo groupBy and partitionBy").master("local[*]")
.getOrCreate()
import testSparkSession.implicits._
// Input dataset with 5 partitions
val dataset = testSparkSession.sparkContext.parallelize(Seq(
TestData("a", 0), TestData("a", 1), TestData("b", 0), TestData("b", 1),
TestData("c", 1), TestData("c", 2)
), 5).toDF("letter", "number")
dataset.as[TestData].groupByKey(row => row.letter)
.flatMapGroups {
case (_, values) => values
}.write.partitionBy("letter").mode("append").json("/tmp/test-parallel-write")
}
case class TestData(letter: String, number: Int)
How does it work?
First, the code performs a shuffle to collect all rows related to a specific key (same as for the partitioning) to the same
partitions. So that, it will perform the write on all the rows belonging to the key at once. Some time ago I wrote a blog
post about partitionBy method. Roughly, internally it will sort the records on the given partition and later write them
one-by-one into the file.
That way we get the plan like this one, where only 1 shuffle, so processing-consuming operation is present:
== Physical Plan ==
*(2) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, TestData, true])).letter, true, false) AS letter#22, knownnotnull(assertnotnull(input[0, TestData, true])).number AS number#23]
+- MapGroups TestSoAnswer$$$Lambda$1236/295519299#55c50f52, value#18.toString, newInstance(class TestData), [value#18], [letter#3, number#4], obj#21: TestData
+- *(1) Sort [value#18 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(value#18, 200), true, [id=#15]
+- AppendColumnsWithObject TestSoAnswer$$$Lambda$1234/1747367695#6df11e91, [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, TestData, true])).letter, true, false) AS letter#3, knownnotnull(assertnotnull(input[0, TestData, true])).number AS number#4], [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#18]
+- Scan[obj#2]
The output of the TestSoAnswer executed twice looks like that:
test-parallel-write % ls
_SUCCESS letter=a letter=b letter=c
test-parallel-write % ls letter=a
part-00170-68245d8b-b155-40ca-9b5c-d9fb746ac76c.c000.json part-00170-cd90d64f-43c6-4582-aae6-fe443b6617f4.c000.json
test-parallel-write % ls letter=b
part-00161-68245d8b-b155-40ca-9b5c-d9fb746ac76c.c000.json part-00161-cd90d64f-43c6-4582-aae6-fe443b6617f4.c000.json
test-parallel-write % ls letter=c
part-00122-68245d8b-b155-40ca-9b5c-d9fb746ac76c.c000.json part-00122-cd90d64f-43c6-4582-aae6-fe443b6617f4.c000.json
You can also control the number of records written per file with this configuration.
Edit: Didn't see the comment of #mazaneicha but indeed, you can try with repartition("partitioning column")! It's even more clear than the grouping expression.
Best,
Bartosz.
If you're not going to use Spark for anything other than to split the file into smaller versions of itself, then I would say Spark is a poor choice. You'd be better off doing this within AWS following an approach such as the one given in this Stack Overflow post
Assuming you have an EC2 instance available, you'd run something like this:
aws s3 cp s3://input_folder/12GB.json - | split -l 1000 - output.
aws s3 cp output.* s3://output_folder/
If you're looking to do some further processing of the data in Spark, you're going to want to repartition the data to chunks between 128MB and 1 GB. With the default (snappy) compression, you typically end up with 20% of the original file size. So, in your case: between (12/5) ~3 and (12/5/8) ~20 partitions, so:
data = spark.read.format("json").load(input_folder, schema=INS_SCHEMA)
dataPart = data.repartition(12)
This is not actually a particularly large data set for Spark and should not be as cumbersome to deal with.
Saving as parquet gives you a good recovery point, and re-reading the data will be very fast. The total file size will be about 2.5 GB.

Cross Join in Apache Spark with dataset is very slow

I have posted this question on spark user forum but received no response so asking it here again.
We have a use case where we need to do a Cartesian join and for some reason we are not able to get it work with Dataset API's.
We have two dataset:
one data set with 2 string columns say c1, c2. It is a small data set with ~1 million records. The two columns are both strings of 32 characters so should be less than 500 mb.
We broadcast this dataset
the other data set is little bigger with ~10 million records
val ds1 = spark.read.format("csv").option("header", "true").load(<s3-location>).select("c1", "c2")
ds1.count
val ds2 = spark.read.format("csv").load(<s3-location>).toDF("c11", "c12", "c13", "c14", "c15", "ts")
ds2.count
ds2.crossJoin(broadcast(ds1)).filter($"c1" <= $"c11" && $"c11" <= $"c2").count
If I implement it using RDD api where I broadcast data in ds1 and then filter data in ds2 it works fine.
I have confirmed the broadcast is successful.
2019-02-14 23:11:55 INFO CodeGenerator:54 - Code generated in 10.469136 ms
2019-02-14 23:11:55 INFO TorrentBroadcast:54 - Started reading broadcast variable 29
2019-02-14 23:11:55 INFO TorrentBroadcast:54 - Reading broadcast variable 29 took 6 ms
2019-02-14 23:11:56 INFO CodeGenerator:54 - Code generated in 11.280087 ms
Query Plan:
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, Cross, ((c1#68 <= c11#13) && (c11#13 <= c2#69))
:- *Project []
: +- *Filter isnotnull(_c0#0)
: +- *FileScan csv [_c0#0,_c1#1,_c2#2,_c3#3,_c4#4,_c5#5] Batched: false, Format: CSV, Location: InMemoryFileIndex[], PartitionFilters: [], PushedFilters: [IsNotNull(_c0)], ReadSchema: struct<_c0:string,_c1:string,_c2:string,_c3:string,_c4:string,_c5:string>
+- BroadcastExchange IdentityBroadcastMode
+- *Project [c1#68, c2#69]
+- *Filter (isnotnull(c1#68) && isnotnull(c2#69))
+- *FileScan csv [c1#68,c2#69] Batched: false, Format: CSV, Location: InMemoryFileIndex[], PartitionFilters: [], PushedFilters: [IsNotNull(c1), IsNotNull(c2)], ReadSchema: struct
then the stage do not progress.
I updated the code to use broadcast ds1 and then did the join in the mapPartitions for ds2.
val ranges = spark.read.format("csv").option("header", "true").load(<s3-location>).select("c1", "c2").collect
val rangesBC = sc.broadcast(ranges)
then used this rangesBC in the mapPartitions method to identify the range each row in ds2 belongs and this job completes in 3 hrs, while the other job does not complete even after 24 hrs. This kind of implies that the query optimizer is not doing what I want it to do.
What am I doing wrong? Any pointers will be helpful. Thank you!
I have run into this issue recently and found that Spark has a strange partitioning behavior when cross joining large dataframes. If your input dataframe contain few million records, then the cross joined dataframe has partitions equal to the multiplication of the input dataframes partition, that is
Partitions of crossJoinDF = (Partitions of ds1) * (Partitions of ds2).
If ds1 or ds2 contain about few hundred partitions then the cross join dataframe would have partitions in the range of ~ 10,000. These are way too many partitions, which result in excessive overhead in managing many small tasks, making any computation (in your case - filter) on the cross joined data frame very slow to run.
So how do you make the computation faster? First check if this is indeed the issue for your problem:
scala> val crossJoinDF = ds2.crossJoin(ds1)
# This should return immediately because of spark lazy evaluation
scala> val crossJoinDFPartitions = crossJoinDF.rdd.partitions.size
Check the number of the partitions on the cross joined dataframe. If crossJoinDFPartitions > 10,000, then you do indeed have the same issue i.e cross joined dataframe has way too many partitions.
To make your operations on cross joined dataframe faster, reduce the number of partitions on the input DataFrames. For example:
scala> val ds1 = ds1.repartition(40)
scala> ds1.rdd.partitions.size
res80: Int = 40
scala> val ds2 = ds2.repartition(40)
scala> ds2.rdd.partitions.size
res81: Int = 40
scala> val crossJoinDF = ds1.crossJoin(ds2)
scala> crossJoinDF.rdd.partitions.size
res82: Int = 1600
scala> crossJoinDF.count()
The count() action should result in execution of the cross join. The count should now return in a reasonable amount of time. The number of exact partitions you choose would depend on number of cores available in your cluster.
The key takeaway here is to make sure that your cross joined dataframe has reasonable number of partitions (<< 10,000). You might also find this post useful which explains this issue in more detail.
I do not know if you are on bare metal or AWS with spot or on-demand or dedicated, or VMs with AZURE, et al. My take:
Appreciate that 10M x 1M is a lot of work, even if .filter applies on the resultant cross join. It will take some time. What were your expectations?
Spark is all about scaling in a linear way in general.
Data Centers with VMs do not have dedicated and hence do not have the fastest performance.
Then:
I ran on Databricks 10M x 100K in a simulated set-up with .86 core and 6GB on Driver for Community Edition. That ran in 17 mins.
I ran the 10M x 1M in your example on a 4 node AWS EMR non-dedicated Cluster (with some EMR-oddities like reserving the Driver on a valuable instance!) it took 3 hours for partial completion. See the picture below.
So, to answer your question:
- You did nothing wrong.
Just just need more resources allowing more parallelisation.
I did add some explicit partitioning as you can see.

Why is my parquet partitioned data slower than non-partitioned one?

My understanding is: If I partition my data on a column I will query by it should be faster. However, when I tried it, it seem to be slower instead why?
I have a users dataframe which I tried partitioning my yearmonth and not.
So I have 1 dataset partitioned by creation_yearmonth.
questionsCleanedDf.repartition("creation_yearmonth") \
.write.partitionBy('creation_yearmonth') \
.parquet('wasb://.../parquet/questions.parquet')
I have another not partitioned
questionsCleanedDf \
.write \
.parquet('wasb://.../parquet/questions_nopartition.parquet')
Then I tried creating a dataframe from these 2 parquet files and running the same query
questionsDf = spark.read.parquet('wasb://.../parquet/questions.parquet')
and
questionsDf = spark.read.parquet('wasb://.../parquet/questions_nopartition.parquet')
The query
spark.sql("""
SELECT * FROM questions
WHERE creation_yearmonth = 201606
""")
It seem like the no partition one is consistently faster or have similar times (~2 - 3s) while partitioned one is slighly slower (~3 - 4s).
I tried to do an explain:
For the partitioned dataset:
== Physical Plan ==
*FileScan parquet [id#6404,title#6405,tags#6406,owner_user_id#6407,accepted_answer_id#6408,view_count#6409,answer_count#6410,comment_count#6411,creation_date#6412,favorite_count#6413,creation_yearmonth#6414] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data#cs4225.blob.core.windows.net/parquet/questions.parquet], PartitionCount: 1, PartitionFilters: [isnotnull(creation_yearmonth#6414), (creation_yearmonth#6414 = 201606)], PushedFilters: [], ReadSchema: struct<id:int,title:string,tags:array<string>,owner_user_id:int,accepted_answer_id:int,view_count...
PartitionCount: 1 I should since in this case, it can just go directly to the parition it should be faster?
For the non-paritioned one:
== Physical Plan ==
*Project [id#6440, title#6441, tags#6442, owner_user_id#6443, accepted_answer_id#6444, view_count#6445, answer_count#6446, comment_count#6447, creation_date#6448, favorite_count#6449, creation_yearmonth#6450]
+- *Filter (isnotnull(creation_yearmonth#6450) && (creation_yearmonth#6450 = 201606))
+- *FileScan parquet [id#6440,title#6441,tags#6442,owner_user_id#6443,accepted_answer_id#6444,view_count#6445,answer_count#6446,comment_count#6447,creation_date#6448,favorite_count#6449,creation_yearmonth#6450] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data#cs4225.blob.core.windows.net/parquet/questions_nopartition.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(creation_yearmonth), EqualTo(creation_yearmonth,201606)], ReadSchema: struct<id:int,title:string,tags:array<string>,owner_user_id:int,accepted_answer_id:int,view_count...
Also very surprising. At first the dataset has dates as strings, so I need to do a query like:
spark.sql("""
SELECT * FROM questions
WHERE CAST(creation_date AS date) BETWEEN '2017-06-01' AND '2017-07-01'
""").show(20, False)
I expected this to be even slower but it turns out, it performs the best ~1-2s. Why is that? I thought in this case, it needs to cast each row?
The explain output here:
== Physical Plan ==
*Project [id#6521, title#6522, tags#6523, owner_user_id#6524, accepted_answer_id#6525, view_count#6526, answer_count#6527, comment_count#6528, creation_date#6529, favorite_count#6530]
+- *Filter ((isnotnull(creation_date#6529) && (cast(cast(creation_date#6529 as date) as string) >= 2017-06-01)) && (cast(cast(creation_date#6529 as date) as string) <= 2017-07-01))
+- *FileScan parquet [id#6521,title#6522,tags#6523,owner_user_id#6524,accepted_answer_id#6525,view_count#6526,answer_count#6527,comment_count#6528,creation_date#6529,favorite_count#6530] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data#cs4225.blob.core.windows.net/filtered/questions.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(creation_date)], ReadSchema: struct<id:string,title:string,tags:array<string>,owner_user_id:string,accepted_answer_id:string,v...
Overpartitioning can actually reduce performance:
If a column has only a few rows matching each value, the number of
directories to process can become a limiting factor, and the data file
in each directory could be too small to take advantage of the Hadoop
mechanism for transmitting data in multi-megabyte blocks.
This excerpt was taken from the documentation of a different Hadoop component, Impala, but the presented argument should be valid to all components of the Hadoop stack.
I think that regardless of the partitioning scheme used, the advantages of partitioning will not be apparent until the table grows way beyond 900 MB-s.

Will a persisted-dataframe be calculated repeatedly many times?

I've got the following structured query:
val A = 'load somedata from HDFS'.persist(StorageLevel.MEMORY_AND_DISK_SER)
val B = A.filter('condition 1')
val C = A.filter('condition 2')
val D = A.filter('condition 3')
val E = A.filter('condition 4')
val F = A.filter('condition 5')
val G = A.filter('condition 6')
val H = A.filter('condition 7')
val I = B.union(C).union(D).union(E).union(F).union(G).union(H)
I persist the dataframe A, so that when I use B/C/D/E/F/G/H, the A dataframe should be calculated only once? But the DAG of this job is below:
From the DAG above, it seems that stage 6-12 are all executed and the dataframe A is calculated 7 times?
Why would this happen?
Maybe the DAG is just fake? I found that there are not lines on the top of stage 7-12 where stage 6 does have two lines from other stage
I didn't list all the operations. After union operation, I save the I dataframe to HDFS. Will this action on the I dataframe make the persist operation be done really? Or must I do an action operation such as count on the A dataframe to trigger the persist operation before reuse A dataframe?
Doing the following line won't persist your dataset.
val A = 'load somedata from HDFS'.persist(StorageLevel.MEMORY_AND_DISK_SER)
Caching/persistence is lazy when used with Dataset API so you have to trigger the caching using count operator or similar that in turn submits a Spark job.
After that all the following operators, filter including, should use InMemoryTableScan with the green dot in the plan (as shown below).
In your case even after union the dataset I is not cached since you have not triggered the caching (but merely marked it for caching).
After union operation, I save the I dataframe to HDFS. Will this action on the I dataframe make the persist operation be done really?
Yes. Only actions (like saving to an external storage) can trigger the persistence for future reuse.
Or must I do an action operation such as count on the A dataframe to trigger the persist operation before reuse A dataframe?
That's the point! In your case, since you want to reuse A dataframe across filter operators you should persist it first, count (to trigger the caching) followed by filters.
In your case, no filter will benefit from any performance increase due to persist. That persist is practically void of any impact on the performance and just makes a code reviewer think it's otherwise.
If you want to see when and if your dataset is cached, you can check out Storage tab in web UI or ask CacheManager about it.
val nums = spark.range(5).cache
nums.count
scala> spark.sharedState.cacheManager.lookupCachedData(nums)
res0: Option[org.apache.spark.sql.execution.CachedData] =
Some(CachedData(Range (0, 5, step=1, splits=Some(8))
,InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *Range (0, 5, step=1, splits=8)
))

How to load only the data of the last partition

I have some data partitioned this way:
/data/year=2016/month=9/version=0
/data/year=2016/month=10/version=0
/data/year=2016/month=10/version=1
/data/year=2016/month=10/version=2
/data/year=2016/month=10/version=3
/data/year=2016/month=11/version=0
/data/year=2016/month=11/version=1
When using this data, I'd like to load the last version only of each month.
A simple way to do this is to do load("/data/year=2016/month=11/version=3") instead of doing load("/data").
The drawback of this solution is the loss of partitioning information such as year and month, which means it would not be possible to apply operations based on the year or the month anymore.
Is it possible to ask Spark to load the last version only of each month? How would you go about this?
Well, Spark supports predicate push-down, so if you provide a filter following the load, it will only read in the data fulfilling the criteria in the filter. Like this:
spark.read.option("basePath", "/data").load("/data").filter('version === 3)
And you get to keep the partitioning information :)
Just an addition to previous answers for reference
I have a below ORC format table in hive which is partitioned on year,month & date column.
hive (default)> show partitions test_dev_db.partition_date_table;
OK
year=2019/month=08/day=07
year=2019/month=08/day=08
year=2019/month=08/day=09
If I set below properties, I can read the latest partition data in spark sql as shown below:
spark.sql("SET spark.sql.orc.enabled=true");
spark.sql("SET spark.sql.hive.convertMetastoreOrc=true")
spark.sql("SET spark.sql.orc.filterPushdown=true")
spark.sql("""select * from test_dev_db.partition_date_table where year ='2019' and month='08' and day='07' """).explain(True)
we can see PartitionCount: 1 in plan and it's obvious that it has filtered the latest partition.
== Physical Plan ==
*(1) FileScan orc test_dev_db.partition_date_table[emp_id#212,emp_name#213,emp_salary#214,emp_date#215,year#216,month#217,day#218] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://xxx.host.com:8020/user/xxxx/dev/hadoop/database/test_dev..., **PartitionCount: 1**, PartitionFilters: [isnotnull(year#216), isnotnull(month#217), isnotnull(day#218), (year#216 = 2019), (month#217 = 0..., PushedFilters: [], ReadSchema: struct<emp_id:int,emp_name:string,emp_salary:int,emp_date:date>
whereas same will not work if I use below query:
even if we create dataframe using spark.read.format("orc").load(hdfs absolute path of table) and create a temporary view and run spark sql on that. It will still scan all the partitions available for that table until and unless we use specific filter condition on top of that.
spark.sql("""select * from test_dev_db.partition_date_table where year ='2019' and month='08' and day in (select max(day) from test_dev_db.partition_date_table)""").explain(True)
It still has scanned all the three partitions, here PartitionCount: 3
== Physical Plan ==
*(2) BroadcastHashJoin [day#282], [max(day)#291], LeftSemi, BuildRight
:- *(2) FileScan orc test_dev_db.partition_date_table[emp_id#276,emp_name#277,emp_salary#278,emp_date#279,year#280,month#281,day#282] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://xxx.host.com:8020/user/xxx/dev/hadoop/database/test_dev..., PartitionCount: 3, PartitionFilters: [isnotnull(year#280), isnotnull(month#281), (year#280 = 2019), (month#281 = 08)], PushedFilters: [], ReadSchema: struct<emp_id:int,emp_name:string,emp_salary:int,emp_date:date>
To filter out the data based on the max partition using spark sql, we can use the below approach. we can use below technique for partition pruning to limits the number of files and partitions that Spark reads when querying the Hive ORC table data.
rdd=spark.sql("""show partitions test_dev_db.partition_date_table""").rdd.flatMap(lambda x:x)
newrdd=rdd.map(lambda x : x.replace("/","")).map(lambda x : x.replace("year=","")).map(lambda x : x.replace("month=","-")).map(lambda x : x.replace("day=","-")).map(lambda x : x.split('-'))
max_year=newrdd.map(lambda x : (x[0])).max()
max_month=newrdd.map(lambda x : x[1]).max()
max_day=newrdd.map(lambda x : x[2]).max()
prepare your query to filter Hive partition table using these max values.
query = "select * from test_dev_db.partition_date_table where year ='{0}' and month='{1}' and day ='{2}'".format(max_year,max_month,max_day)
>>> spark.sql(query).show();
+------+--------+----------+----------+----+-----+---+
|emp_id|emp_name|emp_salary| emp_date|year|month|day|
+------+--------+----------+----------+----+-----+---+
| 3| Govind| 810000|2019-08-09|2019| 08| 09|
| 4| Vikash| 5500|2019-08-09|2019| 08| 09|
+------+--------+----------+----------+----+-----+---+
spark.sql(query).explain(True)
If you see the plan of this query, you can see that it has scanned only one partition of given Hive table.
here PartitionCount is 1
== Optimized Logical Plan ==
Filter (((((isnotnull(day#397) && isnotnull(month#396)) && isnotnull(year#395)) && (year#395 = 2019)) && (month#396 = 08)) && (day#397 = 09))
+- Relation[emp_id#391,emp_name#392,emp_salary#393,emp_date#394,year#395,month#396,day#397] orc
== Physical Plan ==
*(1) FileScan orc test_dev_db.partition_date_table[emp_id#391,emp_name#392,emp_salary#393,emp_date#394,year#395,month#396,day#397] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://xxx.host.com:8020/user/xxx/dev/hadoop/database/test_dev..., PartitionCount: 1, PartitionFilters: [isnotnull(day#397), isnotnull(month#396), isnotnull(year#395), (year#395 = 2019), (month#396 = 0..., PushedFilters: [], ReadSchema: struct<emp_id:int,emp_name:string,emp_salary:int,emp_date:date>
I think you have to use Spark's Window Function and then find and filter out the latest version.
import org.apache.spark.sql.functions.{col, first}
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("year","month").orderBy(col("version").desc)
spark.read.load("/data")
.withColumn("maxVersion", first("version").over(windowSpec))
.select("*")
.filter(col("maxVersion") === col("version"))
.drop("maxVersion")
Let me know if this works for you.
Here's a Scala general function:
/**
* Given a DataFrame, use keys (e.g. last modified time), to show the most up to date record
*
* #param dF DataFrame to be parsed
* #param groupByKeys These are the columns you would like to groupBy and expect to be duplicated,
* hence why you're trying to obtain records according to a latest value of keys.
* #param keys The sequence of keys used to rank the records in the table
* #return DataFrame with records that have rank 1, this means the most up to date version of those records
*/
def getLastUpdatedRecords(dF: DataFrame, groupByKeys: Seq[String], keys: Seq[String]): DataFrame = {
val part = Window.partitionBy(groupByKeys.head, groupByKeys.tail: _*).orderBy(array(keys.head, keys.tail: _*).desc)
val rowDF = dF.withColumn("rn", row_number().over(part))
val res = rowDF.filter(col("rn")===1).drop("rn")
res
}

Resources