Inconsistent Persistence of DataFrames in Spark 1.5 - apache-spark

We recently switched to Spark 1.5.0 from 1.4.1 and have noticed some inconsistent behavior in persisting DataFrames.
df1 = sqlContext.read.parquet("df1.parquet")
df1.count()
161,100,982
df2 = sqlContext.read.parquet("df2.parquet")
df2.count()
67,498,706
join_df = df1.join(df2, "id")
join_df.count()
160,608,147
join_df.write.parquet("join.parquet")
join_parquet = sqlContext.read.parquet("join.parquet")
join_parquet.count()
67,698,892
join_df.write.json("join.json")
join_json = sqlContext.read.parquet("join.json")
join_json.count()
67,695,663
The first major issue is that there is an order of magnitude difference between the count of the join DataFrame and the persisted join DataFrame. Secondly, persisting the same DataFrame into 2 different formats yields different results.
Does anyone have any idea on what could be going on here?

I had similar problem in Spark 1.6.1:
When I created 2 different DataFrames from single RDD and persisted to 2 Parquets, I found out, that Rows in parquets are not the same.
Later I found out that somewhere among definition of pipeline of operations for this RDD, I had rdd.reduceByKey() operation, that returned undeterministic result for multiple calls.
Apparently that for each DataFrame one different reduceByKey() was called and resulted in small differences in some Rows.
Might be something similar in your case.

Related

Unexpected behavior during Join (only works if rename column 'year' as 'year' ) otherwise fails with "package.TreeNodeException: execute tree"

I have a spark data frame which after multiple transformations needs to be joined with the one of its parent data frames. This join fails unless i rename a column 'year' as 'year'. I have faced such behavior before as well when after 6-7 transformations the data frame required to be joined with the output of the 3rd transformation.
I couldn't understand why this is happening so i tried random things like persisting, tried using spark sql API instead of pyspark, but still got the same issue. in case of spark sql as well the join worked after renaming the column with the same name
I cannot share the code because of some restrictions but the general code flow is like
DF = spark.read(.......)
subset DF
df1 = transformation1 on DF
df2 = transformation2 on df1
Subset df2
df3 = transformation3 on df2
#this fails
final = df2.alias('a').join( df3.alias('b'),[conditon],'left').select('a.*')
#this succeeds
final = df2.withColumnRenamed('Year','Year').alias('a).join( df3.alias('b'),[conditon],'left').select('a.*')
I cannot provide the stack trace but something like this pops up
package.TreeNodeException: execute tree:
Exhange hashpartitioning(.....)
remaining logical plan
I have just recently started with spark and don't really understand what is happening here, so any help would be appreciated
Also this is my first time posting, So any pointers on how to better format the problem is welcome.
Bugs. I simply rename. It is painful.
See How to resolve the AnalysisException: resolved attribute(s) in Spark. Other scenarios as well.
Also How to rename duplicated columns after join?. Many things on SO in this regard.
Still with latest release Spark 2.4 as well.

Efficient pyspark join

I've read a lot about how to do efficient joins in pyspark. The ways to achieve efficient joins I've found are basically:
Use a broadcast join if you can. (I usually can't because the dataframes are too large)
Consider using a very large cluster. (I'd rather not because of $$$).
Use the same partitioner.
The last one is the one i'd rather try, but I can't find a way to do it in pyspark. I've tried:
df.repartition(numberOfPartitions,['parition_col1','partition_col2'])
but it doesn't help, it still takes way too long until I stop it, because spark get's stucked in the last few jobs.
So, how can I use the same partitioner in pyspark and speed up my joins, or even get rid of the shuffles that takes forever ? Which code do I need to use ?
PD: I've checked other articles, even on stackoverflow, but I still can't see code.
you can also use a two-pass approach, in case it suits your requirement.First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.
It was nicely explained by Sim. see link below
two pass approach to join big dataframes in pyspark
based on case explained above I was able to join sub-partitions serially in a loop and then persisting joined data to hive table.
Here is the code.
from pyspark.sql.functions import *
emp_df_1.withColumn("par_id",col('emp_id')%5).repartition(5, 'par_id').write.format('orc').partitionBy("par_id").saveAsTable("UDB.temptable_1")
emp_df_2.withColumn("par_id",col('emp_id')%5).repartition(5, 'par_id').write.format('orc').partitionBy("par_id").saveAsTable("UDB.temptable_2")
So, if you are joining on an integer emp_id, you can partition by the ID modulo some number and this way you can re distribute the load across the spark partitions and records having similar keys will be grouped together and reside on same partition.
you can then read and loop through each sub partition data and join both the dataframes and persist them together.
counter =0;
paritioncount = 4;
while counter<=paritioncount:
query1 ="SELECT * FROM UDB.temptable_1 where par_id={}".format(counter)
query2 ="SELECT * FROM UDB.temptable_2 where par_id={}".format(counter)
EMP_DF1 =spark.sql(query1)
EMP_DF2 =spark.sql(query2)
df1 = EMP_DF1.alias('df1')
df2 = EMP_DF2.alias('df2')
innerjoin_EMP = df1.join(df2, df1.emp_id == df2.emp_id,'inner').select('df1.*')
innerjoin_EMP.show()
innerjoin_EMP.write.format('orc').insertInto("UDB.temptable")
counter = counter +1
I have tried this and this is working fine. This is just an example to demo the two-pass approach. your join conditions may vary and the number of partitions also depending on your data size.
Thank you #vikrantrana for your answer, I will try it if I ever need it. I say these because I found out the problem wasn't with the 'big' joins, the problem was the amount of calculations prior to the join. Imagine this scenario:
I read a table and I store in a dataframe, called df1. I read another table, and I store it in df2. Then, I perfome a huge amount of calculations and joins to both, and I end up with a join between df1 and df2. The problem here wasn't the size, the problem was spark's execution plan was huge and it couldn't maintain all the intermediate tables in memory, so it started to write to disk and it took so much time.
The solution that worked to me was to persist df1 and df2 in disk before the join (I also persisted other intermediate dataframes that were the result of big and complex calculations).

Spark: Apply multiple transformations without recalculating or caching

Is it possible to take the output of a transformation (RDD/Dataframe) and feed it to two independent transformations without recalculating the first transformation and without caching the whole dataset?
Long version
Consider the case.
I have a very large dataset that doesn't fit in memory. Now I do some transformations on it which prepare the data to be worked on efficiently (grouping, filtering, sorting....):
DATASET --(TF1: transformation with group by, etc)--> DF1
DF1 --(TF2: more_transformations_some_columns)--> output
DF1 --(TF3: more_transformations_other_columns)--> output2
I was wondering if there is any way (or planned in dev) to tell Spark that, after TF1, it must reuse the same results (at partition level, without caching everything!) to serve both TF2 and TF3.
This can be conceptually imagined as a cache() at each partition, with automatic unpersist() when the partition was consumed by the further transformations.
I searched for a long time but couldn't find any way of doing it.
My attempt:
DF1 = spark.read()... .groupBy().agg()...
DF2 = DF1.select("col1").cache() # col1 fits in mem
DF3 = DF1.select("col1", transformation(other_cols)).write()... # Force evaluation of col1
Unfortunately, DF3 cannot guess it could to the caching of col1. So apparently it isn't possible to ask spark to only cache a few columns. That would already alleviate the problem.
Any ideas?
I don't think it is possible to cache just some of the columns,
but will this solve your problem?
DF1 = spark.read()... .groupBy().agg()...
DF3 = DF1.select("col1", transformation(other_cols)).cache()
DF3.write()
DF2 = DF3.select("col1")

reducebykey and aggregatebykey in spark Dataframe

I am using spark 2.0 to read the data from parquet file .
val Df = sqlContext.read.parquet("c:/data/parquet1")
val dfSelect= Df.
select(
"id",
"Currency",
"balance"
)
val dfSumForeachId=dfSelect.groupBy("id").sum("balance")
val total=dfSumForeachId.agg(sum("sum(balance)")).first().getDouble(0)
In order to get a total balance value is this the best way of getting it using an action first() on a dataframe ?
In spark 2.0 is it fine to use groupby key ,does it have the same performance issue like groupbykey on rdd like does it need to shuffle the whole data over the network and then perform aggregation or the aggregation is performed locally like reducebykey in earlier version of the spark
Thanks
Getting the data by using first is a perfectly valid way of getting the data. That said, doing:
val total = dfSelect.agg(sum("balance")).first().getDouble(0)
would probably give you better performance for getting the total.
group by key and reduce by key work exactly the same as previous versions for the same reasons. group by key makes no assumption on the action you want to do and therefore cannot know how to do partial aggregations as reduce by key does.
When you do dataframe groupby and sum you are actually doing reduce by key with the + option and the second aggregation you did is a reduce with the +. That said dataframe does it more efficiently because, knowing exactly what is done it can perform many optimizations such as whole stage code generation.

Can I put back a partitioner to a PairRDD after transformations?

It seems that the "partitioner" of a pairRDD is reset to None after most transformations (e.g. values() , or toDF() ). However my understanding is that the partitioning may not always be changed for these transformations.
Since cogroup and maybe other examples perform more efficiently when the partitioning is known to be co-partitioned, I'm wondering if there's a way to tell spark that the rdd's are still co-partitioned.
See the simple example below where I create two co-partitioned rdd's, then cast them to DFs and perform cogroup on the resulting rdds. A similar example could be done with values, and then adding the right pairs back on.
Although this example is simple, my real case is maybe I load two parquet dataframes with the same partitioning.
Is this possible and would it result in a performance benefit in this case?
data1 = [Row(a=1,b=2),Row(a=2,b=3)]
data2 = [Row(a=1,c=4),Row(a=2,c=5)]
rdd1 = sc.parallelize(data1)
rdd2 = sc.parallelize(data2)
rdd1 = rdd1.map(lambda x: (x.a,x)).partitionBy(2)
rdd2 = rdd2.map(lambda x: (x.a,x)).partitionBy(2)
print(rdd1.cogroup(rdd2).getNumPartitions()) #2 partitions
rdd3 = rdd1.toDF(["a","b"]).rdd
rdd4 = rdd2.toDF(["a","c"]).rdd
print(rdd3.cogroup(rdd4).getNumPartitions()) #4 partitions (2 empty)
In the scala api most transformations include the
preservesPartitioning=true
option. Some of the python RDD api's retain that capability: but for example the
groupBy
is a significant exception. As far as Dataframe API's the partitioning scheme seems to be mostly outside of end user control - even on the scala end.
It is likely then that you would have to:
restrict yourself to using rdds - i.e. refrain from the DataFrame/Dataset approach
be choosy on which RDD transformations you choose: take a look at the ones that do allow either
retaining the parent's partitioning schem
using preservesPartitioning=true

Resources