Spark: where doesn't work properly - apache-spark

I have 2 dataset, and i want to create a join dataset, so I did
Dataset<Row> join = ds1.join(ds2, "id");
However for performance enhancement I tried to replace join with .where(cond) ( I also tried .filter(cond) ) like this:
Dataset<Row> join = ds1.where(col("id").equalTo(ds2.col("id"));
which also work, but not when one of the datasets is empty ( In this case it will return the non-empty dataset), However this is not the expected result.
So my question why .where doesn't work properly in that case, or is there another optimized solution for joining 2 datasets without using join().

Join and where condition are 2 different things. Your code for where condition will fail due to the resolve attribute issue. The where condition or filter condition is specific to that DataFrame. If you will mention second DataFrame in the condition it won’t iterate over like join. Please check your code if you are getting the result at all

Absolutely one of the key points when you want to join two RDDs, is the partitioner used over those two. If the first and the second rdd has the same partitioner then your join operation would be in the best performance it could be. If paritioner varies, then the first rdd's partitioner would be used to partition the second rdd.
Then try to just use a "light key", e.g. use encoded or hashed output of a String instead using the raw, and the same partitioner for both the rdds.

Related

As RDDs are immutable - what will be the use case for emptyRDD

rdd = sparkContext.emptyRDD()
What is the need of this method. What we can do with this empty RDD.
Can anyone give some use case or idea. Where we can use this empty rdd?
As RDDs are immutable - what will be the use case for emptyRDD
Few cases:
If your method must return a RDD (and not a null value) even in case where nothing matches, then an emptyRDD is accurate,
If you want to do a loop, to union from 0 to n RDD into a single one : the 0's one will be the empty rdd, then rdd = rdd.union(anotherOne) for each next loop.
Honestly I have never used it, but I guess it is there because some transformations need an RDD as argument, whether it is empty or not. Suppose you need to perform an outer join and the RDD you are joining against depends on a condition that could determine its emptyness, like:
full_rdd.fullOuterJoin(another_full_rdd if condition else sparkContext.emptyRDD())
If the condition is not satisfied, the result shows pairs of type (key, (full_rdd[key], None). I think it is the more elegant way to perform a full join based on a condition, But, as I said, I have never needed something like that, I hope someone else finds better examples.

Spark Dataset join performance

I receive a Dataset and I am required to join it with another Table. Hence the most simple solution that came to my mind was to create a second Dataset for the other table and perform the joinWith.
def joinFunction(dogs: Dataset[Dog]): Dataset[(Dog, Cat)] = {
val cats: Dataset[Cat] = spark.table("dev_db.cat").as[Cat]
dogs.joinWith(cats, ...)
}
Here my main concern is with spark.table("dev_db.cat"), as it feels like we are referring to all of the cat data as
SELECT * FROM dev_db.cat
and then doing a join at a later stage. Or will the query optimizer directly perform the join with out referring to the whole table? Is there a better solution?
Here are some suggestions for your case:
a. If you have where, filter, limit, take etc operations try to apply them before joining the two datasets. Spark can't push down these kind of filters therefore you have to do by your own reducing as much as possible the amount of target records. Here an excellent source of information over the Spark optimizer.
b. Try to co-locate the datasets and minimize the shuffled data by using repartition function. The repartition should be based on the keys that participate in join i.e:
dogs.repartition(1024, "key_col1", "key_col2")
dogs.join(cats, Seq("key_col1", "key_col2"), "inner")
c. Try to use broadcast for the smaller dataset if you are sure that it can fit in memory (or increase the value of spark.broadcast.blockSize). This consists a certain boost for the performance of your Spark program since it will ensure the co-existense of two datasets within the same node.
If you can't apply any of the above then Spark doesn't have a way to know which records should be excluded and therefore will scan all the available rows from both datasets.
You need to do an explain and see if predicate push down is used. Then you can judge your concern to be correct or not.
However, in general now, if no complex datatypes are used and/or datatype mismatches are not evident, then push down takes place. You can see that with simple createOrReplaceTempView as well. See https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/4201913720573284/4413065072037724/latest.html

How to use groupByKey() on multiple RDDs?

I have multiple RDDs with one common field CustomerId.
For eg:
debitcardRdd has data as (CustomerId, debitField1, debitField2, ......)
creditcardRdd has data as (CustomerId, creditField1, creditField2, ....)
netbankingRdd has data as (CustomerId, nbankingField1, nbankingField2, ....)
We perform different transformations on each individual rdd, however we need to perform a transformation on the data from all the 3 rdds by grouping CustomerId.
Example: (CustomerId,debitFiedl1,creditField2,bankingField1,....)
Is there any way we can group the data from all RDDs based on same key.
Note: In Apache Beam it can be done by using coGroupByKey, just checking if there is such alternative available in spark.
Just cogroup
debitcardRdd.keyBy(_.CustomerId).cogroup(
creditcardRdd.keyBy(_.CustomerId),
netbankingRdd.keyBy(_.CustomerId)
)
In contrast to the below, the .keyBy is not imho actually required here and we note that cogroup - not well described can extend to n RDDs.
val rddREScogX = rdd1.cogroup(rdd2,rdd3,rddn, ...)
Points should go to the first answer.

How to compute the largest value in a column using withColumn?

I'm trying to compute the largest value of the following DataFrame in Spark 1.6.1:
val df = sc.parallelize(Seq(1,2,3)).toDF("id")
A first approach would be to select the maximum value, and it works as expected:
df.select(max($"id")).show
The second approach could be to use withColumn as follows:
df.withColumn("max", max($"id")).show
But unfortunately it fails with the following error message:
org.apache.spark.sql.AnalysisException: expression 'id' is neither
present in the group by, nor is it an aggregate function. Add to group
by or wrap in first() (or first_value) if you don't care which value
you get.;
How can I compute the maximum value in a withColumn function without any Window or groupBy? If not possible, how can I do it in this specific case using a Window?
The right approach is to compute an aggregate as a separate query and combine with the actual result. Unlike window functions, suggested in many answers here, it won't require shuffle to a single partition and will be applicable to large datasets.
It could be done withColumn using a separate action:
import org.apache.spark.sql.functions.{lit, max}
df.withColumn("max", lit(df.agg(max($"id")).as[Int].first))
but it is much cleaner to use either explicit:
import org.apache.spark.sql.functions.broadcast
df.crossJoin(broadcast(df.agg(max($"id") as "max")))
or implicit cross join:
spark.conf.set("spark.sql.crossJoin.enabled", true)
df.join(broadcast(df.agg(max($"id") as "max")))
There are few categories of functions in Apache Spark.
Aggregate functions, e.g. max, when we wanna aggregate multiple rows in to one
None-aggregate functions, abs, isnull, when we wanna transform one column to another
Collection functions, e.g. explode, when one row will expand to multiple rows.
Implicit aggregation
They are used to when we wanna aggregate more rows in to one.
The following code internally has an aggregation.
df.select(max($"id")).explain
== Physical Plan ==
*HashAggregate(keys=[], functions=[max(id#3)])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_max(id#3)])
+- *Project [value#1 AS id#3]
+- Scan ExistingRDD[value#1]
we can also using multiple aggregation functions in a select.
df.select(max($"id"), min($"id")).explain
aggregate functions can not mix with none-aggregate functions directly
The following code will report error.
df.select(max($"id"), $"id")
df.withColumn("max", max($"id"))
Because max($"id") has few values then $"id"
aggregate with over
In this case the analytic function is applied and presented for all rows in the result set.
We can use
df.select(max($"id").over, $"id").show
Or
df.withColumn("max", max($"id").over).show
This is Spark 2.0 here.
With withColumn and window functions it could be as follows:
df.withColumn("max", max('id) over)
Note the empty over which is to assume a "empty" window (and is equivalent of over ()).
If you however need a more complete WindowSpec you can do the following (again, this is 2.0):
import org.apache.spark.sql.expressions._
// the trick that has performance cost (!)
val window = Window.orderBy()
df.withColumn("max", max('id) over window).show
Please note that the code has a serious performance issue as reported by Spark itself:
WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

SparkSQL DataFrame order by across partitions

I'm using spark sql to run a query over my dataset. The result of the query is pretty small but still partitioned.
I would like to coalesce the resulting DataFrame and order the rows by a column. I tried
DataFrame result = sparkSQLContext.sql("my sql").coalesce(1).orderBy("col1")
result.toJSON().saveAsTextFile("output")
I also tried
DataFrame result = sparkSQLContext.sql("my sql").repartition(1).orderBy("col1")
result.toJSON().saveAsTextFile("output")
the output file is ordered in chunks (i.e. the partitions are ordered, but the data frame is not ordered as a whole). For example, instead of
1, value
2, value
4, value
4, value
5, value
5, value
...
I get
2, value
4, value
5, value
-----------> partition boundary
1, value
4, value
5, value
What is the correct way to get an absolute ordering of my query result?
Why isn't the data frame being coalesced into a single partition?
I want to mention couple of things here .
1- the source code shows that the orderBy statement internally calls the sorting api with global ordering set to true .So the lack of ordering at the level of the output suggests that the ordering was lost while writing into the target. My point is that a call to orderBy always requires global order.
2- Using a drastic coalesce , as in forcing a single partition in your case , can be really dangerous. I would recommend you do not do that. The source code suggests that calling coalesce(1) can potentially cause upstream transformations to use a single partition . This would be brutal performance wise.
3- You seem to expect the orderBy statement to be executed with a single partition. I do not think that i agree with that statement. That would make Spark a really silly distributed framework.
Community please let me know if you agree or disagree with statements.
how are you collecting data from the output anyway?
maybe the output actually contains sorted data , but the transformations /actions that you performed in order to read from the output is responsible for the order lost.
The orderBy will produce new partitions after your coalesce. To have a single output partition, reorder the operations...
DataFrame result = spark.sql("my sql").orderBy("col1").coalesce(1)
result.write.json("results.json")
As #JavaPlanet mentioned, for really big data you don't want to coalesce into a single partition. It will drastically reduce your level of parallelism.

Resources