Is it possible to perform right anti join in Dataflow?
I can see joins in Dataflow but I didn't find right anti join .
Any help is appreciated
AFAIK, Currently Anti joins are not supported in Data flows. Only the following joins are supported.
If your Source tables are from SQL, its good if you perform a stored procedure at source for Right anti join in Dataflow.
If not, you can try the below workaround using Exists transformation.
Exists transformation takes left and Right streams and gives the records of left records which are not in Right stream.(Left Anti join)
To achieve Right anti join, you can change the incoming streams.
Sample demo:
Left source data:
Right source data:
Exists transformation:
Result:
Related
I'm re-writing some SQL code and there is a section of the code which makes use of sub-queries. I could write this as a join, but wanted to know if it can be done in a similar sub-query manner in Pyspark. There is significant performance benefit by using sub-queries in the SQL code, but want to know if this would be irrelevant in Pyspark due to optimisation in the DAG. So it would be helpful if someone could explain the relative performance tradeoff, if there is a tradeoff.
The logic is pretty simple: I have df_a and I want to pull a column from df_b where df_a and df_b have a matching index on a certain key. The below isn't working but is intended to show the intent.
df_a.select("df_a.key_1", "df_a.key_2", df_b.select("df_b.key_2").where(col("df_b.key_1")=="df_a.key_3"))
If you already have your sql code that is working then you can just use spark.sql(s) where s is your query as a String. It can contain subqueries too. Just make sure to create a view of your dataframe to use it inside spark.sql query. Here is a toy example :
df.createOrReplaceTempView("people")
sqlDF = spark.sql("SELECT * FROM people") // put your sql query here containing subqueries
As for your question regarding optimisation tradeoff. In theory the catalyst optimizer used by spark should take care of any optimization in your query but as always if you know exactly what kind of optimisation you need then in general it is better do it by hand rather than relying on catalist.
I was asked this question recently where I was describing a use case which involved multiple joins in addition to some processing that I had implemented in Spark, the question was, could the joins have not been done while importing the data to HDFS using Sqoop? I wanted to understand from an architectural standpoint if it's advisable to implement the joins in Sqoop even if it's possible.
It is possible to do joins in sqoop imports.
From an architecture point of view, It depends on your usecase, sqoop is mainly a utility for fast imports/exports. All the etl can be done through spark/pig/hive/impala.
Although it is doable, I would recommend not to, since it will increase your job's time efficiency plus it will put load on your source for computing joins/aggregations as well also sqoop was primarily designed to be an ingestion tool for structured sources.
It depends on the infrastructure of your data pipeline, if you are using Spark for some other purpose then it will be better to use the same Spark for importing the data as well. Sqoop support join and will be sufficient if you only need to import data and nothing else. Hope this answers your query.
You can use:
a view in the DBMS where reading from using sqoop eval to set parameters in DB there, optionally.
freeform SQL for sqoop wher JOIN defined
However, views with JOINs cannot be used for incremental imports.
The facility of using free-form query in the current version of Sqoop
is limited to simple queries where there are no ambiguous projections
and no OR conditions in the WHERE clause. Use of complex queries such
as queries that have sub-queries or joins leading to ambiguous
projections can lead to unexpected results.
Sqoop import tool supports join. It can be archived using --query option (Don't use this option with --table / --column).
I was wondering if there are performance difference between calling except (https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#except(org.apache.spark.sql.Dataset) and using a left anti-join. So far, the only difference I can see is that with the left anti-join, the 2 datasets can have different columns.
Your title vs. explanation differ.
But, if you have the same structure you can use both methods to find missing data.
EXCEPT
is a specific implementation that enforces same structure and is a subtract operation, whereas
LEFT ANTI JOIN
allows different structures as you would say, but can give the same result.
Use cases differ: 1) Left Anti Join can apply to many situations pertaining to missing data - customers with no orders (yet), orphans in a database. 2) Except is for subtracting things, e.g. Machine Learning splitting data into test- and training sets.
Performance should not be a real deal breaker as they are different use cases in general and therefore difficult to compare. Except will involve the same data source whereas LAJ will involve different data sources.
Is there some tooling in Spark to handle bad records, meaning something which is null after a left join or that was not joined properly?
It would be great if there was something like this but specifically for checking data quality after joins.
No there is not, the reference to Databricks you make is different to the left join (not properly whatever that may mean) situation you mean. I think you also mean outer join at least and that is deliberately so.
In Spark, is there a way of adding a column to a DataFrame by means of a join, but in a way that guarantees that the left hand side remains completely unchanged?
This is what I have looked at so far:
leftOuterJoin... but that risks duplicating rows, so one would have to be super-careful to make sure that there are no duplicate keys on the right. Not exactly robust or performant, if the only way to guarantee safety is to dedupe before the join.
There is a data structure that seems to guarantee no duplicate keys: PairRDD. That has a nice method of looking up a key in the key-value table: YYY.lookup("key") . Thus one might expect to be able to do .withColumn("newcolumn", udf((key:String) => YYY.lookup(key)).apply(keyColumn)) but it seems that udfs cannot do this because they apparently cannot access the sqlContext which is apparently needed for the lookup. If there were a way of using withColumn I would be extremely happy because it has the right semantics.
Many thanks in advance!