Joining two dataframes in Spark

Joining two dataframes in Spark - apache-spark

When I'm trying to join two data frames using
DataFrame joindf = dataFrame.join(df, df.col(joinCol)); //.equalTo(dataFrame.col(joinCol)));
My program is throwing below exception
org.apache.spark.sql.AnalysisException: join condition 'url' of type
string is not a boolean.;
Here joinCol value is url
Need inputs as what could possibly cause these exceptions

join variants which take as a second argument Column expect that it can be evaluated as a boolean expression.
If you want a simple equi-join based on a column name use a version which takes a column name as a String:
String joinCol = "foo";
dataFrame.join(df, joinCol);

What that means is that the join condition should evaluate to an expression. Lets say we want to join 2 dataframes based on id, so what we can do is :
With Python:
df1.join(df2, df['id'] == df['id'], 'left') # 3rd parameter is type of join which in this case is left join
With Scala:
df1.join(df2, df('id') === df('id')) // create inner join based on id column

You cannot use df.col(joinCol) as this is not an expression. In order to join 2 dataframes you need to identify the columns you wanted to join
Let's say you have a DataFrame emp and dept, joining these two dataframes should look like below in Scala
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"inner")
.show(false)
This example is taken from Spark SQL Join DataFrames

Related

How to join on null columns with Spark Sql?

I am running some queries with joins using Spark Sql 3.1 where the same columns in both tables can contain null values like:
select ...
from a
join b
on a.col_with_nulls = b.col_with_nulls
and a.col_without_nulls = b.col_without_nulls
However the query when it comes to null values is not working in the on condition. I have also tried with:
select ...
from a
join b
on a.col_with_nulls is not distinct from b.col_with_nulls
and a.col_without_nulls = b.col_without_nulls
as suggested in other solutions here but I keep on get the same result. Any idea?

You can use <=> or eqNullSafe() in order retain nulls in join.

pyspark RDD - Left outer join on specific key

I have two table A and B with hundred of columns. I am trying to apply left outer join on two table but they both have different keys.
I created a new column with same key in B as A. Then was able to apply left outer join. However, how do I join both tables if I am unable to make the column names consistent?
This is what I have tried:
a = spark.table('a').rdd
a = spark.table('a')
b = b.withColumn("acct_id",col("id"))
b = b.rdd
a.leftOuterJoin(b).collect()

If you have dataframe then why you are creating rdd for that, is there any specific need?
Try below command on dataframes -
a.join(b, a.column_name==b.column_name, 'left').show()
Here are few commands you can use to investigate your dataframe
##Get column names of dataframe
a.columns
##Get column names with their datatype of dataframe
a.dtypes
##What is the type of object (eg. dataframe, rdd etc.)
type(a)

DataFrames are faster than rdd, and you already have dataframes, so I sugest:
a = spark.table('a')
b = spark.table('b').withColumn("acct_id",col("id"))
result = pd.merge(a, b, left_on='id', right_on='acct_id', how='left').rdd

split,operate and union dataframe in spark

How can we split a dataframe and operate on individual split and union all the individual dataframes results back ?
Lets say i have dataframe with below columns. I need to split the dataframe based on channel and operate on individual splits which adds new column called bucket. then i need to union back the results.
account,channel,number_of_views
The groupBy is only allowing simple aggreagted operation. On each splitted dataframe i need to do feature extraction.
currently all Feature Transformers of spark-mllib are support only single dataframe.

you can randomly split like this
val Array(training_data, validat_data, test_data) = raw_data_rating_before_spilt.randomSplit(Array(0.6,0.2,0.2))
this will create 3 df then d what you want to do then you can join or union
val finalDF = df1.join(df2, df1.col("col_name")===df2.col("col_name"))
you can also join multiple df at the same time.
this is what you want or anything else.??

efficiently get joined and not joined data of a dataframe against other dataframe

I have two dataframes lets say A and B. They have different schemas.
I want to get records from dataframe A which joins with B on a key and the records which didn't get joined, I want those as well.
Can this be done in a single query?
Since going over the same data twice will reduce the performance. The DataFrame A is much bigger in size than B.
Dataframe B's size will be around 50Gb-100gb.
Hence I can't broadcast B in that case.
I am okay with getting a single Dataframe C as a result, which can have a partition column "Joined" with values "Yes" or "No", signifying whether the data in A got joined or not with B.
What in case if A has duplicates? and I don't want them.
I was thinking that I'll do a recudeByKey later on the C dataframe. Any suggestions around that?
I am using hive tables to store the Data in ORC file format on HDFS.
Writing code in scala.

Yes, you just need to do a left-outer join:
import sqlContext.implicits._
val A = sc.parallelize(List(("id1", 1234),("id1", 1234),("id3", 5678))).toDF("id1", "number")
val B = sc.parallelize(List(("id1", "Hello"),("id2", "world"))).toDF("id2", "text")
val joined = udf((id: String) => id match {
case null => "No"
case _ => "Yes"
})
val C = A
.distinct
.join(B, 'id1 === 'id2, "left_outer")
.withColumn("joined",joined('id2))
.drop('id2)
.drop('text)
This will yield a dataframe C:[id1: string, number: int, joined: string] that looks like this:
[id1,1234,Yes]
[id3,5678,No]
Note that I have added a distinct to filter out duplicates in A and that the last column in C refers to wether or not is was joined.
EDIT: Following remark from OP, I have added the drop lines to remove the columns from B.

Spark Deduplicate column in dataframe based on column in other dataframe

I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way.
Any ideas?

The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:
val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")
df1.as("df1").join(
df2.as("df2"),
col("df1.uniq_id") === col("df2.uniq_id"),
"left_outer"
).filter($"df2.uniq_id".isNull)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Joining two dataframes in Spark - apache-spark

join variants which take as a second argument Column expect that it can be evaluated as a boolean expression. If you want a simple equi-join based on a column name use a version which takes a column name as a String: String joinCol = "foo"; dataFrame.join(df, joinCol);

Related

How to join on null columns with Spark Sql?

pyspark RDD - Left outer join on specific key

split,operate and union dataframe in spark

efficiently get joined and not joined data of a dataframe against other dataframe

Spark Deduplicate column in dataframe based on column in other dataframe

Categories

Resources