pyspark RDD - Left outer join on specific key - apache-spark

I have two table A and B with hundred of columns. I am trying to apply left outer join on two table but they both have different keys.
I created a new column with same key in B as A. Then was able to apply left outer join. However, how do I join both tables if I am unable to make the column names consistent?
This is what I have tried:
a = spark.table('a').rdd
a = spark.table('a')
b = b.withColumn("acct_id",col("id"))
b = b.rdd
a.leftOuterJoin(b).collect()

If you have dataframe then why you are creating rdd for that, is there any specific need?
Try below command on dataframes -
a.join(b, a.column_name==b.column_name, 'left').show()
Here are few commands you can use to investigate your dataframe
##Get column names of dataframe
a.columns
##Get column names with their datatype of dataframe
a.dtypes
##What is the type of object (eg. dataframe, rdd etc.)
type(a)

DataFrames are faster than rdd, and you already have dataframes, so I sugest:
a = spark.table('a')
b = spark.table('b').withColumn("acct_id",col("id"))
result = pd.merge(a, b, left_on='id', right_on='acct_id', how='left').rdd

Related

Join two DataFrames where the join key is different and only select some columns

What I would like to do is:
Join two DataFrames A and B using their respective id columns a_id and b_id. I want to select all columns from A and two specific columns from B
I tried something like what I put below with different quotation marks but still not working. I feel in pyspark, there should have a simple way to do this.
A_B = A.join(B, A.id == B.id).select(A.*, B.b1, B.b2)
I know you could write
A_B = sqlContext.sql("SELECT A.*, B.b1, B.b2 FROM A JOIN B ON A.a_id = B.b_id")
to do this but I would like to do it more like the pseudo code above.
Your pseudocode is basically correct. This slightly modified version would work if the id column existed in both DataFrames:
A_B = A.join(B, on="id").select("A.*", "B.b1", "B.b2")
From the docs for pyspark.sql.DataFrame.join():
If on is a string or a list of strings indicating the name of the join
column(s), the column(s) must exist on both sides, and this performs
an equi-join.
Since the keys are different, you can just use withColumn() (or withColumnRenamed()) to created a column with the same name in both DataFrames:
A_B = A.withColumn("id", col("a_id")).join(B.withColumn("id", col("b_id")), on="id")\
.select("A.*", "B.b1", "B.b2")
If your DataFrames have long complicated names, you could also use alias() to make things easier:
A_B = long_data_frame_name1.alias("A").withColumn("id", col("a_id"))\
.join(long_data_frame_name2.alias("B").withColumn("id", col("b_id")), on="id")\
.select("A.*", "B.b1", "B.b2")
Try this solution:
A_B = A.join(B,col('B.id') == col('A.id')).select([col('A.'+xx) for xx in A.columns]
+ [col('B.other1'),col('B.other2')])
The below lines in SELECT played the trick of selecting all columns from A and 2 columns from Table B.
[col('a.'+xx) for xx in a.columns] : all columns in a
[col('b.other1'),col('b.other2')] : some columns of b
I think the easier solution is just to join table A to table B with selected columns you want. here is a sample code to do this:
joined_tables = table_A.join(table_B.select('col1', 'col2', 'col3'), ['id'])
the code above join all columns from table_A and columns "col1", "col2", "col3" from table_B.

spark: apply explode to a list of columns in a dataFrame, but not to all columns

Suppose I have a list of column names
val expFields = List("f1", "f2") and a dataFrame df, and I'd like to explode columns in the expField list of df. That means I'd like to apply "explode" to a select number of columns and return a new dataFrame. I don't want to manually specify column names like df.withColumn("f1", explode(col("f1"))).withColumn("f2", explode(col("f2"))). I'd like to use the expFields list to specify these columns. How do I do that in Spark?
Just fold over the list of columns:
expFields.foldLeft(df)((acc, c) => acc.withColumn(c, explode(col(c))))

efficiently get joined and not joined data of a dataframe against other dataframe

I have two dataframes lets say A and B. They have different schemas.
I want to get records from dataframe A which joins with B on a key and the records which didn't get joined, I want those as well.
Can this be done in a single query?
Since going over the same data twice will reduce the performance. The DataFrame A is much bigger in size than B.
Dataframe B's size will be around 50Gb-100gb.
Hence I can't broadcast B in that case.
I am okay with getting a single Dataframe C as a result, which can have a partition column "Joined" with values "Yes" or "No", signifying whether the data in A got joined or not with B.
What in case if A has duplicates? and I don't want them.
I was thinking that I'll do a recudeByKey later on the C dataframe. Any suggestions around that?
I am using hive tables to store the Data in ORC file format on HDFS.
Writing code in scala.
Yes, you just need to do a left-outer join:
import sqlContext.implicits._
val A = sc.parallelize(List(("id1", 1234),("id1", 1234),("id3", 5678))).toDF("id1", "number")
val B = sc.parallelize(List(("id1", "Hello"),("id2", "world"))).toDF("id2", "text")
val joined = udf((id: String) => id match {
case null => "No"
case _ => "Yes"
})
val C = A
.distinct
.join(B, 'id1 === 'id2, "left_outer")
.withColumn("joined",joined('id2))
.drop('id2)
.drop('text)
This will yield a dataframe C:[id1: string, number: int, joined: string] that looks like this:
[id1,1234,Yes]
[id3,5678,No]
Note that I have added a distinct to filter out duplicates in A and that the last column in C refers to wether or not is was joined.
EDIT: Following remark from OP, I have added the drop lines to remove the columns from B.

Spark Deduplicate column in dataframe based on column in other dataframe

I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way.
Any ideas?
The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:
val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")
df1.as("df1").join(
df2.as("df2"),
col("df1.uniq_id") === col("df2.uniq_id"),
"left_outer"
).filter($"df2.uniq_id".isNull)

Joining two dataframes in Spark

When I'm trying to join two data frames using
DataFrame joindf = dataFrame.join(df, df.col(joinCol)); //.equalTo(dataFrame.col(joinCol)));
My program is throwing below exception
org.apache.spark.sql.AnalysisException: join condition 'url' of type
string is not a boolean.;
Here joinCol value is url
Need inputs as what could possibly cause these exceptions
join variants which take as a second argument Column expect that it can be evaluated as a boolean expression.
If you want a simple equi-join based on a column name use a version which takes a column name as a String:
String joinCol = "foo";
dataFrame.join(df, joinCol);
What that means is that the join condition should evaluate to an expression. Lets say we want to join 2 dataframes based on id, so what we can do is :
With Python:
df1.join(df2, df['id'] == df['id'], 'left') # 3rd parameter is type of join which in this case is left join
With Scala:
df1.join(df2, df('id') === df('id')) // create inner join based on id column
You cannot use df.col(joinCol) as this is not an expression. In order to join 2 dataframes you need to identify the columns you wanted to join
Let's say you have a DataFrame emp and dept, joining these two dataframes should look like below in Scala
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"inner")
.show(false)
This example is taken from Spark SQL Join DataFrames

Resources