Join Data Explode on Spark Databricks - apache-spark

I have data similar like
FieldA FieldB ExplodedField
1 A 1
1 A 2
1 A 3
2 B 3
2 B 5
I would like to join the data so the output will look in the following way:
FieldA FieldB ExplodedField
1 A 1
1 A 1,2
1 A 1,2,3
2 B 3
2 B 3,5
How would you implement it in Spark. Notice that the input dataset is very large

Try with window partitionBy + orderBy with collect_list and concat_ws functions.
Example:
val df=Seq((1,"A",1),(1,"A",2),(1,"A",3)).toDF("FieldA","FieldB","ExplodedField").withColumn("mid",monotonically)
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val win=Window.partitionBy("FieldA","FieldB").orderBy("mid")
df.withColumn("ExplodedField",concat_ws(",",collect_list(col("ExplodedField")).over(win))).
drop("mid").
show()
/*
+------+------+-------------+
|FieldA|FieldB|ExplodedField|
+------+------+-------------+
| 1| A| 1|
| 1| A| 1,2|
| 1| A| 1,2,3|
+------+------+-------------+
*/

Related

transition matrix from pyspark dataframe

I have two columns (such as):
from
to
1
2
1
3
2
4
4
2
4
2
4
3
3
3
And I want to create a transition matrix (where sum of rows in a columns add up to 1):
1. 2. 3. 4.
1. 0 0 0 0
2. 0.5* 0 0 2/3
3. 0.5 0.5 1 1/3
4. 0 0.5 0 0
where 1 -> 2 would be : (the number of times 1 (in 'from') is next to 2 (in 'to)) / (total times 1 points to any value).
You can create this kind of transition matrix using a window and pivot.
First some dummy data:
import pandas as pd
import numpy as np
np.random.seed(42)
x = np.random.randint(1,5,100)
y = np.random.randint(1,5,100)
df = spark.createDataFrame(pd.DataFrame({'from': x, 'to': y}))
df.show()
+----+---+
|from| to|
+----+---+
| 3| 3|
| 4| 2|
| 1| 2|
...
To create a pct column, first group the data by unique combinations of from/to and get the counts. With that aggregated dataframe, create a new column, pct that uses the Window to find the total number of records for each from group which is used as the denominator.
Lastly, pivot the table to make the to values columns and the pct data the values of the matrix.
from pyspark.sql import functions as F, Window
w = Window().partitionBy('from')
grp = df.groupBy('from', 'to').count().withColumn('pct', F.col('count') / F.sum('count').over(w))
res = grp.groupBy('from').pivot('to').agg(F.round(F.first('pct'), 2))
res.show()
+----+----+----+----+----+
|from| 1| 2| 3| 4|
+----+----+----+----+----+
| 1| 0.2| 0.2|0.25|0.35|
| 2|0.27|0.31|0.19|0.23|
| 3|0.46|0.17|0.21|0.17|
| 4|0.13|0.13| 0.5|0.23|
+----+----+----+----+----+

Query a second dataframe based on the values of first dataframe. [spark] [pyspark]

I have a specific requirement where I need to query a dataframe based on a range condition.
The values of the range come from the rows of another dataframe and so I will have as many queries as the rows in this different dataframe.
Using collect() in my scenario seems to be the bottleneck because it brings every row to the driver.
Sample example:
I need to execute a query on table 2 for every row in table 1
Table 1:
ID1
Num1
Num2
1
10
3
2
40
4
Table 2
ID2
Num3
1
9
2
39
3
22
4
12
For the first row in table 1, I create a range [10-3,10+3] =[7,13] => this becomes the range for the first query.
For the second row in table 2, I create a range [40-4,40+4] =[36,44] => this becomes the range for the second query.
I am currently doing collect() and iterating over the rows to get the values. I use these values as ranges in my queries for Table 2.
Output of Query 1:
ID2
Num3
1
9
4
12
Output of Query 2:
ID2
Num3
2
39
Since the number of rows in table 1 is very large, doing a collect() operation is costly.
And since the values are numeric, I assume a join won't work.
Any help in optimizing this task is appreciated.
Depending on what you want your output to look like, you could solve this with a join. Consider the following code:
case class FirstType(id1: Int, num1: Int, num2: Int)
case class Bounds(id1: Int, lowerBound: Int, upperBound: Int)
case class SecondType(id2: Int, num3: Int)
val df = Seq((1, 10, 3), (2, 40, 4)).toDF("id1", "num1", "num2").as[FirstType]
df.show
+---+----+----+
|id1|num1|num2|
+---+----+----+
| 1| 10| 3|
| 2| 40| 4|
+---+----+----+
val df2 = Seq((1, 9), (2, 39), (3, 22), (4, 12)).toDF("id2", "num3").as[SecondType]
df2.show
+---+----+
|id2|num3|
+---+----+
| 1| 9|
| 2| 39|
| 3| 22|
| 4| 12|
+---+----+
val bounds = df.map(x => Bounds(x.id1, x.num1 - x.num2, x.num1 + x.num2))
bounds.show
+---+----------+----------+
|id1|lowerBound|upperBound|
+---+----------+----------+
| 1| 7| 13|
| 2| 36| 44|
+---+----------+----------+
val test = bounds.join(df2, df2("num3") >= bounds("lowerBound") && df2("num3") <= bounds("upperBound"))
test.show
+---+----------+----------+---+----+
|id1|lowerBound|upperBound|id2|num3|
+---+----------+----------+---+----+
| 1| 7| 13| 1| 9|
| 2| 36| 44| 2| 39|
| 1| 7| 13| 4| 12|
+---+----------+----------+---+----+
In here, I do the following:
Create 3 case classes to be able to use typed datasets later on
Create the 2 dataframes
Create an auxilliary dataframe called bounds, which contains the lower/upper bounds
Join the second dataframe onto that auxilliary one
As you can see, the test dataframe contains the result. For each unique combination of the id1, lowerBound and upperBound columns you'll get the different dataframes that you wanted if you look at the id2 and num3 columns only.
You could, for example use a groupBy operation to group by these 3 columns and then do whatever you wanted with the output KeyValueGroupedDataset (something like test.groupBy("id1", "lowerBound", "upperBound")). From there on it depends on what you want: if you want to apply an operation to each dataset for each of the bounds you could use the mapValues method of KeyValueGroupedDataset.
Hope this helps!

Mapping key and list of values to key value using pyspark

I have a dataset which consists of two columns C1 and C2.The columns are associated with a relation of many to many.
What I would like to do is find for each C2 the value C1 which has the most associations with C2 values overall.
For example:
C1 | C2
1 | 2
1 | 5
1 | 9
2 | 9
2 | 8
We can see here that 1 is matched to 3 values of C2 while 2 is matched to 2 so i would like as output:
Out1 |Out2| matches
2 | 1 | 3
5 | 1 | 3
9 | 1 | 3 (1 wins because 3>2)
8 | 2 | 2
What I have done so far is:
dataset = sc.textFile("...").\
map(lambda line: (line.split(",")[0],list(line.split(",")[1]) ) ).\
reduceByKey(lambda x , y : x+y )
What this does is for each C1 value gather all the C2 matches,the count of this list is our desired matches column. What I would like now is somehow use each value in this list as a new key and have a mapping like :
(Key ,Value_list[value1,value2,...]) -->(value1 , key ),(value2 , key)...
How could this be done using spark? Any advice would be really helpful.
Thanks in advance!
The dataframe API is perhaps easier for this kind of task. You can group by C1, get the count, then group by C2, and get the value of C1 that corresponds to the highest number of matches.
import pyspark.sql.functions as F
df = spark.read.csv('file.csv', header=True, inferSchema=True)
df2 = (df.groupBy('C1')
.count()
.join(df, 'C1')
.groupBy(F.col('C2').alias('Out1'))
.agg(
F.max(
F.struct(F.col('count').alias('matches'), F.col('C1').alias('Out2'))
).alias('c')
)
.select('Out1', 'c.Out2', 'c.matches')
.orderBy('Out1')
)
df2.show()
+----+----+-------+
|Out1|Out2|matches|
+----+----+-------+
| 2| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 9| 1| 3|
+----+----+-------+
We can get the desired result easily using dataframe API.
from pyspark.sql import *
import pyspark.sql.functions as fun
from pyspark.sql.window import Window
spark = SparkSession.builder.master("local[*]").getOrCreate()
# preparing sample dataframe
data = [(1, 2), (1, 5), (1, 9), (2, 9), (2, 8)]
schema = ["c1", "c2"]
df = spark.createDataFrame(data, schema)
output = df.withColumn("matches", fun.count("c1").over(Window.partitionBy("c1"))) \
.groupby(fun.col('C2').alias('out1')) \
.agg(fun.first(fun.col("c1")).alias("out2"), fun.max("matches").alias("matches"))
output.show()
# output
+----+----+-------+
|Out1|out2|matches|
+----+----+-------+
| 9| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 2| 1| 3|
+----+----+-------+

Join two dataframes in pyspark by one column

I have a two dataframes that I need to join by one column and take just rows from the first dataframe if that id is contained in the same column of second dataframe:
df1:
id a b
2 1 1
3 0.5 1
4 1 2
5 2 1
df2:
id c d
2 fs a
5 fa f
Desired output:
df:
id a b
2 1 1
5 2 1
I have tried with df1.join(df2("id"),"left"), but gives me error :'Dataframe' object is not callable.
df2("id") is not a valid python syntax for selecting columns, you'd either need df2[["id"]] or use select df2.select("id"); For your example, you can do:
df1.join(df2.select("id"), "id").show()
+---+---+---+
| id| a| b|
+---+---+---+
| 5|2.0| 1|
| 2|1.0| 1|
+---+---+---+
or:
df1.join(df2[["id"]], "id").show()
+---+---+---+
| id| a| b|
+---+---+---+
| 5|2.0| 1|
| 2|1.0| 1|
+---+---+---+
If you need to check if id exists in df2 and does not need any column in your output from df2 then isin() is more efficient solution (This is similar to EXISTS and IN in SQL).
df1 = spark.createDataFrame([(2,1,1) ,(3,5,1,),(4,1,2),(5,2,1)], "id: Int, a : Int , b : Int")
df2 = spark.createDataFrame([(2,'fs','a') ,(5,'fa','f')], ['id','c','d'])
Create df2.id as list and pass it to df1 under isin()
from pyspark.sql.functions import col
df2_list = df2.select('id').rdd.map(lambda row : row[0]).collect()
df1.where(col('id').isin(df2_list)).show()
#+---+---+---+
#| id| a| b|
#+---+---+---+
#| 2| 1| 1|
#| 5| 2| 1|
#+---+---+---+
It is reccomended to use isin() IF -
You don't need to return data from the refrence dataframe/table
You have duplicates in the refrence dataframe/table (JOIN can cause duplicate rows if values are repeated)
You just want to check existence of particular value

NTILE function not working in Spark SQL 1.5

I'm testing the NTILE function on a simple dataset like this:
(id: string, value: double)
A 10
B 3
C 4
D 4
E 4
F 30
C 30
D 10
A 4
H 4
Running the following query against HIVE (on MapReduce)
SELECT tmp.id, tmp.sum_val, NTILE(4) OVER (ORDER BY tmp.sum_val) AS quartile FROM (SELECT id, sum(value) AS sum_val FROM testntile GROUP BY id) AS tmp
works fine with the following result:
(id, sum_val, quartile)
B 3 1
H 4 1
E 4 2
D 14 2
A 14 3
F 30 3
C 34 4
Running the same query against Hive on Spark (v 1.5) still works fine.
Running the same query against Spark SQL 1.5 (CDH 5.5.1)
val result = sqlContext.sql("SELECT tmp.id, tmp.sum_val, NTILE(4) OVER (ORDER BY tmp.sum_val) AS quartile FROM (SELECT id, sum(value) AS sum_val FROM testntile GROUP BY id) AS tmp")
result.collect().foreach(println)
I get the following wrong result:
[B,3.0,0]
[E,4.0,0]
[H,4.0,0]
[A,14.0,0]
[D,14.0,0]
[F,30.0,0]
[C,34.0,0]
IMPORTANT: the result is NOT deterministic because "sometimes" correct values are returned
Running the same algorithm directly on the dataframe
val x = sqlContext.sql("select id, sum(value) as sum_val from testntile group by id")
val w = Window.partitionBy("id").orderBy("sum_val")
val resultDF = x.select( x("id"),x("sum_val"), ntile(4).over(w) )
still returns a wrong result.
Am I doing something wrong? Any ideas? Thanks in advance for your answers.
If you use Window.partitionBy("id").orderBy("sum_val") you are grouping by id and after you are applying ntile function. So in this way every group has one element and ntile apply the same value for every id.
In order to achieve your first result, you need to remove partitionBy("id") and use only Window.orderBy("sum_val").
This is how I modify your code:
val w = Window.orderBy("sum_val")
val resultDF = x.orderBy("sum_val").select( x("id"),x("sum_val"), ntile(4).over(w) )
And this is the print of resultDF.show():
+---+-------+-----+
| id|sum_val|ntile|
+---+-------+-----+
| B| 3| 1|
| E| 4| 1|
| H| 4| 2|
| D| 14| 2|
| A| 14| 3|
| F| 30| 3|
| C| 34| 4|
+---+-------+-----+

Resources