Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:
df1:
timestamp | length | width
1 | 10 | 20
3 | 5 | 3
df2:
timestamp | name | length
0 | "sample" | 3
2 | "test" | 6
How can I combine these two dataframes into one that would look something like this:
df3:
timestamp | df1 | df2
| length | width | name | length
0 | null | null | "sample" | 3
1 | 10 | 20 | null | null
2 | null | null | "test" | 6
3 | 5 | 3 | null | null
I am extremely new to spark, so this might not actually make a lot of sense. But the problem I am trying to solve is: I need to combine these dataframes so that later I can convert each row to a given object. However, they have to be ordered by timestamp, so when I write these objects out, they are in the correct order.
So for example, given the df3 above, I would be able to generate the following list of objects:
objs = [
ObjectType1(timestamp=0, name="sample", length=3),
ObjectType2(timestamp=1, length=10, width=20),
ObjectType1(timestamp=2, name="test", length=6),
ObjectType2(timestamp=3, length=5, width=3)
]
Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?
P.S.: Note that I repeated length in both dataframes. That was done on purpose to illustrate that they may have columns of same name and type, but represent completely different data, so merging schema is not a possibility.
what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")
See this example, built from yours (just less typing)
// data shaped as your example
case class t1(ts:Int, width:Int,l:Int)
case class t2(ts:Int, width:Int,l:Int)
// create data frames
val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
+---+-----+----+------+----+
| ts|width| l| name| l2|
+---+-----+----+------+----+
| 0| null|null|sample| 3|
| 1| 10| 20| null|null|
| 2| null|null| test| 6|
| 3| 5| 3| null|null|
+---+-----+----+------+----+
Related
What is the difference between explode and explode_outer? The documentation for both functions is the same and also the examples for both functions are identical:
SELECT explode(array(10, 20));
10
20
and
SELECT explode_outer(array(10, 20));
10
20
The Spark source suggests that there is a difference between the two functions
expression[Explode]("explode"),
expressionGeneratorOuter[Explode]("explode_outer")
but what is the effect of expressionGeneratorOuter compared to expression?
explode creates a row for each element in the array or map column by ignoring null or empty values in array whereas explode_outer returns all values in array or map including null or empty.
For example, for the following dataframe-
id | name | likes
_______________________________
1 | Luke | [baseball, soccer]
2 | Lucy | null
explode gives the following output-
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
Whereas explode_outer gives the following output-
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
2 | Lucy | null
SELECT explode(col1) from values (array(10,20)), (null)
returns
+---+
|col|
+---+
| 10|
| 20|
+---+
while
SELECT explode_outer(col1) from values (array(10,20)), (null)
returns
+----+
| col|
+----+
| 10|
| 20|
|null|
+----+
I am trying to implement a custom explode in Pyspark. I have 4 columns that are arrays of structs with virtually the same schema (one columns structs contain one less field than the other three).
For each row in my DataFrame, I have 4 columns that are arrays of structs. The columns are students, teaching_assistants, teachers, administrators.
The students, teaching_assistants and teachers are arrays of structs with field id, student_level and name.
For example, here is a sample row in the DataFrame.
The students, teaching_assistants and teachers structs all have the same schema ("id", "student_level", "name") and the administrators struct has the "id" and "name" fields but is missing the student level.
I want to perform a custom explode such that for every row I have one entry for each student, teaching assistant, professor and administrator along with the original column name in case I had to search by "person type".
So for the screenshot of the row above, the output would be 8 rows:
+-----------+---------------------+----+---------------+----------+
| School_id | type | id | student_level | name |
+-----------+---------------------+----+---------------+----------+
| 1999 | students | 1 | 0 | Brian |
| 1999 | students | 9 | 2 | Max |
| 1999 | teaching_assistants | 19 | 0 | Xander |
| 1999 | teachers | 21 | 0 | Charlene |
| 1999 | teachers | 12 | 2 | Rob |
| 1999 | administrators | 23 | None | Marsha |
| 1999 | administrators | 11 | None | Ryan |
| 1999 | administrators | 14 | None | Bob |
+-----------+---------------------+----+---------------+----------+
For the administrators, the student_level column would just be null. The problem is if I use the explode function, I end up with all of these items in different columns.
Is it possible to accomplish this in Pyspark? One thought I had was to figure out how to combine the 4 array columns into 1 array and then do an explode on the array, although I am not sure if combining arrays of structs and getting the column names as a field is feasible (I've tried various things) and I also don't know if it would work if the administrators were missing a field.
In the past, I have done this by converting to RDD and using a flatmap/custom udf but it was very inefficient for millions of rows.
The idea is to use stack to transform the columns students, teaching_assistants, teachers and administrators into separate rows with the correct value for each type. After that, the column containing the data can be exploded and then the elements of the single structs can be transformed into separate columns.
Using stack requires that all columns that are stacked have the same type. This means that all columns must contain arrays of the same struct and also the nullability of all elements of the struct must match. Therefore the administrators column has to converted into the correct struct type first.
df.withColumn("administrators", F.expr("transform(administrators, " +
"a -> if(1<2,named_struct('id', a.id, 'name', a.name, 'student_level', "+
"cast(null as long)),null))"))\
.select("School_id", F.expr("stack(4, 'students', students, "+
"'teaching_assistants', teaching_assistants, 'teachers', teachers, "+
"'administrators', administrators) as (type, temp1)")) \
.withColumn("temp2", F.explode("temp1"))\
.select("School_id", "type", "temp2.id", "temp2.name", "temp2.student_level")\
.show()
prints
+---------+-------------------+---+--------+-------------+
|School_id| type| id| name|student_level|
+---------+-------------------+---+--------+-------------+
| 1999| students| 1| Brian| 0|
| 1999| students| 9| Max| 2|
| 1999|teaching_assistants| 19| Xander| 0|
| 1999| teachers| 21|Charlene| 0|
| 1999| teachers| 12| Rob| 2|
| 1999| administrators| 23| Marsha| null|
| 1999| administrators| 11| Ryan| null|
| 1999| administrators| 14| Bob| null|
+---------+-------------------+---+--------+-------------+
The strange looking if(1<2, named_struct(...), null) in the first line is necessary to set the correct nullabilty for the elements of the administrators array.
This solution works for Spark 2.4+. If it was possible to transform the administrators struct in a previous step, this solution would also work for earlier versions.
I have a Spark job which is time scheduled to be executed.
When I write the result DataFrame to a Data Target (S3, HDFS, DB...), I want that what Spark writes is not duplicated for a specific column.
EXAMPLE:
Let's say that MY_ID is the unique column.
1st execution:
--------------
|MY_ID|MY_VAL|
--------------
| 1 | 5 |
| 2 | 9 |
| 3 | 6 |
--------------
2nd execution:
--------------
|MY_ID|MY_VAL|
--------------
| 2 | 9 |
| 3 | 2 |
| 4 | 4 |
--------------
What I am expecting to find in the Data Target after the 2 executions is something like this:
--------------
|MY_ID|MY_VAL|
--------------
| 1 | 5 |
| 2 | 9 |
| 3 | 6 |
| 4 | 4 |
--------------
Where the expected output is the result of the first execution with the results of the second execution appended. In case the value for MY_ID already exists, the old one is kept, discarding the results of new executions (in this case the 2nd execution wants to write for MY_ID 3 the MY_VAL 9. Since that this record already exists from the 1st execution, the new record is discarded).
So the distinct() function is not enough to guarantee this condition. The uniqueness of the column MY_ID should be kept even in the dumped output.
Is there any solution that can guarantee this property at reasonable computational costs? (It is basically the same idea of UNIQUE in relational Databases.)
You can to do fullOuterJoin on first & second iteration.
val joined = firstIteration.join(secondIteration, Seq("MY_ID"), "fullouter")
scala> joined.show
+-----+------+------+
|MY_ID|MY_VAL|MY_VAL|
+-----+------+------+
| 1| 5| null|
| 3| 6| 2|
| 4| null| 4|
| 2| 9| 9|
+-----+------+------+
From the resultant table, if firstIteration's MY_VAL has value, you can use it as it is. Else if its null (indicates that the key occurs only in second iteration). use the value from secondIteration's MY_VAL.
scala> joined.withColumn("result", when(firstIteration.col("MY_VAL").isNull, secondIteration.col("MY_VAL"))
.otherwise(firstIteration.col("MY_VAL")))
.drop("MY_VAL")
.show
+-----+------+
|MY_ID|result|
+-----+------+
| 1| 5|
| 3| 6|
| 4| 4|
| 2| 9|
+-----+------+
Not sure whether you are using Scala or Python but have a look at the dropDuplicates function that allow you the specify one or more columns:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
Sorry for the vague title, I can't think of a better way to put it. I understand a bit of python and have some experience with Pandas dataframes, but recently I have been tasked to look at something involving Spark and I'm struggling to get my ahead around it.
I suppose the best way to explain this is with a small example. Imagine I have dataframe A:
id | Name |
--------------
1 | Random |
2 | Random |
3 | Random |
As well as dataframe B:
id | Fruit |
-------------
1 | Pear |
2 | Pear |
2 | Apple |
2 | Banana |
3 | Pear |
3 | Banana |
Now what I'm trying to do is match dataframe A with B (based on id matching), and iterate through the Fruit column in dataframe B. If a value comes up (say Banana), I want to add it as a column to dataframe. Could be a simple sum (everytime banana comes up add 1 to a column), or just class it if it comes up once. So for example, an output could look like this:
id | Name | Banana
---------------------
1 | Random | 0
2 | Random | 1
3 | Random | 1
My issue is iterating through Spark dataframes, and how I can connect the two if the match does occur. I was trying to do something to this effect:
def fruit(input):
fruits = {"Banana" : "B"}
return fruits[input]
fruits = df.withColumn("Output", fruit("Fruit"))
But it's not really working. Any ideas? Apologies in advance my experience with Spark is very little.
Hope this helps!
#sample data
A = sc.parallelize([(1,"Random"), (2,"Random"), (3,"Random")]).toDF(["id", "Name"])
B = sc.parallelize([(1,"Pear"), (2,"Pear"), (2,"Apple"), (2,"Banana"), (3,"Pear"), (3,"Banana")]).toDF(["id", "Fruit"])
df_temp = A.join(B, A.id==B.id, 'inner').drop(B.id)
df = df_temp.groupby(df_temp.id, df_temp.Name).\
pivot("Fruit").\
count().\
na.fill(0)
df.show()
Output is
+---+------+-----+------+----+
| id| Name|Apple|Banana|Pear|
+---+------+-----+------+----+
| 1|Random| 0| 0| 1|
| 3|Random| 0| 1| 1|
| 2|Random| 1| 1| 1|
+---+------+-----+------+----+
Edit note: In case you are only interested in few fruits then
from pyspark.sql.functions import col
#list of fruits you are interested in
fruit_list = ["Pear", "Banana"]
df = df_temp.\
filter(col('fruit').isin(fruit_list)).\
groupby(df_temp.id, df_temp.Name).\
pivot("Fruit").\
count().\
na.fill(0)
df.show()
+---+------+------+----+
| id| Name|Banana|Pear|
+---+------+------+----+
| 1|Random| 0| 1|
| 3|Random| 1| 1|
| 2|Random| 1| 1|
+---+------+------+----+
I have the following two DataFrames:
l1 = [(['hello','world'],), (['stack','overflow'],), (['hello', 'alice'],), (['sample', 'text'],)]
df1 = spark.createDataFrame(l1)
l2 = [(['big','world'],), (['sample','overflow', 'alice', 'text', 'bob'],), (['hello', 'sample'],)]
df2 = spark.createDataFrame(l2)
df1:
["hello","world"]
["stack","overflow"]
["hello","alice"]
["sample","text"]
df2:
["big","world"]
["sample","overflow","alice","text","bob"]
["hello", "sample"]
For every row in df1, I want to calculate the number of times all the words in the array occur in df2.
For example, the first row in df1 is ["hello","world"]. Now, I want to check df2 for the intersection of ["hello","world"] with every row in df2.
| ARRAY | INTERSECTION | LEN(INTERSECTION)|
|["big","world"] |["world"] | 1 |
|["sample","overflow","alice","text","bob"] |[] | 0 |
|["hello","sample"] |["hello"] | 1 |
Now, I want to return the sum(len(interesection)). Ultimately I want the resulting df1 to look like this:
df1 result:
ARRAY INTERSECTION_TOTAL
| ["hello","world"] | 2 |
| ["stack","overflow"] | 1 |
| ["hello","alice"] | 2 |
| ["sample","text"] | 3 |
How do I solve this?
I'd focus on avoiding Cartesian product first. I'd try to explode and join
from pyspark.sql.functions import explode, monotonically_increasing_id
df1_ = (df1.toDF("words")
.withColumn("id_1", monotonically_increasing_id())
.select("*", explode("words").alias("word")))
df2_ = (df2.toDF("words")
.withColumn("id_2", monotonically_increasing_id())
.select("id_2", explode("words").alias("word")))
(df1_.join(df2_, "word").groupBy("id_1", "id_2", "words").count()
.groupBy("id_1", "words").sum("count").drop("id_1").show())
+-----------------+----------+
| words|sum(count)|
+-----------------+----------+
| [hello, alice]| 2|
| [sample, text]| 3|
|[stack, overflow]| 1|
| [hello, world]| 2|
+-----------------+----------+
If intermediate values are not needed it could be simplified to:
df1_.join(df2_, "word").groupBy("words").count().show()
+-----------------+-----+
| words|count|
+-----------------+-----+
| [hello, alice]| 2|
| [sample, text]| 3|
|[stack, overflow]| 1|
| [hello, world]| 2|
+-----------------+-----+
and you could omit adding ids.