How to explode columns? - apache-spark

After:
val df = Seq((1, Vector(2, 3, 4)), (1, Vector(2, 3, 4))).toDF("Col1", "Col2")
I have this DataFrame in Apache Spark:
+------+---------+
| Col1 | Col2 |
+------+---------+
| 1 |[2, 3, 4]|
| 1 |[2, 3, 4]|
+------+---------+
How do I convert this into:
+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 |
+------+------+------+------+
| 1 | 2 | 3 | 4 |
| 1 | 2 | 3 | 4 |
+------+------+------+------+

A solution that doesn't convert to and from RDD:
df.select($"Col1", $"Col2"(0) as "Col2", $"Col2"(1) as "Col3", $"Col2"(2) as "Col3")
Or arguable nicer:
val nElements = 3
df.select(($"Col1" +: Range(0, nElements).map(idx => $"Col2"(idx) as "Col" + (idx + 2)):_*))
The size of a Spark array column is not fixed, you could for instance have:
+----+------------+
|Col1| Col2|
+----+------------+
| 1| [2, 3, 4]|
| 1|[2, 3, 4, 5]|
+----+------------+
So there is no way to get the amount of columns and create those. If you know the size is always the same, you can set nElements like this:
val nElements = df.select("Col2").first.getList(0).size

Just to give the Pyspark version of sgvd's answer. If the array column is in Col2, then this select statement will move the first nElements of each array in Col2 to their own columns:
from pyspark.sql import functions as F
df.select([F.col('Col2').getItem(i) for i in range(nElements)])

Just add on to sgvd's solution:
If the size is not always the same, you can set nElements like this:
val nElements = df.select(size('Col2).as("Col2_count"))
.select(max("Col2_count"))
.first.getInt(0)

You can use a map:
df.map {
case Row(col1: Int, col2: mutable.WrappedArray[Int]) => (col1, col2(0), col2(1), col2(2))
}.toDF("Col1", "Col2", "Col3", "Col4").show()

If you are working with SparkR, you can find my answer here where you don't need to use explode but you need SparkR::dapply and stringr::str_split_fixed.

Related

How to create Spark dataframe in Scala from Lists

I have below lists,
val col1 = List(1,2,3,4,5)
val col2 = List("a", "b", "c", "d", "e")
val col3 = List(6,7,8)
Requirement is to create a dataframe as below in Scala,
--------------------
| col1 | col2| col3|
--------------------
| 1 | a | 6 |
| 2 | b | 7 |
| 3 | c | 8 |
| 4 | d | null|
| 5 | e | null|
--------------------
Thank you.
If you're using Scala 2.12 or older, zipAll function may be helpful. Also using Option for nullability.
val data = col1.map(Option.apply)
.zipAll(col2.map(Option.apply), None, None)
.zipAll(col3.map(Option.apply), (None, None), None)
.map { case ((c1, c2), c3) => (c1, c2, c3) }
val df = data.toDF("col1", "col2", "col3")
Build rows iterating over indexes to get
import sparkSession.implicits._
val df = List((1,’a’),(2,’b’)).toDF()

Pyspark add or remove rows in a dataframe based on another similar dataframe

Consider that I have two Dataframes DF1 and DF2 with the same schema.
what I want to do is that :
For each row in DF1,
if DF1.uniqueId exists in DF2 and type is new, then add to DF2 with a repeat count.
if DF1.uniqueId exists in DF2 and type is old, change DF2 type to that of DF1 type (old).
if DF1.uniqueId does not exists in DF2 and type is new, add a new row to DF2.
if DF1.uniqueId does not exist in DF2 and type is old, move that row to a new table -DF3
ie. if the tables are as shown below, the resultant or the updated DF2 should be like resultDF2 table below
DF1
+----------+--------------------------+
|UniqueID |type_ |
+----------+--------------------------+
|1 |new |
|1 |new |
|1 |new |
|2 |old |
|1 |new |
+----------+--------------------------+
DF2
+----------+--------------------------+
|UniqueID |type_ |
+----------+--------------------------+
| | |
+----------+--------------------------+
resultDF2
+----------+--------------------------++----------+--------------------------+
|UniqueID |type_ | repeatCount |
+----------+--------------------------++----------+--------------------------+
| 1 | new | 3 |
+----------+--------------------------++----------+--------------------------+
resultDF3
+----------+--------------------------++----------+--------------------------+
|UniqueID |type_ | repeatCount |
+----------+--------------------------++----------+--------------------------+
| 1 | old | 0 |
+----------+--------------------------++----------+--------------------------+
** if there is only one entry repeatCount is zero.
I am trying to achieve this using pyspark.
Can anyone please suggest me with any pointers on how to achieve this considering that I have both the tables in-memory.
The desired output can be obtained by:
Group df1 on UniqueId and get repeatCount, during this operation remove UniqueId that have old and new type_.
Apply a Full Join between dataframe from step 1 and df2.
From the joined result, remove rows where df.UniqueId is absent from df2 and df1.type_ is old.
Finally, select the UniqueID, type_ and repeatCount.
from pyspark.sql import functions as F
data = [(1, "new",), # Not exists and new
(1, "new",),
(1, "new",),
(2, "old",), # Not exists and old
(1, "new",),
(3, "old",), # cancel out
(3, "new",), # cancel out
(4, "new",), # one entry count zero example
(5, "new",), # Exists and new
(6, "old",), ] # Exists and old
df1 = spark.createDataFrame(data, ("UniqueID", "type_", ))
df2 = spark.createDataFrame([(5, "new", ), (6, "new", ), ], ("UniqueID", "type_", ))
df1_grouped = (df1.groupBy("UniqueID").agg(F.collect_set("type_").alias("types_"),
(F.count("type_") - F.lit(1)).alias("repeatCount"))
.filter(F.size(F.col("types_")) == 1) # when more than one type of `type_` is present they cancel out
.withColumn("type_", F.col("types_")[0])
.drop("types_")
)
id_not_exists_old = (df2["UniqueID"].isNull() & (df1_grouped["type_"] == F.lit("old")))
(df1_grouped.join(df2, df1_grouped["UniqueID"] == df2["UniqueID"], "full")
.filter(~(id_not_exists_old))
.select(df1_grouped["UniqueID"], df1_grouped["type_"], "repeatCount")
).show()
"""
+--------+-----+-----------+
|UniqueID|type_|repeatCount|
+--------+-----+-----------+
| 1| new| 3|
| 4| new| 0|
| 5| new| 0|
| 6| old| 0|
+--------+-----+-----------+
"""

Combine two dataframes with seperate keys for one dataframe so can select two column based on keys

I want new column DATE1 equal to a column START in dataframe1(DF1) on KEY1 and combine with Dataframe2 (DF2) based on KEY2 in DF2 so it shows DATE1 just when the key mayches in the join. I can show column start but it shows all.
I want DATE2 equal column START in dataframe1(DF1) on KEY1 but combine with DF2 based on a diffrent key called KEY3 in DF2 so it shows DATE2 just when the key matches in the join. I can show column start but not sure how to only show colum start when combined on two keys.
Example input for DF1 would be:
+---------+--------+------+------+
|START |KEY1 |Color OTHER |
+---------+--------+------+------+
| 10/05/21| 1 | White| 3000|
| 10/06/21| 2 | Blue| 4100|
| 10/07/21| 3 | Green| 6200|
+---------+--------+------+------+
DF2 input would be:
+---------+--------+----+
|KEY2 |KEY3 |NUMBER|
+---------+--------+----+
| 1 | 2| 3000 |
| 2 | 3| 4100 |
| 3 | 1| 6200 |
+---------+--------+----+
Output would be something like below:
+---------+--------+
|DATE1 | DATE2 |
+---------+--------+
| 10/05/21|10/06/21|
| 10/06/21|10/07/21|
| 10/07/21|10/05/21|
+---------+--------+
Below is code
def transform_df_data(df: DataFrame):
return \
df \
.withColumn("DATE1", col("START")) \
.withColumn("DATE2", col("START")) \
.withColumn("KEY1", col("KEY1")) \
.select("KEY1","DATE1","DATE2")
def build_final_df(df:DataFrame, otherdf:Dataframe)
df_transform = transform_df_data(d_period)
return final_transform.join(df_transform , final_transform.KEY1 == df_transform(KEY2, 'inner').withColumn("DATE1", col("START")).select("DATE1","DATE2")
Note sure I correctly understand the question, but I think you want to join df1 and df2 on KEY1 = KEY2 then join the result again with df1 on KEY1 = KEY3:
import pyspark.sql.functions as F
data1 = [("10/05/21", 1, "White", 3000), ("10/06/21", 2, "Blue", 4100), ("10/07/21", 3, "Green", 6200)]
df1 = spark.createDataFrame(data1, ["START", "KEY1", "Color", "OTHER"])
data2 = [(1, 2, 3000), (2, 3, 4100), (3, 1, 6200)]
df2 = spark.createDataFrame(data2, ["KEY2", "KEY3", "NUMBER"])
df_result = df1.withColumnRenamed("START", "DATE1").join(
df2,
F.col("KEY1") == F.col("KEY2")
).select("DATE1", "KEY3").join(
df1.withColumnRenamed("START", "DATE2"),
F.col("KEY1") == F.col("KEY3")
).select("DATE1", "DATE2")
df_result.show()
#+--------+--------+
#| DATE1| DATE2|
#+--------+--------+
#|10/07/21|10/05/21|
#|10/05/21|10/06/21|
#|10/06/21|10/07/21|
#+--------+--------+

pyspark counting number of nulls per group

I have a dataframe that has time series data in it and some categorical data
| cat | TS1 | TS2 | ... |
| A | 1 | null | ... |
| A | 1 | 20 | ... |
| B | null | null | ... |
| A | null | null | ... |
| B | 1 | 100 | ... |
I would like to find out how many null values there are per column per group, so an expected output would look something like:
| cat | TS1 | TS2 |
| A | 1 | 2 |
| B | 1 | 1 |
Currently I can this for one of the groups with something like this
df_null_cats = df.where(df.cat == "A").where(reduce(lambda x, y: x | y, (col(c).isNull() for c in df.columns))).select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_nulls.columns])
but I am struggling to get one that would work for the whole dataframe.
You can use groupBy and aggregation function to get required output.
from pyspark.sql import *
from pyspark.sql.functions import *
spark = SparkSession.builder.master("local").getOrCreate()
# Sample dataframe
in_values = [("A", 1, None),
("A", 1, 20),
("B", None, None),
("A", None, None),
("B", 1, 100)]
in_df = spark.createDataFrame(in_values, "cat string, TS1 int, TS2 int")
columns = in_df.columns
# Ignoring groupBy column and considering cols which are required in aggregation
columns.remove("cat")
agg_expression = [sum(when(in_df[x].isNull(), 1).otherwise(0)).alias(x) for x in columns]
in_df.groupby("cat").agg(*agg_expression).show()
+---+---+---+
|cat|TS1|TS2|
+---+---+---+
| B| 1| 1|
| A| 1| 2|
+---+---+---+
"Sum" function can be used with condition for null value. On Scala:
val df = Seq(
(Some("A"), Some(1), None),
(Some("A"), Some(1), Some(20)),
(Some("B"), None, None),
(Some("A"), None, None),
(Some("B"), Some(1), Some(100)),
).toDF("cat", "TS1", "TS2")
val aggregatorColumns = df
.columns
.tail
.map(columnName => sum(when(col(columnName).isNull, 1).otherwise(0)).alias(columnName))
df
.groupBy("cat")
.agg(
aggregatorColumns.head, aggregatorColumns.tail: _*
)
#Mohana's answer is good but it's still not dynamic: you need to code the operation for every single column.
In my answer below, we can use Pandas UDFs and applyInPandas to write a simple function in Pandas which will then be applied to our PySpark dataframe.
import pandas as pd
from pyspark.sql.types import *
in_values = [("A", 1, None),
("A", 1, 20),
("B", None, None),
("A", None, None),
("B", 1, 100)]
df = spark.createDataFrame(in_values, "cat string, TS1 int, TS2 int")
# define output schema: same column names, but we must ensure that the output type is integer
output_schema = StructType(
[StructField('cat', StringType())] + \
[StructField(col, IntegerType(), True) for col in [c for c in df.columns if c.startswith('TS')]]
)
# custom Python function to define aggregations in Pandas
def null_count(pdf):
columns = [c for c in pdf.columns if c.startswith('TS')]
result = pdf\
.groupby('cat')[columns]\
.agg(lambda x: x.isnull().sum())\
.reset_index()
return result
# use applyInPandas
df\
.groupby('cat')\
.applyInPandas(null_count, output_schema)\
.show()
+---+---+---+
|cat|TS1|TS2|
+---+---+---+
| A| 1| 2|
| B| 1| 1|
+---+---+---+

Spark: Join two dataframes on an array type column

I have a simple use case
I have two dataframes df1 and df2, and I am looking for an efficient way to join them?
df1: Contains my main dataframe (billions of records)
+--------+-----------+--------------+
|doc_id |doc_name |doc_type_id |
+--------+-----------+--------------+
| 1 |doc_name_1 |[1,4] |
| 2 |doc_name_2 |[3,2,6] |
+--------+-----------+--------------+
df2: Contains labels of doc types(40000 records), as it's a small one I am broadcasting it.
+------------+----------------+
|doc_type_id |doc_type_name |
+------------+----------------+
| 1 |doc_type_1 |
| 2 |doc_type_2 |
| 3 |doc_type_3 |
| 4 |doc_type_4 |
| 5 |doc_type_5 |
| 6 |doc_type_5 |
+------------+----------------+
I would like to join these two dataframes to result in somthing like this:
+--------+------------+--------------+----------------------------------------+
|doc_id |doc_name |doc_type_id |doc_type_name |
+--------+------------+--------------+----------------------------------------+
| 1 |doc_name_1 |[1,4] |["doc_type_1","doc_type_4"] |
| 2 |doc_name_2 |[3,2,6] |["doc_type_3","doc_type_2","doc_type_6"]|
+--------+------------+--------------+----------------------------------------+
Thanks
We can use array_contains + groupBy + collect_list functions for this case.
Example:
val df1=Seq(("1","doc_name_1",Seq(1,4)),("2","doc_name_2",Seq(3,2,6))).toDF("doc_id","doc_name","doc_type_id")
val df2=Seq(("1","doc_type_1"),("2","doc_type_2"),("3","doc_type_3"),("4","doc_type_4"),("5","doc_type_5"),("6","doc_type_6")).toDF("doc_type_id","doc_type_name")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
df1.createOrReplaceTempView("tbl")
df2.createOrReplaceTempView("tbl2")
spark.sql("select a.doc_id,a.doc_name,a.doc_type_id,collect_list(b.doc_type_name) doc_type_name from tbl a join tbl2 b on array_contains(a.doc_type_id,int(b.doc_type_id)) = TRUE group by a.doc_id,a.doc_name,a.doc_type_id").show(false)
//+------+----------+-----------+------------------------------------+
//|doc_id|doc_name |doc_type_id|doc_type_name |
//+------+----------+-----------+------------------------------------+
//|2 |doc_name_2|[3, 2, 6] |[doc_type_2, doc_type_3, doc_type_6]|
//|1 |doc_name_1|[1, 4] |[doc_type_1, doc_type_4] |
//+------+----------+-----------+------------------------------------+
Other way to achieve is by using explode + join + collect_list:
val df3=df1.withColumn("arr",explode(col("doc_type_id")))
df3.join(df2,df2.col("doc_type_id") === df3.col("arr"),"inner").
groupBy(df3.col("doc_id"),df3.col("doc_type_id"),df3.col("doc_name")).
agg(collect_list(df2.col("doc_type_name")).alias("doc_type_name")).
show(false)
//+------+-----------+----------+------------------------------------+
//|doc_id|doc_type_id|doc_name |doc_type_name |
//+------+-----------+----------+------------------------------------+
//|1 |[1, 4] |doc_name_1|[doc_type_1, doc_type_4] |
//|2 |[3, 2, 6] |doc_name_2|[doc_type_2, doc_type_3, doc_type_6]|
//+------+-----------+----------+------------------------------------+

Resources