Spark: how to process certain column content individually in dataframe?

Spark: how to process certain column content individually in dataframe? - apache-spark

The data structure is like this:
id
name
data
001
aaa
true,false,false
002
bbb
true,true,true
003
ccc
false,true,true
I want to map the results in data to their names by their corresponding orders in the mapping table. In detail, the first step is to get the order number of False in data and then get the name by order number in the mapping table.
For example, the first record has two False and their index numbers are 2 and 3, so the mapping result is code2 and code3. Also, there are all true in the second record so the mapping result is an empty string.
the mapping table: ("code1","code2","code3")
the expected result:
id
name
data
001
aaa
code2,code3
002
bbb
003
ccc
code1
Is it possible to achieve this in the dataframe?

If you are using spark 3+ you can use filter and transform functions as
val df = Seq(
("001", "aaa", "true,false,false"),
("002", "bbb", "true,true,true"),
("003", "ccc", "false,true,true"),
).toDF("id", "name", "data")
val cols = Seq("col1", "col2", "col3")
val dfNew = df.withColumn("data", split($"data", ","))
.withColumn("mapping", arrays_zip($"data", typedLit(cols)))
.withColumn("new1", filter($"mapping", (c: Column) => c.getField("data") === "false"))
.withColumn("data", transform($"new1", (c: Column) => c.getField("1")))
.drop("new1", "mapping")
dfNew.show(false)
Output:
+---+----+------------+
|id |name|data |
+---+----+------------+
|001|aaa |[col2, col3]|
|002|bbb |[] |
|003|ccc |[col1] |
+---+----+------------+

The following should work but be aware that it features a posexplode (explode an array with positional value) which can be a costly operation specially if you have a huge dataset.
val df = Seq(
("001", "aaa", "true,false,false"),
("002", "bbb", "true,true,true"),
("003", "ccc", "false,true,true")
).toDF("id", "name", "data")
val codes = Seq(
(0, "code1"),
(1, "code2"),
(2, "code3")
).toDF("code_id", "codes")
val df1 = df.select($"*", posexplode(split($"data", ",")))
.join(codes, $"pos" === $"code_id")
.withColumn( "codes", when($"col" === "false", $"codes").otherwise(null) )
//+---+----+----------------+---+-----+-------+-----+
//| id|name| data|pos| col|code_id|codes|
//+---+----+----------------+---+-----+-------+-----+
//|001| aaa|true,false,false| 0| true| 0| null|
//|001| aaa|true,false,false| 1|false| 1|code2|
//|001| aaa|true,false,false| 2|false| 2|code3|
//|002| bbb| true,true,true| 0| true| 0| null|
//|002| bbb| true,true,true| 1| true| 1| null|
//|002| bbb| true,true,true| 2| true| 2| null|
//|003| ccc| false,true,true| 0|false| 0|code1|
//|003| ccc| false,true,true| 1| true| 1| null|
//|003| ccc| false,true,true| 2| true| 2| null|
//+---+----+----------------+---+-----+-------+-----+
val finalDf = df1.groupBy($"id", $"name").agg(concat_ws(",", collect_list($"codes")).as("data"))
//+---+----+-----------+
//| id|name| data|
//+---+----+-----------+
//|002| bbb| |
//|001| aaa|code2,code3|
//|003| ccc| code1|
//+---+----+-----------+

Related

Generate n columns in dataframe based on mutiple values

I have created dataframe like this from a table
df = spark.sql("select * from test") # it is having 2 columns id and name
df2 = df.groupby('id').agg(collect_list('name')
df2.show()
|id|name|
|44038:4572|[0032477212299451]|
|44038:5439|[00324772, 0032477, 003247, 00324]|
|44038:4429|[0032477212299308]|
Until here it's correct, for one id I can store multiple names (values).
Now when I try to create dynamic columns into dataframe based on values, it is not working.
df3 = df2.select([df2.id] + [df2.name[i] for i in range (length)])
Output:
|id |name[0]|
|44038:4572|0032477212299451|
|44038:5439|00324772|
|44038:4429|032477212299308|
Expected output in dataframe:
|id|name[0]|name[1]|name[2]|name[3]|
|44038:4572|0032477212299451|null|null|null|
|44038:5439|00324772|0032477|003247|0034|
|44038:4429|032477212299308|null|null|null|
And then have to replace null with 0.

You might be better off doing pivot instead of collect_list:
from pyspark.sql import functions as F, Window
df2 = (df.withColumn('rn', F.row_number().over(Window.partitionBy('id').orderBy(F.desc('name'))))
.groupBy('id')
.pivot('rn')
.agg(F.first('name'))
.fillna("0")
)
df2.show()
+----------+----------------+-------+------+-----+
| id| 1| 2| 3| 4|
+----------+----------------+-------+------+-----+
|44038:4572|0032477212299451| 0| 0| 0|
|44038:5439| 00324772|0032477|003247|00324|
|44038:4429|0032477212299308| 0| 0| 0|
+----------+----------------+-------+------+-----+
If you want pretty column names, you can do
df3 = df2.toDF('id', *[f'name{i}' for i in range(len(df2.columns) - 1)])
df3.show()
+----------+----------------+-------+------+-----+
| id| name0| name1| name2|name3|
+----------+----------------+-------+------+-----+
|44038:4572|0032477212299451| 0| 0| 0|
|44038:5439| 00324772|0032477|003247|00324|
|44038:4429|0032477212299308| 0| 0| 0|
+----------+----------------+-------+------+-----+

How to compute the numerical difference between columns of different dataframes?

Given two spark dataframes A and B with the same number of columns and rows, I want to compute the numerical difference between the two dataframes and store it into another dataframe (or another data structure optionally).
For instance let us have the following datasets
DataFrame A:
+----+---+
| A | B |
+----+---+
| 1| 0|
| 1| 0|
+----+---+
DataFrame B:
----+---+
| A | B |
+----+---+
| 1| 0 |
| 0| 0 |
+----+---+
How to obtain B-A, i.e
+----+---+
| c1 | c2|
+----+---+
| 0| 0 |
| -1| 0 |
+----+---+
In practice the real dataframes have a consequent number of rows and 50+ columns for which the difference need to be computed. What is the Spark/Scala way of doing it?

I was able to solve this by using the approach below. This code can work with any number of columns. You just have to change the input DFs accordingly.
import org.apache.spark.sql.Row
val df0 = Seq((1, 5), (1, 4)).toDF("a", "b")
val df1 = Seq((1, 0), (3, 2)).toDF("a", "b")
val columns = df0.columns
val rdd = df0.rdd.zip(df1.rdd).map {
x =>
val arr = columns.map(column =>
x._2.getAs[Int](column) - x._1.getAs[Int](column))
Row(arr: _*)
}
spark.createDataFrame(rdd, df0.schema).show(false)
Output generated:
df0=>
+---+---+
|a |b |
+---+---+
|1 |5 |
|1 |4 |
+---+---+
df1=>
+---+---+
|a |b |
+---+---+
|1 |0 |
|3 |2 |
+---+---+
Output=>
+---+---+
|a |b |
+---+---+
|0 |-5 |
|2 |-2 |
+---+---+

If your df A is the same as df B you can try below approach. I don't know if this will work correct for large datasets, it will be better to have id for joining already instead of creating it using monotonically_increasing_id().
import spark.implicits._
import org.apache.spark.sql.functions._
val df0 = Seq((1, 0), (1, 0)).toDF("a", "b")
val df1 = Seq((1, 0), (0, 0)).toDF("a", "b")
// new cols names
val colNamesA = df0.columns.map("A_" + _)
val colNamesB = df0.columns.map("B_" + _)
// rename cols and add id
val dfA = df0.toDF(colNamesA: _*)
.withColumn("id", monotonically_increasing_id())
val dfB = df1.toDF(colNamesB: _*)
.withColumn("id", monotonically_increasing_id())
dfA.show()
dfB.show()
// get columns without id
val dfACols = dfA.columns.dropRight(1).map(dfA(_))
val dfBCols = dfB.columns.dropRight(1).map(dfB(_))
// diff between cols
val calcCols = (dfACols zip dfBCols).map(s=>s._2-s._1)
// join dfs
val joined = dfA.join(dfB, "id")
joined.show()
calcCols.foreach(_.explain(true))
joined.select(calcCols:_*).show()
+---+---+---+
|A_a|A_b| id|
+---+---+---+
| 1| 0| 0|
| 1| 0| 1|
+---+---+---+
+---+---+---+
|B_a|B_b| id|
+---+---+---+
| 1| 0| 0|
| 0| 0| 1|
+---+---+---+
+---+---+---+---+---+
| id|A_a|A_b|B_a|B_b|
+---+---+---+---+---+
| 0| 1| 0| 1| 0|
| 1| 1| 0| 0| 0|
+---+---+---+---+---+
(B_a#26 - A_a#18)
(B_b#27 - A_b#19)
+-----------+-----------+
|(B_a - A_a)|(B_b - A_b)|
+-----------+-----------+
| 0| 0|
| -1| 0|
+-----------+-----------+

Spark dataframe self-joins are producing empty dataframe as a result

Below is my data in csv which I read into dataframe.
id,pid,pname,ppid
1, 1, 5, -1
2, 1, 7, -1
3, 2, 9, 1
4, 2, 11, 1
5, 3, 5, 1
6, 4, 7, 2
7, 1, 9, 3
I am reading that data into a dataframe data_df. I am tryng to do a self-join on different columns. But the results dataframes are empty. Have tried multiple options.
Below is my code. Only the last joined4 is producing the result.
val joined = data_df.as("first").join(data_df.as("second")).where( col("first.ppid") === col("second.pid"))
joined.show(50, truncate = false)
val joined2 = data_df.as("first").join(data_df.as("second"), col("first.ppid") === col("second.pid"), "inner")
joined2.show(50, truncate = false)
val df1 = data_df.as("df1")
val df2 = data_df.as("df2")
val joined3 = df1.join(df2, $"df1.ppid" === $"df2.id")
joined3.show(50, truncate = false)
val joined4 = data_df.as("df1").join(data_df.as("df2"), Seq("id"))
joined4.show(50, truncate = false)
Below are the output of joined, joined2, joined3, joined4 respectively :
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+-----+----+
|id |pid|pname|ppid|pid|pname|ppid|
+---+---+-----+----+---+-----+----+
| 1 | 1| 5| -1| 1| 5| -1|
| 2 | 1| 7| -1| 1| 7| -1|
| 3 | 2| 9| 1| 2| 9| 1|
| 4 | 2| 11| 1| 2| 11| 1|
| 5 | 3| 5| 1| 3| 5| 1|
| 6 | 4| 7| 2| 4| 7| 2|
| 7 | 1| 9| 3| 1| 9| 3|
+---+---+-----+----+---+-----+----+

Sorry, later on figured out that the spaces in the csv were causing the issue. If I create a correctly structured csv of the initial data, the problem disappears.
Correct csv format as follows.
id,pid,pname,ppid
1,1,5,-1
2,1,7,-1
3,2,9,1
4,2,1,1
5,3,5,1
6,4,7,2
7,1,9,3
Ideally, I can also use the option to ignore leading whitespaces as shown in the following answer :
val data_df = spark.read
.schema(dataSchema)
.option("mode", "FAILFAST")
.option("header", "true")
.option("ignoreLeadingWhiteSpace", "true")
.csv(dataSourceName)
pySpark (v2.4) DataFrameReader adds leading whitespace to column names

update a dataframe column with new values

df1 has fields id and json; df2 has fields idand json
df1.count() => 1200; df2.count() => 20
df1 has all the rows. df2 has an incremental update with just 20 rows.
My goal is to update df1 with the values from df2. All the ids of df2 are in df1. But df2 has updated values(in the json field) for those same ids.
Resulting df should have all the values from df1 and updated values from df2.
What is the best way to do this? - With the least number of joins and filters.
Thanks!

You can achieve this using one left join.
Create Example DataFrames
Using the sample data provided by #Shankar Koirala in his answer.
data1 = [
(1, "a"),
(2, "b"),
(3, "c")
]
df1 = sqlCtx.createDataFrame(data1, ["id", "value"])
data2 = [
(1, "x"),
(2, "y")
]
df2 = sqlCtx.createDataFrame(data2, ["id", "value"])
Do a left join
Join the two DataFrames using a left join on the id column. This will keep all of the rows in the left DataFrame. For the rows in the right DataFrame that don't have a matching id, the value will be null.
import pyspark.sql.functions as f
df1.alias('l').join(df2.alias('r'), on='id', how='left')\
.select(
'id',
f.col('l.value').alias('left_value'),
f.col('r.value').alias('right_value')
)\
.show()
#+---+----------+-----------+
#| id|left_value|right_value|
#+---+----------+-----------+
#| 1| a| x|
#| 3| c| null|
#| 2| b| y|
#+---+----------+-----------+
Select the desired data
We will use the fact that the unmatched ids have a null to select the final columns. Use pyspark.sql.functions.when() to use the right value if it is not null, otherwise keep the left value.
df1.alias('l').join(df2.alias('r'), on='id', how='left')\
.select(
'id',
f.when(
~f.isnull(f.col('r.value')),
f.col('r.value')
).otherwise(f.col('l.value')).alias('value')
)\
.show()
#+---+-----+
#| id|value|
#+---+-----+
#| 1| x|
#| 3| c|
#| 2| y|
#+---+-----+
You can sort this output if you want the ids in order.
Using pyspark-sql
You can do the same thing using a pyspark-sql query:
df1.registerTempTable('df1')
df2.registerTempTable('df2')
query = """SELECT l.id,
CASE WHEN r.value IS NOT NULL THEN r.value ELSE l.value END AS value
FROM df1 l LEFT JOIN df2 r ON l.id = r.id"""
sqlCtx.sql(query.replace("\n", "")).show()
#+---+-----+
#| id|value|
#+---+-----+
#| 1| x|
#| 3| c|
#| 2| y|
#+---+-----+

I would like to provide a slightly more general solution. What happens if the input data has 100 columns instead of 2? We would spend too much time making a coalesce of those 100 columns to keep the values on the right side of the left join.
Another way to solve this problem would be to "delete" the updated rows from the original df and finally make a union with the updated rows.
data_orginal = spark.createDataFrame([
(1, "a"),
(2, "b"),
(3, "c")
], ("id", "value"))
data_updated = spark.createDataFrame([
(1, "x"),
(2, "y")
], ("id", "value"))
data_orginal.show()
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
data_updated.show()
+---+-----+
| id|value|
+---+-----+
| 1| x|
| 2| y|
+---+-----+
data_orginal.createOrReplaceTempView("data_orginal")
data_updated.createOrReplaceTempView("data_updated")
src_data_except_updated = spark.sql(f"SELECT * FROM data_orginal WHERE id not in (1,2)")
result_data = src_data_except_updated.union(data_updated)
result_data.show()
+---+-----+
| id|value|
+---+-----+
| 3| c|
| 1| x|
| 2| y|
+---+-----+
Notice that the query
SELECT * FROM data_orginal WHERE id not in (1,2)
could be generated automatically:
ids_collect = spark.sql(f"SELECT id FROM data_updated").collect()
ids_list = [f"{x.id}" for x in ids_collect]
ids_str = ",".join(ids_list)
query_get_all_except = f"SELECT * FROM data_original WHERE id not in ({ids_str})"

Explode array data into rows in spark [duplicate]

This question already has answers here:
Dividing complex rows of dataframe to simple rows in Pyspark
(3 answers)
Closed 5 years ago.
I have a dataset in the following way:
FieldA FieldB ArrayField
1 A {1,2,3}
2 B {3,5}
I would like to explode the data on ArrayField so the output will look in the following way:
FieldA FieldB ExplodedField
1 A 1
1 A 2
1 A 3
2 B 3
2 B 5
I mean I want to generate an output line for each item in the array the in ArrayField while keeping the values of the other fields.
How would you implement it in Spark.
Notice that the input dataset is very large.

The explode function should get that done.
pyspark version:
>>> df = spark.createDataFrame([(1, "A", [1,2,3]), (2, "B", [3,5])],["col1", "col2", "col3"])
>>> from pyspark.sql.functions import explode
>>> df.withColumn("col3", explode(df.col3)).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| A| 1|
| 1| A| 2|
| 1| A| 3|
| 2| B| 3|
| 2| B| 5|
+----+----+----+
Scala version
scala> val df = Seq((1, "A", Seq(1,2,3)), (2, "B", Seq(3,5))).toDF("col1", "col2", "col3")
df: org.apache.spark.sql.DataFrame = [col1: int, col2: string ... 1 more field]
scala> df.withColumn("col3", explode($"col3")).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| A| 1|
| 1| A| 2|
| 1| A| 3|
| 2| B| 3|
| 2| B| 5|
+----+----+----+

You can use explode function
Below is the simple example for your case
import org.apache.spark.sql.functions._
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(1, "A", List(1,2,3)),
(2, "B", List(3, 5))
)).toDF("FieldA", "FieldB", "FieldC")
data.withColumn("ExplodedField", explode($"FieldC")).drop("FieldC")
Hope this helps!

explode does exactly what you want. Docs:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode
Also, here is an example from a different question using it:
https://stackoverflow.com/a/44418598/1461187

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark: how to process certain column content individually in dataframe? - apache-spark

Related

Generate n columns in dataframe based on mutiple values

How to compute the numerical difference between columns of different dataframes?

Spark dataframe self-joins are producing empty dataframe as a result

update a dataframe column with new values

Explode array data into rows in spark [duplicate]

Categories

Resources