Explode array values into multiple columns using PySpark - apache-spark

I am new to pyspark and I want to explode array values in such a way that each value gets assigned to a new column. I tried using explode but I couldn't get the desired output. Below is my output
this is the code
from pyspark.sql import *
from pyspark.sql.functions import explode
if __name__ == "__main__":
spark = SparkSession.builder \
.master("local[3]") \
.appName("DataOps") \
.getOrCreate()
dataFrameJSON = spark.read \
.option("multiLine", True) \
.option("mode", "PERMISSIVE") \
.json("data.json")
dataFrameJSON.printSchema()
sub_DF = dataFrameJSON.select(explode("values.line").alias("new_values"))
sub_DF.printSchema()
sub_DF2 = sub_DF.select("new_values.*")
sub_DF2.printSchema()
sub_DF.show(truncate=False)
new_DF = sub_DF2.select("id", "period.*", "property")
new_DF.show(truncate=False)
new_DF.printSchema()
this is data:
{
"values" : {
"line" : [
{
"id" : 1,
"period" : {
"start_ts" : "2020-01-01T00:00:00",
"end_ts" : "2020-01-01T00:15:00"
},
"property" : [
{
"name" : "PID",
"val" : "P120E12345678"
},
{
"name" : "EngID",
"val" : "PANELID00000000"
},
{
"name" : "TownIstat",
"val" : "12058091"
},
{
"name" : "ActiveEng",
"val" : "5678.1"
}
]
}
}

Could you include the data instead of screenshots ?
Meanwhile, assuming that df is the dataframe being used, what we need to do, is to create a new dataframe, while exrtracting the vals from the previous property array to new columns, and droping the property column at last :
from pyspark.sql.functions import col
output_df = df.withColumn("PID", col("property")[0].val).withColumn("EngID", col("property")[1].val).withColumn("TownIstat", col("property")[2].val).withColumn("ActiveEng", col("property")[3].val).drop("property")
In case the elementwas of type ArrayType use the following :
from pyspark.sql.functions import col
output_df = df.withColumn("PID", col("property")[0][1]).withColumn("EngID", col("property")[1][1]).withColumn("TownIstat", col("property")[2][1]).withColumn("ActiveEng", col("property")[3][1]).drop("property")
Explode will explode the arrays into new Rows, not columns, see this : pyspark explode

This is a general solution and works even when the JSONs are messy (different ordering of elements or if some of the elements are missing)
You got to flatten first, regexp_replace to split the 'property' column and finally pivot. This also avoids hard coding of the new column names.
Constructing your dataframe:
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.functions import *
schema = StructType([StructField("id", IntegerType()), StructField("start_ts", StringType()), StructField("end_ts", StringType()), \
StructField("property", ArrayType(StructType( [StructField("name", StringType()), StructField("val", StringType())] )))])
data = [[1, "2010", "2020", [["PID", "P123"], ["Eng", "PA111"], ["Town", "999"], ["Act", "123.1"]]],\
[2, "2011", "2012", [["PID", "P456"], ["Eng", "PA222"], ["Town", "777"], ["Act", "234.1"]]]]
df = spark.createDataFrame(data,schema=schema)
df.show(truncate=False)
+---+--------+------+------------------------------------------------------+
|id |start_ts|end_ts|property |
+---+--------+------+------------------------------------------------------+
|1 |2010 |2020 |[[PID, P123], [Eng, PA111], [Town, 999], [Act, 123.1]]|
|2 |2011 |2012 |[[PID, P456], [Eng, PA222], [Town, 777], [Act, 234.1]]|
+---+--------+------+------------------------------------------------------+
Flattening and pivoting:
df_flatten = df.rdd.flatMap(lambda x: [(x[0],x[1], x[2], y) for y in x[3]]).toDF(['id', 'start_ts', 'end_ts', 'property'])\
.select('id', 'start_ts', 'end_ts', col("property").cast("string"))
df_split = df_flatten.select('id', 'start_ts', 'end_ts', regexp_replace(df_flatten.property, "[\[\]]", "").alias("replacced_col"))\
.withColumn("arr", split(col("replacced_col"), ", "))\
.select(col("arr")[0].alias("col1"), col("arr")[1].alias("col2"), 'id', 'start_ts', 'end_ts')
final_df = df_split.groupby(df_split.id,)\
.pivot("col1")\
.agg(first("col2"))\
.join(df,'id').drop("property")
Output:
final_df.show()
+---+-----+-----+----+----+--------+------+
| id| Act| Eng| PID|Town|start_ts|end_ts|
+---+-----+-----+----+----+--------+------+
| 1|123.1|PA111|P123| 999| 2010| 2020|
| 2|234.1|PA222|P456| 777| 2011| 2012|
+---+-----+-----+----+----+--------+------+

Related

PySpark row to struct with specified structure

This is my initial dataframe:
columns = ["CounterpartID","Year","Month","Day","churnprobability", "deadprobability"]
data = [(1234, 2021,5,12, 0.85,0.6),(1224, 2022,6,12, 0.75,0.6),(1345, 2022,5,13, 0.8,0.2),(234, 2021,7,12, 0.9,0.8)]
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType
schema = StructType([
StructField("client_id", IntegerType(), False),
StructField("year", IntegerType(), False),
StructField("month", IntegerType(), False),
StructField("day", IntegerType(), False),
StructField("churn_probability", DoubleType(), False),
StructField("dead_probability", DoubleType(), False)
])
df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()
df.show(truncate=False)
Then I do some transformations on the columns (basically, separating out the float columns into before decimals and after decimals columns) to get the intermediary dataframe.
abc = df.rdd.map(lambda x: (x[0],x[1],x[2],x[3],int(x[4]),int(x[4]%1 * pow(10,9)), int(x[5]),int(x[5]%1 * pow(10,9)) )).toDF(['client_id','year', 'month', 'day', 'churn_probability_unit', 'churn_probability_nano', 'dead_probability_unit', 'dead_probability_nano'] )
display(abc)
Below is the final desired dataframe (this is just an example of one row, but of course I'll need all the rows from the intermediary dataframe.
sjson = {"clientId": {"id": 1234 },"eventDate": {"year": 2022,"month": 8,"day": 5},"churnProbability": {"rate": {"units": "500","nanos": 780000000}},"deadProbability": {"rate": {"units": "500","nanos": 780000000}}}
df = spark.read.json(sc.parallelize([sjson])).select("clientId", "eventDate", "churnProbability", "deadProbability")
display(df)
How do I reach this end state from the intermediary state efficiently for all rows?
End goal is to use this final dataframe to write to Kafka where the schema of the topic is a form of the final desired dataframe.
I would probably eliminate the use of rdd logic (and again toDF) by using just one select from your original df:
from pyspark.sql import functions as F
defg = df.select(
F.struct(F.col('client_id').alias('id')).alias('clientId'),
F.struct('year', 'month', 'day').alias('eventDate'),
F.struct(
F.struct(
F.floor('churn_probability').alias('unit'),
(F.col('churn_probability') % 1 * 10**9).cast('long').alias('nanos')
).alias('rate')
).alias('churnProbability'),
F.struct(
F.struct(
F.floor('dead_probability').alias('unit'),
(F.col('dead_probability') % 1 * 10**9).cast('long').alias('nanos')
).alias('rate')
).alias('deadProbability'),
)
defg.show()
# +--------+-------------+----------------+----------------+
# |clientId| eventDate|churnProbability| deadProbability|
# +--------+-------------+----------------+----------------+
# | {1234}|{2021, 5, 12}|{{0, 850000000}}|{{0, 600000000}}|
# | {1224}|{2022, 6, 12}|{{0, 750000000}}|{{0, 600000000}}|
# | {1345}|{2022, 5, 13}|{{0, 800000000}}|{{0, 200000000}}|
# | {234}|{2021, 7, 12}|{{0, 900000000}}|{{0, 800000000}}|
# +--------+-------------+----------------+----------------+
So, I was able to solve this using structs , without using to_json
import pyspark.sql.functions as f
defg = abc.withColumn(
"clientId",
f.struct(
f.col("client_id").
alias("id")
)).withColumn(
"eventDate",
f.struct(
f.col("year").alias("year"),
f.col("month").alias("month"),
f.col("day").alias("day"),
)
).withColumn(
"churnProbability",
f.struct( f.struct(
f.col("churn_probability_unit").alias("unit"),
f.col("churn_probability_nano").alias("nanos")
).alias("rate")
)
).withColumn(
"deadProbability",
f.struct( f.struct(
f.col("dead_probability_unit").alias("unit"),
f.col("dead_probability_nano").alias("nanos")
).alias("rate")
)
).select ("clientId","eventDate","churnProbability", "deadProbability" )

Creating Dataframes in PySpark

I'm trying to create a dataframe from a list
Can someone let me know why I'm getting the error:
java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
with the following
from pyspark.sql.types import *
test_list = ['green', 'peter']
df = spark.createDataFrame(test_list,StringType()).toDF("color", "name")
Thanks
The test_list should contain the list of rows where the rows should be a tuple or list like
test_list = [('green', 'peter')]
or
test_list = [['green', 'peter']]
In case more than one rows it will be like
test_list = [('green', 'peter'), ('red', 'brialle')]
df = spark.createDataFrame(test_list, schema=["color", "name"])
df.show()
Results in
+-----+------+
|color| name |
+-----+--------+
|green|peter |
+-----+--------+
|red |brialle |
+-----+--------+
Reference: CreateDataFrame
required_columns = ['id', 'first_name', 'last_name', 'email', 'phone_numbers', 'courses']
target_column_names = ['user_id', 'user_first_name', 'user_last_name', 'user_email', 'user_phone_numbers', 'enrolled_courses']
users_df. \
select(required_columns). \
toDF(*target_column_names). \
show()

What is the most efficient way to do two different joins on the same two dataframes in Pyspark

I am trying to compare two dataframes to look for new records and updated records, which in turn will be used to create a third dataframe. I am using Pyspark 2.4.3
As I come from a SQL background (ASE), my initial thought would be to do a left join to find new records and a != on a hash of all the columns to find updates:
SELECT a.*
FROM Todays_Data a
Left Join Yesterdays_PK_And_Hash b on a.pk = b.pk
WHERE (b.pk IS NULL) --finds new records
OR (b.hashOfColumns != HASHBYTES('md5',<converted and concatenated columns>)) --updated records
I have been playing around with Pyspark and have come up with a script that achieves the results I am after:
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import md5, concat_ws, col, lit
sc = SparkContext("local", "test App")
sqlContext = SQLContext(sc)
sp = SparkSession \
.builder \
.appName("test App") \
.getOrCreate()
df = sp.createDataFrame(
[("Fred", "Smith", "16ba5519cdb13f99e087473e4faf3825"), # hashkey here is created based on YOB of 1973. To test for an update
("Fred", "Davis", "253ab75676cdbd73b874c97a62d27608"),
("Barry", "Clarke", "cc3baaa05a1146f2f8cf0a743c9ab8c4")],
["First_name", "Last_name", "hashkey"]
)
df_a = sp.createDataFrame(
[("Fred", "Smith", "Adelaide", "Doctor", 1971),
("Fred", "Davis", "Melbourne", "Baker", 1970),
("Barry", "Clarke", "Sydney", "Scientist", 1975),
("Jane", "Hall", "Sydney", "Dentist", 1980)],
["First_name", "Last_name", "City", "Occupation", "YOB"]
)
df_a = df_a.withColumn("hashkey", md5(concat_ws("", *df_a.columns)))
df_ins = df_a.alias('a').join(df.alias('b'), (col('a.First_name') == col('b.First_name')) &
(col('a.Last_name') == col('b.Last_name')), 'left_anti') \
.select(lit("Insert").alias("_action"), 'a.*') \
.dropDuplicates()
df_up = df_a.alias('a').join(df.alias('b'), (col('a.First_name') == col('b.First_name')) &
(col('a.Last_name') == col('b.Last_name')) &
(col('a.hashkey') != col('b.hashkey')), 'inner') \
.select(lit("Update").alias("_action"), 'a.*') \
.dropDuplicates()
df_delta = df_ins.union(df_up).sort("YOB")
df_delta = df_delta.drop("hashkey")
df_delta.show(truncate=False)
What this produces is my final delta as such:
+-------+----------+---------+--------+----------+----+
|_action|First_name|Last_name|City |Occupation|YOB |
+-------+----------+---------+--------+----------+----+
|Update |Fred |Smith |Adelaide|Doctor |1971|
|Insert |Jane |Hall |Sydney |Dentist |1980|
+-------+----------+---------+--------+----------+----+
While I am getting the results I am after, I am unsure how efficient the above code is.
Ultimately in the end, I would like to run similar patterns against datasets into the 100's of million records.
Is there anyway to make this more efficient?
Thanks
Have you explored broadcast join? Your join statements could be problematic if you have 100M + records. If the dataset B is smaller, this would the tiny modification I would try:
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import md5, concat_ws, col, lit, broadcast
sc = SparkContext("local", "test App")
sqlContext = SQLContext(sc)
sp = SparkSession \
.builder \
.appName("test App") \
.getOrCreate()
df = sp.createDataFrame(
[("Fred", "Smith", "16ba5519cdb13f99e087473e4faf3825"), # hashkey here is created based on YOB of 1973. To test for an update
("Fred", "Davis", "253ab75676cdbd73b874c97a62d27608"),
("Barry", "Clarke", "cc3baaa05a1146f2f8cf0a743c9ab8c4")],
["First_name", "Last_name", "hashkey"]
)
df_a = sp.createDataFrame(
[("Fred", "Smith", "Adelaide", "Doctor", 1971),
("Fred", "Davis", "Melbourne", "Baker", 1970),
("Barry", "Clarke", "Sydney", "Scientist", 1975),
("Jane", "Hall", "Sydney", "Dentist", 1980)],
["First_name", "Last_name", "City", "Occupation", "YOB"]
)
df_a = df_a.withColumn("hashkey", md5(concat_ws("", *df_a.columns)))
df_ins = df_a.alias('a').join(broadcast(df.alias('b')), (col('a.First_name') == col('b.First_name')) &
(col('a.Last_name') == col('b.Last_name')), 'left_anti') \
.select(lit("Insert").alias("_action"), 'a.*') \
.dropDuplicates()
df_up = df_a.alias('a').join(broadcast(df.alias('b')), (col('a.First_name') == col('b.First_name')) &
(col('a.Last_name') == col('b.Last_name')) &
(col('a.hashkey') != col('b.hashkey')), 'inner') \
.select(lit("Update").alias("_action"), 'a.*') \
.dropDuplicates()
df_delta = df_ins.union(df_up).sort("YOB")
Maybe rewriting the code cleanly would be easier to follow too.
#Ash, from a readability standpoint, you could do a couple of things:
Use variables
Use functions.
Use pep-8 guiding style as much as possible. (ex: no more than 80 chars in a line)
joinExpr = (col('a.First_name') == col('b.First_name')) &
(col('a.Last_name') == col('b.Last_name')
joinType = 'left_anti'
df_up = df_a.alias('a').join(broadcast(df.alias('b')), joinExpr) &
(col('a.hashkey') != col('b.hashkey')), joinType) \
.select(lit("Update").alias("_action"), 'a.*') \
.dropDuplicates()
This is still long, but you get the idea.

Passing an entire row as an argument to spark udf through spark dataframe - throws AnalysisException

I am trying to pass an entire row to the spark udf along with few other arguments, I am not using spark sql rather I am using dataframe withColumn api, but I am getting the following exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) col3#9 missing from col1#7,col2#8,col3#13 in operator !Project [col1#7, col2#8, col3#13, UDF(col3#9, col2, named_struct(col1, col1#7, col2, col2#8, col3, col3#9)) AS contcatenated#17]. Attribute(s) with the same name appear in the operation: col3. Please check if the right attribute(s) are used.;;
The above exception can be replicated using the below code:
addRowUDF() // call invokes
def addRowUDF() {
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().config(new SparkConf().set("master", "local[*]")).appName(this.getClass.getSimpleName).getOrCreate()
import spark.implicits._
val df = Seq(
("a", "b", "c"),
("a1", "b1", "c1")).toDF("col1", "col2", "col3")
execute(df)
}
def execute(df: org.apache.spark.sql.DataFrame) {
import org.apache.spark.sql.Row
def concatFunc(x: Any, y: String, row: Row) = x.toString + ":" + y + ":" + row.mkString(", ")
import org.apache.spark.sql.functions.{ udf, struct }
val combineUdf = udf((x: Any, y: String, row: Row) => concatFunc(x, y, row))
def udf_execute(udf: String, args: org.apache.spark.sql.Column*) = (combineUdf)(args: _*)
val columns = df.columns.map(df(_))
val df2 = df.withColumn("col3", lit("xxxxxxxxxxx"))
val df3 = df2.withColumn("contcatenated", udf_execute("uudf", df2.col("col3"), lit("col2"), struct(columns: _*)))
df3.show(false)
}
output should be:
+----+----+-----------+----------------------------+
|col1|col2|col3 |contcatenated |
+----+----+-----------+----------------------------+
|a |b |xxxxxxxxxxx|xxxxxxxxxxx:col2:a, b, c |
|a1 |b1 |xxxxxxxxxxx|xxxxxxxxxxx:col2:a1, b1, c1 |
+----+----+-----------+----------------------------+
That happens because you refer to column that is no longer in the scope. When you call:
val df2 = df.withColumn("col3", lit("xxxxxxxxxxx"))
it shades the original col3 column, effectively making preceding columns with the same name accessible. Even if it wasn't the case, let's say after:
val df2 = df.select($"*", lit("xxxxxxxxxxx") as "col3")
the new col3 would be ambiguous, and indistinguishable by name from the one defined brought by *.
So to achieve the required output you'll have to use another name:
val df2 = df.withColumn("col3_", lit("xxxxxxxxxxx"))
and then adjust the rest of your code accordingly:
df2.withColumn(
"contcatenated",
udf_execute("uudf", df2.col("col3_") as "col3",
lit("col2"), struct(columns: _*))
).drop("_3")
If the logic is as simple as the one in the example, you can of course just inline things:
df.withColumn(
"contcatenated",
udf_execute("uudf", lit("xxxxxxxxxxx") as "col3",
lit("col2"), struct(columns: _*))
).drop("_3")

How do find out the total amount for each month using spark in python

I'm looking for a way to aggregate by month my data. I want firstly to keep only month in my visitdate. My DataFrame looks like this:
Row(visitdate = 1/1/2013,
patientid = P1_Pt1959,
amount = 200,
note = jnut,
)
My objectif subsequently is to group by visitdate and calculate the sum of amount. I tried this :
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
file_path = "G:/Visit Data.csv"
patients = spark.read.csv(file_path,header = True)
patients.createOrReplaceTempView("visitdate")
sqlDF = spark.sql("SELECT visitdate,SUM(amount) as totalamount from visitdate GROUP BY visitdate")
sqlDF.show()
This is the result :
visitdate|totalamount|
+----------+-----------+
| 9/1/2013| 10800.0|
|25/04/2013| 12440.0|
|27/03/2014| 16930.0|
|26/03/2015| 18560.0|
|14/05/2013| 13770.0|
|30/06/2013| 13880.0
My objectif is to get something like this:
visitdate|totalamount|
+----------+-----------+
|1/1/2013| 10800.0|
|1/2/2013| 12440.0|
|1/3/2013| 16930.0|
|1/4/2014| 18560.0|
|1/5/2015| 13770.0|
|1/6/2015| 13880.0|
You need to truncate your date's down to months so they group properly, then do a groupBy/sum. There is a spark function to do this for you call date_trunc. For example.
from datetime import date
from pyspark.sql.functions import date_trunc, sum
data = [
(date(2000, 1, 2), 1000),
(date(2000, 1, 2), 2000),
(date(2000, 2, 3), 3000),
(date(2000, 2, 4), 4000),
]
df = spark.createDataFrame(sc.parallelize(data), ["date", "amount"])
df.groupBy(date_trunc("month", df.date)).agg(sum("amount"))
+-----------------------+-----------+
|date_trunc(month, date)|sum(amount)|
+-----------------------+-----------+
| 2000-01-01 00:00:00| 3000|
| 2000-02-01 00:00:00| 7000|
+-----------------------+-----------+

Resources