Extract values from a complex column in PySpark - apache-spark

I have a PySpark dataframe which has a complex column, refer to below value:
ID value
1 [{"label":"animal","value":"cat"},{"label":null,"value":"George"}]
I want to add a new column in PySpark dataframe which basically convert it into a list of strings. If Label is null, string should contain "value" and if label is not null, string should be "label:value". So for above example dataframe, the output should look like below:
ID new_column
1 ["animal:cat", "George"]

You can use transform to transform each array element into a string, which is constructed using concat_ws:
df2 = df.selectExpr(
'id',
"transform(value, x -> concat_ws(':', x['label'], x['value'])) as new_column"
)
df2.show()
+---+--------------------+
| id| new_column|
+---+--------------------+
| 1|[animal:cat, George]|
+---+--------------------+

Related

Join dataframes with list field

I'm using structured streaming with pyspark, and I'm reading from a kafka topic a key, which is an integer, and a value, which is a coma separated list of integers
I'm trying to make a join of this dataframe with another dataframe that I get from MongoDB. I could also mke a filter based on the values of the kafka dataframe, that are present in the column "id" of the MongoDB dataframe (though I don't know if the concept is also correct)
kafka dataframe:
key
value
1
2,9,7
MongoDB dataframe
name
id
camp_1
1
camp_2
9
camp_3
5
camp_4
7
camp_5
2
So, the result should be
name
id
camp_5
2
camp_2
9
camp_4
7
I'm thinking in a join since I've been unable to iterate over the values of the list in the "value" field of the kafka dataframe
You can try to use explode and then join
from pyspark.sql.functions import explode
kafkaData = [("1", [2, 7, 9])]
kafkaDf = spark.createDataFrame(kafkaData, schema=["key", "Value"])
mongoData = [("Camp_1", 1), ("Camp_2", 9), ("Camp_3", 5), ("Camp_4", 7), ("Camp_5", 2)]
mongoDf = spark.createDataFrame(mongoData, schema=["name", "id"])
kafkaDfExploded = kafkaDf.select(explode("Value").alias("id"))
kafkaDfExploded.join(mongoDf, "id", "left").select("name", "id").show()
output:
+------+---+
| name| id|
+------+---+
|Camp_4| 7|
|Camp_2| 9|
|Camp_5| 2|
+------+---+
You may add order by if it is relevant for you, you may also broadcast one of df if you know its going to be small.
If your list 2, 7, 9 is not an array but string you can first split it, then rest will be similar

Spark Dataframe | Merge mulitple rows with missing values

I have a dataframe with a column that is a list of strings and another column that contains year.
There are a few rows with a missing values for the year column
Year
fields
2020
IFDSDEP.7
IFDSDEP.7,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54,IFDSIMP.60
2020
IFDSIMP.7,IFDSIMP.14,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54
I would like to merge rows with or without year value to a single row, is there a way to do it ?
In production, we can have multiple years and there could be a million rows.
My output should look like this:
Year
fields
2020
IFDSDEP.7,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54,IFDSIMP.60,IFDSIMP.14
Thanks for the help.
If the fields column is a string you should probably first split this string into an array of strings, that way you will be able to combine into a unique list and then join them all back.
Regarding the nulls in the year column, you will have to fill the missing values. you will need to find a way to know which year to fill if there are multiple years.
Once you have done that, groupBy should do the trick.
# Your example DataFrame
df: DataFrame = spark.createDataFrame(data=[
[2020, "IFDSDEP.7"],
[None, "IFDSDEP.7,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54,IFDSIMP.60"],
[2020, "IFDSIMP.7,IFDSIMP.14,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54"],
], schema=StructType([
StructField("year", IntegerType()),
StructField("fields", StringType())
])).cache()
df.show(truncate=False)
+----+-----------------------------------------------------+
|year|fields |
+----+-----------------------------------------------------+
|2020|IFDSDEP.7 |
|null|IFDSDEP.7,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54,IFDSIMP.60|
|2020|IFDSIMP.7,IFDSIMP.14,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54|
+----+-----------------------------------------------------+
# Replace null with 2020 for `year` column
df = df.fillna({"year": 2020})
df.show(truncate=False)
+----+-----------------------------------------------------+
|year|fields |
+----+-----------------------------------------------------+
|2020|IFDSDEP.7 |
|2020|IFDSDEP.7,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54,IFDSIMP.60|
|2020|IFDSIMP.7,IFDSIMP.14,IFDSIMP.51,IFDSIMP.52,IFDSIMP.54|
+----+-----------------------------------------------------+
# Transforming the fields column
df = df.withColumn("fields", F.split(F.col("fields"), ","))
df.show(truncate=False)
+----+-----------------------------------------------------------+
|year|fields |
+----+-----------------------------------------------------------+
|2020|[IFDSDEP.7] |
|2020|[IFDSDEP.7, IFDSIMP.51, IFDSIMP.52, IFDSIMP.54, IFDSIMP.60]|
|2020|[IFDSIMP.7, IFDSIMP.14, IFDSIMP.51, IFDSIMP.52, IFDSIMP.54]|
+----+-----------------------------------------------------------+
# Aggregate on year and collect all arrays of fields then combine them all and make them distinct
df_agg = df.groupby("year").agg(F.array_distinct(F.flatten(F.collect_list("fields"))))
df.show(truncate=False)
+----+----------------------------------------------------------------------------------+
|year|array_distinct(flatten(collect_list(fields))) |
+----+----------------------------------------------------------------------------------+
|2020|[IFDSDEP.7, IFDSIMP.51, IFDSIMP.52, IFDSIMP.54, IFDSIMP.60, IFDSIMP.7, IFDSIMP.14]|
+----+----------------------------------------------------------------------------------+
breaking down the last part of the code:
F.collect_list("fields") - collect all the fields arrays for the group by key (year)
by now you should have an array of arrays
[[IFDSDEP.7], [IFDSDEP.7, IFDSIMP.51, IFDSIMP.52, IFDSIMP.54, IFDSIMP.60], [IFDSIMP.7, IFDSIMP.14, IFDSIMP.51, IFDSIMP.52, IFDSIMP.54]]
F.flatten() - this function flattens out the sub arrays into one large array
[IFDSDEP.7, IFDSDEP.7, IFDSIMP.51, IFDSIMP.52, IFDSIMP.54, IFDSIMP.60, IFDSIMP.7, IFDSIMP.14, IFDSIMP.51, IFDSIMP.52, IFDSIMP.54]
F.array_distinct() - this function deduplicates the values in the array which results in what you expect
[IFDSDEP.7, IFDSIMP.51, IFDSIMP.52, IFDSIMP.54, IFDSIMP.60, IFDSIMP.7, IFDSIMP.14]

How to convert all the date format to a timestamp for date column?

I am using PySpark version 3.0.1. I am reading a csv file as a PySpark dataframe having 2 date column. But when I try to print the schema both column is populated as string type.
Above screenshot attached is a Dataframe and schema of the Dataframe.
How to convert the row values there in both the date column to timestamp format using pyspark?
I have tried many things but all code is required the current format but how to convert to proper timestamp if I am not aware of what format is coming in csv file.
I have tried below code as wellb but this is creating a new column with null value
df1 = df.withColumn('datetime', col('joining_date').cast('timestamp'))
print(df1.show())
print(df1.printSchema())
Since there are two different date types, you need to convert using two different date formats, and coalesce the results.
import pyspark.sql.functions as F
result = df.withColumn(
'datetime',
F.coalesce(
F.to_timestamp('joining_date', 'MM-dd-yy'),
F.to_timestamp('joining_date', 'MM/dd/yy')
)
)
result.show()
+------------+-------------------+
|joining_date| datetime|
+------------+-------------------+
| 01-20-20|2020-01-20 00:00:00|
| 01/19/20|2020-01-19 00:00:00|
+------------+-------------------+
If you want to convert all to a single format:
import pyspark.sql.functions as F
result = df.withColumn(
'datetime',
F.date_format(
F.coalesce(
F.to_timestamp('joining_date', 'MM-dd-yy'),
F.to_timestamp('joining_date', 'MM/dd/yy')
),
'MM-dd-yy'
)
)
result.show()
+------------+--------+
|joining_date|datetime|
+------------+--------+
| 01-20-20|01-20-20|
| 01/19/20|01-19-20|
+------------+--------+

Transform data into rdd and analyze

I am new in spark and have below data in csv format, which I want to convert in proper format.
Csv file with no header
Student_name=abc, student_grades=A, Student_gender=female
Student_name=Xyz, student_grades=B, Student_gender=male
Now I want to put it in rdd with creation of header
Student_Name student_grades student_gender
abc A female
Xyz B male
Also I want to get list of students with grades as A, B and C
What you could do is infer the schema from the first line of the file, and then transform the dataframe accordingly that is:
Remove the column name from the row values.
Rename the columns
Here is how you could do it. First, let's read your data from a file and display it.
// the options are here to get rid of potential spaces around the ",".
val df = spark.read
.option("ignoreTrailingWhiteSpace", true)
.option("ignoreLeadingWhiteSpace", true)
.csv("path/your_file.csv")
df.show(false)
+----------------+----------------+---------------------+
|_c0 |_c1 |_c2 |
+----------------+----------------+---------------------+
|Student_name=abc|student_grades=A|Student_gender=female|
|Student_name=Xyz|student_grades=B|Student_gender=male |
+----------------+----------------+---------------------+
Then, we extract a mapping between the default names and the new names using the first row of the dataframe.
val row0 = df.head
val cols = df
.columns
.map(c => c -> row0.getAs[String](c).split("=").head )
Finally we get rid of the name of the columns with a split on "=" and rename the columns using our mapping:
val new_df = df
.select(cols.map{ case (old_name, new_name) =>
split(col(old_name), "=")(1) as new_name
} : _*)
new_df.show(false)
+------------+--------------+--------------+
|Student_name|student_grades|Student_gender|
+------------+--------------+--------------+
|abc |A |female |
|Xyz |B |male |
+------------+--------------+--------------+

spark groupby on several columns at same time

I’m using Spark2.0
I have a dataframe having several columns like id, latitude, longitude, time,
I want to do a groupby and keep [“latitude”,” longitude”] always together,
Could I do the following?
df.groupBy('id',[“latitude”,” longitude”] ,'time')
I want to calculate records number for each user , at each different time, with each different location [“latitude”,” longitude”].
You can combine "latitude" and "longitude" columns and then can use groupBy. Below sample is using Scala.
val df = Seq(("1","33.33","35.35","8:00"),("2","31.33","39.35","9:00"),("1","33.33","35.35","8:00")).toDF("id","latitude","longitude","time")
df.show()
val df1 = df.withColumn("lat-long",array($"latitude",$"longitude"))
df1.show()
val df2 = df1.groupBy("id","lat-long","time").count()
df2.show()
Output will be like below.
+---+--------------+----+-----+
| id| lat-long|time|count|
+---+--------------+----+-----+
| 2|[31.33, 39.35]|9:00| 1|
| 1|[33.33, 35.35]|8:00| 2|
+---+--------------+----+-----+
You can just use:
df.groupBy('id', 'latitude', 'longitude','time').agg(...)
This will work as expected without any additional steps.

Resources