PySpark: Operations with columns given different levels of aggregation and conditions - apache-spark

I want to get a ratio of sentiment, and for that I need to calculate how many positives and how many negatives there are per topic, and then divide it by the total of records of each topic.
Let's say I have this dataset:
+-----+---------+
|topic|sentiment|
+-----+---------+
|Chair| positive|
|Table| negative|
|Chair| negative|
|Chair| negative|
|Table| positive|
|Table| positive|
|Table| positive|
+-----+---------+
In this case, I could give a value of -1 to 'negative' and 1 to 'positive', then this ratio would be 0.5 in the case of Table (negative + positive + positive + positive) / total_count), and -0.33 in the case of Chair: (positive + negative + negative) / total_count).
I have come up with this solution but seems way too complicated:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import col, when
spark = SparkSession.builder.appName('SparkExample').getOrCreate()
data_e = [("Chair","positive"),
("Table","negative"),
("Chair","negative"),
("Chair","negative"),
("Table","positive"),
("Table","positive")
]
schema_e = StructType([ \
StructField("topic",StringType(),True), \
StructField("sentiment",StringType(),True), \
])
df_e = spark.createDataFrame(data=data_e,schema=schema_e)
df_e_int = df_e.withColumn('sentiment_int',
when(col('sentiment') == 'positive', 1) \
.otherwise(-1)) \
.select('topic', 'sentiment_int')
agg_e = df_e_int.groupBy('topic') \
.count() \
.select('topic',
col('count').alias('counts'))
agg_sum_e = df_e_int.groupBy('topic') \
.sum('sentiment_int') \
.select('topic',
col('sum(sentiment_int)').alias('sum_value'))
agg_joined_e = agg_e.join(agg_sum_e,
agg_e.topic == agg_sum_e.topic,
'inner') \
.select(agg_e.topic, 'counts', 'sum_value')
final_agg_e = agg_joined_e.withColumn('sentiment_ratio',
(col('sum_value')/col('counts'))) \
.select('topic', 'sentiment_ratio')
The final output would look like this:
+-----+-------------------+
|topic| sentiment_ratio|
+-----+-------------------+
|Chair|-0.3333333333333333|
|Table| 0.5 |
+-----+-------------------+
What's the most efficient way of doing this?

You can condense your logic into two lines by using avg:
from pyspark.sql import functions as F
df_e.groupBy("topic") \
.agg(F.avg(F.when(F.col("sentiment").eqNullSafe("positive"), 1).otherwise(-1))) \
.show()

Related

Pyspark: add one row dynamically into the final dataframe

I've a final dataframe with this format:
Product_ID: string
Product_COD: string
Product_NAM: string
Product_VER: integer
ProductLine_NAM: string
Language_COD: string
ProductType_NAM: string
Load_DAT: integer
LoadEnd_DAT:integer
edmChange_DTT: timestamp
and I want to add a new row to that dataframe where the ID (Product_ID) is -1 and in the string columns insert 'Unknown' and in the remaining datatypes set to "null" for example:
I created this code:
id_column = "Product_ID"
df_lessOne = spark.createDataFrame(["-1"], "string").toDF(id_column) #create a new id_column row with -1
appended_df = finalDf.unionByName(df_lessOne, allowMissingColumns=True) #add the rest columns of dataframe with nulls
appended_df_filter = appended_df.filter(""+ id_column + " = '-1'")
columns = [item[0] for item in appended_df_filter.dtypes if item[1].startswith('string')] #select only string columns
# replace string columns with "Unknown"
for c_na in columns:
appended_df_filter = (appended_df_filter
.filter(""+ id_column + " = '-1'")
.withColumn(c_na, lit('Unknown'))
)
appended_df = appended_df.filter(""+ id_column + " <> '-1'")
dfs = [appended_df, appended_df_filter]
#add final -1 row to the final dataframe
finalDf = reduce(DataFrame.unionAll, dfs)
display(finalDf)
but unfortunately, it's not working well.
I'm trying to create this dynamically because after I want to use it in other dataframes. I just need to change the id_column after.
Can anyone please help me in achieving this
Thank you!
from pyspark.sql.types import *
from datetime import datetime
import pyspark.sql.functions as F
data2 = [
("xp3980","2103","Product_1",1,"PdLine_23","XX1","PNT_1",2,36636,datetime.strptime('2020-08-20 10:00:00', '%Y-%m-%d %H:%M:%S')),
("gi9387","2411","Product_2",1,"PdLine_44","YT89","PNT_6",2,35847,datetime.strptime('2021-07-21 7:00:00', '%Y-%m-%d %H:%M:%S'))
]
schema = StructType([ \
StructField("Product_ID",StringType(),True), \
StructField("Product_COD",StringType(),True), \
StructField("Product_NAM",StringType(),True), \
StructField("Product_VER", IntegerType(),True), \
StructField("ProductLine_NAM", StringType(), True), \
StructField("Language_COD", StringType(), True), \
StructField("ProductType_NAM", StringType(), True), \
StructField("Load_DAT", IntegerType(), True), \
StructField("LoadEnd_DAT", IntegerType(), True), \
StructField("edmChange_DTT", TimestampType(), True) \
])
my_df = spark.createDataFrame(data=data2,schema=schema)
df_res = spark.createDataFrame([(-1,)]).toDF("Product_ID")
for c in my_df.schema:
if str(c.name) == 'Product_ID':
continue
if str(c.dataType) == 'StringType':
df_res = df_res.withColumn(c.name, F.lit('Unknown'))
else:
df_res = df_res.withColumn(c.name, F.lit(None))
my_df.union(df_res).show()
+----------+-----------+-----------+-----------+---------------+------------+---------------+--------+-----------+-------------------+
# |Product_ID|Product_COD|Product_NAM|Product_VER|ProductLine_NAM|Language_COD|ProductType_NAM|Load_DAT|LoadEnd_DAT| edmChange_DTT|
# +----------+-----------+-----------+-----------+---------------+------------+---------------+--------+-----------+-------------------+
# | xp3980| 2103| Product_1| 1| PdLine_23| XX1| PNT_1| 2| 36636|2020-08-20 10:00:00|
# | gi9387| 2411| Product_2| 1| PdLine_44| YT89| PNT_6| 2| 35847|2021-07-21 07:00:00|
# | -1| Unknown| Unknown| null| Unknown| Unknown| Unknown| null| null| null|
# +----------+-----------+-----------+-----------+---------------+------------+---------------+--------+-----------+-------------------+

Performance improvement for UDFs - get column name of least value per row in pyspark

I use this udf:
mincol = F.udf(lambda row: cols[row.index(min(row))], StringType())
df = df.withColumn("mycol", mincol(F.struct([df[x] for x in cols])))
to get the column name for least value per row as value for another column called 'mycol'.
But this code is very slow.
Any suggestions to improve performance?
I am using Pyspark 2.3
Here is another solution for Spark 2.3 which uses only built-in functions:
from sys import float_info
from pyspark.sql.functions import array, least, col, lit, concat_ws, expr
cols = df.columns
col_names = array(list(map(lit, cols)))
set_cols = list(map(col, cols))
# replace null with largest python float
df.na.fill(float_info.max) \
.withColumn("min", least(*cols)) \
.withColumn("cnames", col_names) \
.withColumn("set", concat_ws(",", *set_cols)) \
.withColumn("min_col", expr("cnames[find_in_set(min, set) - 1]")) \
.select(*[cols + ["min_col"]]) \
.show()
Steps:
Fill all nulls with the larger possible float number. This is a good candidate for null replacement since is hard to find a larger value.
Find min column using least.
Create the column cnames for storing the column names.
Create the column set, which contains all the values as a comma-separated string.
Create the column min_col using find_in_set. The function handles each string item separately and will return the index of the found item. Finally, we use the index with cnames[indx - 1] to retrieve the column name.
Here is an approach without udf. The idea is to create an array containing the value and name of each column and then sort this array.
df1 = spark.createDataFrame([
(1., 2., 3.),(3.,2.,1.), (9.,8.,-1.), (1.2, 1.2, 9.1), (3., None, 1.0)], \
["col1", "col2", "col3"])
cols = df1.columns
col_string = ', '.join("'{0}'".format(c) for c in cols)
df1 = df1.withColumn("vals", F.array(cols)) \
.withColumn("cols", F.expr("Array(" + col_string + ")")) \
.withColumn("zipped", F.arrays_zip("vals", "cols")) \
.withColumn("without_nulls", F.expr("filter(zipped, x -> not x.vals is null)")) \
.withColumn("sorted", F.expr("array_sort(without_nulls)")) \
.withColumn("min", F.col("sorted")[0].cols) \
.drop("vals", "cols", "zipped", "without_nulls", "sorted")
df1.show(truncate=False)
prints
+----+----+----+----+
|col1|col2|col3|min |
+----+----+----+----+
|1.0 |2.0 |3.0 |col1|
|3.0 |2.0 |1.0 |col3|
|9.0 |8.0 |-1.0|col3|
|1.2 |1.2 |9.1 |col1|
|3.0 |null|1.0 |col3|
+----+----+----+----+

How to extract time from timestamp in pyspark?

I have a requirement to extract time from timestamp(this is a column in dataframe) using pyspark.
lets say this is the timestamp 2019-01-03T18:21:39 , I want to extract only time "18:21:39" such that it always appears in this manner "01:01:01"
df = spark.createDataFrame(["2020-06-17T00:44:30","2020-06-17T06:06:56","2020-06-17T15:04:34"],StringType()).toDF('datetime')
df=df.select(df['datetime'].cast(TimestampType()))
I tried like below but did not get the expected result
df1=df.withColumn('time',concat(hour(df['datetime']),lit(":"),minute(df['datetime']),lit(":"),second(df['datetime'])))
display(df1)
+-------------------+-------+
| datetime| time|
+-------------------+-------+
|2020-06-17 00:44:30|0:44:30|
|2020-06-17 06:06:56| 6:6:56|
|2020-06-17 15:04:34|15:4:34|
+-------------------+-------+
my results are like this 6:6:56 but i want them to be 06:06:56
Use the date_format function.
from pyspark.sql.types import StringType
df = spark \
.createDataFrame(["2020-06-17T00:44:30","2020-06-17T06:06:56","2020-06-17T15:04:34"], StringType()) \
.toDF('datetime')
from pyspark.sql.functions import date_format
q = df.withColumn('time', date_format('datetime', 'HH:mm:ss'))
>>> q.show()
+-------------------+--------+
| datetime| time|
+-------------------+--------+
|2020-06-17T00:44:30|00:44:30|
|2020-06-17T06:06:56|06:06:56|
|2020-06-17T15:04:34|15:04:34|
+-------------------+--------+

What is the most efficient way to do two different joins on the same two dataframes in Pyspark

I am trying to compare two dataframes to look for new records and updated records, which in turn will be used to create a third dataframe. I am using Pyspark 2.4.3
As I come from a SQL background (ASE), my initial thought would be to do a left join to find new records and a != on a hash of all the columns to find updates:
SELECT a.*
FROM Todays_Data a
Left Join Yesterdays_PK_And_Hash b on a.pk = b.pk
WHERE (b.pk IS NULL) --finds new records
OR (b.hashOfColumns != HASHBYTES('md5',<converted and concatenated columns>)) --updated records
I have been playing around with Pyspark and have come up with a script that achieves the results I am after:
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import md5, concat_ws, col, lit
sc = SparkContext("local", "test App")
sqlContext = SQLContext(sc)
sp = SparkSession \
.builder \
.appName("test App") \
.getOrCreate()
df = sp.createDataFrame(
[("Fred", "Smith", "16ba5519cdb13f99e087473e4faf3825"), # hashkey here is created based on YOB of 1973. To test for an update
("Fred", "Davis", "253ab75676cdbd73b874c97a62d27608"),
("Barry", "Clarke", "cc3baaa05a1146f2f8cf0a743c9ab8c4")],
["First_name", "Last_name", "hashkey"]
)
df_a = sp.createDataFrame(
[("Fred", "Smith", "Adelaide", "Doctor", 1971),
("Fred", "Davis", "Melbourne", "Baker", 1970),
("Barry", "Clarke", "Sydney", "Scientist", 1975),
("Jane", "Hall", "Sydney", "Dentist", 1980)],
["First_name", "Last_name", "City", "Occupation", "YOB"]
)
df_a = df_a.withColumn("hashkey", md5(concat_ws("", *df_a.columns)))
df_ins = df_a.alias('a').join(df.alias('b'), (col('a.First_name') == col('b.First_name')) &
(col('a.Last_name') == col('b.Last_name')), 'left_anti') \
.select(lit("Insert").alias("_action"), 'a.*') \
.dropDuplicates()
df_up = df_a.alias('a').join(df.alias('b'), (col('a.First_name') == col('b.First_name')) &
(col('a.Last_name') == col('b.Last_name')) &
(col('a.hashkey') != col('b.hashkey')), 'inner') \
.select(lit("Update").alias("_action"), 'a.*') \
.dropDuplicates()
df_delta = df_ins.union(df_up).sort("YOB")
df_delta = df_delta.drop("hashkey")
df_delta.show(truncate=False)
What this produces is my final delta as such:
+-------+----------+---------+--------+----------+----+
|_action|First_name|Last_name|City |Occupation|YOB |
+-------+----------+---------+--------+----------+----+
|Update |Fred |Smith |Adelaide|Doctor |1971|
|Insert |Jane |Hall |Sydney |Dentist |1980|
+-------+----------+---------+--------+----------+----+
While I am getting the results I am after, I am unsure how efficient the above code is.
Ultimately in the end, I would like to run similar patterns against datasets into the 100's of million records.
Is there anyway to make this more efficient?
Thanks
Have you explored broadcast join? Your join statements could be problematic if you have 100M + records. If the dataset B is smaller, this would the tiny modification I would try:
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import md5, concat_ws, col, lit, broadcast
sc = SparkContext("local", "test App")
sqlContext = SQLContext(sc)
sp = SparkSession \
.builder \
.appName("test App") \
.getOrCreate()
df = sp.createDataFrame(
[("Fred", "Smith", "16ba5519cdb13f99e087473e4faf3825"), # hashkey here is created based on YOB of 1973. To test for an update
("Fred", "Davis", "253ab75676cdbd73b874c97a62d27608"),
("Barry", "Clarke", "cc3baaa05a1146f2f8cf0a743c9ab8c4")],
["First_name", "Last_name", "hashkey"]
)
df_a = sp.createDataFrame(
[("Fred", "Smith", "Adelaide", "Doctor", 1971),
("Fred", "Davis", "Melbourne", "Baker", 1970),
("Barry", "Clarke", "Sydney", "Scientist", 1975),
("Jane", "Hall", "Sydney", "Dentist", 1980)],
["First_name", "Last_name", "City", "Occupation", "YOB"]
)
df_a = df_a.withColumn("hashkey", md5(concat_ws("", *df_a.columns)))
df_ins = df_a.alias('a').join(broadcast(df.alias('b')), (col('a.First_name') == col('b.First_name')) &
(col('a.Last_name') == col('b.Last_name')), 'left_anti') \
.select(lit("Insert").alias("_action"), 'a.*') \
.dropDuplicates()
df_up = df_a.alias('a').join(broadcast(df.alias('b')), (col('a.First_name') == col('b.First_name')) &
(col('a.Last_name') == col('b.Last_name')) &
(col('a.hashkey') != col('b.hashkey')), 'inner') \
.select(lit("Update").alias("_action"), 'a.*') \
.dropDuplicates()
df_delta = df_ins.union(df_up).sort("YOB")
Maybe rewriting the code cleanly would be easier to follow too.
#Ash, from a readability standpoint, you could do a couple of things:
Use variables
Use functions.
Use pep-8 guiding style as much as possible. (ex: no more than 80 chars in a line)
joinExpr = (col('a.First_name') == col('b.First_name')) &
(col('a.Last_name') == col('b.Last_name')
joinType = 'left_anti'
df_up = df_a.alias('a').join(broadcast(df.alias('b')), joinExpr) &
(col('a.hashkey') != col('b.hashkey')), joinType) \
.select(lit("Update").alias("_action"), 'a.*') \
.dropDuplicates()
This is still long, but you get the idea.

How do find out the total amount for each month using spark in python

I'm looking for a way to aggregate by month my data. I want firstly to keep only month in my visitdate. My DataFrame looks like this:
Row(visitdate = 1/1/2013,
patientid = P1_Pt1959,
amount = 200,
note = jnut,
)
My objectif subsequently is to group by visitdate and calculate the sum of amount. I tried this :
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
file_path = "G:/Visit Data.csv"
patients = spark.read.csv(file_path,header = True)
patients.createOrReplaceTempView("visitdate")
sqlDF = spark.sql("SELECT visitdate,SUM(amount) as totalamount from visitdate GROUP BY visitdate")
sqlDF.show()
This is the result :
visitdate|totalamount|
+----------+-----------+
| 9/1/2013| 10800.0|
|25/04/2013| 12440.0|
|27/03/2014| 16930.0|
|26/03/2015| 18560.0|
|14/05/2013| 13770.0|
|30/06/2013| 13880.0
My objectif is to get something like this:
visitdate|totalamount|
+----------+-----------+
|1/1/2013| 10800.0|
|1/2/2013| 12440.0|
|1/3/2013| 16930.0|
|1/4/2014| 18560.0|
|1/5/2015| 13770.0|
|1/6/2015| 13880.0|
You need to truncate your date's down to months so they group properly, then do a groupBy/sum. There is a spark function to do this for you call date_trunc. For example.
from datetime import date
from pyspark.sql.functions import date_trunc, sum
data = [
(date(2000, 1, 2), 1000),
(date(2000, 1, 2), 2000),
(date(2000, 2, 3), 3000),
(date(2000, 2, 4), 4000),
]
df = spark.createDataFrame(sc.parallelize(data), ["date", "amount"])
df.groupBy(date_trunc("month", df.date)).agg(sum("amount"))
+-----------------------+-----------+
|date_trunc(month, date)|sum(amount)|
+-----------------------+-----------+
| 2000-01-01 00:00:00| 3000|
| 2000-02-01 00:00:00| 7000|
+-----------------------+-----------+

Resources