pyspark applying odm mapping on column level - apache-spark

I have below 2 data frames and i would like to apply similar condition and return the values in pyspark data frames.
df1.show()
+---+-------+--------+
|id |tr_type|nominal |
+---+-------+--------+
|1 |K |2.0 |
|2 |ZW |7.0 |
|3 |V |12.5 |
|4 |VW |9.0 |
|5 |CI |5.0 |
+---+-------+--------+
One dimensional mapping:
*abcefgh
+-------+------------+------------+-----------+
|odm_id |return_value|odm_relation|input_value|
+-------+------------+------------+-----------+
|abcefgh|B |EQ |K |
|abcefgh|B |EQ |ZW |
|abcefgh|S |EQ |V |
|abcefgh|S |EQ |VW |
|abcefgh|I |EQ |CI |
+-------+------------+------------+-----------+
I need to apply below condition The nominal volume is negated when there is a sell transaction.
IF (tr_type, $abcefgh.) == 'S' THEN ;
nominal = -nominal ;
The expected output:
+---+-------+-------+-----------+
|id |tr_type|nominal|nominal_new|
+---+-------+-------+-----------+
|1 |K |2.0 |2.0 |
|2 |ZW |7.0 |7.0 |
|3 |V |12.5 |-12.5 |
|4 |VW |9.0 |-9.0 |
|5 |CI |5.0 |5.0 |
+---+-------+-------+-----------+

you could join the 2 dataframes on tr_type == input_value and use a when().otherwise() to create the new column.
see example below using your samples
data_sdf. \
join(odm_sdf.selectExpr('return_value', 'input_value as tr_type').
dropDuplicates(),
['tr_type'],
'left'
). \
withColumn('nominal_new',
func.when(func.col('return_value') == 'S', func.col('nominal') * -1).
otherwise(func.col('nominal'))
). \
drop('return_value'). \
show()
# +-------+---+-------+-----------+
# |tr_type| id|nominal|nominal_new|
# +-------+---+-------+-----------+
# | K| 1| 2.0| 2.0|
# | CI| 5| 5.0| 5.0|
# | V| 3| 12.5| -12.5|
# | VW| 4| 9.0| -9.0|
# | ZW| 2| 7.0| 7.0|
# +-------+---+-------+-----------+

Related

Transpose each record into multiple columns in pyspark dataframe

I am looking to transpose each record into multiple columns in pyspark dataframe.
This is my dataframe:
+--------+-------------+--------------+------------+------+
|level_1 |level_2 |level_3 |level_4 |UNQ_ID|
+--------+-------------+--------------+------------+------+
|D Group|Investments |ORB |ECM |1 |
|E Group|Investment |Origination |Execution |2 |
+--------+-------------+--------------+------------+------+
Required dataframe is:
+--------+---------------+------+
|level |name |UNQ_ID|
+--------+---------------+------+
|level_1 |D Group |1 |
|level_1 |E Group |2 |
|level_2 |Investments |1 |
|level_2 |Investment |2 |
|level_3 |ORB |1 |
|level_3 |Origination |2 |
|level_4 |ECM |1 |
|level_4 |Execution |2 |
+--------+---------------+------+
The easier way using stack function:
import pyspark.sql.functions as f
output_df = df.selectExpr('stack(4, "level_1", level_1, "level_2", level_2, "level_3", level_3, "level_4", level_4) as (level, name)', 'UNQ_ID')
output_df.show()
# +-------+-----------+------+
# | level| name|UNQ_ID|
# +-------+-----------+------+
# |level_1| D Group| 1|
# |level_2|Investments| 1|
# |level_3| ORB| 1|
# |level_4| ECM| 1|
# |level_1| E Group| 2|
# |level_2|Investments| 2|
# |level_3|Origination| 2|
# |level_4| Execution| 2|
# +-------+-----------+------+

Creating Total and percentage of total columns in Pyspark

Here is my test data
test = spark.createDataFrame([
("2018-06-03",2, 4, 4 ),
("2018-06-04",4, 3, 3 ),
( "2018-06-03",8, 1, 1),
("2018-06-01",3, 1, 1),
( "2018-06-05", 3, 2, 0),
])\
.toDF( "transactiondate", "SalesA", "SalesB","SalesC")
test.show()
I would like to add a row-wise total column and % of the total column corresponding to each sales category (A, B and C)
Desired Output:
+---------------+------+------+------+----------+------+------+------+
|transactiondate|SalesA|SalesB|SalesC|TotalSales|Perc_A|Perc_B|Perc_C|
+---------------+------+------+------+----------+------+------+------+
| 2018-06-03| 2| 4| 4| 10| 0.2| 0.4| 0.4|
| 2018-06-04| 4| 3| 3| 10| 0.4| 0.3| 0.3|
| 2018-06-03| 8| 1| 1| 10| 0.8| 0.1| 0.1|
| 2018-06-01| 3| 1| 1| 5| 0.6| 0.2| 0.2|
| 2018-06-05| 3| 2| 0| 5| 0.6| 0.4| 0.0|
+---------------+------+------+------+----------+------+------+------+
How can I do it in pyspark?
Edit: I want the code to be adaptable even if I add more items, i.e. if I have one more column salesD, code should create total and percentage columns. (i.e. columns shouldn't be hardcoded)
You can use selectExpr and do simple arithmetic SQL operations for each added columns
test = test.selectExpr("*",
"SalesA+SalesB+SalesC as TotalSales",
"SalesA/(SalesA+SalesB+SalesC) as Perc_A",
"SalesB/(SalesA+SalesB+SalesC) as Perc_B",
"SalesC/(SalesA+SalesB+SalesC) as Perc_C"
)
or use a more flexible solution
from pyspark.sql.functions import col, expr
# columns to be included in TotalSales calculation
cols = ['SalesA', 'SalesB', 'SalesC']
test = (test
.withColumn('TotalSales', expr('+'.join(cols)))
.select(col('*'),
*[expr('{0}/TotalSales {1}'.format(c,'Perc_'+c)) for c in cols]))
One option is to use several withColumn statements
import pyspark.sql.functions as F
test\
.withColumn('TotalSales', F.col('SalesA') + F.col('SalesB') + F.col('SalesC'))\
.withColumn('Perc_A', F.col('SalesA') / F.col('TotalSales'))\
.withColumn('Perc_B', F.col('SalesB') / F.col('TotalSales'))\
.withColumn('Perc_C', F.col('SalesC') / F.col('TotalSales'))
Try this spark-sql solution
test.createOrReplaceTempView("sales_table")
sales=[ x for x in test.columns if x.upper().startswith("SALES") ]
sales2="+".join(sales)
print(str(sales)) # ['SalesA', 'SalesB', 'SalesC']
per_sales=[ x +"/TotalSales as " + "Perc_" +x for x in sales ]
per_sales2=",".join(per_sales)
print(str(per_sales)) # ['SalesA/TotalSales as Perc_SalesA', 'SalesB/TotalSales as Perc_SalesB', 'SalesC/TotalSales as Perc_SalesC']
spark.sql(f"""
with t1 ( select *, {sales2} TotalSales from sales_table )
select *, {per_sales2} from t1
""").show(truncate=False)
+---------------+------+------+------+----------+-----------+-----------+-----------+
|transactiondate|SalesA|SalesB|SalesC|TotalSales|Perc_SalesA|Perc_SalesB|Perc_SalesC|
+---------------+------+------+------+----------+-----------+-----------+-----------+
|2018-06-03 |2 |4 |4 |10 |0.2 |0.4 |0.4 |
|2018-06-04 |4 |3 |3 |10 |0.4 |0.3 |0.3 |
|2018-06-03 |8 |1 |1 |10 |0.8 |0.1 |0.1 |
|2018-06-01 |3 |1 |1 |5 |0.6 |0.2 |0.2 |
|2018-06-05 |3 |2 |0 |5 |0.6 |0.4 |0.0 |
+---------------+------+------+------+----------+-----------+-----------+-----------+
You can also use the aggregate() higher order function to sum the sales* columns. But for this the columns must be of Integer/double type, not long.
test2=test.withColumn("SalesA",expr("cast(salesa as int)"))\
.withColumn("SalesB",expr("cast(salesb as int)"))\
.withColumn("SalesC",expr("cast(salesc as int)"))
test2.createOrReplaceTempView("sales_table2")
sales3=",".join(sales) # just join the sales columns with comma
spark.sql(f"""
with t1 ( select *, aggregate(array({sales3}),0,(acc,x) -> acc+x) TotalSales from sales_table2 )
select *, {per_sales2} from t1
""").show(truncate=False)
+---------------+------+------+------+----------+-----------+-----------+-----------+
|transactiondate|SalesA|SalesB|SalesC|TotalSales|Perc_SalesA|Perc_SalesB|Perc_SalesC|
+---------------+------+------+------+----------+-----------+-----------+-----------+
|2018-06-03 |2 |4 |4 |10 |0.2 |0.4 |0.4 |
|2018-06-04 |4 |3 |3 |10 |0.4 |0.3 |0.3 |
|2018-06-03 |8 |1 |1 |10 |0.8 |0.1 |0.1 |
|2018-06-01 |3 |1 |1 |5 |0.6 |0.2 |0.2 |
|2018-06-05 |3 |2 |0 |5 |0.6 |0.4 |0.0 |
+---------------+------+------+------+----------+-----------+-----------+-----------+

How to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method in Pyspark?

I have pyspark.rdd.PipelinedRDD (Rdd1).
when I am doing Rdd1.collect(),it is giving result like below.
[(10, {3: 3.616726727464709, 4: 2.9996439803387602, 5: 1.6767412921625855}),
(1, {3: 2.016527311459324, 4: -1.5271512313750577, 5: 1.9665475696370045}),
(2, {3: 6.230272144805092, 4: 4.033642544526678, 5: 3.1517805604906313}),
(3, {3: -0.3924680103722977, 4: 2.9757316477407443, 5: -1.5689126834176417})]
Now I want to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method
My final data frame should be like below. df.show() should be like:
+----------+-------+-------------------+
|CId |IID |Score |
+----------+-------+-------------------+
|10 |4 |2.9996439803387602 |
|10 |5 |1.6767412921625855 |
|10 |3 |3.616726727464709 |
|1 |4 |-1.5271512313750577|
|1 |5 |1.9665475696370045 |
|1 |3 |2.016527311459324 |
|2 |4 |4.033642544526678 |
|2 |5 |3.1517805604906313 |
|2 |3 |6.230272144805092 |
|3 |4 |2.9757316477407443 |
|3 |5 |-1.5689126834176417|
|3 |3 |-0.3924680103722977|
+----------+-------+-------------------+
I can achieve this converting to rdd next applying collect, iteration and finally Data frame.
but now I want to convert pyspark.rdd.PipelinedRDD to Dataframe with out using any collect() method.
please let me know how to achieve this?
You want to do two things here:
1. flatten your data
2. put it into a dataframe
One way to do it is as follows:
First, let us flatten the dictionary:
rdd2 = Rdd1.flatMapValues(lambda x : [ (k, x[k]) for k in x.keys()])
When collecting the data, you get something like this:
[(10, (3, 3.616726727464709)), (10, (4, 2.9996439803387602)), ...
Then we can format the data and turn it into a dataframe:
rdd2.map(lambda x : (x[0], x[1][0], x[1][1]))\
.toDF(("CId", "IID", "Score"))\
.show()
which gives you this:
+---+---+-------------------+
|CId|IID| Score|
+---+---+-------------------+
| 10| 3| 3.616726727464709|
| 10| 4| 2.9996439803387602|
| 10| 5| 1.6767412921625855|
| 1| 3| 2.016527311459324|
| 1| 4|-1.5271512313750577|
| 1| 5| 1.9665475696370045|
| 2| 3| 6.230272144805092|
| 2| 4| 4.033642544526678|
| 2| 5| 3.1517805604906313|
| 3| 3|-0.3924680103722977|
| 3| 4| 2.9757316477407443|
| 3| 5|-1.5689126834176417|
+---+---+-------------------+
There is an even easier and more elegant solution avoiding python lambda-expressions as in #oli answer which relies on spark DataFrames's explode which perfectly fits your requirement. It should be faster too because there is no need to use python lambda's twice. See below:
from pyspark.sql.functions import explode
# dummy data
data = [(10, {3: 3.616726727464709, 4: 2.9996439803387602, 5: 1.6767412921625855}),
(1, {3: 2.016527311459324, 4: -1.5271512313750577, 5: 1.9665475696370045}),
(2, {3: 6.230272144805092, 4: 4.033642544526678, 5: 3.1517805604906313}),
(3, {3: -0.3924680103722977, 4: 2.9757316477407443, 5: -1.5689126834176417})]
# create your rdd
rdd = sc.parallelize(data)
# convert to spark data frame
df = rdd.toDF(["CId", "Values"])
# use explode
df.select("CId", explode("Values").alias("IID", "Score")).show()
+---+---+-------------------+
|CId|IID| Score|
+---+---+-------------------+
| 10| 3| 3.616726727464709|
| 10| 4| 2.9996439803387602|
| 10| 5| 1.6767412921625855|
| 1| 3| 2.016527311459324|
| 1| 4|-1.5271512313750577|
| 1| 5| 1.9665475696370045|
| 2| 3| 6.230272144805092|
| 2| 4| 4.033642544526678|
| 2| 5| 3.1517805604906313|
| 3| 3|-0.3924680103722977|
| 3| 4| 2.9757316477407443|
| 3| 5|-1.5689126834176417|
+---+---+-------------------+
This is how you can do it with scala
val Rdd1 = spark.sparkContext.parallelize(Seq(
(10, Map(3 -> 3.616726727464709, 4 -> 2.9996439803387602, 5 -> 1.6767412921625855)),
(1, Map(3 -> 2.016527311459324, 4 -> -1.5271512313750577, 5 -> 1.9665475696370045)),
(2, Map(3 -> 6.230272144805092, 4 -> 4.033642544526678, 5 -> 3.1517805604906313)),
(3, Map(3 -> -0.3924680103722977, 4 -> 2.9757316477407443, 5 -> -1.5689126834176417))
))
val x = Rdd1.flatMap(x => (x._2.map(y => (x._1, y._1, y._2))))
.toDF("CId", "IId", "score")
Output:
+---+---+-------------------+
|CId|IId|score |
+---+---+-------------------+
|10 |3 |3.616726727464709 |
|10 |4 |2.9996439803387602 |
|10 |5 |1.6767412921625855 |
|1 |3 |2.016527311459324 |
|1 |4 |-1.5271512313750577|
|1 |5 |1.9665475696370045 |
|2 |3 |6.230272144805092 |
|2 |4 |4.033642544526678 |
|2 |5 |3.1517805604906313 |
|3 |3 |-0.3924680103722977|
|3 |4 |2.9757316477407443 |
|3 |5 |-1.5689126834176417|
+---+---+-------------------+
Hope you can convert to pyspark.
Ensure a spark session is created first:
sc = SparkContext()
spark = SparkSession(sc)
I found this answer when I was trying to solve this exact issue.
'PipelinedRDD' object has no attribute 'toDF' in PySpark

Pyspark : Cumulative Sum with reset condition

We have dataframe like below :
+------+--------------------+
| Flag | value|
+------+--------------------+
|1 |5 |
|1 |4 |
|1 |3 |
|1 |5 |
|1 |6 |
|1 |4 |
|1 |7 |
|1 |5 |
|1 |2 |
|1 |3 |
|1 |2 |
|1 |6 |
|1 |9 |
+------+--------------------+
After normal cumsum we get this.
+------+--------------------+----------+
| Flag | value|cumsum |
+------+--------------------+----------+
|1 |5 |5 |
|1 |4 |9 |
|1 |3 |12 |
|1 |5 |17 |
|1 |6 |23 |
|1 |4 |27 |
|1 |7 |34 |
|1 |5 |39 |
|1 |2 |41 |
|1 |3 |44 |
|1 |2 |46 |
|1 |6 |52 |
|1 |9 |61 |
+------+--------------------+----------+
Now what we want is for cumsum to reset when specific condition is set for ex. when it crosses 20.
Below is expected output:
+------+--------------------+----------+---------+
| Flag | value|cumsum |expected |
+------+--------------------+----------+---------+
|1 |5 |5 |5 |
|1 |4 |9 |9 |
|1 |3 |12 |12 |
|1 |5 |17 |17 |
|1 |6 |23 |23 |
|1 |4 |27 |4 | <-----reset
|1 |7 |34 |11 |
|1 |5 |39 |16 |
|1 |2 |41 |18 |
|1 |3 |44 |21 |
|1 |2 |46 |2 | <-----reset
|1 |6 |52 |8 |
|1 |9 |61 |17 |
+------+--------------------+----------+---------+
This is how we are calculating the cumulative sum.
win_counter = Window.partitionBy("flag")
df_partitioned = df_partitioned.withColumn('cumsum',F.sum(F.col('value')).over(win_counter))
There are two ways I've found to solve it without udf:
Dataframe
from pyspark.sql.window import Window
import pyspark.sql.functions as f
df = spark.createDataFrame([
(1, 5), (1, 4), (1, 3), (1, 5), (1, 6), (1, 4),
(1, 7), (1, 5), (1, 2), (1, 3), (1, 2), (1, 6), (1, 9)
], schema='Flag int, value int')
w = (Window
.partitionBy('flag')
.orderBy(f.monotonically_increasing_id())
.rowsBetween(Window.unboundedPreceding, Window.currentRow))
df = df.withColumn('values', f.collect_list('value').over(w))
expr = "AGGREGATE(values, 0, (acc, el) -> IF(acc < 20, acc + el, el))"
df = df.select('Flag', 'value', f.expr(expr).alias('cumsum'))
df.show(truncate=False)
RDD
df = spark.createDataFrame([
(1, 5), (1, 4), (1, 3), (1, 5), (1, 6), (1, 4),
(1, 7), (1, 5), (1, 2), (1, 3), (1, 2), (1, 6), (1, 9)
], schema='Flag int, value int')
def cumsum_by_flag(rows):
cumsum, reset = 0, False
for row in rows:
if reset:
cumsum = row.value
reset = False
else:
cumsum += row.value
reset = cumsum > 20
yield row.value, cumsum
def unpack(value):
flag = value[0]
value, cumsum = value[1]
return flag, value, cumsum
rdd = df.rdd.keyBy(lambda row: row.Flag)
rdd = (rdd
.groupByKey()
.flatMapValues(cumsum_by_flag)
.map(unpack))
df = rdd.toDF('Flag int, value int, cumsum int')
df.show(truncate=False)
Output:
+----+-----+------+
|Flag|value|cumsum|
+----+-----+------+
|1 |5 |5 |
|1 |4 |9 |
|1 |3 |12 |
|1 |5 |17 |
|1 |6 |23 |
|1 |4 |4 |
|1 |7 |11 |
|1 |5 |16 |
|1 |2 |18 |
|1 |3 |21 |
|1 |2 |2 |
|1 |6 |8 |
|1 |9 |17 |
+----+-----+------+
It's probably best to do with pandas_udf here.
from pyspark.sql.functions import pandas_udf, PandasUDFType
pdf = pd.DataFrame({'flag':[1]*13,'id':range(13), 'value': [5,4,3,5,6,4,7,5,2,3,2,6,9]})
df = spark.createDataFrame(pdf)
df = df.withColumn('cumsum', F.lit(math.inf))
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def _calc_cumsum(pdf):
pdf.sort_values(by=['id'], inplace=True, ascending=True)
cumsums = []
prev = None
reset = False
for v in pdf['value'].values:
if prev is None:
cumsums.append(v)
prev = v
else:
prev = prev + v if not reset else v
cumsums.append(prev)
reset = True if prev >= 20 else False
pdf['cumsum'] = cumsums
return pdf
df = df.groupby('flag').apply(_calc_cumsum)
df.show()
the results:
+----+---+-----+------+
|flag| id|value|cumsum|
+----+---+-----+------+
| 1| 0| 5| 5.0|
| 1| 1| 4| 9.0|
| 1| 2| 3| 12.0|
| 1| 3| 5| 17.0|
| 1| 4| 6| 23.0|
| 1| 5| 4| 4.0|
| 1| 6| 7| 11.0|
| 1| 7| 5| 16.0|
| 1| 8| 2| 18.0|
| 1| 9| 3| 21.0|
| 1| 10| 2| 2.0|
| 1| 11| 6| 8.0|
| 1| 12| 9| 17.0|
+----+---+-----+------+

Pyspark autoincrement for alternating group of values

I'm trying to create a new column in a Spark DataFrame using Pyspark, which represents an autoincrement (or ID) based on groups of alternating boolean values. Lets say I have the following DataFrame:
df.show()
+-----+------------+-------------+
|id |par_id |is_on |
+-----+------------+-------------+
|40002|1 |true |
|40003|2 |true |
|40004|null |false |
|40005|17 |true |
|40006|2 |true |
|40007|17 |true |
|40008|240 |true |
|40009|1861 |true |
|40010|1862 |true |
|40011|2 |true |
|40012|null |false |
|40013|1863 |true |
|40014|626 |true |
|40016|208 |true |
|40017|2 |true |
|40018|null |false |
|40019|2 |true |
|40020|1863 |true |
|40021|2 |true |
|40022|2 |true |
+-----+------------+-------------+
I want to extend this DataFrame with an incremental id called id2 using the is_on attribute. That is, each group of boolean values should get an increasing id. The resulting DataFrame should look like this:
df.show()
+-----+------------+-------------+-----+
|id |par_id |is_on |id2 |
+-----+------------+-------------+-----+
|40002|1 |true |1 |
|40003|2 |true |1 |
|40004|null |false |2 |
|40005|17 |true |3 |
|40006|2 |true |3 |
|40007|17 |true |3 |
|40008|240 |true |3 |
|40009|1861 |true |3 |
|40010|1862 |true |3 |
|40011|2 |true |3 |
|40012|null |false |4 |
|40013|1863 |true |5 |
|40014|626 |true |5 |
|40016|208 |true |5 |
|40017|2 |true |5 |
|40018|null |false |6 |
|40019|2 |true |7 |
|40020|1863 |true |7 |
|40021|2 |true |7 |
|40022|2 |true |7 |
+-----+------------+-------------+-----+
Do you have any suggestions to do that? How can I write a User Defined Function for this?
#this is python spark testing file
from pyspark.sql import SparkSession
from pyspark.sql.functions import count, col, udf, struct
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark=SparkSession.builder.master("local").appName("durga prasad").config("spark.sql.warehouse.dir","/home/hadoop/spark-2.0.1-bin-hadoop2.7/bin/test_warehouse").getOrCreate()
df=spark.read.csv("/home/hadoop/stack_test.txt",sep=",",header=True)
# This is udf
count=1 # these variable is changed based on function call
prStr='' # these variable is changed based on function call
def test_fun(str):
global count
global prStr
if str=="false":
count=count + 1
prStr=str
return count
if str=="true" and prStr =='false':
count=count + 1
prStr=str
return count
elif str=='true':
count=count
prStr=str
return count
# udf function end
testUDF = udf(test_fun, StringType()) # register udf
df.select("id","par_id","is_on",testUDF('is_on').alias("id2")).show()
####output
+-----+------+-----+---+
| id|par_id|is_on|id2|
+-----+------+-----+---+
|40002| 1| true| 1|
|40003| 2| true| 1|
|40004| null|false| 2|
|40005| 17| true| 3|
|40006| 2| true| 3|
|40007| 17| true| 3|
|40008| 240| true| 3|
|40009| 1861| true| 3|
|40010| 1862| true| 3|
|40011| 2| true| 3|
|40012| null|false| 4|
|40013| 1863| true| 5|
|40014| 626| true| 5|
|40016| 208| true| 5|
|40017| 2| true| 5|
|40018| null|false| 6|
|40019| 2| true| 7|
|40020| 1863| true| 7|
|40021| 2| true| 7|
|40022| 2| true| 7|
+-----+------+-----+---+

Resources