Pyspark: Split and conditional statements - apache-spark

I try to create a column called "w" in which If I split the values and then I create a conditional table in which If I find a value with the "<" smybol then that value should be substracted -0.1. When you find a value with "+" when you just should eliminate the +.
I tried this the split but I need to write the conditions.
Tahnk you for your help :)
dataframe = dataframe.withColumn("x", split(col("x"), "-").getItem(0))
data = [["1", "Amit", "DU", "I", "<25"],
["2", "Mohit", "DU", "I", "<25"],
["3", "rohith", "BHU", "I", 35-40],
["4", "sridevi", "LPU", "I", 30-35],
["1", "sravan", "KLMP", "M", 25-30],
["5", "gnanesh", "IIT", "M", 40-45],
["5", "gnadesh", "KLM", "c", "+45"]]
columns = ['ID', 'NAME', 'college', 'metric', 'x']
dataframe = spark.createDataFrame(data, columns)
My output is like this:
+---+-------+-------+------+--------
| ID| NAME|college|metric| x|
+---+-------+-------+------+--------+
| 1| Amit| DU| I| <25|
| 2| Mohit| DU| I| <25|
| 3| rohith| BHU| I| 35 - 40|
| 4|sridevi| LPU| I| 30 - 35|
| 1| sravan| KLMP| M| 25 - 30|
| 5|gnanesh| IIT| M| 40 - 45|
| 5|gnadesh| KLM| c| +45|
+---+-------+-------+------+--------+
My Output should look like this
+---+-------+-------+------+--------+----+
| ID| NAME|college|metric| x| w|
+---+-------+-------+------+--------+----+
| 1| Amit| DU| I| <25|24.9|
| 2| Mohit| DU| I| <25|24.9|
| 3| rohith| BHU| I| 35 - 40| 35|
| 4|sridevi| LPU| I| 30 - 35| 30|
| 1| sravan| KLMP| M| 25 - 30| 25 |
| 5|gnanesh| IIT| M| 40 - 45| 40 |
| 5|gnadesh| KLM| c| +45| 45 |
+---+-------+-------+------+--------+----+

From what I understood, you have three conditions for values in column X (Let me know if this is not the case)
If the value is <X then the new column value will be X-0.1
If the value is X-Y then the new column value will be X
If the value is +X then the new column value will be 'X'
Thus this should work:
df.withColumn("NewColumn", \
F.when(F.col("x").contains('<'), F.split("x", "<").getItem(1)-0.1)\
.when(F.col("x").contains('-'), F.split("x", "-").getItem(0))\
.when(F.col("x").contains("+"), F.split("x", "\\+").getItem(1)))\
.show()
Input:
Output:

Related

How do I calculate the start/end of an interval (set of rows) containing identical values?

Assume we have a spark DataFrame that looks like the following (ordered by time):
+------+-------+
| time | value |
+------+-------+
| 1 | A |
| 2 | A |
| 3 | A |
| 4 | B |
| 5 | B |
| 6 | A |
+------+-------+
I'd like to calculate the start/end times of each sequence of uninterrupted values. The expected output from the above DataFrame would be:
+-------+-------+-----+
| value | start | end |
+-------+-------+-----+
| A | 1 | 3 |
| B | 4 | 5 |
| A | 6 | 6 |
+-------+-------+-----+
(The end value for the final row could also be null.)
Doing this with a simple group aggregation:
.groupBy("value")
.agg(
F.min("time").alias("start"),
F.max("time").alias("end")
)
doesn't take into account the fact that the same value can appear in multiple different intervals.
the idea is to create an identifier for each group and use it to group by and compute your min and max time.
assuming df is your dataframe:
from pyspark.sql import functions as F, Window
df = df.withColumn(
"fg",
F.when(
F.lag('value').over(Window.orderBy("time"))==F.col("value"),
0
).otherwise(1)
)
df = df.withColumn(
"rn",
F.sum("fg").over(
Window
.orderBy("time")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
)
From that point, you have your dataframe with an identifier for each consecutive group.
df.show()
+----+-----+---+---+
|time|value| rn| fg|
+----+-----+---+---+
| 1| A| 1| 1|
| 2| A| 1| 0|
| 3| A| 1| 0|
| 4| B| 2| 1|
| 5| B| 2| 0|
| 6| A| 3| 1|
+----+-----+---+---+
then you just have to do the aggregation
df.groupBy(
'value',
"rn"
).agg(
F.min('time').alias("start"),
F.max('time').alias("end")
).drop("rn").show()
+-----+-----+---+
|value|start|end|
+-----+-----+---+
| A| 1| 3|
| B| 4| 5|
| A| 6| 6|
+-----+-----+---+

Why sum is not displaying after aggregation & pivot?

Here I have student marks like below and I want to transpose subject name column and want to get the total marks also after the pivot.
Source table like:
+---------+-----------+-----+
|StudentId|SubjectName|Marks|
+---------+-----------+-----+
| 1| A| 10|
| 1| B| 20|
| 1| C| 30|
| 2| A| 20|
| 2| B| 25|
| 2| C| 30|
| 3| A| 10|
| 3| B| 20|
| 3| C| 20|
+---------+-----------+-----+
Destination:
+---------+---+---+---+-----+
|StudentId| A| B| C|Total|
+---------+---+---+---+-----+
| 1| 10| 20| 30| 60|
| 3| 10| 20| 20| 50|
| 2| 20| 25| 30| 75|
+---------+---+---+---+-----+
Please find the below source code:
val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
val list = List((1, "A", 10), (1, "B", 20), (1, "C", 30), (2, "A", 20), (2, "B", 25), (2, "C", 30), (3, "A", 10),
(3, "B", 20), (3, "C", 20))
val df = list.toDF("StudentId", "SubjectName", "Marks")
df.show() // source table as per above
val df1 = df.groupBy("StudentId").pivot("SubjectName", Seq("A", "B", "C")).agg(sum("Marks"))
df1.show()
val df2 = df1.withColumn("Total", col("A") + col("B") + col("C"))
df2.show // required destitnation
val df3 = df.groupBy("StudentId").agg(sum("Marks").as("Total"))
df3.show()
df1 is not displaying the sum/total column. it's displaying like below.
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 10| 20| 30|
| 3| 10| 20| 20|
| 2| 20| 25| 30|
+---------+---+---+---+
df3 is able to create new Total column but why in df1 it not able to create a new column?
Please, can anybody help me what I missing or anything wrong with my understanding of pivot concept?
This is an expected behaviour from spark pivot function as .agg function is applied on the pivoted columns that's the reason why you are not able to see sum of marks as new column.
Refer to this link for official documentation about pivot.
Example:
scala> df.groupBy("StudentId").pivot("SubjectName").agg(sum("Marks") + 2).show()
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 12| 22| 32|
| 3| 12| 22| 22|
| 2| 22| 27| 32|
+---------+---+---+---+
In the above example we have added 2 to all the pivoted columns.
Example2:
To get count using pivot and agg
scala> df.groupBy("StudentId").pivot("SubjectName").agg(count("*")).show()
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 1| 1| 1|
| 3| 1| 1| 1|
| 2| 1| 1| 1|
+---------+---+---+---+
The .agg followed by pivot is applicable only for the pivoted data. To find the sum you should you should add new column and sum it as below.
val cols = Seq("A", "B", "C")
val result = df.groupBy("StudentId")
.pivot("SubjectName")
.agg(sum("Marks"))
.withColumn("Total", cols.map(col _).reduce(_ + _))
result.show(false)
Output:
+---------+---+---+---+-----+
|StudentId|A |B |C |Total|
+---------+---+---+---+-----+
|1 |10 |20 |30 |60 |
|3 |10 |20 |20 |50 |
|2 |20 |25 |30 |75 |
+---------+---+---+---+-----+

How to define spark dataframe join match priority

I have two dataframes.
dataDF
+---+
| tt|
+---+
| a|
| b|
| c|
| ab|
+---+
alter
+----+-----+------+
|name|alter|profit|
+----+-----+------+
| a| aa| 1|
| b| a| 5|
| c| ab| 8|
+----+-----+------+
The task is to search col "tt" in dataframe alter col("name"), if found it join them, if not found it, then search col "tt" in col("alter"). The priority of col ("name") is high than col("alter"). That means if row of col("tt") is matched to col("name"), I do not want to match it to other row which only matches col("alter"). How can I achieve this task?
I tried to write a join, but it does not work.
dataDF = dataDF.select("*")
.join(broadcast(alterDF),
col("tt") === col("Name") || col("tt") === col("alter"),
"left")
The result is:
+---+----+-----+------+
| tt|name|alter|profit|
+---+----+-----+------+
| a| a| aa| 1|
| a| b| a| 5| // this row is not expected.
| b| b| a| 5|
| c| c| ab| 8|
| ab| c| ab| 8|
+---+----+-----+------+
You can try joining twice. First time with the name column, filter out the tt values for which data is not matched and join it with the alter column. Union both the results. Please find the code below for the same. I hope it is helpful.
//Creating Test Data
val dataDF = Seq("a", "b", "c", "ab").toDF("tt")
val alter = Seq(("a", "aa", 1), ("b", "a", 5), ("c", "ab", 8))
.toDF("name", "alter", "profit")
val join1 = dataDF.join(alter, col("tt") === col("name"), "left")
val join2 = join1.filter( col("name").isNull).select("tt")
.join(alter, col("tt") === col("alter"), "left")
val joinDF = join1.filter( col("name").isNotNull).union(join2)
joinDF.show(false)
+---+----+-----+------+
|tt |name|alter|profit|
+---+----+-----+------+
|a |a |aa |1 |
|b |b |a |5 |
|c |c |ab |8 |
|ab |c |ab |8 |
+---+----+-----+------+

PySpark : change column names of a df based on relations defined in another df

I have two Spark data-frames loaded from csv of the form :
mapping_fields (the df with mapped names):
new_name old_name
A aa
B bb
C cc
and
aa bb cc dd
1 2 3 43
12 21 4 37
to be transformed into :
A B C D
1 2 3
12 21 4
as dd didn't have any mapping in the original table, D column should have all null values.
How can I do this without converting the mapping_df into a dictionary and checking individually for mapped names? (this would mean I have to collect the mapping_fields and check, which kind of contradicts my use-case of distributedly handling all the datasets)
Thanks!
With melt borrowed from here you could:
from pyspark.sql import functions as f
mapping_fields = spark.createDataFrame(
[("A", "aa"), ("B", "bb"), ("C", "cc")],
("new_name", "old_name"))
df = spark.createDataFrame(
[(1, 2, 3, 43), (12, 21, 4, 37)],
("aa", "bb", "cc", "dd"))
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"], "left_outer")
.withColumn("value", f.when(f.col("new_name").isNotNull(), col("value")))
.withColumn("new_name", f.coalesce("new_name", f.upper(col("old_name"))))
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
but in your description nothing justifies this. Because number of columns is fairly limited, I'd rather:
mapping = dict(
mapping_fields
.filter(f.col("old_name").isin(df.columns))
.select("old_name", "new_name").collect())
df.select([
(f.lit(None).cast(t) if c not in mapping else col(c)).alias(mapping.get(c, c.upper()))
for (c, t) in df.dtypes])
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
At the end of the day you should use distributed processing when it provides performance or scalability improvements. Here it would do the opposite and make your code overly complicated.
To ignore no-matches:
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"])
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
or
df.select([
col(c).alias(mapping.get(c))
for (c, t) in df.dtypes if c in mapping])
I tried with a simple for loop,hope this helps too.
from pyspark.sql import functions as F
l1 = [('A','aa'),('B','bb'),('C','cc')]
l2 = [(1,2,3,43),(12,21,4,37)]
df1 = spark.createDataFrame(l1,['new_name','old_name'])
df2 = spark.createDataFrame(l2,['aa','bb','cc','dd'])
print df1.show()
+--------+--------+
|new_name|old_name|
+--------+--------+
| A| aa|
| B| bb|
| C| cc|
+--------+--------+
>>> df2.show()
+---+---+---+---+
| aa| bb| cc| dd|
+---+---+---+---+
| 1| 2| 3| 43|
| 12| 21| 4| 37|
+---+---+---+---+
when you need the missing column with null values,
>>>cols = df2.columns
>>> for i in cols:
val = df1.where(df1['old_name'] == i).first()
if val is not None:
df2 = df2.withColumnRenamed(i,val['new_name'])
else:
df2 = df2.withColumn(i,F.lit(None))
>>> df2.show()
+---+---+---+----+
| A| B| C| dd|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
when we need only the mapping columns,changing the else part,
else:
df2 = df2.drop(i)
>>> df2.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 12| 21| 4|
+---+---+---+
This will transform the original df2 dataframe though.

How to join two data frames in Apache Spark and merge keys into one column?

I have two following Spark data frames:
sale_df:
|user_id|total_sale|
+-------+----------+
| a| 1100|
| b| 2100|
| c| 3300|
| d| 4400
and target_df:
user_id|personalized_target|
+-------+-------------------+
| b| 1000|
| c| 2000|
| d| 3000|
| e| 4000|
+-------+-------------------+
How can I join them in a way that output is:
user_id total_sale personalized_target
a 1100 NA
b 2100 1000
c 3300 2000
d 4400 4000
e NA 4000
I have tried all most all the join types but it seems that single join can not make the desired output.
Any PySpark or SQL and HiveContext can help.
You can use the equi-join synthax in Scala
val output = sales_df.join(target_df,Seq("user_id"),joinType="outer")
You should check if it works in python:
output = sales_df.join(target_df,['user_id'],"outer")
You need to perform an outer equi-join :
data1 = [['a', 1100], ['b', 2100], ['c', 3300], ['d', 4400]]
sales = sqlContext.createDataFrame(data1,['user_id','total_sale'])
data2 = [['b', 1000],['c',2000],['d',3000],['e',4000]]
target = sqlContext.createDataFrame(data2,['user_id','personalized_target'])
sales.join(target, 'user_id', "outer").show()
# +-------+----------+-------------------+
# |user_id|total_sale|personalized_target|
# +-------+----------+-------------------+
# | e| null| 4000|
# | d| 4400| 3000|
# | c| 3300| 2000|
# | b| 2100| 1000|
# | a| 1100| null|
# +-------+----------+-------------------+

Resources