I have product, brand and percentage columns. I want to calculate the sum of the percentage column for the rows above the current row for those with different brand than the current row and also for those with same brand as the current row. How can I do it in PySpark or using using spark.sql?
sample data:
df = pd.DataFrame({'a': ['a1','a2','a3','a4','a5','a6'],
'brand':['b1','b2','b1', 'b3', 'b2','b1'],
'pct': [40, 30, 10, 8,7,5]})
df = spark.createDataFrame(df)
What I am looking for:
product brand pct pct_same_brand pct_different_brand
a1 b1 40 null null
a2 b2 30 null 40
a3 b1 10 40 30
a4 b3 8 null 80
a5 b2 7 30 58
a6 b1 5 50 45
This is what I have tried:
df.createOrReplaceTempView('tmp')
spark.sql("""
select *, sum(pct * (select case when n1.brand = n2.brand then 1 else 0 end
from tmp n1)) over(order by pct desc rows between UNBOUNDED PRECEDING and 1
preceding)
from tmp n2
""").show()
You can get the pct_different_brand column by subtracting the total rolling sum from the partitioned rolling sum (i.e. pct_same_brand column):
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'pct_same_brand',
F.sum('pct').over(
Window.partitionBy('brand')
.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
)
).withColumn(
'pct_different_brand',
F.sum('pct').over(
Window.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
) - F.coalesce(F.col('pct_same_brand'), F.lit(0))
)
df2.show()
+---+-----+---+--------------+-------------------+
| a|brand|pct|pct_same_brand|pct_different_brand|
+---+-----+---+--------------+-------------------+
| a1| b1| 40| null| null|
| a2| b2| 30| null| 40|
| a3| b1| 10| 40| 30|
| a4| b3| 8| null| 80|
| a5| b2| 7| 30| 58|
| a6| b1| 5| 50| 45|
+---+-----+---+--------------+-------------------+
Related
Hi I have sparse dataframe that was loaded by mergeschema option
DF
name A1 A2 B1 B2 ..... partitioned_name
A 1 1 null null partition_a
B 2 2 null null partition_a
A null null 3 4 partition_b
B null null 3 4 partition_b
to
DF
name A1 A2 B1 B2 .....
A 1 1 3 4
B 2 2 3 4
Any Best ideas without joining for efficiency (nor rdd because data is huge)? I was thinking about solutions like pandas concat(axis=1) since all the tables are sorted
If that pattern repeats and you don't mind hardcode the column names:
df = spark.createDataFrame(
[
('A','1','1','null','null','partition_a'),
('B','2','2','null','null','partition_a'),
('A','null','null','3','4','partition_b'),
('B','null','null','3','4','partition_b')
],
['name','A1','A2','B1','B2','partitioned_name']
)\
.withColumn('A1', F.col('A1').cast('integer'))\
.withColumn('A2', F.col('A2').cast('integer'))\
.withColumn('B1', F.col('B1').cast('integer'))\
.withColumn('B2', F.col('B2').cast('integer'))\
df.show()
import pyspark.sql.functions as F
cols_to_agg = [col for col in df.columns if col not in ["name", "partitioned_name"]]
df\
.groupby('name')\
.agg(F.sum('A1').alias('A1'),
F.sum('A2').alias('A2'),
F.sum('B1').alias('B1'),
F.sum('B2').alias('B2'))\
.show()
+----+----+----+----+----+----------------+
# |name| A1| A2| B1| B2|partitioned_name|
# +----+----+----+----+----+----------------+
# | A| 1| 1|null|null| partition_a|
# | B| 2| 2|null|null| partition_a|
# | A|null|null| 3| 4| partition_b|
# | B|null|null| 3| 4| partition_b|
# +----+----+----+----+----+----------------+
# +----+---+---+---+---+
# |name| A1| A2| B1| B2|
# +----+---+---+---+---+
# | A| 1| 1| 3| 4|
# | B| 2| 2| 3| 4|
# +----+---+---+---+---+
I am trying to manipulate two dataframes using PySpark as part of an AWS Glue job.
df1:
item tag
1 AB
2 CD
3 EF
4 QQ
df2:
key1 key2 tags
A1 B1 [AB]
A1 B2 [AB, CD, EF]
A2 B1 [CD, EF]
A2 B3 [AB, EF, ZZ]
I would like to match up the array in df2 with the tag in df1, in the following way:
item key1 key2 tag
1 A1 B1 AB
1 A1 B2 AB
2 A1 B2 CD
2 A2 B1 CD
3 A1 B2 EF
3 A2 B1 EF
3 A2 B3 EF
So, the tag in df1 is used to expand the row based on the tag entries in df2. For example, item 1's tag "AB" occurs in the tags array in df2 for the first two rows.
Also note how 4 gets ignored as the tag QQ does not exist in any array in df2.
I know this is going to be an inner join, but I am not sure how to match up df1.tag with df2.tags to pull in key1 and key2.
Any assistance would be greatly appreciated.
You can do a join using an array_contains condition:
import pyspark.sql.functions as F
result = (df1.join(df2, F.array_contains(df2.tags, df1.tag))
.select('item', 'key1', 'key2', 'tag')
.orderBy('item', 'key1', 'key2')
)
result.show()
+----+----+----+---+
|item|key1|key2|tag|
+----+----+----+---+
| 1| A1| B1| AB|
| 1| A1| B2| AB|
| 1| A2| B3| AB|
| 2| A1| B2| CD|
| 2| A2| B1| CD|
| 3| A1| B2| EF|
| 3| A2| B1| EF|
| 3| A2| B3| EF|
+----+----+----+---+
import pyspark.sql.functions as F
df = df1.join(
df2.select('key1', 'key2', F.explode('tags').alias('tag')),
'tag',
'inner'
)
df.show()
# +---+----+----+----+
# |tag|item|key1|key2|
# +---+----+----+----+
# | EF| 3| A1| B2|
# | EF| 3| A2| B1|
# | EF| 3| A2| B3|
# | AB| 1| A1| B1|
# | AB| 1| A1| B2|
# | AB| 1| A2| B3|
# | CD| 2| A1| B2|
# | CD| 2| A2| B1|
# +---+----+----+----+
I have a dataset which consists of two columns C1 and C2.The columns are associated with a relation of many to many.
What I would like to do is find for each C2 the value C1 which has the most associations with C2 values overall.
For example:
C1 | C2
1 | 2
1 | 5
1 | 9
2 | 9
2 | 8
We can see here that 1 is matched to 3 values of C2 while 2 is matched to 2 so i would like as output:
Out1 |Out2| matches
2 | 1 | 3
5 | 1 | 3
9 | 1 | 3 (1 wins because 3>2)
8 | 2 | 2
What I have done so far is:
dataset = sc.textFile("...").\
map(lambda line: (line.split(",")[0],list(line.split(",")[1]) ) ).\
reduceByKey(lambda x , y : x+y )
What this does is for each C1 value gather all the C2 matches,the count of this list is our desired matches column. What I would like now is somehow use each value in this list as a new key and have a mapping like :
(Key ,Value_list[value1,value2,...]) -->(value1 , key ),(value2 , key)...
How could this be done using spark? Any advice would be really helpful.
Thanks in advance!
The dataframe API is perhaps easier for this kind of task. You can group by C1, get the count, then group by C2, and get the value of C1 that corresponds to the highest number of matches.
import pyspark.sql.functions as F
df = spark.read.csv('file.csv', header=True, inferSchema=True)
df2 = (df.groupBy('C1')
.count()
.join(df, 'C1')
.groupBy(F.col('C2').alias('Out1'))
.agg(
F.max(
F.struct(F.col('count').alias('matches'), F.col('C1').alias('Out2'))
).alias('c')
)
.select('Out1', 'c.Out2', 'c.matches')
.orderBy('Out1')
)
df2.show()
+----+----+-------+
|Out1|Out2|matches|
+----+----+-------+
| 2| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 9| 1| 3|
+----+----+-------+
We can get the desired result easily using dataframe API.
from pyspark.sql import *
import pyspark.sql.functions as fun
from pyspark.sql.window import Window
spark = SparkSession.builder.master("local[*]").getOrCreate()
# preparing sample dataframe
data = [(1, 2), (1, 5), (1, 9), (2, 9), (2, 8)]
schema = ["c1", "c2"]
df = spark.createDataFrame(data, schema)
output = df.withColumn("matches", fun.count("c1").over(Window.partitionBy("c1"))) \
.groupby(fun.col('C2').alias('out1')) \
.agg(fun.first(fun.col("c1")).alias("out2"), fun.max("matches").alias("matches"))
output.show()
# output
+----+----+-------+
|Out1|out2|matches|
+----+----+-------+
| 9| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 2| 1| 3|
+----+----+-------+
I have a table like below
id week count
A100 201008 2
A100 201009 9
A100 201010 16
A100 201011 23
A100 201012 30
A100 201013 36
A100 201015 43
A100 201017 50
A100 201018 57
A100 201019 63
A100 201023 70
A100 201024 82
A100 201025 88
A100 201026 95
A100 201027 102
Here, we can see that below weeks are missing :
First 201014 is missing
Second 201016 is missing
Third weeks missing 201020, 201021, 201022
My requirement is whenever we have missing values we need to show the count of previous week.
In this case output should be :
id week count
A100 201008 2
A100 201009 9
A100 201010 16
A100 201011 23
A100 201012 30
A100 201013 36
A100 201014 36
A100 201015 43
A100 201016 43
A100 201017 50
A100 201018 57
A100 201019 63
A100 201020 63
A100 201021 63
A100 201022 63
A100 201023 70
A100 201024 82
A100 201025 88
A100 201026 95
A100 201027 102
How I can achieve this requirement using hive/pyspark?
Although this answer is in Scala, Python version will look almost the same & can be easily converted.
Step 1:
Find the rows that has missing week(s) value prior to it.
Sample Input:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
//sample input
val input = sc.parallelize(List(("A100",201008,2), ("A100",201009,9),("A100",201014,4), ("A100",201016,45))).toDF("id","week","count")
scala> input.show
+----+------+-----+
| id| week|count|
+----+------+-----+
|A100|201008| 2|
|A100|201009| 9|
|A100|201014| 4| //missing 4 rows
|A100|201016| 45| //missing 1 row
+----+------+-----+
To find it, we can use .lead() function on week. And compute the difference between the leadWeek and week. The difference should not be > 1, if so there are missing row prior to it.
val diffDF = input
.withColumn("leadWeek", lead($"week", 1).over(Window.partitionBy($"id").orderBy($"week"))) // partitioning by id & computing lead()
.withColumn("diff", ($"leadWeek" - $"week") -1) // finding difference between leadWeek & week
scala> diffDF.show
+----+------+-----+--------+----+
| id| week|count|leadWeek|diff|
+----+------+-----+--------+----+
|A100|201008| 2| 201009| 0| // diff -> 0 represents that no rows needs to be added
|A100|201009| 9| 201014| 4| // diff -> 4 represents 4 rows are to be added after this row.
|A100|201014| 4| 201016| 1| // diff -> 1 represents 1 row to be added after this row.
|A100|201016| 45| null|null|
+----+------+-----+--------+----+
Step 2:
If the diff is >= 1: Create and Add n number of rows(InputWithDiff, check the case class below) as specified in
diff and increment week value accordingly. Return the newly
created rows along with the original row.
If the diff is 0, No additional computation is required. Return the original row as it is.
Convert diffDF to Dataset for ease of computation.
case class InputWithDiff(id: Option[String], week: Option[Int], count: Option[Int], leadWeek: Option[Int], diff: Option[Int])
val diffDS = diffDF.as[InputWithDiff]
val output = diffDS.flatMap(x => {
val diff = x.diff.getOrElse(0)
diff match {
case n if n >= 1 => x :: (1 to diff).map(y => InputWithDiff(x.id, Some(x.week.get + y), x.count,x.leadWeek, x.diff)).toList // create and append new Rows
case _ => List(x) // return as it is
}
}).drop("leadWeek", "diff").toDF // drop unnecessary columns & convert to DF
final output:
scala> output.show
+----+------+-----+
| id| week|count|
+----+------+-----+
|A100|201008| 2|
|A100|201009| 9|
|A100|201010| 9|
|A100|201011| 9|
|A100|201012| 9|
|A100|201013| 9|
|A100|201014| 4|
|A100|201015| 4|
|A100|201016| 45|
+----+------+-----+
PySpark solution
Sample Data
df = spark.createDataFrame([(1,201901,10),
(1,201903,9),
(1,201904,21),
(1,201906,42),
(1,201909,3),
(1,201912,56)
],['id','weeknum','val'])
df.show()
+---+-------+---+
| id|weeknum|val|
+---+-------+---+
| 1| 201901| 10|
| 1| 201903| 9|
| 1| 201904| 21|
| 1| 201906| 42|
| 1| 201909| 3|
| 1| 201912| 56|
+---+-------+---+
1) The basic idea is to create a combination of all id's and weeks (starting from the minimum possible value to the maximum) with a cross join.
from pyspark.sql.functions import min,max,sum,when
from pyspark.sql import Window
min_max_week = df.agg(min(df.weeknum),max(df.weeknum)).collect()
#Generate all weeks using range
all_weeks = spark.range(min_max_week[0][0],min_max_week[0][1]+1)
all_weeks = all_weeks.withColumnRenamed('id','weekno')
#all_weeks.show()
id_all_weeks = df.select(df.id).distinct().crossJoin(all_weeks).withColumnRenamed('id','aid')
#id_all_weeks.show()
2) Thereafter, left joining the original dataframe on to these combinations helps identify missing values.
res = id_all_weeks.join(df,(df.id == id_all_weeks.aid) & (df.weeknum == id_all_weeks.weekno),'left')
res.show()
+---+------+----+-------+----+
|aid|weekno| id|weeknum| val|
+---+------+----+-------+----+
| 1|201911|null| null|null|
| 1|201905|null| null|null|
| 1|201903| 1| 201903| 9|
| 1|201904| 1| 201904| 21|
| 1|201901| 1| 201901| 10|
| 1|201906| 1| 201906| 42|
| 1|201908|null| null|null|
| 1|201910|null| null|null|
| 1|201912| 1| 201912| 56|
| 1|201907|null| null|null|
| 1|201902|null| null|null|
| 1|201909| 1| 201909| 3|
+---+------+----+-------+----+
3) Then, use a combination of window functions, sum -> to assign groups
and max -> to fill in the missing values once the groups are classified.
w1 = Window.partitionBy(res.aid).orderBy(res.weekno)
groups = res.withColumn("grp",sum(when(res.id.isNull(),0).otherwise(1)).over(w1))
w2 = Window.partitionBy(groups.aid,groups.grp)
missing_values_filled = groups.withColumn('filled',max(groups.val).over(w2)) #select required columns as needed
missing_values_filled.show()
+---+------+----+-------+----+---+------+
|aid|weekno| id|weeknum| val|grp|filled|
+---+------+----+-------+----+---+------+
| 1|201901| 1| 201901| 10| 1| 10|
| 1|201902|null| null|null| 1| 10|
| 1|201903| 1| 201903| 9| 2| 9|
| 1|201904| 1| 201904| 21| 3| 21|
| 1|201905|null| null|null| 3| 21|
| 1|201906| 1| 201906| 42| 4| 42|
| 1|201907|null| null|null| 4| 42|
| 1|201908|null| null|null| 4| 42|
| 1|201909| 1| 201909| 3| 5| 3|
| 1|201910|null| null|null| 5| 3|
| 1|201911|null| null|null| 5| 3|
| 1|201912| 1| 201912| 56| 6| 56|
+---+------+----+-------+----+---+------+
Hive Query with the same logic as described above (assuming a table with all weeks can be created)
select id,weeknum,max(val) over(partition by id,grp) as val
from (select i.id
,w.weeknum
,t.val
,sum(case when t.id is null then 0 else 1 end) over(partition by i.id order by w.weeknum) as grp
from (select distinct id from tbl) i
cross join weeks_table w
left join tbl t on t.id = i.id and w.weeknum = t.weeknum
) t
I have a dataframe where I need to compare a few values and deduce a few things out of it.
For instance,
my DF
CITY DAY MONTH TAG RANGE VALUE RANK
A 1 01 A [50, 90] 55 1
A 2 02 B [30, 40] 34 3
A 1 03 A [05, 10] 15 20
A 1 04 B [50, 60] 11 10
A 1 05 B [50, 60] 54 4
I have to check , for every row if the value of "VALUE" lies between the "RANGE". Here, arr[0] is the lower limit and arr[1] is the upper limit.
I need to create a new DF such that,
NEW-DF
TAG Positive Negative
A 1 1
B 2 1
If the "value" lies between the given range and the rank < 5 then I would add it to "positive"
If the value doesnt lie in the given range , then it is a negative
If the value lies in the given range, but the rank > 5, then I would count it as negative
"Positive" and "Negative" is nothing but the count of the values fulfilling either conditions.
We can use element_at to get the elements at each position and compare them to the corresponding value in each row, along with the rank condition, and then perform a groupby with sum on the tag:
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
range_df = df.withColumn('in_range', (F.element_at('range', 1).cast(IntegerType()) < F.col('value')) &
(F.col('value') < F.element_at('range', 2).cast(IntegerType())) &
(F.col('rank') < 5))
range_df.show()
grouped_df = range_df.groupby('tag').agg(F.sum(F.col('in_range').cast(IntegerType())).alias('total_positive'),
F.sum((~F.col('in_range')).cast(IntegerType())).alias('total_negative'))
grouped_df.show()
Output:
+---+--------+-----+----+--------+
|tag| range|value|rank|in_range|
+---+--------+-----+----+--------+
| A|[50, 90]| 55| 1| true|
| B|[30, 40]| 34| 3| true|
| A|[05, 10]| 15| 20| false|
| B|[50, 60]| 11| 10| false|
| B|[50, 60]| 54| 4| true|
+---+--------+-----+----+--------+
+---+--------------+--------------+
|tag|total_positive|total_negative|
+---+--------------+--------------+
| B| 2| 1|
| A| 1| 1|
+---+--------------+--------------+
You have to use a UDF first to process the range :
val df = Seq(("A","1","01","A","[50,90]","55","1")).toDF("city","day","month","tag","range","value","rank")
+----+---+-----+---+-------+-----+----+
|city|day|month|tag| range|value|rank|
+----+---+-----+---+-------+-----+----+
| A| 1| 01| A|[50,90]| 55| 1|
+----+---+-----+---+-------+-----+----+
def checkRange(range : String,rank : String, value : String) : String = {
val rangeProcess = range.dropRight(1).drop(1).split(",")
if (rank.toInt > 5){
"negative"
} else {
if (value > rangeProcess(0) && value < rangeProcess(1)){
"positive"
} else {
"negative"
}
}
}
val checkRangeUdf = udf(checkRange _)
df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).show()
+----+---+-----+---+-------+-----+----+--------+
|city|day|month|tag| range|value|rank| Result|
+----+---+-----+---+-------+-----+----+--------+
| A| 1| 01| A|[50,90]| 55| 1|positive|
+----+---+-----+---+-------+-----+----+--------+
val result = df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).groupBy("city","Result").count.show
+----+--------+-----+
|city| Result|count|
+----+--------+-----+
| A|positive| 1|
+----+--------+-----+