Using pyspark, I'd like to be able to group a spark dataframe, sort the group, and then provide a row number. So
Group Date
A 2000
A 2002
A 2007
B 1999
B 2015
Would become
Group Date row_num
A 2000 0
A 2002 1
A 2007 2
B 1999 0
B 2015 1
Use window function:
from pyspark.sql.window import *
from pyspark.sql.functions import row_number
df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))
The accepted solution almost has it right. Here is the solution based on the output requested in the question:
df = spark.createDataFrame([("A", 2000), ("A", 2002), ("A", 2007), ("B", 1999), ("B", 2015)], ["Group", "Date"])
| A|2000|
| A|2002|
| A|2007|
| B|1999|
| B|2015|
# accepted solution above
from pyspark.sql.window import *
from pyspark.sql.functions import row_number
df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))
# accepted solution above output
| B|1999| 1|
| B|2015| 2|
| A|2000| 1|
| A|2002| 2|
| A|2007| 3|
As you can see, the function row_number starts from 1 and not 0 and the requested question wanted to have the row_num starting from 0. Simple change like I have made below:
df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))-1).show()
Output :
| B|1999| 0|
| B|2015| 1|
| A|2000| 0|
| A|2002| 1|
| A|2007| 2|
Then you can sort the "Group" column in whatever order you want. The above solution almost has it but it is important to remember that row_number begins with 1 and not 0.
I have a dataset which consists of two columns C1 and C2.The columns are associated with a relation of many to many.
What I would like to do is find for each C2 the value C1 which has the most associations with C2 values overall.
For example:
C1 | C2
1 | 2
1 | 5
1 | 9
2 | 9
2 | 8
We can see here that 1 is matched to 3 values of C2 while 2 is matched to 2 so i would like as output:
Out1 |Out2| matches
2 | 1 | 3
5 | 1 | 3
9 | 1 | 3 (1 wins because 3>2)
8 | 2 | 2
What I have done so far is:
dataset = sc.textFile("...").\
map(lambda line: (line.split(",")[0],list(line.split(",")[1]) ) ).\
reduceByKey(lambda x , y : x+y )
What this does is for each C1 value gather all the C2 matches,the count of this list is our desired matches column. What I would like now is somehow use each value in this list as a new key and have a mapping like :
(Key ,Value_list[value1,value2,...]) -->(value1 , key ),(value2 , key)...
How could this be done using spark? Any advice would be really helpful.
Thanks in advance!
The dataframe API is perhaps easier for this kind of task. You can group by C1, get the count, then group by C2, and get the value of C1 that corresponds to the highest number of matches.
import pyspark.sql.functions as F
df ='file.csv', header=True, inferSchema=True)
df2 = (df.groupBy('C1')
.join(df, 'C1')
F.struct(F.col('count').alias('matches'), F.col('C1').alias('Out2'))
.select('Out1', 'c.Out2', 'c.matches')
| 2| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 9| 1| 3|
We can get the desired result easily using dataframe API.
from pyspark.sql import *
import pyspark.sql.functions as fun
from pyspark.sql.window import Window
spark = SparkSession.builder.master("local[*]").getOrCreate()
# preparing sample dataframe
data = [(1, 2), (1, 5), (1, 9), (2, 9), (2, 8)]
schema = ["c1", "c2"]
df = spark.createDataFrame(data, schema)
output = df.withColumn("matches", fun.count("c1").over(Window.partitionBy("c1"))) \
.groupby(fun.col('C2').alias('out1')) \
.agg(fun.first(fun.col("c1")).alias("out2"), fun.max("matches").alias("matches"))
# output
| 9| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 2| 1| 3|
I tried to use window function to calculate current value based on previous value in dynamic way
rowID | value
1 | 5
2 | 7
3 | 6
If value > pre_value then value
So in row 2, since 7 > 5 then value becomes 5.
The final result should be
rowID | value
1 | 5
2 | 5
3 | 5
However using lag().over(w) gave the result as
rowID | value
1 | 5
2 | 5
3 | 6
it compares third row value 6 against the "7" not the new value "5"
Any suggestion how to achieve this?
| 1| 5|
| 2| 7|
| 3| 6|
| 4| 9|
| 5| 4|
| 6| 3|
Your required logic is too dynamic for window functions, therefore, we have to go row by row updating our values. One solution could be to use normal python udf on collected list and then explode once udf has been applied. If have relatively small data, this should be fine.(spark2.4 only because of arrays_zip).
from pyspark.sql import functions as F
from pyspark.sql.types import *
def add_one(a):
for i in range(1,len(a)):
if a[i]>a[i-1]:
return a
udf1= F.udf(add_one, ArrayType(IntegerType()))
.withColumn("value", udf1("value"))\
.withColumn("zipped", F.explode(F.arrays_zip("rowID","value"))).select("zipped.*").show()
| 1| 5|
| 2| 5|
| 3| 5|
| 4| 5|
| 5| 4|
| 6| 3|
Better yet, as you have groups of 5000, using a Pandas vectorized udf( grouped MAP) should help a lot with processing. And you do not have to collect_list with 5000 integers and explode or use pivot. I think this should be the optimal solution. Pandas UDAF available for spark2.3+
GroupBy below is empty, but you can add your grouping column in that.
from pyspark.sql.functions import pandas_udf, PandasUDFType
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def grouped_map(df1):
for i in range(1, len(df1)):
if df1.loc[i, 'value']>df1.loc[i-1,'value']:
return df1
| 1| 5|
| 2| 5|
| 3| 5|
| 4| 5|
| 5| 4|
| 6| 3|
I have the following sample dataframe
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
and I want to explode the values in each row and associate alternating 1-0 values in the generated rows. This way I can identify the start/end entries in each row.
I am able to achieve the desired result this way
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = (df.withColumn('start_end', fn.array('start', 'end'))
.withColumn('date', fn.explode('start_end'))
.withColumn('row_num', fn.row_number().over(w)))
df = (df.withColumn('is_start', fn.when(fn.col('row_num')%2 == 0, 0).otherwise(1))
.select('date', 'is_start'))
which gives
| date | is_start |
| start | 1 |
| end | 0 |
| start1 | 1 |
| end1 | 0 |
but it seems overly complicated for such a simple task.
Is there any better/cleaner way without using UDFs?
You can use pyspark.sql.functions.posexplode along with pyspark.sql.functions.array.
First create an array out of your start and end columns, then explode this with the position:
from pyspark.sql.functions import array, posexplode"end", "start")).alias("is_start", "date")).show()
#|is_start| date|
#| 0| end|
#| 1| start|
#| 0| end1|
#| 1|start1|
You can try union:
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
df = df.withColumn('startv', F.lit(1))
df = df.withColumn('endv', F.lit(0))
df =['start', 'startv']).union(['end', 'endv']))
| start|startv|
| start| 1|
|start1| 1|
| end| 0|
| end1| 0|
You can rename the columns and re-order the rows starting here.
I had similar situation in my use case. In my situation i had Huge dataset(~50GB) and doing any self join/heavy transformation was resulting in more memory and unstable execution .
I went one more level down of dataset and used flatmap of rdd. This will use map side transformation and it will be cost effective in terms of shuffle, cpu and memory.
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
| start| end|
| start| end|
final_df = df.rdd.flatMap(lambda row: [(row.start, 1), (row.end, 0)]).toDF(['date', 'is_start'])
| date|is_start|
| start| 1|
| end| 0|
|start1| 1|
| end1| 0|
First things first, hope I am formatting my question correctly.
I have this dataframe:
df = sc.parallelize([
('1112', 1, 0, 1, '2018-05-01'),
('1111', 1, 1, 1, '2018-05-01'),
('1111', 1, 3, 2, '2018-05-04'),
('1111', 1, 1, 2, '2018-05-05'),
('1111', 1, 1, 2, '2018-05-06'),
]).toDF(["customer_id", "buy_count", "date_difference", "expected_answer", "date"]).cache()
|customer_id|buy_count|date_difference|expected_answer| date|
| 1111| 1| 1| 1|2018-05-01|
| 1111| 1| 3| 2|2018-05-04|
| 1111| 1| 1| 2|2018-05-05|
| 1111| 1| 1| 2|2018-05-06|
| 1112| 1| 0| 1|2018-05-01|
I want to create the "expected_answer" column:
If a customer hasn't bought for more than 3 days (date_difference >=3), I want to increase his buy_count by 1. Every purchase after that needs to have the new buy_count unless he doesn't buy for another 3 days in which case buy_count will increase again.
Here is my code and how far I have gotten with it. The problem seems to be that spark does not actually impute value but creates a new column. Is there a way to get past this? I also tried with Hive, exactly same results.
from pyspark.sql.window import Window
import pyspark.sql.functions as func
from pyspark.sql.functions import when
windowSpec = func.lag(df['buy_count']).\
df.withColumn('buy_count', \
when(df['date_difference'] >=3, windowSpec +1).when(windowSpec.isNull(), 1)\
|customer_id|buy_count|date_difference|expected_answer| date|
| 1112| 1| 0| 1|2018-05-01|
| 1111| 1| 1| 1|2018-05-01|
| 1111| 2| 3| 2|2018-05-04|
| 1111| 1| 1| 2|2018-05-05|
| 1111| 1| 1| 2|2018-05-06|
How can I get the expected result? Thanks in advance.
Figured it out at last. Thanks everyone for pointing out similar cases.
I was under the impression that SUM() over Partition would sum over the whole partition and not just sum everything before current row. Luckily, I was able to solve my problem with a very simple SQL:
SELECT SUM(CASE WHEN(date_difference>=3) THEN 1 ELSE 0 END) OVER (PARTITION BY customer_id ORDER BY date)
Using pyspark, I'd like to be able to group a spark dataframe, sort the group, and then provide a row number. So
Group Date
A 2000
A 2002
A 2007
B 1999
B 2015
Would become
Group Date row_num
A 2000 0
A 2002 1
A 2007 2
B 1999 0
B 2015 1
Use window function:
from pyspark.sql.window import *
from pyspark.sql.functions import row_number
df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))
The accepted solution almost has it right. Here is the solution based on the output requested in the question:
df = spark.createDataFrame([("A", 2000), ("A", 2002), ("A", 2007), ("B", 1999), ("B", 2015)], ["Group", "Date"])
| A|2000|
| A|2002|
| A|2007|
| B|1999|
| B|2015|
# accepted solution above
from pyspark.sql.window import *
from pyspark.sql.functions import row_number
df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))
# accepted solution above output
| B|1999| 1|
| B|2015| 2|
| A|2000| 1|
| A|2002| 2|
| A|2007| 3|
As you can see, the function row_number starts from 1 and not 0 and the requested question wanted to have the row_num starting from 0. Simple change like I have made below:
df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))-1).show()
Output :
| B|1999| 0|
| B|2015| 1|
| A|2000| 0|
| A|2002| 1|
| A|2007| 2|
Then you can sort the "Group" column in whatever order you want. The above solution almost has it but it is important to remember that row_number begins with 1 and not 0.