I have a spark dataframe with hour of day, locationID and frequency.
Frequency is how many times a locationID appears in that hour of day.
+----+----------+---------+
|hour|locationID|frequency|
+----+----------+---------+
| 0 | 1 | 20 |
| 0 | 2 | 11 |
| 0 | 9 | 6 |
| 1 | 3 | 32 |
| 1 | 1 | 22 |
| 1 | 5 | 4 |
I want to take the 2 most frequent locationID per hour.
This can be done with a row_number window function. The window groups by hour and orders the frequency in descending order. Thereafter, filter for the top 2 rows.
from pyspark.sql import Window
from pyspark.sql.functions import row_number,desc
w = Window.partitionBy(df.hour).orderBy(df.frequency.desc())
rnum_df = df.withColumn('rnum',row_number().over(w))
rnum_df.filter(rnum_df.rnum <= 2).show()
Related
I need to modify available value in billable and non-billable utilization, earlier its default now the value is dynamic.
I have a Billable column value as 'Yes' and 'No'
If Value is 'Yes' then it will sum row-wise and created new columns as 'Billable Utilization'
Billing_utilization = df[Billing_utilization] * sum/available * 100
If value is 'No' then it will be sum row-wise and created new column as 'Non-Billable Utilization'.
Non-Billing_utilization = df[Non-Billing_utilization] * sum/ available1 * 100
Data:
| Employee Name | Java | Python | .Net | React | Billable |
| Priya | 10 | | 5 | | Yes |
| Priya | | 10 | | 5 | No |
| Krithi | | 10v | 20 | | No |
Output
Priya is in both billable and non-billable, priya name appears in two rows. I need to merge in single row with Employee Name. So expected output should be
| Employee Name | Java | Python | .Net | React | Total | Billing | Non-Billing |
| Priya | 10 | 10 | 5 | 5 | 30 | 8.928571429 | 8.928571429 |
| Krithi | 10 | 20 | | | 30 | | 17.85714286 |
df['Billable Status'] = np.where ( df['Billable Status'] == 'Billable', 'Billable Utilization','Non Billable Utilization' )
df2 = (df.groupby ( ['Employee Name', 'Billable Status'])[list_column].sum ().sum ( axis=1 ).unstack ().div (available2).mul(100)).round ( 2 ))
df = df1.join ( df2 ).reset_index ()
df.index = df.index
# Round the column value
df['Total'] = df['Total'].round ( 2 )
# df= df.round(2)
Try:
cols = df.select_dtypes ( 'number' ).columns.tolist ()
df['Total'] = df.groupby('Employee Name')[cols].transform('sum').sum(1)
df['Billing'] = df.mask(df['Billable'] == 'No')[cols].sum(1) / df['Total']
df['Non-Billing'] = df.mask(df['Billable'] == 'Yes')[cols].sum(1) / df['Total']
aggfuncs = dict(zip(cols, ['sum']*len(cols)))
aggfuncs.update({'Total': 'first', 'Billing': 'sum', 'Non-Billing': 'sum'})
out = df.pivot_table(aggfuncs, 'Employee Name', aggfunc=aggfuncs,
sort=False, fill_value=0)[aggfuncs].reset_index()
Output:
>>> out
Employee Name Java Python .Net React Total Billing Non-Billing
0 Priya 10 10 5 5 30 0.5 0.5
1 Krithi 0 10 20 0 30 0.0 1.0
Let's say I have the following table with money spent per day (timestamp)
timestamp| spent
0 | 0
1 | 0
2 | 1
3 | 4
4 | 0
5 | 0
6 | 1
7 | 3
The result I'm looking for is a table adding columns for the cummulative money spent in the last "n" days, for example the last 2 days and the last 5 days. Resulting in something like this.
timestamp | spent | spent-2d |spent-5d | ....
0 | 0 | null | null | ...
1 | 0 | 0 | null | ...
2 | 1 | 1 | null | ...
3 | 4 | 5 | null | ...
4 | 0 | 4 | 5 | ...
5 | 0 | 0 | 5 | ...
6 | 1 | 1 | 6 | ...
7 | 3 | 4 | 8 | ....
One possible solution is to add lagged columns and then sum but for say, 180 days I would need to add 180 columns and I want to to this process with not just one but several columns in the dataframe. For example for 100-500 columns I want the lagged sum over 1,2,5,7,15,30,90 and 180 days. So adding 180*500 columns seems to be a bad idea.
Any other ideas to make this in a very efficient way?
Window "rangeBetween" method can be used, example for 5 days column:
val lastFiveDaysWindow = Window
.orderBy("timestamp")
.rangeBetween(Window.currentRow - 4, Window.currentRow)
df
.withColumn("spent-5d",
when(
$"timestamp" >= 4,
sum("spent").over(lastFiveDaysWindow)
)
)
Note: Only for small Dataframes, warning exists:
No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
For bigger DataFrames, inner join can be used, like in answer here:
Count of all element less than the value in a row
I have a dataframe which contains some products, a date and a value. Now the dates have different gaps inbetween recorded values that I want to fill out. Such that I have a recorded value for every hour from the first time the product was seen to the last, if there is no record I want to use the latest value.
So, I have a dataframe like:
| ProductId | Date | Value |
|-----------|-------------------------------|-------|
| 1 | 2020-03-12T00:00:00.000+0000 | 4 |
| 1 | 2020-03-12T01:00:00.000+0000 | 2 |
| 2 | 2020-03-12T01:00:00.000+0000 | 3 |
| 2 | 2020-03-12T03:00:00.000+0000 | 4 |
| 1 | 2020-03-12T05:00:00.000+0000 | 4 |
| 3 | 2020-03-12T05:00:00.000+0000 | 2 |
I want to create a new dataframe that looks like:
| ProductId | Date | Value |
|-----------|-------------------------------|-------|
| 1 | 2020-03-12T00:00:00.000+0000 | 4 |
| 1 | 2020-03-12T01:00:00.000+0000 | 2 |
| 1 | 2020-03-12T02:00:00.000+0000 | 2 |
| 1 | 2020-03-12T03:00:00.000+0000 | 2 |
| 1 | 2020-03-12T04:00:00.000+0000 | 2 |
| 1 | 2020-03-12T05:00:00.000+0000 | 4 |
| 2 | 2020-03-12T01:00:00.000+0000 | 3 |
| 2 | 2020-03-12T02:00:00.000+0000 | 3 |
| 2 | 2020-03-12T03:00:00.000+0000 | 4 |
| 3 | 2020-03-12T05:00:00.000+0000 | 2 |
My code so far:
def generate_date_series(start, stop):
start = datetime.strptime(start, "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
stop = datetime.strptime(stop, "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
return [start + datetime.timedelta(hours=x) for x in range(0, (stop-start).hours + 1)]
spark.udf.register("generate_date_series", generate_date_series, ArrayType(TimestampType()))
df = df.withColumn("max", max(col("Date")).over(Window.partitionBy("ProductId"))) \
.withColumn("min", min(col("Date")).over(Window.partitionBy("ProductId"))) \
.withColumn("Dato", explode(generate_date_series(col("min"), col("max"))) \
.over(Window.partitionBy("ProductId").orderBy(col("Dato").desc())))
window_over_ids = (Window.partitionBy("ProductId").rangeBetween(Window.unboundedPreceding, -1).orderBy("Date"))
df = df.withColumn("Value", last("Value", ignorenulls=True).over(window_over_ids))
Error:
TypeError: strptime() argument 1 must be str, not Column
So the first question is obviously how do I create and call the udf correctly so I don't run into the above error.
The second question is how do I complete the task, such that I get my desired dataframe?
So after some searching and experimenting I found a solution. I defined a udf that returns a date range between two dates with 1 hour intervals. And I then do a forward fill
I fixed the issue with the following code:
def missing_hours(t1, t2):
return [t1 + timedelta(hours=x) for x in range(0, int((t2-t1).total_seconds()/3600))]
missing_hours_udf = udf(missing_hours, ArrayType(TimestampType()))
window = Window.partitionBy("ProductId").orderBy("Date")
df_missing = df.withColumn("prev_timestamp", lag(col("Date"), 1, None).over(window)) \
.filter(col("prev_timestamp").isNotNull()) \
.withColumn("Date", explode(missing_hours_udf(col("prev_timestamp"), col("Date")))) \
.withColumn("Value", lit(None)) \
.drop("prev_timestamp")
df = df_original.union(df_missing)
window = Window.partitionBy("ProductId").orderBy("Date") \
.rowsBetween(-sys.maxsize, 0)
# define the forward-filled column
filled_values_column = last(df['Value'], ignorenulls=True).over(window)
# do the fill
df = df.withColumn('Value', filled_values_column)
I have a data frame with a multilevel index. I would like to sort this data frame based on a specific column and extract the first n rows for each group of the first index, but n is different for each group.
For example:
| Index1| Index2| Sort_In_descending_order | How_manyRows_toChoose |
-----------------------------------------------------------------------
| 1 | 20 | 3 | 2 |
| | 40 | 2 | 2 |
| | 10 | 1 | 2 |
| 2 | 20 | 2 | 1 |
| | 50 | 1 | 1 |
the result should look like this:
| Index1| Index2| Sort_In_descending_order | How_manyRows_toChoose |
-----------------------------------------------------------------------
| 1 | 20 | 3 | 2 |
| | 40 | 2 | 2 |
| 2 | 20 | 2 | 1 |
I got this far:
df.groupby(level[0,1]).sum().sort_values(['Index1','Sort_In_descending_order'],ascending=False).groupby('Index1').head(2)
However the .head(2) picks 2 element of each group independent of the number in the column "How_manyRows_toChoose".
Some pice of code would be great!
Thank you!
Use lambda function in GroupBy.apply with head and add parameter group_keys=False for avoid duplicated index values:
#original code
df = (df.groupby(level[0,1])
.sum()
.sort_values(['Index1','Sort_In_descending_order'],ascending=False))
df = (df.groupby('Index1', group_keys=False)
.apply(lambda x: x.head(x['How_manyRows_toChoose'].iat[0])))
print (df)
Sort_In_descending_order How_manyRows_toChoose
Index1 Index2
1 20 3 2
40 2 2
2 20 2 1
I know this is a very specific problem and it is not usual to post this kind of question on stackoverflow, but I am in the strange situation of having an idea of a naive algorithm that would solve my issue, but not being able to implement it. Hence my question.
I have a data frame
|user_id| action | day | week |
------------------------------
| d25as | AB | 2 | 1 |
| d25as | AB | 3 | 2 |
| d25as | AB | 5 | 1 |
| m3562 | AB | 1 | 3 |
| m3562 | AB | 7 | 1 |
| m3562 | AB | 9 | 1 |
| ha42a | AB | 3 | 2 |
| ha42a | AB | 4 | 3 |
| ha42a | AB | 5 | 1 |
I want to create a dataframe with users that are seem at least 3 days a week for at least 3 weeks a month. The "day" column goes from 1 to 31 and the "week" column goes from 1 to 4.
The way I thought about doing it is :
split dataframe into 4 dataframes for each week
for every week_dataframe count days seen per user.
count for every user how many weeks with >= 3 days they were seen.
only add to the new df the users seen for >= 3 such weeks.
Now I need to do this in Spark and in a way that scales and I have no idea how to implement it. Also ,if you have a better idea of an algorithm than my naive approach, that would really be helpful.
I suggest using groupBy function with selecting users with where selector:
df.groupBy('user_id', 'week')\
.agg(countDistinct('day').alias('days_per_week'))\
.where('days_per_week >= 3')\
.groupBy('user_id')\
.agg(count('week').alias('weeks_per_user'))\
.where('weeks_per_user >= 3' )
#eakotelnikov is correct.
But if anyone is facing the error
NameError: name 'countDistinct' is not defined
then please use below statement prior to execute eakotelnikov solution
from pyspark.sql.functions import *
Adding another solution for this problem
tdf.registerTempTable("tbl")
outdf = spark.sql("""
select user_id , count(*) as weeks_per_user from
( select user_id , week , count(*) as days_per_week
from tbl
group by user_id , week
having count(*) >= 3
) x
group by user_id
having count(*) >= 3
""")
outdf.show()