Suppose I have the following pyspark dataframe df:
id date var1 var2
1 1 NULL 2
1 2 b 3
2 1 a NULL
2 2 a 1
I want the first non missing observation for all var* columns and additionally the value of date where this is from, i.e. the final result should look like:
id var1 dt_var1 var2 dt_var2
1 b 2 2 1
2 a 1 1 2
Getting the values is straightforward using
df.orderBy(['id','date']).groupby('id').agg(
*[F.first(x, ignorenulls=True).alias(x) for x in ['var1', 'var2']]
)
But I fail to see how I could get the respective dates. I could loop variable for variable, drop missing, and keep the first row. But this sounds like a poor solution that will not scale well, as it would require a separate dataframe for each variable.
I would prefer a solution that scales to many columns (var3, var4,...)
You should not use groupby if you want to get the first non-null according to date ordering. The order is not guaranteed after a groupby operation even if you called orderby just before.
You need to use window functions instead. To get the date associated with each var value you can use this trick with structs:
from pyspark.sql import Window, functions as F
w = (Window.partitionBy("id").orderBy("date")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
)
df1 = df.select(
"id",
*[F.first(
F.when(F.col(x).isNotNull(), F.struct(x, F.col("date").alias(f"dt_{x}"))),
ignorenulls=True).over(w).alias(x)
for x in ["var1", "var2"]
]
).distinct().select("id", "var1.*", "var2.*")
df1.show()
#+---+----+-------+----+-------+
#| id|var1|dt_var1|var2|dt_var2|
#+---+----+-------+----+-------+
#| 1| b| 2| 2| 1|
#| 2| a| 1| 1| 2|
#+---+----+-------+----+-------+
Related
I have a dataframe like this one:
name field1 field2 field3
a 4 10 8
b 5 0 11
c 10 7 4
d 0 1 5
I need to find top 3 names for each field.
Expected output:
top3-field1 top3-field2 top3-field3
c a b
b c a
a d d
So, I tried to sort field(n) column values, limit top 3 results and generate new columns using withColumn method, like this:
df1 = df.orderBy(f.col("field1").desc(), "name") \
.limit(3) \
.withColumn("top3-field1", df["name"]) \
.select("top3-field1", "field1")
With this approach I have to create different dataframes for each field(n), and then join them to get the result as described above. I feel that there must be better solution for this problem. Hope someone can give me suggestions
You can first stack the df, then get the rank descending, then filter out rank less than or equal to 3, finally pivot the names:
Note that I am using this function in my code to make stacking a little easier in typing per se:
from pyspark.sql import functions as F, Window as W #imports
w = W.partitionBy("col").orderBy(F.desc("values"))
out = (df.selectExpr("name",stack_multiple_col(df,df.columns[1:]))
.withColumn("Rnk",F.dense_rank().over(w))
.where("Rnk<=3").groupBy("Rnk").pivot("col").agg(F.first("name")))
out.show()
+---+------+------+------+
|Rnk|field1|field2|field3|
+---+------+------+------+
| 1| c| a| b|
| 2| b| c| a|
| 3| a| d| d|
+---+------+------+------+
If you are not willing to use the function, you can write the same as :
w = W.partitionBy("col").orderBy(F.desc("values"))
out = (df.selectExpr("name",
'stack(3,"field1",field1,"field2",field2,"field3",field3) as (col,values)')
.withColumn("Rnk",F.dense_rank().over(w))
.where("Rnk<=3").groupBy("Rnk").pivot("col").agg(F.first("name")))
Full code:
def stack_multiple_col(df,cols=df.columns,output_columns=["col","values"]):
"""stacks multiple columns in a dataframe,
takes all columns by default unless passed a list of values"""
return (f"""stack({len(cols)},{','.join(map(','.join,
(zip([f'"{i}"' for i in cols],cols))))}) as ({','.join(output_columns)})""")
w = W.partitionBy("col").orderBy(F.desc("values"))
out = (df.selectExpr("name",stack_multiple_col(df,df.columns[1:]))
.withColumn("Rnk",F.dense_rank().over(w))
.where("Rnk<=3").groupBy("Rnk").pivot("col").agg(F.first("name")))
out.show()
I have the following spark dataframe -
+----+----+
|col1|col2|
+----+----+
| a| 1|
| b|null|
| c| 3|
+----+----+
Is there a way in spark API to detect if col2 contains, say, 3? Please note that the answer should be just one indicator value - yes/no - and not the set of records that have 3 in col2.
The PySpark recommended way of finding if a DataFrame contains a particular value is to use pyspak.sql.Column.contains API. You can use a boolean value on top of this to get a True/False boolean value.
For your example:
bool(df.filter(df.col2.contains(3)).collect())
#Output
>>>True
bool(df.filter(df.col2.contains(100)).collect())
#Output
>>>False
Source : https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.Column.contains.html
By counting the number of values in col2 that are equal to 3:
import pyspark.sql.functions as f
df.agg(f.expr('sum(case when col2 = 3 then 1 else 0 end)')).first()[0] > 0
You can use when as a conditional statement
from pyspark.sql.functions import when
df.select(
(when(col("col2") == '3', 'yes')
.otherwise('no')
).alias('col3')
)
I have a dataframe and an id column as a group. For each id I want to pair its elements in the following way:
title id
sal 1
summer 1
fada 1
row 2
winter 2
gole 2
jack 3
noway 3
output
title id pair
sal 1 None
summer 1 summer,sal
fada 1 fada,summer
row 2 None
winter 2 winter, row
gole 2 gole,winter
jack 3 None
noway 3 noway,jack
As you can see in the output, we pair from the last element of the group id, with an element above it. Since the first element of the group does not have a pair I put None. I should also mention that this can be done in pandas by the following code, but I need Pyspark code since my data is big.
df=data.assign(pair=data.groupby('id')['title'].apply(lambda x: x.str.cat(x.shift(1),sep=',')))
|
I can't emphasise more that a Spark dataframe is an unordered collection of rows, so saying something like "the element above it" is undefined without a column to order by. You can fake an ordering using F.monotonically_increasing_id(), but I'm not sure if that's what you wanted.
from pyspark.sql import functions as F, Window
w = Window.partitionBy('id').orderBy(F.monotonically_increasing_id())
df2 = df.withColumn(
'pair',
F.when(
F.lag('title').over(w).isNotNull(),
F.concat_ws(',', 'title', F.lag('title').over(w))
)
)
df2.show()
+------+---+-----------+
| title| id| pair|
+------+---+-----------+
| sal| 1| null|
|summer| 1| summer,sal|
| fada| 1|fada,summer|
| jack| 3| null|
| noway| 3| noway,jack|
| row| 2| null|
|winter| 2| winter,row|
| gole| 2|gole,winter|
+------+---+-----------+
I have data that looks like this:
userid,eventtime,location_point
4e191908,2017-06-04 03:00:00,18685891
4e191908,2017-06-04 03:04:00,18685891
3136afcb,2017-06-04 03:03:00,18382821
661212dd,2017-06-04 03:06:00,80831484
40e8a7c3,2017-06-04 03:12:00,18825769
I would like to add a new boolean column that marks true if there are 2 or moreuserid within a 5 minutes window in the same location_point. I had an idea of using lag function to lookup over a window partitioned by the userid and with the range between the current timestamp and the next 5 minutes:
from pyspark.sql import functions as F
from pyspark.sql import Window as W
from pyspark.sql.functions import col
days = lambda i: i * 60*5
windowSpec = W.partitionBy(col("userid")).orderBy(col("eventtime").cast("timestamp").cast("long")).rangeBetween(0, days(5))
lastURN = F.lag(col("location_point"), 1).over(windowSpec)
visitCheck = (last_location_point == output.location_pont)
output.withColumn("visit_check", visitCheck).select("userid","eventtime", "location_pont", "visit_check")
This code is giving me an analysis exception when I use the RangeBetween function:
AnalysisException: u'Window Frame RANGE BETWEEN CURRENT ROW AND 1500
FOLLOWING must match the required frame ROWS BETWEEN 1 PRECEDING AND 1
PRECEDING;
Do you know any way to tackle this problem?
Given your data:
Let's add a column with a timestamp in seconds:
df = df.withColumn('timestamp',df_taf.eventtime.astype('Timestamp').cast("long"))
df.show()
+--------+-------------------+--------------+----------+
| userid| eventtime|location_point| timestamp|
+--------+-------------------+--------------+----------+
|4e191908|2017-06-04 03:00:00| 18685891|1496545200|
|4e191908|2017-06-04 03:04:00| 18685891|1496545440|
|3136afcb|2017-06-04 03:03:00| 18382821|1496545380|
|661212dd|2017-06-04 03:06:00| 80831484|1496545560|
|40e8a7c3|2017-06-04 03:12:00| 18825769|1496545920|
|4e191908|2017-06-04 03:11:30| 18685891|1496545890|
+--------+-------------------+--------------+----------+
Now, let's define a window function, with a partition by location_point, an order by timestamp and a range between -300s and current time. We can count the number of elements in this window and put these data in a column named 'occurences in_5_min':
w = Window.partitionBy('location_point').orderBy('timestamp').rangeBetween(-60*5,0)
df = df.withColumn('occurrences_in_5_min',F.count('timestamp').over(w))
df.show()
+--------+-------------------+--------------+----------+--------------------+
| userid| eventtime|location_point| timestamp|occurrences_in_5_min|
+--------+-------------------+--------------+----------+--------------------+
|40e8a7c3|2017-06-04 03:12:00| 18825769|1496545920| 1|
|3136afcb|2017-06-04 03:03:00| 18382821|1496545380| 1|
|661212dd|2017-06-04 03:06:00| 80831484|1496545560| 1|
|4e191908|2017-06-04 03:00:00| 18685891|1496545200| 1|
|4e191908|2017-06-04 03:04:00| 18685891|1496545440| 2|
|4e191908|2017-06-04 03:11:30| 18685891|1496545890| 1|
+--------+-------------------+--------------+----------+--------------------+
Now you can add the desired column with True if the number of occurences is strictly more than 1 in the last 5 minutes on a particular location:
add_bool = udf(lambda col : True if col>1 else False, BooleanType())
df = df.withColumn('already_occured',add_bool('occurrences_in_5_min'))
df.show()
+--------+-------------------+--------------+----------+--------------------+---------------+
| userid| eventtime|location_point| timestamp|occurrences_in_5_min|already_occured|
+--------+-------------------+--------------+----------+--------------------+---------------+
|40e8a7c3|2017-06-04 03:12:00| 18825769|1496545920| 1| false|
|3136afcb|2017-06-04 03:03:00| 18382821|1496545380| 1| false|
|661212dd|2017-06-04 03:06:00| 80831484|1496545560| 1| false|
|4e191908|2017-06-04 03:00:00| 18685891|1496545200| 1| false|
|4e191908|2017-06-04 03:04:00| 18685891|1496545440| 2| true|
|4e191908|2017-06-04 03:11:30| 18685891|1496545890| 1| false|
+--------+-------------------+--------------+----------+--------------------+---------------+
rangeBetween just doesn't make sense for non-aggregate function like lag. lag takes always a specific row, denoted by offset argument, so specifying frame is pointless.
To get a window over time series you can use window grouping with standard aggregates:
from pyspark.sql.functions import window, countDistinct
(df
.groupBy("location_point", window("eventtime", "5 minutes"))
.agg( countDistinct("userid")))
You can add more arguments to modify slide duration.
You can try something similar with window functions if you partition by location:
windowSpec = (W.partitionBy(col("location"))
.orderBy(col("eventtime").cast("timestamp").cast("long"))
.rangeBetween(0, days(5)))
df.withColumn("id_count", countDistinct("userid").over(windowSpec))
i am trying to get the max value Alphabet from a panda dataframe as whole. I am not interested in what row or column it came from. I am just interested in a single max value within the dataframe.
This is what it looks like:
id conditionName
1 C
2 b
3 A
4 A
5 A
expected result is:
|id|conditionName|
+--+-------------+
| 3| A |
| 4| A |
| 5| A |
+----------------+
because 'A' is the first letter of the alphabet
df= df.withColumn("conditionName", col("conditionName").cast("String"))
.groupBy("id,conditionName").max("conditionName");
df.show(false);
Exception: "conditionName" is not a numeric column. Aggregation function can only be applied on a numeric column.;
I need the max from an entire dataframe Alphabet character.
What should I use, so that the desired results?
Thank advance !
You can sort your DataFrame by your string column, grab the first value and use it to filter your original data:
from pyspark.sql.functions import lower, desc, first
# we need lower() because ordering strings is case sensitive
first_letter = df.orderBy((lower(df["condition"]))) \
.groupBy() \
.agg(first("condition").alias("condition")) \
.collect()[0][0]
df.filter(df["condition"] == first_letter).show()
#+---+---------+
#| id|condition|
#+---+---------+
#| 3| A|
#| 4| A|
#| 5| A|
#+---+---------+
Or more elegantly using Spark SQL:
df.registerTempTable("table")
sqlContext.sql("SELECT *
FROM table
WHERE lower(condition) = (SELECT min(lower(condition))
FROM table)
")