I have a dataframe like this after some operations;
df_new_1 = df_old.filter(df_old["col1"] >= df_old["col2"])
df_new_2 = df_old.filter(df_old["col1"] < df_old["col2"])
print(df_new_1.count(), df_new_2.count())
>> 10, 15
I can find the number of rows individually like above by calling count(). But how can I do this using pyspark sql row operation. i.e aggregating by row. I want to see the result like this;
Row(check1=10, check2=15)
Since you tagged pyspark-sql, you can do the following:
df_old.createOrReplaceTempView("df_table")
spark.sql("""
SELECT sum(int(col1 >= col2)) as check1
, sum(int(col1 < col2)) as check2
FROM df_table
""").collect()
Or use the API functions:
from pyspark.sql.functions import expr
df_old.agg(
expr("sum(int(col1 >= col2)) as check1"),
expr("sum(int(col1 < col2)) as check2")
).collect()
Related
I'm trying to drop some rows in my dask dataframe with :
df.drop(df[(df.A <= 3) | (df.A > 1000)].index)
But this one doesn't work and return NotImplementedError: Drop currently only works for axis=1
I really need help
You can remove rows from a Pandas/Dask dataframe as follows:
df = df[condition]
In your case you might do something like the following:
df = df[(df.A > 3) & (df.A <= 1000)]
I need to detect threshold values on timeseries with Pyspark.
On the example graph below I want to detect (by storing the associated timestamp) each occurrence of the parameter ALT_STD being larger than 5000 and then lower than 5000.
For this simple case I can run simple queries such as
t_start = df.select('timestamp')\
.filter(df.ALT_STD > 5000)\
.sort('timestamp')\
.first()
t_stop = df.select('timestamp')\
.filter((df.ALT_STD < 5000)\
& (df.timestamp > t_start.timestamp))\
.sort('timestamp')\
.first()
However, in some cases, the event can by cyclic and I may have several curves (i.e. several times ALT_STD will raise above or below 5000). Of course, if I use the queries above I will only be able to detect the first occurrences.
I guess I should use window function with an udf, but I can't find a working solution.
My guess is that the algorithm should be something like:
windowSpec = Window.partitionBy('flight_hash')\
.orderBy('timestamp')\
.rowsBetween(Window.currentRow, 1)
def detect_thresholds(x):
if (x['ALT_STD'][current_row]< 5000) and (x['ALT_STD'][next_row] > 5000):
return x['timestamp'] #Or maybe simply 1
if (x['ALT_STD'][current_row]> 5000) and (x['ALT_STD'][current_row] > 5000):
return x['timestamp'] #Or maybe simply 2
else:
return 0
import pyspark.sql.functions as F
detect_udf = F.udf(detect_threshold, IntegerType())
df.withColumn('Result', detect_udf(F.Struct('ALT_STD')).over(windowSpec).show()
Is such an algorithm feasible in Pyspark ? How ?
Post-scriptum:
As a side note, I have understood how to use udf or udf and built-in sql window functions but not how to combine udf AND window.
e.g. :
# This will compute the mean (built-in function)
df.withColumn("Result", F.mean(df['ALT_STD']).over(windowSpec)).show()
# This will also work
divide_udf = F.udf(lambda x: x[0]/1000., DoubleType())
df.withColumn('result', divide_udf(F.struct('timestamp')))
No need for udf here (and python udfs cannot be used as window functions). Just use lead / lag with when:
from pyspark.sql.functions import col, lag, lead, when
result = (when((col('ALT_STD') < 5000) & (lead(col('ALT_STD'), 1) > 5000), 1)
.when(col('ALT_STD') > 5000) & (lead(col('ALT_STD'), 1) < 5000), 1)
.otherwise(0))
df.withColum("result", result)
Thanks to user9569772 answer I found out. His solution did not work because .lag() or .lead() are window functions.
from pyspark.sql.functions import when
from pyspark.sql import functions as F
# Define conditions
det_start = (F.lag(F.col('ALT_STD')).over(windowSpec) < 100)\
& (F.lead(F.col('ALT_STD'), 0).over(windowSpec) >= 100)
det_end = (F.lag(F.col('ALT_STD'), 0).over(windowSpec) > 100)\
& (F.lead(F.col('ALT_STD')).over(windowSpec) < 100)
# Combine conditions with .when() and .otherwise()
result = (when(det_start, 1)\
.when(det_end, 2)\
.otherwise(0))
df.withColumn("phases", result).show()
I am running spark 2.1 on windows 10, I have fetched data from MySQL to spark using JDBC and the table looks like this
x y z
------------------
1 a d1
Null v ed
5 Null Null
7 s Null
Null bd Null
I want to create a new spark dataset with only x and y columns from the above table and I wan't to keep only those rows which do not have null in either of those 2 columns. My resultant table should look like this
x y
--------
1 a
7 s
The following is the code:
val load_DF = spark.read.format("jdbc").option("url", "jdbc:mysql://100.150.200.250:3306").option("dbtable", "schema.table_name").option("user", "uname1").option("password", "Pass1").load()
val filter_DF = load_DF.select($"x".isNotNull,$"y".isNotNull).rdd
// lets print first 5 values of filter_DF
filter_DF.take(5)
res0: Array[org.apache.spark.sql.Row] = Array([true,true], [false,true], [true,false], [true,true], [false,true])
As shown, the above result doesn't give me actual values but it returns Boolean values (true when value is not Null and false when value is Null)
Try this;
val load_DF = spark.read.format("jdbc").option("url", "jdbc:mysql://100.150.200.250:3306").option("dbtable", "schema.table_name").option("user", "uname1").option("password", "Pass1").load()
Now;
load_DF.select($"x",$"y").filter("x !== null").filter("y !== null")
Spark provides DataFrameNaFunctions for this purpose of dropping null values, etc.
In your example above you just need to call the following on a DataSet that you load
val noNullValues = load_DF.na.drop("all", Seq("x", "y"))
This will drop records where nulls occur in either field x or y but not z. You can read up on DataFrameNaFunctions for further options to fill in data, or translate values if required.
Apply "any" in na.drop:
df = df.select("x", "y")
.na.drop("any", Seq("x", "y"))
You are simply applying a function (in this case isNotNull) to the values when you do a select - instead you need to replace select with filter.
val filter_DF = load_DF.filter($"x".isNotNull && $"y".isNotNull)
or if you prefer:
val filter_DF = load_DF.filter($"x".isNotNull).filter($"y".isNotNull)
I have a scenario where I need to join multiple tables and identify if the date + another integer column is greater than another date column.
Select case when (manufacturedate + LeadTime < DueDate) then numericvalue ((DueDate - manufacturepdate) + 1) else PartSource.EffLeadTime)
Is there a way to handle it in spark sql?
Thanks,
Ash
I tried with sqlcontext, there is a date_add('date',integer). date_add() is hive functionality and it works for cassandra context too.
cc.sql("select date_add(current_date(),1) from table").show
Thanks
Aravinth
Assuming you have a DataFrame with your data, you are using Scala and the "another integer" represents a number of days, one way to do it is the following:
import org.apache.spark.sql.functions._
val numericvalue = 1
val column = when(
datediff(col("DueDate"), col("manufacturedate")) > col("LeadTime"), lit(numericvalue)
).otherwise(col("PartSource.EffLeadTime"))
val result = df.withColumn("newVal", column)
The desired value will be in a new column called "newVal".
I'm trying to apply a complex function to a pandas DataFrame, and I'm wondering if there's a faster way to do it. A simplified version of my data looks like this:
UID,UID2,Time,EventType
1,1,18:00,A
1,1,18:05,B
1,2,19:00,A
1,2,19:03,B
2,6,20:00,A
3,4,14:00,A
What I want to do is for each combination of UID and UID2 check if there is both a row with EventType = A and EventType = B, and then calculate the time difference, and then add it back as a new column. So the new dataset would be:
UID,UID2,Time,EventType,TimeDiff
1,1,18:00,A,5
1,1,18:05,B,5
1,2,19:00,A,3
1,2,19:03,B,3
2,6,20:00,A,nan
3,4,14:00,A,nan
This is the current implementation, where I group the records by UID and UID2, then have only a small subset of rows to search to identify whether both EventTypes exist. I can't figure out a faster one, and profiling in PyCharm hasn't helped uncover where the bottleneck is.
for (uid, uid2), group in df.groupby(["uid", "uid2"]):
# if there is a row for both A and B for a uid, uid2 combo
if len(group[group["EventType"] == "A"]) > 0 and len(group[group["EventType"] == "D"]) > 0:
time_a = group.loc[group["EventType"] == "A", "Time"].iloc[0]
time_b = group.loc[group["EventType"] == "B", "Time"].iloc[0]
timediff = time_b - time_a
timediff_min = timediff.components.minutes
df.loc[(df["uid"] == uid) & (df["uid2"] == uid2), "TimeDiff"] = timediff_min
I need to make sure Time column is a timedelta
df.Time = pd.to_datetime(df.Time)
df.Time = df.Time - pd.to_datetime(df.Time.dt.date)
After that I create a helper dataframe
df1 = df.set_index(['UID', 'UID2', 'EventType']).unstack().Time
df1
Finally, I take the diff and merge to df
df.merge((df1.B - df1.A).rename('TimeDiff').reset_index())