I was running a code script to get the following result. The code is shown below. I don't understand why I got the xyz1 column as shown in the image. For example, why the first row of xyz1 is 0. According to the windows function, its corresponding group should be the first two rows, but why F.count(F.col("xyz")).over(w) get 0 here.
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql import functions as F
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
list=([1,5,4],
[1,5,None],
[1,5,1],
[1,5,4],
[2,5,1],
[2,5,2],
[2,5,None],
[2,5,None],
[2,5,4])
df=spark.createDataFrame(list,['I_id','p_id','xyz'])
w= Window().partitionBy("I_id","p_id").orderBy(F.col("xyz"))
df.withColumn("xyz1",F.count(F.col("xyz")).over(w)).show()
Note that count only counts non-null items, and that the grouping is only defined by the partitionBy clause, but not the orderBy clause.
When you specify an ordering column, the default window range is (according to the docs)
(rangeFrame, unboundedPreceding, currentRow)
So your window defintion is actually
w = (Window().partitionBy("I_id","p_id")
.orderBy(F.col("xyz"))
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
)
And so the window only includes the rows from xyz = -infinity to the value of xyz in the current row. That's why the first row has a count of zero because it counts non-null items from xyz = -infinity to xyz = null, i.e. the first two rows of the dataframe.
For the row where xyz = 2, the count includes non-null items from xyz = -infinity to xyz = 2, i.e. the first four rows. That's why you got a count of 2, because the non-null items are 1 and 2.
Related
I want to filter by date on my pyspark dataframe.
I have a dataframe like this:
+------+---------+------+-------------------+-------------------+----------+
|amount|cost_type|place2| min_ts| max_ts| ds|
+------+---------+------+-------------------+-------------------+----------+
|100000| reorder| 1.0|2020-10-16 10:16:31|2020-11-21 18:50:27|2021-05-29|
|250000|overusage| 1.0|2020-11-21 18:48:02|2021-02-09 20:07:28|2021-05-29|
|100000|overusage| 1.0|2019-05-12 16:00:40|2020-11-21 18:44:04|2021-05-29|
|200000| reorder| 1.0|2020-11-21 19:00:09|2021-05-29 23:56:25|2021-05-29|
+------+---------+------+-------------------+-------------------+----------+
And I want to filter just one row for every possible cost_type which has the nearest time to ds
for example for ds = '2021-05-29' my filter should select the second and fourth row. But for ds = '2020-05-01' should select the first and third row of my dataframe. If my ds was in the range of min_ts and max_ts my filter should select that row for every cost type.
A possible way is to assign row numbers based on some conditions:
Whether ds is between min_ts and max_ts.
If not, the smaller of the absolute date difference between ds and min_ts, or between ds and max_ts.
from pyspark.sql import functions as F, Window
w = Window.partitionBy('cost_type').orderBy(
F.col('ds').cast('timestamp').between(F.col('min_ts'), F.col('max_ts')).desc(),
F.least(F.abs(F.datediff('ds', 'max_ts')), F.abs(F.datediff('ds', 'min_ts')))
)
df2 = df.withColumn('rn', F.row_number().over(w)).filter('rn = 1').drop('rn')
I'm trying to drop duplicates based on column1 and select the row with max value in column2. The column2 has "year"(2019,2020 etc) as values and it is of type "String". The solution I have is, converting the column 2 into integer and selecting the max value.
Dataset<Row> ds ; //The dataset with column1,column2(year), column3 etc.
Dataset<Row> newDs = ds.withColumn("column2Int", col("column2").cast(DataTypes.IntegerType));
newDs = newDs.groupBy("column1").max("column2Int"); // drops all other columns
This approach drops all other columns in the original dataset 'ds' when I do a "group by". So I have to do a join between 'ds' and 'newDS' to get back all the original columns. Also casting the String column to Integer looks like an ineffective workaround.
Is it possible to drop the duplicates and get the row with bigger string value from the original dataset itself ?
This is a classic de-duplication problem and you'll need to use Window + Rank + filter combo for this.
I'm not very familiar with the Java syntax, but the sample code should look like something below,
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
Dataset<Row> df = ???;
WindowSpec windowSpec = Window.partitionBy("column1").orderBy(functions.desc("column2Int"));
Dataset<Row> result =
df.withColumn("column2Int", functions.col("column2").cast(DataTypes.IntegerType))
.withColumn("rank", functions.rank().over(windowSpec))
.where("rank == 1")
.drop("rank");
result.show(false);
Overview of what happened,
Add the casted integer column to the df for future sorting.
Subsections/ windows were formed in your dataset (partitions) based on the value of column1
For each of these subsections/ windows/ partitions the rows were sorted on column casted to int. Desc order as you want max.
Ranks like row numbers are assigned to the rows in each partition/ window created.
Filtering is done for all row where rank is 1 (max value as the ordering was desc.)
Given a Spark dataframe that I have
val df = Seq(
("2019-01-01",100),
("2019-01-02",101),
("2019-01-03",102),
("2019-01-04",103),
("2019-01-05",102),
("2019-01-06",99),
("2019-01-07",98),
("2019-01-08",100),
("2019-01-09",47)
).toDF("day","records")
I want to add a new column to this so that I get an average value of last N records on a given day. For example, if N=3, then on a given day, that value should be average of last 3 values EXCLUDING the current record
For example, for day 2019-01-05, it would be (103+102+101)/3
How I can use efficiently use over() clause in order to do this in Spark?
PySpark solution.
Window definition should be 3 PRECEDING AND 1 FOLLOWING which translates to positions (-3,-1) with both boundaries included.
from pyspark.sql import Window
from pyspark.sql.functions import avg
w = Window.orderBy(df.day)
df_with_rsum = df.withColumn("rsum_prev_3_days",avg(df.records).over(w).rowsBetween(-3, -1))
df_with_rsum.show()
The solution assumes there is one row per date in the dataframe without missing dates in between. If not, aggregate the rows by date before applying the window function.
I have a DataFrame with 10 rows and 2 columns: an ID column with random identifier values and a VAL column filled with None.
vals = [
Row(ID=1,VAL=None),
Row(ID=2,VAL=None),
Row(ID=3,VAL=None),
Row(ID=4,VAL=None),
Row(ID=5,VAL=None),
Row(ID=6,VAL=None),
Row(ID=7,VAL=None),
Row(ID=8,VAL=None),
Row(ID=9,VAL=None),
Row(ID=10,VAL=None)
]
df = spark.createDataFrame(vals)
Now lets say I want to update the VAL column for 3 Rows with value "lets", 3 Rows with value "bucket" and 4 Rows with value "this".
Is there a straightforward way of doing this in PySpark?
Note: ID values is not necessarily consecutive, bucket distribution is not necessarily even
I'll try to explain an idea with some pseudo-code and you'll map to your solution.
Using window function on one partition we can generate row_number() sequential number for each row in dataframe and store it let say in column row_num.
Next your "rules" can be represented as another little dataframe: [min_row_num, max_row_num, label].
All you need is to join those two datasets on row number, adding new column:
df1.join(df2,
on=col('df1.row_num').between(col('min_row_num'), col('max_row_num'))
)
.select('df1.*', 'df2.label')
I have quite a huge pandas dataframe with many columns. The dataframe contains two groups. It is basically setup as follows:
import pandas as pd
csv = [{"air" : 0.47,"co2" : 0.43 , "Group" : 1}, {"air" : 0.77,"co2" : 0.13 , "Group" : 1}, {"air" : 0.17,"co2" : 0.93 , "Group" : 2} ]
df = pd.DataFrame(csv)
I want to perform a t-test paired t-test on air and co2 thereby compare the two groups Group = 1 and Group = 2.
I have many many more columns than just air co2- hence, I would like to find a procedure that works for all columns int the dataframe. I believe, I could use scipy.stats.ttest_rel together with pd.groupby oder apply. How would that work? Thanks in advance /R
I would use pandas dataframe.where method.
group1_air = df.where(df.Group== 1).dropna()['air']
group2_air = df.where(df.Group== 2).dropna()['air']
This bit of code returns into group1_air all the values of the air column where the group column is 1 and all the values of air where group is 2 in group2_air.
The drop.na() is required because the .where method will return NAN for every row in which the specified conditions is not met. So all rows where group is 2 will return with NAN values when you use df.where(df.Group== 1).
Whether you need to use scipy.stats.ttest_rel or scipy.stats.ttest_ind depends on your groups. If you samples are from independent groups you should use ttest_ind if your samples are from related groups you should use ttest_rel.
So if your samples are independent from oneanother your final piece of required code is.
scipy.stats.ttest_ind(group1_air,group2_air)
else you need to use
scipy.stats.ttest_rel(group1_air,group2_air)
When you want to also test co2 you simply need to change air for co2 in the given example.
Edit:
This is a rough sketch of the code you should run to execute ttests over every column in your dataframe except for the group column. You may need to tamper a bit with the column_list to get it completely compliant with your needs (you may not want to loop over every column for example).
# get a list of all columns in the dataframe without the Group column
column_list = [x for x in df.columns if x != 'Group']
# create an empty dictionary
t_test_results = {}
# loop over column_list and execute code explained above
for column in column_list:
group1 = df.where(df.Group== 1).dropna()[column]
group2 = df.where(df.Group== 2).dropna()[column]
# add the output to the dictionary
t_test_results[column] = scipy.stats.ttest_ind(group1,group2)
results_df = pd.DataFrame.from_dict(t_test_results,orient='Index')
results_df.columns = ['statistic','pvalue']
At the end of this code you have a dataframe with the output of the ttest over every column you will have looped over.