I want to create a new column that contains the count of dataframe depending on filter.
Here is an example:
+---------------------------------------+
|conditions |
+---------------------------------------+
|* |
|* |
|p1==1 AND p2==1 |
I tried:
df = df.withColumn('cardinal',df.filter(conditions).count())
it didn't work. The error message is:
"filter expression 'conditions' of type string is not a boolean.;;\nFilter conditions#2043: string\n+-
You have to use literal for your df.filter function.
Try with below syntax:
>>> df1 = df.withColumn('cardinal',lit(df.filter(conditions).count()))
Now df1 dataframe will have cardinal column added to it.
Update:
i tried with simple example:
import pyspark.sql.functions as F
df=sc.parallelize([(1,1),(2,1),(3,2)]).toDF(["p1","p2"]) #createDataFrame
conditions=((F.col('p1')==1) & (F.col('p2')==1)) #define conditions variable
df1=df.withColumn("cardinal",F.lit(df.filter(conditions).count())) #add column
df1.show(10,False)
+---+---+--------+
|p1 |p2 |cardinal|
+---+---+--------+
|1 |1 |1 |
|2 |1 |1 |
|3 |2 |1 |
+---+---+--------+
(or)
Without using conditions variable
df1=df.withColumn("cardinal",F.lit(df.filter((F.col('p1')==1) & (F.col('p2')==1)).count()))
df1.show(10,False)
+---+---+--------+
|p1 |p2 |cardinal|
+---+---+--------+
|1 |1 |1 |
|2 |1 |1 |
|3 |2 |1 |
+---+---+--------+
(or)
using .where clause
df1=df.withColumn("cardinal",F.lit(df.where((F.col("p1")==1) & (F.col("p2")==1)).count()))
df1.show(10,False)
+---+---+--------+
|p1 |p2 |cardinal|
+---+---+--------+
|1 |1 |1 |
|2 |1 |1 |
|3 |2 |1 |
+---+---+--------+
Related
Let's say I have the following DataFrame -
+---+----------+-----------------------+
|id |date |timestamp |
+---+----------+-----------------------+
|2 |2022-02-02|2022-02-02 10:05:15.336|
|1 |2022-02-01|2022-02-01 10:05:20.536|
|4 |2022-02-02|2022-02-03 11:35:55.336|
|1 |2022-02-01|2022-02-01 10:00:00.336|
|2 |2022-02-02|2022-02-01 10:03:00.336|
|3 |2022-02-01|2022-02-03 11:35:55.336|
|1 |2022-02-01|2022-02-01 10:05:15.336|
|2 |2022-02-02|2022-02-02 11:00:00.000|
|4 |2022-02-02|2022-02-03 11:35:55.336|
|1 |2022-02-01|null |
|1 |2022-02-01|2022-02-01 10:03:00.336|
|3 |2022-02-01|2022-02-01 10:00:00.336|
+---+----------+-----------------------+
First, I want to sort it by date, and then partition it by id and sort it by timestamp within each id, but leave it still sorted by date -
+---+----------+-----------------------+
|id |date |timestamp |
+---+----------+-----------------------+
|1 |2022-02-01|2022-02-01 10:00:00.336|
|1 |2022-02-01|2022-02-01 10:05:20.536|
|1 |2022-02-01|2022-02-01 10:03:00.336|
|1 |2022-02-01|2022-02-01 10:05:15.336|
|1 |2022-02-01|null |
|3 |2022-02-01|2022-02-01 10:00:00.336|
|3 |2022-02-01|2022-02-03 11:35:55.336|
|4 |2022-02-02|2022-02-03 11:35:55.336|
|4 |2022-02-02|2022-02-03 11:35:55.336|
|2 |2022-02-02|2022-02-01 10:03:00.336|
|2 |2022-02-02|2022-02-02 10:05:15.336|
|2 |2022-02-02|2022-02-02 11:00:00.000|
+---+----------+-----------------------+
The order of the ids doesn't matter, so I don't want to sort according to them too because it will make the process too heavy.
Then, I want to give each idan index in ascending order, according to the order I got after the sorting process -
+---+----------+-----------------------+-------+
|id |date |timestamp | index |
+---+----------+-----------------------+-------+
|1 |2022-02-01|2022-02-01 10:00:00.336| 1 |
|1 |2022-02-01|2022-02-01 10:05:20.536| 1 |
|1 |2022-02-01|2022-02-01 10:03:00.336| 1 |
|1 |2022-02-01|2022-02-01 10:05:15.336| 1 |
|1 |2022-02-01|null | 1 |
|3 |2022-02-01|2022-02-01 10:00:00.336| 2 |
|3 |2022-02-01|2022-02-03 11:35:55.336| 2 |
|4 |2022-02-02|2022-02-03 11:35:55.336| 3 |
|4 |2022-02-02|2022-02-03 11:35:55.336| 3 |
|2 |2022-02-02|2022-02-01 10:03:00.336| 4 |
|2 |2022-02-02|2022-02-02 10:05:15.336| 4 |
|2 |2022-02-02|2022-02-02 11:00:00.000| 4 |
+---+----------+-----------------------+-------+
How can I do that?
This is what the dataframe looks like:
+---+-----------------------------------------+-----+
|eco|eco_name |count|
+---+-----------------------------------------+-----+
|B63|Sicilian, Richter-Rauzer Attack |5 |
|D86|Grunfeld, Exchange |3 |
|C99|Ruy Lopez, Closed, Chigorin, 12...cd |5 |
|A44|Old Benoni Defense |3 |
|C46|Three Knights |1 |
|C08|French, Tarrasch, Open, 4.ed ed |13 |
|E59|Nimzo-Indian, 4.e3, Main line |2 |
|A20|English |2 |
|B20|Sicilian |4 |
|B37|Sicilian, Accelerated Fianchetto |2 |
|A33|English, Symmetrical |8 |
|C77|Ruy Lopez |8 |
|B43|Sicilian, Kan, 5.Nc3 |10 |
|A04|Reti Opening |6 |
|A59|Benko Gambit |1 |
|A54|Old Indian, Ukrainian Variation, 4.Nf3 |3 |
|D30|Queen's Gambit Declined |19 |
|C01|French, Exchange |3 |
|D75|Neo-Grunfeld, 6.cd Nxd5, 7.O-O c5, 8.dxc5|1 |
|E74|King's Indian, Averbakh, 6...c5 |2 |
+---+-----------------------------------------+-----+
Schema:
root
|-- eco: string (nullable = true)
|-- eco_name: string (nullable = true)
|-- count: long (nullable = false)
I want to filter it so that only two rows with minimum and maximum counts remain.
The output dataframe should look something like:
+---+-----------------------------------------+--------------------+
|eco|eco_name |number_of_occurences|
+---+-----------------------------------------+--------------------+
|D30|Queen's Gambit Declined |19 |
|C46|Three Knights |1 |
+---+-----------------------------------------+--------------------+
I'm a beginner, I'm really sorry if this is a stupid question.
No need to apologize since this is the place to learn! One of the solutions is to use the Window and rank to find the min/max row:
df = spark.createDataFrame(
[('a', 1), ('b', 1), ('c', 2), ('d', 3)],
schema=['col1', 'col2']
)
df.show(10, False)
+----+----+
|col1|col2|
+----+----+
|a |1 |
|b |1 |
|c |2 |
|d |3 |
+----+----+
Just use filtering to find the min/max count row after the ranking:
df\
.withColumn('min_row', func.rank().over(Window.orderBy(func.asc('col2'))))\
.withColumn('max_row', func.rank().over(Window.orderBy(func.desc('col2'))))\
.filter((func.col('min_row') == 1) | (func.col('max_row') == 1))\
.show(100, False)
+----+----+-------+-------+
|col1|col2|min_row|max_row|
+----+----+-------+-------+
|d |3 |4 |1 |
|a |1 |1 |3 |
|b |1 |1 |3 |
+----+----+-------+-------+
Please note that if the min/max row count are the same, they will be both filtered out.
You can use row_number function twice to order records by count, ascending and descending.
SELECT eco, eco_name, count
FROM (SELECT *,
row_number() over (order by count asc) as rna,
row_number() over (order by count desc) as rnd
FROM df)
WHERE rna = 1 or rnd = 1;
Note there's a tie for count = 1. If you care about it add a secondary sort to control which record is selected or maybe use rank instead to select all.
Suppose we want to track the hops made by an package from warehouse to the customer.
We have a table which store the data but the data is in a column SAY Route
The package starts at the Warehouse – YYY,TTT,MMM
The hops end when the package is delivered to the CUSTOMER
The values in the Route column are separated by space
ID Route
1 TTT A B X Y Z CUSTOMER
2 YYY E Y F G I P B X Q CUSTOMER
3 MMM R T K L CUSTOMER
Expected Output
ID START END
1 TTT A
1 A B
1 B X
.
.
.
1 Z CUSTOMER
2 YYY E
2 E Y
2 Y F
.
.
2 Q CUSTOMER
3 MMM R
.
.
3 L CUSTOMER
Is there anyway to achieve this in pyspark
Add an index to the split route using posexplode, and get the location at the next index for each starting location using lead. If you want to remove the index simply add .drop('index') at the end.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df2 = df.select(
'ID',
F.posexplode(F.split('Route', ' ')).alias('index', 'start')
).withColumn(
'end',
F.lead('start').over(Window.partitionBy('ID').orderBy('index'))
).orderBy('ID', 'index').dropna()
df2.show(99,0)
+---+-----+-----+--------+
|ID |index|start|end |
+---+-----+-----+--------+
|1 |0 |TTT |A |
|1 |1 |A |B |
|1 |2 |B |X |
|1 |3 |X |Y |
|1 |4 |Y |Z |
|1 |5 |Z |CUSTOMER|
|2 |0 |YYY |E |
|2 |1 |E |Y |
|2 |2 |Y |F |
|2 |3 |F |G |
|2 |4 |G |I |
|2 |5 |I |P |
|2 |6 |P |B |
|2 |7 |B |X |
|2 |8 |X |Q |
|2 |9 |Q |CUSTOMER|
|3 |0 |MMM |R |
|3 |1 |R |T |
|3 |2 |T |K |
|3 |3 |K |L |
|3 |4 |L |CUSTOMER|
+---+-----+-----+--------+
I am creating dataframe as per given schema, after that i want to create new dataframe by reordering the existing dataframe.
Can it be possible the re-ordering of columns in spark dataframe?
object Demo extends Context {
def main(args: Array[String]): Unit = {
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
"emp_dept_id","gender","salary")
import sparkSession.sqlContext.implicits._
val empDF = emp.toDF(empColumns: _*)
empDF.show(false)
}
}
Current DF:
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
I want output as this following df, where gender and salary column re-ordered
New DF:
+------+--------+------+------+---------------+-----------+-----------+
|emp_id|name |gender|salary|superior_emp_id|year_joined|emp_dept_id|
+------+--------+------+------+---------------+-----------+-----------+
|1 |Smith |M |3000 |-1 |2018 |10 |
|2 |Rose |M |4000 |1 |2010 |20 |
|3 |Williams|M |1000 |1 |2010 |10 |
|4 |Jones |F |2000 |2 |2005 |10 |
|5 |Brown | |-1 |2 |2010 |40 |
|6 |Brown | |-1 |2 |2010 |50 |
+------+--------+------+------+---------------+-----------+-----------+
Just use select() to re-order the columns:
df = df.select('emp_id','name','gender','salary','superior_emp_id','year_joined','emp_dept_id')
It will be shown according to your ordering in select() argument.
Scala way of doing it
//Order the column names as you want
val columns = Array("emp_id","name","gender","salary","superior_emp_id","year_joined","emp_dept_id")
.map(col)
//Pass it to select
df.select(columns: _*)
Suppose I have a pyspark dataframe with an id column and a time column (t) in seconds. For each id I'd like to group the rows so that each group has all entries that are within 5 seconds after the start time for that group. So for instance, if the table is:
+---+--+
|id |t |
+---+--+
|1 |0 |
|1 |1 |
|1 |3 |
|1 |8 |
|1 |14|
|1 |18|
|2 |0 |
|2 |20|
|2 |21|
|2 |50|
+---+--+
Then the result should be:
+---+--+---------+-------------+-------+
|id |t |subgroup |window_start |offset |
+---+--+---------+-------------+-------+
|1 |0 |1 |0 |0 |
|1 |1 |1 |0 |1 |
|1 |3 |1 |0 |3 |
|1 |8 |2 |8 |0 |
|1 |14|3 |14 |0 |
|1 |18|3 |14 |4 |
|2 |0 |1 |0 |0 |
|2 |20|2 |20 |0 |
|2 |21|2 |20 |1 |
|2 |50|3 |50 |0 |
+---+--+---------+-------------+-------+
I don't need the subgroup numbers to be consecutive. I'm ok with solutions using custom UDAF in Scala as long as it is efficient.
Computing (cumsum(t)-(cumsum(t)%5))/5 within each group can be used to identify the first window, but not the ones beyond that. Essentially the problem is that after the first window is found, the cumulative sum needs to reset to 0. I could operate recursively using this cumulative sum approach, but that is too inefficient on a large dataset.
The following works and is more efficient than recursively calling cumsum, but it is still so slow as to be useless on large dataframes.
d = [[int(x[0]),float(x[1])] for x in [[1,0],[1,1],[1,4],[1,7],[1,14],[1,18],[2,5],[2,20],[2,21],[3,0],[3,1],[3,1.5],[3,2],[3,3.5],[3,4],[3,6],[3,6.5],[3,7],[3,11],[3,14],[3,18],[3,20],[3,24],[4,0],[4,1],[4,2],[4,6],[4,7]]]
schema = pyspark.sql.types.StructType(
[
pyspark.sql.types.StructField('id',pyspark.sql.types.LongType(),False),
pyspark.sql.types.StructField('t',pyspark.sql.types.DoubleType(),False)
]
)
df = spark.createDataFrame(
[pyspark.sql.Row(*x) for x in d],
schema
)
def getSubgroup(ts):
result = []
total = 0
ts = sorted(ts)
tdiffs = numpy.array(ts)
tdiffs = tdiffs[1:]-tdiffs[:-1]
tdiffs = numpy.concatenate([[0],tdiffs])
subgroup = 0
for k in range(len(tdiffs)):
t = ts[k]
tdiff = tdiffs[k]
total = total+tdiff
if total >= 5:
total = 0
subgroup += 1
result.append([t,float(subgroup)])
return result
getSubgroupUDF = pyspark.sql.functions.udf(getSubgroup,pyspark.sql.types.ArrayType(pyspark.sql.types.ArrayType(pyspark.sql.types.DoubleType())))
subgroups = df.select('id','t').distinct().groupBy(
'id'
).agg(
pyspark.sql.functions.collect_list('t').alias('ts')
).withColumn(
't_and_subgroup',
pyspark.sql.functions.explode(getSubgroupUDF('ts'))
).withColumn(
't',
pyspark.sql.functions.col('t_and_subgroup').getItem(0)
).withColumn(
'subgroup',
pyspark.sql.functions.col('t_and_subgroup').getItem(1).cast(pyspark.sql.types.IntegerType())
).drop(
't_and_subgroup','ts'
)
df = df.join(
subgroups,
on=['id','t'],
how='inner'
)
df.orderBy(
pyspark.sql.functions.asc('id'),pyspark.sql.functions.asc('t')
).show()
The subgroup column is equivalent to partitioning by id, window_start so maybe you don't need to create it.
To create window_start , I think this does the job :
.withColumn("window_start", min("t").over(Window.partitionBy("id").orderBy(asc("t")).rangeBetween(0, 5)))
I'm not 100% sure about the behavior of rangeBetween.
To create offset it's just .withColumn("offset", col("t") - col("window_start"))
Let me know how it goes