I have a pyspark dataframe that looks like this:
import pandas as pd
foo = pd.DataFrame({'date_col':['2010-02-27','2010-01-20','2010-01-20','2010-01-21','2010-01-21','2010-02-21','2010-02-22','2010-02-23','2010-02-24','2010-02-25','2010-02-26','2010-01-20','2010-01-21','2010-02-20'], 'group':['a','a','a','a','a','a','a','a','a','a','a','b','b','b']})
I would like to create a week column, which would be an index, which increases every 7 ordered unique values of the date_col by group.
The resulting dataframe should look like this:
foo = pd.DataFrame({'date_col':['2010-02-27','2010-01-20','2010-01-20','2010-01-21','2010-01-21','2010-02-21','2010-02-22','2010-02-23','2010-02-24','2010-02-25','2010-02-26','2010-01-20','2010-01-21','2010-02-20'], 'group':['a','a','a','a','a','a','a','a','a','a','a','b','b','b'],
'week':[2,2,1,1,1,1,1,1,1,1,1,1,1,1]})
Any ideas could I do that in pyspark?
UPDATE
Some more explanation on the logic.
Basically the operation could be split into the following steps:
Order the foo on date_col grouped by group
Create a temp_index column, which would rank the date_col by group
Create a week column which would be the div of temp_index with 7
You can use dense_rank and divide the rank by 7. You need to subtract 1 before dividing because in SQL, ranks start from 1 rather than 0.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'week',
(
(F.dense_rank().over(Window.partitionBy('group').orderBy('date_col')) - 1) / 7
).cast('int')
)
df2.show()
+----------+-----+----+
| date_col|group|week|
+----------+-----+----+
|2010-01-20| b| 0|
|2010-01-21| b| 0|
|2010-02-20| b| 0|
|2010-01-20| a| 0|
|2010-01-20| a| 0|
|2010-01-21| a| 0|
|2010-01-21| a| 0|
|2010-02-21| a| 0|
|2010-02-22| a| 0|
|2010-02-23| a| 0|
|2010-02-24| a| 0|
|2010-02-25| a| 0|
|2010-02-26| a| 1|
|2010-02-27| a| 1|
+----------+-----+----+
Related
I have incremental load in csv files. I read the csv in a dataframe. The dataframe has one column containing some strings. I have to find the distinct strings from this column and assign an ID (integer) to each of the value starting from 0 after joining one other dataframe.
In the next run, I have to assign the ID after finding out the max value in ID column and incrementing it for different strings. Wherever there is a null in ID column, I have to increment it (+1) from the value of the previous run.
FIRST RUN
string
ID
zero
0
first
1
second
2
third
3
fourth
4
SECOND RUN
MAX(ID) = 4
string
ID
zero
0
first
1
second
2
third
3
fourth
4
fifth
5
sixth
6
seventh
7
eighth
8
I have tried this but couldn't make it working..
max = df.agg({"ID": "max"}).collect()[0][0]
df_incremented = df.withcolumn("ID", when(col("ID").isNull(),expr("max += 1")))
Let me know if there is an easy way to achieve this.
As you keep only distinct values, you can use row_number function over window :
from pyspark.sql import Window
from pyspark.sql import functions as F
df = spark.createDataFrame(
[("a",), ("a",), ("b",), ("c",), ("d",), ("e",), ("e",)],
("string",)
)
w = Window.orderBy("string")
df1 = df.distinct().withColumn("ID", F.row_number().over(w) - 1)
df1.show()
#+------+---+
#|string| ID|
#+------+---+
#| a| 0|
#| b| 1|
#| c| 2|
#| d| 3|
#| e| 4|
#+------+---+
Now let's add some rows into this dataframe and use row_number along with coalesce to assign ID only for row where it's null (no need to get the max):
df2 = df1.union(spark.sql("select * from values ('f', null), ('h', null), ('i', null)"))
df3 = df2.withColumn("ID", F.coalesce("ID", F.row_number(w) - 1))
df3.show()
#+------+---+
#|string| ID|
#+------+---+
#| a| 0|
#| b| 1|
#| c| 2|
#| d| 3|
#| e| 4|
#| f| 5|
#| h| 6|
#| i| 7|
#+------+---+
If you wanted to keep duplicated values too and assign them the same ID, then use dense_rank instead of row_number.
I have created dataframe like this from a table
df = spark.sql("select * from test") # it is having 2 columns id and name
df2 = df.groupby('id').agg(collect_list('name')
df2.show()
|id|name|
|44038:4572|[0032477212299451]|
|44038:5439|[00324772, 0032477, 003247, 00324]|
|44038:4429|[0032477212299308]|
Until here it's correct, for one id I can store multiple names (values).
Now when I try to create dynamic columns into dataframe based on values, it is not working.
df3 = df2.select([df2.id] + [df2.name[i] for i in range (length)])
Output:
|id |name[0]|
|44038:4572|0032477212299451|
|44038:5439|00324772|
|44038:4429|032477212299308|
Expected output in dataframe:
|id|name[0]|name[1]|name[2]|name[3]|
|44038:4572|0032477212299451|null|null|null|
|44038:5439|00324772|0032477|003247|0034|
|44038:4429|032477212299308|null|null|null|
And then have to replace null with 0.
You might be better off doing pivot instead of collect_list:
from pyspark.sql import functions as F, Window
df2 = (df.withColumn('rn', F.row_number().over(Window.partitionBy('id').orderBy(F.desc('name'))))
.groupBy('id')
.pivot('rn')
.agg(F.first('name'))
.fillna("0")
)
df2.show()
+----------+----------------+-------+------+-----+
| id| 1| 2| 3| 4|
+----------+----------------+-------+------+-----+
|44038:4572|0032477212299451| 0| 0| 0|
|44038:5439| 00324772|0032477|003247|00324|
|44038:4429|0032477212299308| 0| 0| 0|
+----------+----------------+-------+------+-----+
If you want pretty column names, you can do
df3 = df2.toDF('id', *[f'name{i}' for i in range(len(df2.columns) - 1)])
df3.show()
+----------+----------------+-------+------+-----+
| id| name0| name1| name2|name3|
+----------+----------------+-------+------+-----+
|44038:4572|0032477212299451| 0| 0| 0|
|44038:5439| 00324772|0032477|003247|00324|
|44038:4429|0032477212299308| 0| 0| 0|
+----------+----------------+-------+------+-----+
I have a spark dataframe that looks like this
import pandas as pd
dfs = pd.DataFrame({'country':['a','a','a','a','b','b'], 'value':[1,2,3,4,5,6], 'id':[3,5,4,6, 8,7]})
I would like to add 3 new columns in this dataframe.
An index that starts from 1 and increases for each row, by country
A 2 window difference of the value column by country, ordered by id
A 2 window moving average of the value column by country, ordered by id
Any ideas how I can do that in one go ?
EDIT
The difference column should be [1,2,-1,2,6,-1] and it is calculated as follows:
The rows are ordered by id. Then, the first rows for each country remain unchanged. Then for the second row for country a it is 3-1=2, for the 3rd row it is 2-3=-1 etc
you can use the rowsBetween window spec with windows function
##%%
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
from pyspark.sql.window import Window
# Test data
dfs = sqlContext.createDataFrame([('a',1,3),('a',2,5),('a',3,4),('a',4,6),('b',5,8),('b',6,7)],schema=['country','value','id'])
# First window to calculate the id and difference in values
w=Window.partitionBy('country').orderBy('id')
# use row_number() and lag() functions to get the values
df_id = (dfs.withColumn("id",F.row_number().over(w))).withColumn("delta",F.col('value')-F.lag('value',default=0).over(w))
#% Second window to calculate the moving average, sum and difference
w1 = w=Window.partitionBy('country').orderBy('id').rowsBetween(-1,0)
# do the calculations with a window spec of 2, defined by (-1,0) in w1
df = (df_id.withColumn("movingaverage",F.mean('value').over(w1))).withColumn("moving_sum",F.sum('value').over(w1))
# Additional calculation, not requested by the author
df_res = df.withColumn("moving_difference", F.col('value')-F.col("moving_sum"))
The results
df_res.show()
+-------+-----+---+-----+-------------+----------+-----------------+
|country|value| id|delta|movingaverage|moving_sum|moving_difference|
+-------+-----+---+-----+-------------+----------+-----------------+
| a| 1| 1| 1| 1.0| 1| 0|
| a| 3| 2| 2| 2.0| 4| -1|
| a| 2| 3| -1| 2.5| 5| -3|
| a| 4| 4| 2| 3.0| 6| -2|
| b| 6| 1| 6| 6.0| 6| 0|
| b| 5| 2| -1| 5.5| 11| -6|
+-------+-----+---+-----+-------------+----------+-----------------+
I'm trying to group by an ID column in a pyspark dataframe and sum a column depending on the value of another column.
To illustrate, consider the following dummy dataframe:
+-----+-------+---------+
| ID| type| amount|
+-----+-------+---------+
| 1| a| 55|
| 2| b| 1455|
| 2| a| 20|
| 2| b| 100|
| 3| null| 230|
+-----+-------+---------+
My desired output is:
+-----+--------+----------+----------+
| ID| sales| sales_a| sales_b|
+-----+--------+----------+----------+
| 1| 55| 55| 0|
| 2| 1575| 20| 1555|
| 3| 230| 0| 0|
+-----+--------+----------+----------+
So basically, sales will be the sum of amount, while sales_a and sales_b are the sum of amount when type is a or b respectively.
For sales, I know this could be done like this:
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
For the others, I'm guessing F.when would be useful but I'm not sure how to go about it.
You could create two columns before the aggregation based off of the value of type.
df.withColumn("sales_a", F.when(col("type") == "a", col("amount"))) \
.withColumn("sales_b", F.when(col("type") == "b", col("amount"))) \
.groupBy("ID") \
.agg(F.sum("amount").alias("sales"),
F.sum("sales_a").alias("sales_a"),
F.sum("sales_b").alias("sales_b"))
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
dfPivot = df.filter("type is not null").groupBy("ID").pivot("type").agg(F.sum("amount").alias("sales"))
res = df.join(dfPivot, df.id== dfPivot.id,how='left')
Then replace null with 0.
This is generic solution will work irrespective of values in type column.. so if type c is added in dataframe then it will create column _c
Related question: How to drop columns which have same values in all rows via pandas or spark dataframe?
So I have a pyspark dataframe, and I want to drop the columns where all values are the same in all rows while keeping other columns intact.
However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?
Thanks
You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1 means it has only 1 value in all rows.
# apply countDistinct on each column
col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()
# select the cols with count=1 in an array
cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]
# drop the selected column
df.drop(*cols_to_drop).show()
You can use approx_count_distinct function (link) to count the number of distinct elements in a column. In case there is just one distinct, the remove the corresponding column.
Creating the DataFrame
from pyspark.sql.functions import approx_count_distinct
myValues = [(1,2,2,0),(2,2,2,0),(3,2,2,0),(4,2,2,0),(3,1,2,0)]
df = sqlContext.createDataFrame(myValues,['value1','value2','value3','value4'])
df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 1| 2| 2| 0|
| 2| 2| 2| 0|
| 3| 2| 2| 0|
| 4| 2| 2| 0|
| 3| 1| 2| 0|
+------+------+------+------+
Couting number of distinct elements and converting it into dictionary.
count_distinct_df=df.select([approx_count_distinct(x).alias("{0}".format(x)) for x in df.columns])
count_distinct_df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 4| 2| 1| 1|
+------+------+------+------+
dict_of_columns = count_distinct_df.toPandas().to_dict(orient='list')
dict_of_columns
{'value1': [4], 'value2': [2], 'value3': [1], 'value4': [1]}
#Storing those keys in the list which have just 1 distinct key.
distinct_columns=[k for k,v in dict_of_columns.items() if v == [1]]
distinct_columns
['value3', 'value4']
Drop the columns having distinct values
df=df.drop(*distinct_columns)
df.show()
+------+------+
|value1|value2|
+------+------+
| 1| 2|
| 2| 2|
| 3| 2|
| 4| 2|
| 3| 1|
+------+------+