Pyspark: Dropping columns with no distinct values only using transformations [duplicate] - python-3.x

Related question: How to drop columns which have same values in all rows via pandas or spark dataframe?
So I have a pyspark dataframe, and I want to drop the columns where all values are the same in all rows while keeping other columns intact.
However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?
Thanks

You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1 means it has only 1 value in all rows.
# apply countDistinct on each column
col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()
# select the cols with count=1 in an array
cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]
# drop the selected column
df.drop(*cols_to_drop).show()

You can use approx_count_distinct function (link) to count the number of distinct elements in a column. In case there is just one distinct, the remove the corresponding column.
Creating the DataFrame
from pyspark.sql.functions import approx_count_distinct
myValues = [(1,2,2,0),(2,2,2,0),(3,2,2,0),(4,2,2,0),(3,1,2,0)]
df = sqlContext.createDataFrame(myValues,['value1','value2','value3','value4'])
df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 1| 2| 2| 0|
| 2| 2| 2| 0|
| 3| 2| 2| 0|
| 4| 2| 2| 0|
| 3| 1| 2| 0|
+------+------+------+------+
Couting number of distinct elements and converting it into dictionary.
count_distinct_df=df.select([approx_count_distinct(x).alias("{0}".format(x)) for x in df.columns])
count_distinct_df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 4| 2| 1| 1|
+------+------+------+------+
dict_of_columns = count_distinct_df.toPandas().to_dict(orient='list')
dict_of_columns
{'value1': [4], 'value2': [2], 'value3': [1], 'value4': [1]}
#Storing those keys in the list which have just 1 distinct key.
distinct_columns=[k for k,v in dict_of_columns.items() if v == [1]]
distinct_columns
['value3', 'value4']
Drop the columns having distinct values
df=df.drop(*distinct_columns)
df.show()
+------+------+
|value1|value2|
+------+------+
| 1| 2|
| 2| 2|
| 3| 2|
| 4| 2|
| 3| 1|
+------+------+

Related

How to create an index column, every 7 occurrences in pyspark

I have a pyspark dataframe that looks like this:
import pandas as pd
foo = pd.DataFrame({'date_col':['2010-02-27','2010-01-20','2010-01-20','2010-01-21','2010-01-21','2010-02-21','2010-02-22','2010-02-23','2010-02-24','2010-02-25','2010-02-26','2010-01-20','2010-01-21','2010-02-20'], 'group':['a','a','a','a','a','a','a','a','a','a','a','b','b','b']})
I would like to create a week column, which would be an index, which increases every 7 ordered unique values of the date_col by group.
The resulting dataframe should look like this:
foo = pd.DataFrame({'date_col':['2010-02-27','2010-01-20','2010-01-20','2010-01-21','2010-01-21','2010-02-21','2010-02-22','2010-02-23','2010-02-24','2010-02-25','2010-02-26','2010-01-20','2010-01-21','2010-02-20'], 'group':['a','a','a','a','a','a','a','a','a','a','a','b','b','b'],
'week':[2,2,1,1,1,1,1,1,1,1,1,1,1,1]})
Any ideas could I do that in pyspark?
UPDATE
Some more explanation on the logic.
Basically the operation could be split into the following steps:
Order the foo on date_col grouped by group
Create a temp_index column, which would rank the date_col by group
Create a week column which would be the div of temp_index with 7
You can use dense_rank and divide the rank by 7. You need to subtract 1 before dividing because in SQL, ranks start from 1 rather than 0.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'week',
(
(F.dense_rank().over(Window.partitionBy('group').orderBy('date_col')) - 1) / 7
).cast('int')
)
df2.show()
+----------+-----+----+
| date_col|group|week|
+----------+-----+----+
|2010-01-20| b| 0|
|2010-01-21| b| 0|
|2010-02-20| b| 0|
|2010-01-20| a| 0|
|2010-01-20| a| 0|
|2010-01-21| a| 0|
|2010-01-21| a| 0|
|2010-02-21| a| 0|
|2010-02-22| a| 0|
|2010-02-23| a| 0|
|2010-02-24| a| 0|
|2010-02-25| a| 0|
|2010-02-26| a| 1|
|2010-02-27| a| 1|
+----------+-----+----+

Generate n columns in dataframe based on mutiple values

I have created dataframe like this from a table
df = spark.sql("select * from test") # it is having 2 columns id and name
df2 = df.groupby('id').agg(collect_list('name')
df2.show()
|id|name|
|44038:4572|[0032477212299451]|
|44038:5439|[00324772, 0032477, 003247, 00324]|
|44038:4429|[0032477212299308]|
Until here it's correct, for one id I can store multiple names (values).
Now when I try to create dynamic columns into dataframe based on values, it is not working.
df3 = df2.select([df2.id] + [df2.name[i] for i in range (length)])
Output:
|id |name[0]|
|44038:4572|0032477212299451|
|44038:5439|00324772|
|44038:4429|032477212299308|
Expected output in dataframe:
|id|name[0]|name[1]|name[2]|name[3]|
|44038:4572|0032477212299451|null|null|null|
|44038:5439|00324772|0032477|003247|0034|
|44038:4429|032477212299308|null|null|null|
And then have to replace null with 0.
You might be better off doing pivot instead of collect_list:
from pyspark.sql import functions as F, Window
df2 = (df.withColumn('rn', F.row_number().over(Window.partitionBy('id').orderBy(F.desc('name'))))
.groupBy('id')
.pivot('rn')
.agg(F.first('name'))
.fillna("0")
)
df2.show()
+----------+----------------+-------+------+-----+
| id| 1| 2| 3| 4|
+----------+----------------+-------+------+-----+
|44038:4572|0032477212299451| 0| 0| 0|
|44038:5439| 00324772|0032477|003247|00324|
|44038:4429|0032477212299308| 0| 0| 0|
+----------+----------------+-------+------+-----+
If you want pretty column names, you can do
df3 = df2.toDF('id', *[f'name{i}' for i in range(len(df2.columns) - 1)])
df3.show()
+----------+----------------+-------+------+-----+
| id| name0| name1| name2|name3|
+----------+----------------+-------+------+-----+
|44038:4572|0032477212299451| 0| 0| 0|
|44038:5439| 00324772|0032477|003247|00324|
|44038:4429|0032477212299308| 0| 0| 0|
+----------+----------------+-------+------+-----+

How to create an index column, a 2 window moving average column and a 2 window difference column with groupby in pyspark

I have a spark dataframe that looks like this
import pandas as pd
dfs = pd.DataFrame({'country':['a','a','a','a','b','b'], 'value':[1,2,3,4,5,6], 'id':[3,5,4,6, 8,7]})
I would like to add 3 new columns in this dataframe.
An index that starts from 1 and increases for each row, by country
A 2 window difference of the value column by country, ordered by id
A 2 window moving average of the value column by country, ordered by id
Any ideas how I can do that in one go ?
EDIT
The difference column should be [1,2,-1,2,6,-1] and it is calculated as follows:
The rows are ordered by id. Then, the first rows for each country remain unchanged. Then for the second row for country a it is 3-1=2, for the 3rd row it is 2-3=-1 etc
you can use the rowsBetween window spec with windows function
##%%
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
from pyspark.sql.window import Window
# Test data
dfs = sqlContext.createDataFrame([('a',1,3),('a',2,5),('a',3,4),('a',4,6),('b',5,8),('b',6,7)],schema=['country','value','id'])
# First window to calculate the id and difference in values
w=Window.partitionBy('country').orderBy('id')
# use row_number() and lag() functions to get the values
df_id = (dfs.withColumn("id",F.row_number().over(w))).withColumn("delta",F.col('value')-F.lag('value',default=0).over(w))
#% Second window to calculate the moving average, sum and difference
w1 = w=Window.partitionBy('country').orderBy('id').rowsBetween(-1,0)
# do the calculations with a window spec of 2, defined by (-1,0) in w1
df = (df_id.withColumn("movingaverage",F.mean('value').over(w1))).withColumn("moving_sum",F.sum('value').over(w1))
# Additional calculation, not requested by the author
df_res = df.withColumn("moving_difference", F.col('value')-F.col("moving_sum"))
The results
df_res.show()
+-------+-----+---+-----+-------------+----------+-----------------+
|country|value| id|delta|movingaverage|moving_sum|moving_difference|
+-------+-----+---+-----+-------------+----------+-----------------+
| a| 1| 1| 1| 1.0| 1| 0|
| a| 3| 2| 2| 2.0| 4| -1|
| a| 2| 3| -1| 2.5| 5| -3|
| a| 4| 4| 2| 3.0| 6| -2|
| b| 6| 1| 6| 6.0| 6| 0|
| b| 5| 2| -1| 5.5| 11| -6|
+-------+-----+---+-----+-------------+----------+-----------------+

How to do a conditional aggregation after a groupby in pyspark dataframe?

I'm trying to group by an ID column in a pyspark dataframe and sum a column depending on the value of another column.
To illustrate, consider the following dummy dataframe:
+-----+-------+---------+
| ID| type| amount|
+-----+-------+---------+
| 1| a| 55|
| 2| b| 1455|
| 2| a| 20|
| 2| b| 100|
| 3| null| 230|
+-----+-------+---------+
My desired output is:
+-----+--------+----------+----------+
| ID| sales| sales_a| sales_b|
+-----+--------+----------+----------+
| 1| 55| 55| 0|
| 2| 1575| 20| 1555|
| 3| 230| 0| 0|
+-----+--------+----------+----------+
So basically, sales will be the sum of amount, while sales_a and sales_b are the sum of amount when type is a or b respectively.
For sales, I know this could be done like this:
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
For the others, I'm guessing F.when would be useful but I'm not sure how to go about it.
You could create two columns before the aggregation based off of the value of type.
df.withColumn("sales_a", F.when(col("type") == "a", col("amount"))) \
.withColumn("sales_b", F.when(col("type") == "b", col("amount"))) \
.groupBy("ID") \
.agg(F.sum("amount").alias("sales"),
F.sum("sales_a").alias("sales_a"),
F.sum("sales_b").alias("sales_b"))
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
dfPivot = df.filter("type is not null").groupBy("ID").pivot("type").agg(F.sum("amount").alias("sales"))
res = df.join(dfPivot, df.id== dfPivot.id,how='left')
Then replace null with 0.
This is generic solution will work irrespective of values in type column.. so if type c is added in dataframe then it will create column _c

Saving iteratively to a new DataFrame in Pyspark

I'm performing computations based on 3 different PySpark DataFrames.
This script works in the sense that it performs the computation as it should, however, I struggle with working properly with the results of said computation.
import sys
import numpy as np
from pyspark import SparkConf, SparkContext, SQLContext
sc = SparkContext("local")
sqlContext = SQLContext(sc)
# Dummy Data
df = sqlContext.createDataFrame([[0,1,0,0,0],[1,1,0,0,1],[0,0,1,0,1],[1,0,1,1,0],[1,1,0,0,0]], ['p1', 'p2', 'p3', 'p4', 'p5'])
df.show()
+---+---+---+---+---+
| p1| p2| p3| p4| p5|
+---+---+---+---+---+
| 0| 1| 0| 0| 0|
| 1| 1| 0| 0| 1|
| 0| 0| 1| 0| 1|
| 1| 0| 1| 1| 0|
| 1| 1| 0| 0| 0|
+---+---+---+---+---+
# Values
values = sqlContext.createDataFrame([(0,1,'p1'),(None,1,'p2'),(0,0,'p3'),(None,0, 'p4'),(1,None,'p5')], ('f1', 'f2','index'))
values.show()
+----+----+-----+
| f1| f2|index|
+----+----+-----+
| 0| 1| p1|
|null| 1| p2|
| 0| 0| p3|
|null| 0| p4|
| 1|null| p5|
+----+----+-----+
# Weights
weights = sqlContext.createDataFrame([(4,3,'p1'),(None,1,'p2'),(2,2,'p3'),(None, 3, 'p4'),(3,None,'p5')], ('f1', 'f2','index'))
weights.show()
+----+----+-----+
| f1| f2|index|
+----+----+-----+
| 4| 3| p1|
|null| 1| p2|
| 2| 2| p3|
|null| 3| p4|
| 3|null| p5|
+----+----+-----+
# Function: it sums the vector W for the values of Row equal to the value of V and then divide by the length of V.
# If there a no similarities between Row and V outputs 0
def W_sum(row,v,w):
if len(w[row==v])>0:
return float(np.sum(w[row==v])/len(w))
else:
return 0.0
For each of the columns and for each row in Data, the above function is applied.
# We iterate over the columns of Values (except the last one called index)
for val in values.columns[:-1]:
# we filter the data to work only with the columns that are defined for the selected Value
defined_col = [i[0] for i in values.where(F.col(val) >= 0).select(values.index).collect()]
# we select only the useful columns
df_select= df.select(defined_col)
# we retrieve the reference value and weights
V = np.array(values.where(values.index.isin(defined_col)).select(val).collect()).flatten()
W = np.array(weights.where(weights.index.isin(defined_col)).select(val).collect()).flatten()
W_sum_udf = F.udf(lambda row: W_sum(row, V, W), FloatType())
df_select.withColumn(val, W_sum_udf(F.array(*(F.col(x) for x in df_select.columns))))
This gives :
+---+---+---+---+---+---+
| p1| p2| p3| p4| p5| f1|
+---+---+---+---+---+---+
| 0| 1| 0| 0| 0|2.0|
| 1| 1| 0| 0| 1|1.0|
| 0| 0| 1| 0| 1|2.0|
| 1| 0| 1| 1| 0|0.0|
| 1| 1| 0| 0| 0|0.0|
+---+---+---+---+---+---+
It added the column to the sliced DataFrame as I asked it to. The problem is that I would rather collect the data into a new one that I could access at the end to consult the results.
It it possible to grow (somewhat efficiently) a DataFrame in PySpark as I would with pandas?
Edit to make my goal clearer:
Ideally I would get a DataFrame with the just the computed columns, like this:
+---+---+
| f1| f2|
+---+---+
|2.0|1.0|
|1.0|2.0|
|2.0|0.0|
|0.0|0.0|
|0.0|2.0|
+---+---+
There are some issues with your question...
First, your for loop will produce an error, since df_select in the last line is nowhere defined; there is also no assignment at the end (what does it produce?).
Assuming that df_select is actually your subsubsample dataframe, defined some lines before, and that your last line is something like
new_df = subsubsample.withColumn(val, W_sum_udf(F.array(*(F.col(x) for x in subsubsample.columns))))
then your problem starts getting more clear. Since
values.columns[:-1]
# ['f1', 'f2']
the result of the whole loop would be just
+---+---+---+---+---+
| p1| p2| p3| p4| f2|
+---+---+---+---+---+
| 0| 1| 0| 0|1.0|
| 1| 1| 0| 0|2.0|
| 0| 0| 1| 0|0.0|
| 1| 0| 1| 1|0.0|
| 1| 1| 0| 0|2.0|
+---+---+---+---+---+
i.e. with only the column f2 included (natural, since the results with f1 are simply overwritten).
Now, as I said, assuming that the situation is like this, and that your problem is actually how to have both columns f1 & f2 together rather in different dataframes, you can just forget subsubsample and append columns to your initial df, possibly dropping afterwards the unwanted ones:
init_cols = df.columns
init_cols
# ['p1', 'p2', 'p3', 'p4', 'p5']
new_df = df
for val in values.columns[:-1]:
# we filter the data to work only with the columns that are defined for the selected Value
defined_col = [i[0] for i in values.where(F.col(val) >= 0).select(values.index).collect()]
# we retrieve the reference value and weights
V = np.array(values.where(values.index.isin(defined_col)).select(val).collect()).flatten()
W = np.array(weights.where(weights.index.isin(defined_col)).select(val).collect()).flatten()
W_sum_udf = F.udf(lambda row: W_sum(row, V, W), FloatType())
new_df = new_df.withColumn(val, W_sum_udf(F.array(*(F.col(x) for x in defined_col)))) # change here
# drop initial columns:
for i in init_cols:
new_df = new_df.drop(i)
The resulting new_df will be:
+---+---+
| f1| f2|
+---+---+
|2.0|1.0|
|1.0|2.0|
|2.0|0.0|
|0.0|0.0|
|0.0|2.0|
+---+---+
UPDATE (after comment): To force the division in your W_sum function to be a float, use:
from __future__ import division
new_df now will be:
+---------+----+
| f1| f2|
+---------+----+
| 2.0| 1.5|
|1.6666666|2.25|
|2.3333333|0.75|
| 0.0|0.75|
|0.6666667|2.25|
+---------+----+
with f2 exactly as it should be according to your comment.

Resources