How to drop single valued columns effiecently from dataframe

How to drop single valued columns effiecently from dataframe - apache-spark

How to drop all columns which have single value from a dataframe effiecntly ?
I found two ways:
This method ignores null and only considers other values, I need to consider nulls in my case
# apply countDistinct on each column
col_counts = partsDF.agg(*(countDistinct(col(c)).alias(c) for c in partsDF.columns)).collect()[0].asDict()
this method takes too long time
col_counts = partsDF.agg(*( partsDF.select(c).distinct().count() for c in partsDF.columns)).collect()[0].asDict()
#select the cols with count=1 in an array
cols_to_drop = [col for col in partsDF.columns if col_counts[col] == 1 ]

Related

Python - Filter out rows from dataframe based on match on columns from another dataframe

I have the 2 dataframes as below:
Based on the values of df2, if df1 rows matches ALL the conditions of df2, then remove it from df1.
Expected output:
If the column value is NULL, then consider it to match ALL the values, else regular match.
i.e. 1st row (from df2) only has product value (other columns are null), so filter should match all values of book, business and ONLY product = Forex; so "tr4" row should be matched and hence remove.
2nd row (form df2) has to match book = b2, all business (since NULL) and product = Swap, i.e. no rows matched with all this (AND) condition, and nothing removed.
I can have result inplace or new df, how can this be done?

for i in range(len(df2)):
for j in range(len(df1)):
if (df2['book'][i] == "[NULL]" or df2['book'][i] == df1['book'][i]
and df2['business'][i] == "[NULL]" or df2['business'][i] == df1['business'][i]
and df2['product'][i] == "[NULL]" or df2['product'][i] == df1['product'][i]):
df1 = df1.drop(j)
df1.reset_index(drop=True, inplace=True)

pyspark df filter records based on integer

Input:
Expected Output:
These columns are dynamic. there can be 'n' number of columns. Need to filter records by all columns which is not equal to 0. Column datatype is decimal.
for column in df.columns:
count=(df.filter(F.col(column) != int(0)).count()
if count>0:
do some function
else:
do some other function
Main problems i face:
Any way to filter all columns at the same time without for loop
NULL records are not filtering out
How this can be efficiently done in pyspark?

You can create a filter expression dynamically for all the columns and then perform your filter.
filter_condition = ''
for column_name in df.schema.names:
if filter_condition != '':
filter_condition +=' & '
filter_condition += f"{column_name} <> 0"
display(df.filter(filter_condition))
This one uses simple sql syntax.

how to use multiple columns values to calculate the result in pandas pivot table

def WA(A,B):
c = A*B
return c.sum()/A.sum()
sample = pd.pivot_table(intime,index=['year'], columns = 'DV', values = ['A','B'],
aggfunc={'A' : [len,np.sum], 'B' : [WA]}, fill_value = 0)
I'm grouping the dataframe by year and wanted to find the weighted avg of column B
I'm supposed to multiple column A with B then sum up the result and divide it by the sum of A [functian WA() does that]
I really have no idea on how to call the function by passing both the values

Pandas dataframe manipulation with conditions

How can I iterate through Pandas DataFrame field and fill null values with input from another field within the same data frame
My objective is to fill the na values in column y with corresponding values in column z

Its best to avoid iterating through dataframes when it can be accomplished using vector expressions. Something like this should work, although it may need to be massaged a little for your specific case though.
# Set your dataframe
df = ...
# Gets a boolean vector for positions where you have na in column b
nulls_in_b = df["b"].isna()
# Set the places where its null to values from column c
df["b"].loc[nulls_in_b] = df["c"].loc[nulls_in_b]

Dynamic columns in where clause while reading parquet files in pyspark

I have parquet files and want to read them based on dynamic columns, so take an example, I have 2 data frames and want to select data from df1 based on df2.
so I am using the below code but want to make it dynamic in terms of joining columns, today i have 2 columns, tomorrow i can have 4.
a = dict[keys]
col1 = a[0]
col2 = a[1]
v = df1.join(df2,[df1[col1] == df2[col1],
df1.[col2] == df2.[col2],
how='inner')
So how can i make this columns dynamic so the join condition need not be be hard coded and will add remove columns from join condition.

I would just generation an join condition object first based on your dict and then use it in the join.
from functools import reduce
join_condition = reduce(
lambda a, b: a && b,
[ df1[col] == df2[col]
for col
in dict[keys]
]
)
v = df1.join(
df2,
join_condition,
)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to drop single valued columns effiecently from dataframe - apache-spark

Related

Python - Filter out rows from dataframe based on match on columns from another dataframe

pyspark df filter records based on integer

how to use multiple columns values to calculate the result in pandas pivot table

Pandas dataframe manipulation with conditions

Dynamic columns in where clause while reading parquet files in pyspark

Categories

Resources