How to Sum Many Columns in PySpark Dataframe [duplicate] - apache-spark

This question already has answers here:
Spark SQL: apply aggregate functions to a list of columns
(4 answers)
Closed 4 years ago.
I have a data frame with int values and I'd like to sum every column individually and then test if that column's sum is above 5. If the column's sum is above 5 then I'd like to add it to feature_cols. The answers I've found online only work for pandas and not PySpark. (I'm using Databricks)
Here is what I have so far:
working_cols = df.columns
for x in range(0, len(working_cols)):
if df.agg(sum(working_cols[x])) > 5:
feature_cols.append(working_cols[x])
The current output for this is that feature_cols has every column, even though some have a sum less than 5.
Out[166]:
['Column_1',
'Column_2',
'Column_3',
'Column_4',
'Column_5',
'Column_6',
'Column_7',
'Column_8',
'Column_9',
'Column_10']

I am not an expert in python but in your loop you are comparing a DataFrame[sum(a): bigint] with 5, and for some reason the answer is True.
df.agg(sum(working_cols[x])).collect()[0][0] should give you what you want. I actually collect the dataframe to the driver, select the first row (there is only one) and select the first column (only one as well).
Note that your approach is not optimal in terms of perf. You could compute all the sums with only one pass of the dataframe like this:
sums = [F.sum(x).alias(str(x)) for x in df.columns]
d = df.select(sums).collect()[0].asDict()
With this code, you would have a dictionary that assocites each column name to its sum and on which you could apply any logic that's of intrest to you.

Related

split a column based on a delimiter and then unpivot the result with preserving other columns

I need to split a column to multiple rows and then unpivot them by preseving a/multiple columns, how can I achive this in Python3
See below example
import numpy as np
data=np.array(['a0','a1,a2','a2,a3'])
pk=np.array([1,2,3])
df=pd.DataFrame({'data':data,'PK':pk})
df
df['data'].apply(lambda x : pd.Series(str(x).split(","))).stack()
What I need is:
data pk
a0 1
a1 2
a2 2
a2 3
a3 3
Is there any way to achieve this without merge and resetting indexes as mentioned here?
Convert column data into list and explode the data frame
Data
data=np.array(['a0','a1,a2','a2,a3'])
pk=np.array([1,2,3])
df=pd.DataFrame({'data':data,'PK':pk})
df=spark.createDataFrame(df)
Solution
df.withColumn('data', F.explode(F.split(col('data'),','))).show()
Using the Explode is the keyword (thx to wwnde for pointing it out) for searching this and can be done easily in Python with using existing libraries
First step is converting the column with a delimiter to a list
df=df.assign(Data=df.data.str.split(","))
and then explode
df.explode('Data')
if you are reading from Excel and Pandas detect a list of number as int and if you need to do the explode multiple times then this is the code and results

Can I use pandas.DataFrame.apply() to make more than one column at once?

I have been given a function that takes the values in a row in a dataframe and returns a bunch of additional values.
It looks a bit like this:
my_func(row) -> (value1, value2, value3... valueN)
I'd like each of these values to become assigned to new columns in my dataframe. Can I use DataFrame.apply() do add multiple columns in one go, or do I have to add columns one at a time?
It's obvious how I can use apply to generate one column at a time:
from eunrg_utils.testing_functions.random_dataframe import random_dataframe
df = random_dataframe(columns=2,rows=4)
df["X"] = df.apply(axis=1, func=lambda row:(row.A + row.B))
df["Y"] = df.apply(axis=1, func=lambda row:(row.A - row.B))
But what if the two columns I am adding are something that are more easily calculated together? In this case, I already have a function that gives me everything I need in one shot. I'd rather not have to call it multiple times or add a load of caching.
Is there a syntax I can use that would allow me to use apply to generate 2 columns at the same time? Something like this, but less broken:
# Broken Pseudo-code
from eunrg_utils.testing_functions.random_dataframe import random_dataframe
df = random_dataframe(columns=2,rows=4)
df["X"], df["Y"] = df.apply(axis=1, func=lambda row:(row.A + row.B, row.B-row.A))
What is the correct way to do something like this?
You can assign list of columns names like:
df = pd.DataFrame(np.random.randint(10, size=(2,2)),columns=list('AB'))
df[["X", "Y"]] = df.apply(axis=1, func=lambda row:(row.A + row.B, row.B-row.A))
print (df)
A B X Y
0 2 8 10 7
1 4 3 6 -1

Replace values in dataframe given count of group by

I have a dataframe of categorical variables. I want to replace all the fields in one column with an arbitrary unique string if the count of that category within the column is less than 100.
So, for example, in column color, if any color appears less than 100 times i want it to be replaced by the string "base"
I tried the below code and tried different things I found on stack overflow.
df['color'] = numpy.where(df.groupby("color").filter(lambda x: len(x) < 100), 'dummy', df['color'])
Operands could not be broadcast together with shapes (45638872,878) () (8765878782788,)
IIUC, you need this,
df.loc[df.groupby('color')['color'].transform('count')<100, 'color']= 'dummy'

Pandas get second minimum value from datetime column [duplicate]

This question already has answers here:
How to extract the n-th maximum/minimum value in a column of a DataFrame in pandas?
(3 answers)
Closed 3 years ago.
I have a data frame with a DateTime column, I can get minimum value by using
df['Date'].min()
How can I get the second, third... smallest values
Use nlargest or nsmallest
For second largest,
series.nlargest(2).iloc[-1]
Make sure your dates are in datetime first:
df['Sampled_Date'] = pd.to_datetime(df['Sampled_Date'])
Then drop the duplicates, take the nsmallest(2), and take the last value of that:
df['Sampled_Date'].drop_duplicates().nsmallest(2).iloc[-1]

python select subset where df column value contains one of the values in an array [duplicate]

This question already has answers here:
Pandas filtering for multiple substrings in series
(3 answers)
Closed 4 years ago.
I have created a simple dataframe in python with these columns
Columns: [index, bulletintype, category, companyname, date, url]
I have a simple array with company
companies= [x,y,x]
I would like to create a subset of the dataframe if the column 'companyname' matches on one or more of the names in the companies array.
subset = df[df['companyname'].isin(companies)]
This works pretty great but .isin makes an exact match and my sources don't use the same names. So I'm looking for an alternative angle and would like to use parts of the name to compare. I'm familiar with .str.contains('part of the name') but I can't use this functions in conjunction with an array. Can somebody help me to achieve something like this (but with working code :-)
subset = df[df['companyname'].contains(companies)]
Try creating a regex pattern by joining your companies list with the regex OR character | then use series.str.contains as a boolean mask:
companies = ['x', 'y', 'z']
pat = '|'.join(companies)
df[df['companies'].str.contains(pat)]

Resources