How to pass list in Pyspark function "Withcolumn" - python-3.x

I am doing ltrim and rtrim on multiple columns of dataframe but now i am able to do it individually . like
# selected_colums = selected_colums.withColumn("last_name", ltrim(selected_colums.last_name))
# selected_colums = selected_colums.withColumn("last_name", rtrim(selected_colums.last_name))
# selected_colums = selected_colums.withColumn("email", ltrim(selected_colums.email))
# selected_colums = selected_colums.withColumn("email", rtrim(selected_colums.email))
# selected_colums = selected_colums.withColumn("phone_number", ltrim(selected_colums.phone_number))
# selected_colums = selected_colums.withColumn("phone_number", rtrim(selected_colums.phone_number))
But I want to do it in loop like below
sdk = ['first_name','last_name','email','phone_number','email_alt','phone_number_alt']
for x in sdk:
selected_colums = selected_colums.withColumn(x, ltrim(selected_colums.last_name))
Its giving me syntax error.
Please help me to optimize this code so that for any number of column i can able to do ltrim or rtrim just passing list.

Check below code.
Import required functions
>>> from pyspark.sql.functions import col
Apply ltrim and rtrim on all columns
>>> columnExprs = map(lambda c: rtrim(ltrim(col(c))).alias(c),df.columns)
Apply columnExprs in select
df.select(*columnExprs).show()

Related

Concat the columns in Pandas Dataframe with separator

I have a dataframe say df_dt_proc with 35 columns.
I want to add a column to the dataframe df_dt_proc['procedures'] which should have all the columns concatenated except column at index 0 separated by , .
I am able to achieve the result by the following script:
df_dt_proc['procedures'] = np.nan
_len = len(df_dt_proc.columns[1:-1])
for i in range(len(df_dt_proc)):
res = ''
for j in range(_len):
try:
res += df_dt_proc[j][i] + ', '
except:
break
df_dt_proc['procedures'][i] = res
However, there must be a more pythonic way to achieve this.
Use custom lambda function with remove NaN and Nones and converting to strings, for select all columns without first and last use DataFrame.iloc:
f = lambda x: ', '.join(x.dropna().astype(str))
df_dt_proc['procedures'] = df_dt_proc.iloc[:, 1:-1].agg(f, axis=1)
Try this with agg:
df_dt_proc['procedures'] = df_dt_proc[df_dt_proc.columns[1:-1]].astype(str).agg(', '.join, axis=1)

Can we copy one column from excel and convert it to a list in Python?

I use
df = pd.read_clipboard()
list_name = df['column_name'].to_list()
but this is a bit long method for me. I want to copy a column and convert in python and then apply some function so that the copied text is converted to a list.
this will read a excel column as list
import xlrd
book = xlrd.open_workbook('Myfile.xlsx') #path to your file
sheet = book.sheet_by_name("Sheet1") #Sheet name
def Readlist(Element, Column):
for _ in range(1,sheet.nrows):
Element.append(str(sheet.row_values(_)[Column]))
pass
column1 = [] # List name
Readlist(column1, 1) # Column Number is 1 here
pirnt(column1)
Read a specified column as list use Readlist function, intialize [] variable before using that.
Using Pandas:
import pandas as pd
df = pd.read_excel("path.xlsx", index_col=None, na_values=['NA'], usecols = "A")
mylist = list(df[0])
print(mylist)

How do I add a dynamic list of variable to the command pd.concat

I am using python3 and pandas to create a script that will:
Be dynamic across different dataset lengths(rows) and unique values - completed
Take unique values from column A and create separate dataframes as variables for each unique entry - completed
Add totals to the bottom of each dataframe - completed
Concatenate the separate dataframes back together - incomplete
The issue is I am unable to formulate a way to create a list of the variables in use and apply them as arg in to the command pd.concat.
The sample dataset. The dataset may have more unique BrandFlavors or less which is why the script must be flexible and dynamic.
Script:
import pandas as pd
import warnings
warnings.simplefilter(action='ignore')
excel_file = ('testfile.xlsx')
df = pd.read_excel(excel_file)
df = df.sort_values(by='This', ascending=False)
colarr = df.columns.values
arr = df[colarr[0]].unique()
for i in range(len(arr)):
globals()['var%s' % i] = df.loc[df[colarr[0]] == arr[i]]
for i in range(len(arr)):
if globals()['var%s' % i].empty:
''
else:
globals()['var%s' % i] = globals()['var%s' % i].append({'BrandFlavor':'Total',
'This':globals()['var%s' % i]['This'].sum(),
'Last':globals()['var%s' % i]['Last'].sum(),
'Diff':globals()['var%s' % i]['Diff'].sum(),
'% Chg':globals()['var%s' % i]['Diff'].sum()/globals()['var%s' % i]['Last'].sum() * 100}, ignore_index=True)
globals()['var%s' % i]['% Chg'].fillna(0, inplace=True)
globals()['var%s' % i].fillna(' ', inplace=True)
I have tried this below, however the list is a series of strings
vararr = []
count = 0
for x in range(len(arr)):
vararr.append('var' + str(count))
count = count + 1
df = pd.concat([vararr])
pd.concat does not recognize a string. I tired to build a class with an arg defined but had the same issue.
The desired outcome would be a code snippet that generated a list of variables that matched the ones created by lines 9/10 and could be referenced by pd.concat([list, of, vars, here]). It must be dynamic. Thank you
Just fixing the issue at hand, you shouldn't use globals to make variables, that is not considered good practice. Your code should work with some minor modifications.
import pandas as pd
import warnings
warnings.simplefilter(action='ignore')
excel_file = ('testfile.xlsx')
df = pd.read_excel(excel_file)
df = df.sort_values(by='This', ascending=False)
def good_dfs(dataframe):
if dataframe.empty:
pass
else:
this = dataframe.This.sum()
last = dataframe.Last.sum()
diff = dataframe.Diff.sum()
data = {
'BrandFlavor': 'Total',
'This': this,
'Last': last,
'Diff': diff,
'Pct Change': diff / last * 100
}
dataframe.append(data, ignore_index=True)
dataframe['Pct Change'].fillna(0.0, inplace=True)
dataframe.fillna(' ', inplace=True)
return dataframe
colarr = df.columns.values
arr = df[colarr[0]].unique()
dfs = []
for i in range(len(arr)):
temp = df.loc[df[colarr[0]] == arr[i]]
dfs.append(temp)
final_dfs = [good_dfs(d) for d in dfs]
final_df = pd.concat(final_dfs)
Although I will say, there are far easier ways to accomplish what you want without doing all of this, however that can be a separate question.

Python3 - using pandas to group rows, where two colums contain values in forward or reverse order: v1,v2 or v2,v1

I'm fairly new to python and pandas, but I've written code that reads an excel workbook, and groups rows based on the values contained in two columns.
So where Col_1=A and Col_2=B, or Col_1=B and Col_2=A, both would be assigned a GroupID=1.
sample spreadsheet data, with rows color coded for ease of visibility
I've manged to get this working, but I wanted to know if there's a more simpler/efficient/cleaner/less-clunky way to do this.
import pandas as pd
df = pd.read_excel('test.xlsx')
# get column values into a list
col_group = df.groupby(['Header_2','Header_3'])
original_list = list(col_group.groups)
# parse list to remove 'reverse-duplicates'
new_list = []
for a,b in original_list:
if (b,a) not in new_list:
new_list.append((a,b))
# iterate through each row in the DataFrame
# check to see if values in the new_list[] exist, in forward or reverse
for index, row in df.iterrows():
for a,b in new_list:
# if the values exist in forward direction
if (a in df.loc[index, "Header_2"]) and (b in df.loc[index,"Header_3"]):
# GroupID value given, where value is index in the new_list[]
df.loc[index,"GroupID"] = new_list.index((a,b))+1
# else check if value exists in the reverse direction
if (b in df.loc[index, "Header_2"]) and (a in df.loc[index,"Header_3"]):
df.loc[index,"GroupID"] = new_list.index((a,b))+1
# Finally write the DataFrame to a new spreadsheet
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer, 'Sheet1')
I know of the pandas.groupby([columnA, columnB]) option, but I couldn't figure a way to create groups that contained both (v1, v2) and (v2,v1).
A boolean mask should do the trick:
import pandas as pd
df = pd.read_excel('test.xlsx')
mask = ((df['Header_2'] == 'A') & (df['Header_3'] == 'B') |
(df['Header_2'] == 'B') & (df['Header_3'] == 'A'))
# Label each row in the original DataFrame with
# 1 if it matches the specified criteria, and
# 0 if it does not.
# This column can now be used in groupby operations.
df.loc[:, 'match_flag'] = mask.astype(int)
# Get rows that match the criteria
df[mask]
# Get rows that do not match the criteria
df[~mask]
EDIT: updated answer to address the groupby requirement.
I would do something like this.
import pandas as pd
df = pd.read_excel('test.xlsx')
#make the ordering consistent
df["group1"] = df[["Header_2","Header_3"]].max(axis=1)
df["group2"] = df[["Header_2","Header_3"]].min(axis=1)
#group them together
df = df.sort_values(by=["group1","group2"])
If you need to deal with more than two columns, I can write up a more general way to do this.

Python3, with pandas.dataframe, how to select certain data by some rules to show

I have a pandas.dataframe, and I want to select certain data by some rules.
The following codes generate the dataframe
import datetime
import pandas as pd
import numpy as np
today = datetime.date.today()
dates = list()
for k in range(10):
a_day = today - datetime.timedelta(days=k)
dates.append(np.datetime64(a_day))
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(10, 3)),
columns=('other1', 'actual', 'other2'),
index=['{}'.format(i) for i in range(10)])
df.insert(0, 'dates', dates)
df['err_m'] = np.random.rand(10, 1)*0.1
df['std'] = np.random.rand(10, 1)*0.05
df['gain'] = np.random.rand(10, 1)
Now, I want select by the following rules:
1. compute the sum of 'err_m' and 'std', then sort the df so that the sum is descending
2. from the result of step 1, select the part where 'actual' is > 50
Thanks
Create a new column and then sort by this one:
df['errsum'] = df['err_m'] + df['std']
# Return a sorted dataframe
df_sorted = df.sort('errsum', ascending = False)
Select the lines you want
# Create an array with True where the condition is met
selector = df_sorted['errsum'] > 50
# Return a view of sorted_dataframe with only the lines you want
df_sorted[selector]

Resources