Concat the columns in Pandas Dataframe with separator - python-3.x

I have a dataframe say df_dt_proc with 35 columns.
I want to add a column to the dataframe df_dt_proc['procedures'] which should have all the columns concatenated except column at index 0 separated by , .
I am able to achieve the result by the following script:
df_dt_proc['procedures'] = np.nan
_len = len(df_dt_proc.columns[1:-1])
for i in range(len(df_dt_proc)):
res = ''
for j in range(_len):
try:
res += df_dt_proc[j][i] + ', '
except:
break
df_dt_proc['procedures'][i] = res
However, there must be a more pythonic way to achieve this.

Use custom lambda function with remove NaN and Nones and converting to strings, for select all columns without first and last use DataFrame.iloc:
f = lambda x: ', '.join(x.dropna().astype(str))
df_dt_proc['procedures'] = df_dt_proc.iloc[:, 1:-1].agg(f, axis=1)

Try this with agg:
df_dt_proc['procedures'] = df_dt_proc[df_dt_proc.columns[1:-1]].astype(str).agg(', '.join, axis=1)

Related

Count element in list if it is present in each row of a column. Add to a new column (pandas)

I have a pandas df like this:
MEMBERSHIP
[2022_K_, EWREW_NK]
[333_NFK_,2022_K_, EWREW_NK, 000]
And I have a list of keys:
list_k = ["_K_","_NK_","_NKF_","_KF_"]
I want to add and create a column that count if any of that element is in the column. The desired output is:
MEMBERSHIP | COUNT
[2022_K_, EWREW_NK] | 2
[333_NFK_,2022_K_, EWREW_NK, 000] | 3
Can you help me?
IIUC, you can use pandas .str acccess methods with regex:
import pandas as pd
df = pd.DataFrame({'MEMBERSHIP':[['2022_K_', 'EWREW_NK'],
['333_NFK_','2022_K_', 'EWREW_NK', '000']]})
list_k = ["_K_","_NK","_NFK_","_KF_"] #I changed this list a little
reg = '|'.join(list_k)
df['count'] = df['MEMBERSHIP'].explode().str.contains(reg).groupby(level=0).sum()
print(df)
Output:
MEMBERSHIP count
0 [2022_K_, EWREW_NK] 2
1 [333_NFK_, 2022_K_, EWREW_NK, 000] 3
you can use a lambda function:
def check(x):
total=0
for i in x:
if type(i) != str: #if value is not string pass.
pass
else:
for j in list_k:
if j in i:
total+=1
return total
df['count']=df['MEMBERSHIP'].apply(lambda x: check(x))
I come up with this dumb code
count_row=0
df['Count']= None
for i in df['MEMBERSHIP_SPLIT']:
count_element=0
for sub in i:
for e in list_k:
if e in sub:
count_element+=1
df['Count'][count_row]=count_element
count_row += 1

remove and replace spaces in columns for multiple dataframes

I was just wondering if there's a way of replacing blanks with underscore in column names, for multiple data frames, l tried this but didn't work:
df_columns = [df_1, df_2, df_3]
for i in df_columns:
df_columns.replace(' ', '_')
I've also tried
df_columns = {df_1:['iQ Name', 'Cx Name'], df_2:'Cn class'}
for key in df_columns:
key.columns.replace(' ', '_')
and then I get this error:
TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed
Thanks in advance :)
Does
import pandas as pd
df_1 = pd.DataFrame(columns = ['iQ Name','Cx Name'] )
df_2 = pd.DataFrame(columns = ['Cn Class'])
df_columns = df_1.columns.tolist() + df_2.columns.tolist()
df_columns = [item.replace(' ', '_') for item in df_columns]
df_columns
give you the output you are looking for? It would concatenate the column names into one list, remove the spaces and return them as a list.

How do I add a dynamic list of variable to the command pd.concat

I am using python3 and pandas to create a script that will:
Be dynamic across different dataset lengths(rows) and unique values - completed
Take unique values from column A and create separate dataframes as variables for each unique entry - completed
Add totals to the bottom of each dataframe - completed
Concatenate the separate dataframes back together - incomplete
The issue is I am unable to formulate a way to create a list of the variables in use and apply them as arg in to the command pd.concat.
The sample dataset. The dataset may have more unique BrandFlavors or less which is why the script must be flexible and dynamic.
Script:
import pandas as pd
import warnings
warnings.simplefilter(action='ignore')
excel_file = ('testfile.xlsx')
df = pd.read_excel(excel_file)
df = df.sort_values(by='This', ascending=False)
colarr = df.columns.values
arr = df[colarr[0]].unique()
for i in range(len(arr)):
globals()['var%s' % i] = df.loc[df[colarr[0]] == arr[i]]
for i in range(len(arr)):
if globals()['var%s' % i].empty:
''
else:
globals()['var%s' % i] = globals()['var%s' % i].append({'BrandFlavor':'Total',
'This':globals()['var%s' % i]['This'].sum(),
'Last':globals()['var%s' % i]['Last'].sum(),
'Diff':globals()['var%s' % i]['Diff'].sum(),
'% Chg':globals()['var%s' % i]['Diff'].sum()/globals()['var%s' % i]['Last'].sum() * 100}, ignore_index=True)
globals()['var%s' % i]['% Chg'].fillna(0, inplace=True)
globals()['var%s' % i].fillna(' ', inplace=True)
I have tried this below, however the list is a series of strings
vararr = []
count = 0
for x in range(len(arr)):
vararr.append('var' + str(count))
count = count + 1
df = pd.concat([vararr])
pd.concat does not recognize a string. I tired to build a class with an arg defined but had the same issue.
The desired outcome would be a code snippet that generated a list of variables that matched the ones created by lines 9/10 and could be referenced by pd.concat([list, of, vars, here]). It must be dynamic. Thank you
Just fixing the issue at hand, you shouldn't use globals to make variables, that is not considered good practice. Your code should work with some minor modifications.
import pandas as pd
import warnings
warnings.simplefilter(action='ignore')
excel_file = ('testfile.xlsx')
df = pd.read_excel(excel_file)
df = df.sort_values(by='This', ascending=False)
def good_dfs(dataframe):
if dataframe.empty:
pass
else:
this = dataframe.This.sum()
last = dataframe.Last.sum()
diff = dataframe.Diff.sum()
data = {
'BrandFlavor': 'Total',
'This': this,
'Last': last,
'Diff': diff,
'Pct Change': diff / last * 100
}
dataframe.append(data, ignore_index=True)
dataframe['Pct Change'].fillna(0.0, inplace=True)
dataframe.fillna(' ', inplace=True)
return dataframe
colarr = df.columns.values
arr = df[colarr[0]].unique()
dfs = []
for i in range(len(arr)):
temp = df.loc[df[colarr[0]] == arr[i]]
dfs.append(temp)
final_dfs = [good_dfs(d) for d in dfs]
final_df = pd.concat(final_dfs)
Although I will say, there are far easier ways to accomplish what you want without doing all of this, however that can be a separate question.

Python function to loop through columns to replace strings

I'm new to python, and I've found this community to be quite helpful so far. I've found a lot of answers to my other questions, but I can't seem to figure this one out.
I'm trying to write a function to loop through columns and replace '%', '$', and ','. When I import the .csv in through pandas I have about 80/108 columns that are dtype == object that I need to convert to float.
I've found I can write:
df['column_name'] = df['column_name].str.replace('%', '')
and it successfully executes and strips the %.
Unfortunately I have a lot of columns(108) and want to write a function to take care of the problem. I have come up with the below code that will only execute on some of the columns and puts out an odd error:
# get column names
col_names = list(df.columns.values)
# start cleaning data
def clean_data(x):
for i in range(11, 109, 1):
if x[col_names[i]].dtype == object:
x[col_names[i]] = x[col_names[i]].str.replace('%', '')
x[col_names[i]] = x[col_names[i]].str.replace('$', '')
x[col_names[i]] = x[col_names[i]].str.replace(',', '')
AttributeError: 'DataFrame' object has no attribute 'dtype'
Even though the error stops the process, some of the columns are cleaned up. I can't seem to figure out why it's not cleaning up all columns and then returns the 'dtype' error.
I'm running python 3.6.
Welcome to stackoverflow.
If you want to do this for each columns, use the apply function of the dataframe, no need to loop:
df = pd.DataFrame([['1$', '2%'],] * 3, columns=['A', 'B'])
def myreplace(s):
for ch in ['%','$',',']:
s = s.map(lambda x: x.replace(ch, ''))
return s
df = df.apply(myreplace)
print(df)
If you want to do it for some columns, use the map function of the dataserie, no need to loop:
df = pd.DataFrame([['1$', '2%'],] * 3, columns=['A', 'B'])
def myreplace(s):
for ch in ['%','$',',']:
s = s.replace(ch, '')
return s
df['A'] = df['A'].map(myreplace)

pandas dataframe output need to be a string instead of a list

I have a requirement that the result value should be a string. But when I calculate the maximum value of dataframe it gives the result as a list.
import pandas as pd
def answer_one():
df_copy = [df['# Summer'].idxmax()]
return (df_copy)
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#'+col[1:]}, inplace=True)
names_ids = df.index.str.split('\s\(')
df.index = names_ids.str[0] # the [0] element is the country name (new index)
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)
df = df.drop('Totals')
df.head()
answer_one()
But here the answer_one() will give me a List as an output and not a string. Can someone help me know how this came be converted to a string or how can I get the answer directly from dataframe as a string. I don't want to convert the list to a string using str(df_copy).
Your first solution would be as #juanpa.arrivillaga put it: To not wrap it. Your function becomes:
def answer_one():
df_copy = df['# Summer'].idxmax()
return (df_copy)
>>> 1
Another thing that you might not be expecting but idxmax() will return the index of the max, perhaps you want to do:
def answer_one():
df_copy = df['# Summer'].max()
return (df_copy)
>>> 30
Since you don't want to do str(df_copy) you can do df_copy.astype(str) instead.
Here is how I would write your function:
def get_max_as_string(data, column_name):
""" Return Max Value from a column as a string."""
return data[column_name].max().astype(str)
get_max_as_string(df, '# Summer')
>>> '30'

Resources