remove and replace spaces in columns for multiple dataframes - python-3.x

I was just wondering if there's a way of replacing blanks with underscore in column names, for multiple data frames, l tried this but didn't work:
df_columns = [df_1, df_2, df_3]
for i in df_columns:
df_columns.replace(' ', '_')
I've also tried
df_columns = {df_1:['iQ Name', 'Cx Name'], df_2:'Cn class'}
for key in df_columns:
key.columns.replace(' ', '_')
and then I get this error:
TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed
Thanks in advance :)

Does
import pandas as pd
df_1 = pd.DataFrame(columns = ['iQ Name','Cx Name'] )
df_2 = pd.DataFrame(columns = ['Cn Class'])
df_columns = df_1.columns.tolist() + df_2.columns.tolist()
df_columns = [item.replace(' ', '_') for item in df_columns]
df_columns
give you the output you are looking for? It would concatenate the column names into one list, remove the spaces and return them as a list.

Related

Concat the columns in Pandas Dataframe with separator

I have a dataframe say df_dt_proc with 35 columns.
I want to add a column to the dataframe df_dt_proc['procedures'] which should have all the columns concatenated except column at index 0 separated by , .
I am able to achieve the result by the following script:
df_dt_proc['procedures'] = np.nan
_len = len(df_dt_proc.columns[1:-1])
for i in range(len(df_dt_proc)):
res = ''
for j in range(_len):
try:
res += df_dt_proc[j][i] + ', '
except:
break
df_dt_proc['procedures'][i] = res
However, there must be a more pythonic way to achieve this.
Use custom lambda function with remove NaN and Nones and converting to strings, for select all columns without first and last use DataFrame.iloc:
f = lambda x: ', '.join(x.dropna().astype(str))
df_dt_proc['procedures'] = df_dt_proc.iloc[:, 1:-1].agg(f, axis=1)
Try this with agg:
df_dt_proc['procedures'] = df_dt_proc[df_dt_proc.columns[1:-1]].astype(str).agg(', '.join, axis=1)

Summarize non-zero values or any values from pandas dataframe with timestamps- From_Time & To_Time

I have a dataframe given below
I want to extract all the non-zero values from each column to put it in a summarize way like this
If any value repeated for period of time then starting time of value should go in 'FROM' column and end time of value should go in 'TO' column with column name in 'BLK-ASB-INV' column and value should go in 'Scount' column. For this I have started to write the code like this
import pandas as pd
df = pd.read_excel("StringFault_Bagewadi_16-01-2020.xlsx")
df = df.set_index(['Date (+05:30)'])
cols=['BLK-ASB-INV', 'Scount', 'FROM', 'TO']
res=pd.DataFrame(columns=cols)
for col in df.columns:
ss=df[col].iloc[df[col].to_numpy().nonzero()[0]]
.......
After that I am unable to think how should I approach to get the desired output. Is there any way to do this in python? Thanks in advance for any help.
Finally I have solved my problem, I have written the code given below works perfectly for me.
import pandas as pd
df = pd.read_excel("StringFault.xlsx")
df = df.set_index(['Date (+05:30)'])
cols=['BLK-ASB-INV', 'Scount', 'FROM', 'TO']
res=pd.DataFrame(columns=cols)
for col in df.columns:
device = []
for i in range(len(df[col])):
if df[col][i] == 0:
None
else:
if i < len(df[col])-1 and df[col][i]==df[col][i+1]:
try:
if df[col].index[i] > device[2]:
continue
except IndexError:
device.append(df[col].name)
device.append(df[col][i])
device.append(df[col].index[i])
continue
else:
if len(device)==3:
device.append(df[col].index[i])
res = res.append({'BLK-ASB-INV':device[0], 'Scount':device[1], 'FROM':device[2], 'TO': device[3]}, ignore_index=True)
device=[]
else:
device.append(df[col].name)
device.append(df[col][i])
if i == 0:
device.append(df[col].index[i])
else:
device.append(df[col].index[i-1])
device.append(df[col].index[i])
res = res.append({'BLK-ASB-INV':device[0], 'Scount':device[1], 'FROM':device[2], 'TO': device[3]}, ignore_index=True)
device=[]
For reference, here is the output datafarme

Python function to loop through columns to replace strings

I'm new to python, and I've found this community to be quite helpful so far. I've found a lot of answers to my other questions, but I can't seem to figure this one out.
I'm trying to write a function to loop through columns and replace '%', '$', and ','. When I import the .csv in through pandas I have about 80/108 columns that are dtype == object that I need to convert to float.
I've found I can write:
df['column_name'] = df['column_name].str.replace('%', '')
and it successfully executes and strips the %.
Unfortunately I have a lot of columns(108) and want to write a function to take care of the problem. I have come up with the below code that will only execute on some of the columns and puts out an odd error:
# get column names
col_names = list(df.columns.values)
# start cleaning data
def clean_data(x):
for i in range(11, 109, 1):
if x[col_names[i]].dtype == object:
x[col_names[i]] = x[col_names[i]].str.replace('%', '')
x[col_names[i]] = x[col_names[i]].str.replace('$', '')
x[col_names[i]] = x[col_names[i]].str.replace(',', '')
AttributeError: 'DataFrame' object has no attribute 'dtype'
Even though the error stops the process, some of the columns are cleaned up. I can't seem to figure out why it's not cleaning up all columns and then returns the 'dtype' error.
I'm running python 3.6.
Welcome to stackoverflow.
If you want to do this for each columns, use the apply function of the dataframe, no need to loop:
df = pd.DataFrame([['1$', '2%'],] * 3, columns=['A', 'B'])
def myreplace(s):
for ch in ['%','$',',']:
s = s.map(lambda x: x.replace(ch, ''))
return s
df = df.apply(myreplace)
print(df)
If you want to do it for some columns, use the map function of the dataserie, no need to loop:
df = pd.DataFrame([['1$', '2%'],] * 3, columns=['A', 'B'])
def myreplace(s):
for ch in ['%','$',',']:
s = s.replace(ch, '')
return s
df['A'] = df['A'].map(myreplace)

Erratic behaviour when removing whitespace from the end of a string while importing excel table

I am importing an excel file with whitespaces at the end of most cell content which need removing. The following script works with sample data:
import pandas as pd
def strip(text):
try:
return text.strip()
except AttributeError:
return text
def num_strip(text):
try:
return text.split(" ",1)[0]
except AttributeError:
return text
def parse_excel_sheet(input_file, sheet):
df = pd.read_excel(
input_file,
sheetname= sheet,
parse_cols = 'A,B,C',
names=['ID', 'name_ITA', 'name_ENG'],
converters = {
'ID' : num_strip,
'name1' : strip,
'name2' : strip,
}
)
return df
file = 'http://www.camminiepercorsi.com/wp-content/uploads/excel_test/excel_test.xlsx'
df = parse_excel_sheet(file,'1')
print(df)
however when trying the script on a larger file, parsing the first column 'ID' does not remove whitespaces.
file = 'http://www.camminiepercorsi.com/wp-content/uploads/excel_test/DRS_IL_startingpoint.xlsx'
df = parse_excel_sheet(file,'test')
print(df)
I just run your code and found that whitespaces were correctly removed from column 'ID' in larger file:
for i, el in enumerate(df['ID'].values):
# print(i)
if " " in el:
print(el)
returns no element from 'ID' column: there's no whitespace in these 28 elements.
How did you checked that this was not the case?

pandas dataframe output need to be a string instead of a list

I have a requirement that the result value should be a string. But when I calculate the maximum value of dataframe it gives the result as a list.
import pandas as pd
def answer_one():
df_copy = [df['# Summer'].idxmax()]
return (df_copy)
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#'+col[1:]}, inplace=True)
names_ids = df.index.str.split('\s\(')
df.index = names_ids.str[0] # the [0] element is the country name (new index)
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)
df = df.drop('Totals')
df.head()
answer_one()
But here the answer_one() will give me a List as an output and not a string. Can someone help me know how this came be converted to a string or how can I get the answer directly from dataframe as a string. I don't want to convert the list to a string using str(df_copy).
Your first solution would be as #juanpa.arrivillaga put it: To not wrap it. Your function becomes:
def answer_one():
df_copy = df['# Summer'].idxmax()
return (df_copy)
>>> 1
Another thing that you might not be expecting but idxmax() will return the index of the max, perhaps you want to do:
def answer_one():
df_copy = df['# Summer'].max()
return (df_copy)
>>> 30
Since you don't want to do str(df_copy) you can do df_copy.astype(str) instead.
Here is how I would write your function:
def get_max_as_string(data, column_name):
""" Return Max Value from a column as a string."""
return data[column_name].max().astype(str)
get_max_as_string(df, '# Summer')
>>> '30'

Resources