Python function to loop through columns to replace strings - python-3.x

I'm new to python, and I've found this community to be quite helpful so far. I've found a lot of answers to my other questions, but I can't seem to figure this one out.
I'm trying to write a function to loop through columns and replace '%', '$', and ','. When I import the .csv in through pandas I have about 80/108 columns that are dtype == object that I need to convert to float.
I've found I can write:
df['column_name'] = df['column_name].str.replace('%', '')
and it successfully executes and strips the %.
Unfortunately I have a lot of columns(108) and want to write a function to take care of the problem. I have come up with the below code that will only execute on some of the columns and puts out an odd error:
# get column names
col_names = list(df.columns.values)
# start cleaning data
def clean_data(x):
for i in range(11, 109, 1):
if x[col_names[i]].dtype == object:
x[col_names[i]] = x[col_names[i]].str.replace('%', '')
x[col_names[i]] = x[col_names[i]].str.replace('$', '')
x[col_names[i]] = x[col_names[i]].str.replace(',', '')
AttributeError: 'DataFrame' object has no attribute 'dtype'
Even though the error stops the process, some of the columns are cleaned up. I can't seem to figure out why it's not cleaning up all columns and then returns the 'dtype' error.
I'm running python 3.6.

Welcome to stackoverflow.
If you want to do this for each columns, use the apply function of the dataframe, no need to loop:
df = pd.DataFrame([['1$', '2%'],] * 3, columns=['A', 'B'])
def myreplace(s):
for ch in ['%','$',',']:
s = s.map(lambda x: x.replace(ch, ''))
return s
df = df.apply(myreplace)
print(df)
If you want to do it for some columns, use the map function of the dataserie, no need to loop:
df = pd.DataFrame([['1$', '2%'],] * 3, columns=['A', 'B'])
def myreplace(s):
for ch in ['%','$',',']:
s = s.replace(ch, '')
return s
df['A'] = df['A'].map(myreplace)

Related

remove and replace spaces in columns for multiple dataframes

I was just wondering if there's a way of replacing blanks with underscore in column names, for multiple data frames, l tried this but didn't work:
df_columns = [df_1, df_2, df_3]
for i in df_columns:
df_columns.replace(' ', '_')
I've also tried
df_columns = {df_1:['iQ Name', 'Cx Name'], df_2:'Cn class'}
for key in df_columns:
key.columns.replace(' ', '_')
and then I get this error:
TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed
Thanks in advance :)
Does
import pandas as pd
df_1 = pd.DataFrame(columns = ['iQ Name','Cx Name'] )
df_2 = pd.DataFrame(columns = ['Cn Class'])
df_columns = df_1.columns.tolist() + df_2.columns.tolist()
df_columns = [item.replace(' ', '_') for item in df_columns]
df_columns
give you the output you are looking for? It would concatenate the column names into one list, remove the spaces and return them as a list.

Summarize non-zero values or any values from pandas dataframe with timestamps- From_Time & To_Time

I have a dataframe given below
I want to extract all the non-zero values from each column to put it in a summarize way like this
If any value repeated for period of time then starting time of value should go in 'FROM' column and end time of value should go in 'TO' column with column name in 'BLK-ASB-INV' column and value should go in 'Scount' column. For this I have started to write the code like this
import pandas as pd
df = pd.read_excel("StringFault_Bagewadi_16-01-2020.xlsx")
df = df.set_index(['Date (+05:30)'])
cols=['BLK-ASB-INV', 'Scount', 'FROM', 'TO']
res=pd.DataFrame(columns=cols)
for col in df.columns:
ss=df[col].iloc[df[col].to_numpy().nonzero()[0]]
.......
After that I am unable to think how should I approach to get the desired output. Is there any way to do this in python? Thanks in advance for any help.
Finally I have solved my problem, I have written the code given below works perfectly for me.
import pandas as pd
df = pd.read_excel("StringFault.xlsx")
df = df.set_index(['Date (+05:30)'])
cols=['BLK-ASB-INV', 'Scount', 'FROM', 'TO']
res=pd.DataFrame(columns=cols)
for col in df.columns:
device = []
for i in range(len(df[col])):
if df[col][i] == 0:
None
else:
if i < len(df[col])-1 and df[col][i]==df[col][i+1]:
try:
if df[col].index[i] > device[2]:
continue
except IndexError:
device.append(df[col].name)
device.append(df[col][i])
device.append(df[col].index[i])
continue
else:
if len(device)==3:
device.append(df[col].index[i])
res = res.append({'BLK-ASB-INV':device[0], 'Scount':device[1], 'FROM':device[2], 'TO': device[3]}, ignore_index=True)
device=[]
else:
device.append(df[col].name)
device.append(df[col][i])
if i == 0:
device.append(df[col].index[i])
else:
device.append(df[col].index[i-1])
device.append(df[col].index[i])
res = res.append({'BLK-ASB-INV':device[0], 'Scount':device[1], 'FROM':device[2], 'TO': device[3]}, ignore_index=True)
device=[]
For reference, here is the output datafarme

How do I add a dynamic list of variable to the command pd.concat

I am using python3 and pandas to create a script that will:
Be dynamic across different dataset lengths(rows) and unique values - completed
Take unique values from column A and create separate dataframes as variables for each unique entry - completed
Add totals to the bottom of each dataframe - completed
Concatenate the separate dataframes back together - incomplete
The issue is I am unable to formulate a way to create a list of the variables in use and apply them as arg in to the command pd.concat.
The sample dataset. The dataset may have more unique BrandFlavors or less which is why the script must be flexible and dynamic.
Script:
import pandas as pd
import warnings
warnings.simplefilter(action='ignore')
excel_file = ('testfile.xlsx')
df = pd.read_excel(excel_file)
df = df.sort_values(by='This', ascending=False)
colarr = df.columns.values
arr = df[colarr[0]].unique()
for i in range(len(arr)):
globals()['var%s' % i] = df.loc[df[colarr[0]] == arr[i]]
for i in range(len(arr)):
if globals()['var%s' % i].empty:
''
else:
globals()['var%s' % i] = globals()['var%s' % i].append({'BrandFlavor':'Total',
'This':globals()['var%s' % i]['This'].sum(),
'Last':globals()['var%s' % i]['Last'].sum(),
'Diff':globals()['var%s' % i]['Diff'].sum(),
'% Chg':globals()['var%s' % i]['Diff'].sum()/globals()['var%s' % i]['Last'].sum() * 100}, ignore_index=True)
globals()['var%s' % i]['% Chg'].fillna(0, inplace=True)
globals()['var%s' % i].fillna(' ', inplace=True)
I have tried this below, however the list is a series of strings
vararr = []
count = 0
for x in range(len(arr)):
vararr.append('var' + str(count))
count = count + 1
df = pd.concat([vararr])
pd.concat does not recognize a string. I tired to build a class with an arg defined but had the same issue.
The desired outcome would be a code snippet that generated a list of variables that matched the ones created by lines 9/10 and could be referenced by pd.concat([list, of, vars, here]). It must be dynamic. Thank you
Just fixing the issue at hand, you shouldn't use globals to make variables, that is not considered good practice. Your code should work with some minor modifications.
import pandas as pd
import warnings
warnings.simplefilter(action='ignore')
excel_file = ('testfile.xlsx')
df = pd.read_excel(excel_file)
df = df.sort_values(by='This', ascending=False)
def good_dfs(dataframe):
if dataframe.empty:
pass
else:
this = dataframe.This.sum()
last = dataframe.Last.sum()
diff = dataframe.Diff.sum()
data = {
'BrandFlavor': 'Total',
'This': this,
'Last': last,
'Diff': diff,
'Pct Change': diff / last * 100
}
dataframe.append(data, ignore_index=True)
dataframe['Pct Change'].fillna(0.0, inplace=True)
dataframe.fillna(' ', inplace=True)
return dataframe
colarr = df.columns.values
arr = df[colarr[0]].unique()
dfs = []
for i in range(len(arr)):
temp = df.loc[df[colarr[0]] == arr[i]]
dfs.append(temp)
final_dfs = [good_dfs(d) for d in dfs]
final_df = pd.concat(final_dfs)
Although I will say, there are far easier ways to accomplish what you want without doing all of this, however that can be a separate question.

Get python to add serial nos to each entry as it is run

I am new to programming and probably there is an answer to my question somewhere like here, the closest i found after searching for days. Most of the info deals with existing csvs or hardcoding data. I am trying to make the program create data every time it runs and work on that so a little stumped here.
The Problem:
I can't seem to get python to attach serial nos to each entry when i run the program am making to log my study blocks. It has various fields following are two of them:
Date Time
12-03-2018 11:30
Following is the code snippet:
d= ''
while d == '':
d = input('Date:')
try:
valid_date = dt.strptime(d, '%Y-%m-%d')
except ValueError:
d = ''
print('Please input date in YYYY-MM-DD format.')
t= ''
while t == '':
t = input('Time:')
try:
valid_time = dt.strptime(t, '%H:%M')
except ValueError:
d = ''
print('Please input time in HH:MM format.')
header = csv.DictWriter(outfile, fieldnames= ['UID', 'Date', 'Time', 'Topic', 'Objective', 'Why', 'Summary'], delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL )
header.writeheader()
log_input = csv.writer(outfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
log_input.writerow([d, t, topic, objective, why, summary])
outfile.close()
df = pd.read_csv('E:\Coursera\HSU\python\pom_blocks_log.csv')
df = pd.read_csv('E:\pom_blocks_log.csv')
df = df.reset_index()
df.columns[0] = 'UID'
df['UID'] = df.index
print (df)
I get the following error when i run the program with the df block:
TypeError: Index does not support mutable operations
I new to python and don't really know how to work with data structures, so i am building small programs to learn. Any help is highly appreciated and apologies if this is a duplicate, please point me to the right direction.
So, i figured it out. Following is the process i followed:
I save the CSV file using the csv module.
I load the CSV file in pandas as dataframe.
What this does is, it allows me to append user entries to the CSV every time the program is run and then i can load it as a dataframe and use pandas to manipulate the data accordingly. Then i added a generator to clean the lines off the delimiter character ',' so that it could be loaded as a dataframe for string columns where ',' is accepted as a valid input. Maybe this is a round about approach but, it works.
Following is the code:
import csv
from csv import reader
from datetime import datetime
import pandas as pd
import numpy as np
with open(r'E:\Coursera\HSU\08_programming\trLog_df.csv','a', encoding='utf-8') as csvfile:
# Date
d = ''#input("Date:")
while d == '':
d = input('Date: ')
try:
valid_date = datetime.strptime(d, '%Y-%m-%d')
except ValueError:
d = ''
print("Incorrect data format, should be YYYY-MM-DD")
# Time
t = ''#input("Date:")
while t == '':
t = input('Time: ')
try:
valid_date = datetime.strptime(t, '%H:%M')
except ValueError:
t = ''
print("Incorrect data format, should be HH:MM")
log_input = csv.writer(csvfile, delimiter= ',',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
log_input.writerow([d, t])
# Function to clean lines off the delimter ','
def merge_last(file_name, merge_after_col=7, skip_lines=0):
with open(file_name, 'r') as fp:
for i, line in enumerate(fp):
if i < 2:
continue
spl = line.strip().split(',')
yield (*spl[:merge_after_col], ','.join(spl[merge_after_col:2]))
# Generator to clean the lines
gen = merge_last(r'E:\Coursera\HSU\08_programming\trLog_df.csv', 1)
# get the column names
header = next(gen)
# create the data frame
df = pd.DataFrame(gen, columns=header)
df.head()
print(df)
If anybody has a better solution, it would be enlightening to know how to do it with efficiency and elegance.
Thank you for reading.

pandas dataframe output need to be a string instead of a list

I have a requirement that the result value should be a string. But when I calculate the maximum value of dataframe it gives the result as a list.
import pandas as pd
def answer_one():
df_copy = [df['# Summer'].idxmax()]
return (df_copy)
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#'+col[1:]}, inplace=True)
names_ids = df.index.str.split('\s\(')
df.index = names_ids.str[0] # the [0] element is the country name (new index)
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)
df = df.drop('Totals')
df.head()
answer_one()
But here the answer_one() will give me a List as an output and not a string. Can someone help me know how this came be converted to a string or how can I get the answer directly from dataframe as a string. I don't want to convert the list to a string using str(df_copy).
Your first solution would be as #juanpa.arrivillaga put it: To not wrap it. Your function becomes:
def answer_one():
df_copy = df['# Summer'].idxmax()
return (df_copy)
>>> 1
Another thing that you might not be expecting but idxmax() will return the index of the max, perhaps you want to do:
def answer_one():
df_copy = df['# Summer'].max()
return (df_copy)
>>> 30
Since you don't want to do str(df_copy) you can do df_copy.astype(str) instead.
Here is how I would write your function:
def get_max_as_string(data, column_name):
""" Return Max Value from a column as a string."""
return data[column_name].max().astype(str)
get_max_as_string(df, '# Summer')
>>> '30'

Resources