Pandas: Remove characters from cell in a FOR loop - python-3.x

I have a dataframe with a column labeled Amount which is a dollar amount. For some reason, some of the cells in this column are enclosed in quotation marks (ex: "$47.25").
I'm running this for loop and was wondering what is the best approach to remove the quotes.
for f in files:
print(f)
df = pd.read_csv(f, header = None, nrows=1)
print(df)
je = df.iloc[0,1]
df2 = pd.read_csv(f,header = 6, dtype = {'Amount':float})
df2.to_excel(w, sheet_name = je, index = False)
I have attempted to strip the " from the value using a for loop:
for cell in df2['Amount']:
cell = cell.strip('"')
df2['Amount']=pd.to_numeric(df2['Amount'])
But I am getting:
ValueError: Unable to parse string "$-167.97" at position 0
Thank you in advance!

Given this toy dataframe:
import pandas as pd
df = pd.DataFrame(
{"Transaction": ["t1", "t2"], "Amount": ["$47.25", "'$-167.97'"]}
)
print(df)
# Outputs
Transaction Amount
0 t1 $47.25
1 t2 '$-167.97'
Instead of using a for loop, which should generally be avoided with dataframes, you could simply remove the quotation marks from the Amountcolumn like this:
df["Amount"] = df["Amount"].str.replace("\'", "")
print(df)
# Outputs
Transaction Amount
0 t1 $47.25
1 t2 $-167.97

Related

How to remove rows in pandas dataframe column that contain the hyphen character?

I have a DataFrame given as follows:
new_dict = {'Area_sqfeet': '[1002, 322, 420-500,300,1.25acres,100-250,3.45 acres]'}
df = pd.DataFrame([new_dict])
df.head()
I want to remove hyphen values and change acres to sqfeet in this dataframe.
How may I do it efficiently?
Use list comprehension:
mylist = ["1002", "322", "420-500","300","1.25acres","100-250","3.45 acres"]
# ['1002', '322', '420-500', '300', '1.25acres', '100-250', '3.45 acres']
Step 1: Remove hyphens
filtered_list = [i for i in mylist if "-" not in i] # remove hyphens
Step 2: Convert acres to sqfeet:
final_list = [i if 'acres' not in i else eval(i.split('acres')[0])*43560 for i in filtered_list] # convert to sq foot
#['1002', '322', '300', 54450.0, 150282.0]
Also, if you want to keep the "sqfeet" next tot he converted values use this:
final_list = [i if 'acres' not in i else "{} sqfeet".format(eval(i.split('acres')[0])*43560) for i in filtered_list]
# ['1002', '322', '300', '54450.0 sqfeet', '150282.0 sqfeet']
It's not clear if this is homework, and you haven't shown us what you have already tried per https://stackoverflow.com/help/how-to-ask
Here's something that might get you going in the right direction:
import pandas as pd
col_name = 'Area_sqfeet'
# per comment on your question, you need to make a dataframe with more
# than one row, your original question only had one row
new_list = ["1002", "322", "420-500","300","1.25acres","100-250","3.45 acres"]
df = pd.DataFrame(new_list)
df.columns = ["Area_sqfeet"]
# once you have the df as strings, here's how to remove the ones with hyphens
df = df[df["Area_sqfeet"].str.contains("-")==False]
print(df.head())

Filter dataframe based on groupby sum()

I want to filter my dataframe based on a groupby sum(). I am looking for lines where the amounts for a spesific date, gets to zero.
I have solve this by creating a for loop. I suspect this will reduce performance if the dataframe is large.
It also seems clunky.
newdf = pd.DataFrame()
newdf['name'] = ('leon','eurika','monica','wian')
newdf['surname'] = ('swart','swart','swart','swart')
newdf['birthdate'] = ('14051981','198001','20081012','20100621')
newdf['tdate'] = ('13/05/2015','14/05/2015','15/05/2015', '13/05/2015')
newdf['tamount'] = (100.10, 111.11, 123.45, -100.10)
df = newdf.groupby(['tdate'])[['tamount']].sum().reset_index()
df2 = df.loc[df["tamount"] == 0, "tdate"]
df3 = pd.DataFrame()
for i in df2:
df3 = df3.append(newdf.loc[newdf["tdate"] == i])
print (df3)
The below code is creating an output of the two lines getting to zero when combined on tamount
name surname birthdate tdate tamount
0 leon swart 1981-05-14 13/05/2015 100.1
3 wian swart 2010-06-21 13/05/2015 -100.1
Just use basic numpy :)
import numpy as np
df = newdf.groupby(['tdate'])[['tamount']].sum().reset_index()
dates = df['tdate'][np.where(df['tamount'] == 0)[0]]
newdf[np.isin(newdf['tdate'], dates) == True]
Hope this helps; let me know if you have any questions.

Pandas checks with prefix and more checksum if searched prefix exists or no data

I have below code snippet which works fine.
import pandas as pd
import numpy as np
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
df = df.rename_axis(None)
Example File new_hosts
sj000001
sj000002
sj000003
sj000004
sj124000
sj125000
sj126000
sj127000
sj128000
sj129000
sj130000
sj131000
sj132000
cr000011
cr000012
cr000013
cr000014
crn00001
crn00002
crn00003
crn00004
euk000011
eu0000012
eu0000013
eu0000014
eu5000011
eu5000013
eu5000014
eu5000015
Current output:
sj00 sj12 cr00 cr08 eu00 eu50
sj000001 cr000011 crn00001 euk000011 eu5000011
sj000002 cr000012 crn00002 eu0000012 eu5000013
sj000003 cr000013 crn00003 eu0000013 eu5000014
sj000004 cr000014 crn00004 eu0000014 eu5000015
What's expected:
1) As code works fine but as you see the current output the second column don't have any values but still appearing So, how could i have a checksum if a particular column don't have any values then remove that from display.
2) Can we place a check for the prefixes if they exists in the dataframe before processing to avoid the error.
Appreciate any help.
IIUC, before
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
you can do:
# remove all empty columns
df = df.dropna(axis=1, how='all')
That would solve your first part. Second part can be reindex?
# select prefixes:
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50', 'sh00', 'dt00', 'sh00', 'dt00']
df = df.reindex(prefixes, axis=1).dropna(axis=1, how='all').replace(np.nan, '', regex=True)
Note the axis=1, not axis=0 is identical to what I propose for question 1.
Much thanks to Quang Hoang for the hints on the post, Just for the workaround, i got it working as follows until i get a better answer:
# Select prefixes
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
df = df[prefixes]
# For column `sj12` only extract the values having `sj12` and a should be a word immediately after that like `sj12[a-z]`
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)
df.replace('', np.nan, inplace=True)
# Remove the empty columns
df = df.dropna(axis=1, how='all')
# again drop if all values in the row are nan and replace nan to empty for live columns
df = df.dropna(axis=0, how='all').replace(np.nan, '', regex=True)
# drop the index field
df = df.rename_axis(None)
print(df)

compare index and column in data frame with dictionary

I have a dictionary:
d = {'A-A': 1, 'A-B':2, 'A-C':3, 'B-A':5, 'B-B':1, 'B-C':5, 'C-A':3,
'C-B':4, 'C-C': 9}
and a list:
L = [A,B,C]
I have a DataFrame:
df =pd.DataFrame(columns = L, index=L)
I would like to fill each row in df by values in dictionary based on dictionary keys.For example:
A B C
A 1 2 3
B 5 1 5
C 3 4 9
I tried doing that by:
df.loc[L[0]]=[1,2,3]
df.loc[L[1]]=[5,1,5]
df.loc[L[2]] =[3,4,9]
Is there another way to do that especially when there is a huge data?
Thank you for help
Here is another way that I can think of:
import numpy as np
import pandas as pd
# given
d = {'A-A': 1, 'A-B':2, 'A-C':3, 'B-A':5, 'B-B':1, 'B-C':5, 'C-A':3,
'C-B':4, 'C-C': 9}
L = ['A', 'B', 'C']
# copy the key values into a numpy array
z = np.asarray(list(d.values()))
# reshape the array according to your DataFrame
z_new = np.reshape(z, (3, 3))
# copy it into your DataFrame
df = pd.DataFrame(z_new, columns = L, index=L)
This should do the trick, though it's probably not the best way:
for index in L:
prefix = index + "-"
df.loc[index] = [d.get(prefix + column, 0) for column in L]
Calculating the prefix separately beforehand is probably slower for a small list and probably faster for a large list.
Explanation
for index in L:
This iterates through all of the row names.
prefix = index + "-"
All of the keys for each row start with index + "-", e.g. "A-", "B-"… etc..
df.loc[index] =
Set the contents of the entire row.
[ for column in L]
The same as your comma thing ([1, 2, 3]) just for an arbitrary number of items. This is called a "list comprehension".
d.get( , 0)
This is the same as d[ ] but returns 0 if it can't find anything.
prefix + column
Sticks the column on the end, e.g. "A-" gives "A-A", "A-B"…

Return pieces of strings from separate pandas dataframes based on multi-conditional logic

I'm new to python, and trying to do some work with dataframes in pandas
On the left side is piece of the primary dataframe (df1), and the right is a second (df2). The goal is to fill in the df1['vd_type'] column with strings based on several pieces of conditional logic. I can make this work with nested np.where() functions, but as this gets deeper into the hierarchy, it gets too long to run at all, so I'm looking for a more elegant solution.
The english version of the logic is this:
For df1['vd_type']: If df1['shape'] == the first two characters in df2['vd_combo'] AND df1['vd_pct'] <= df2['combo_value'], then return the last 3 characters in df2['vd_combo'] on the line where both of these conditions are true. If it can't find a line in df2 where both conditions are true, then return "vd4".
Thanks in advance!
EDIT #2: So I want to implement a 3rd condition based on another variable, with everything else the same, except in df1 there is another column 'log_vsc' with existing values, and the goal is to fill in an empty df1 column 'vsc_type' with one of 4 strings in the same scheme. The extra condition would be just that the 'vd_type' that we just defined would match the 'vd' column arising from the split 'vsc_combo'.
df3 = pd.DataFrame()
df3['vsc_combo'] = ['A1_vd1_vsc1','A1_vd1_vsc2','A1_vd1_vsc3','A1_vd2_vsc1','A1_vd2_vsc2' etc etc etc
df3['combo_value'] = [(number), (number), (number), (number), (number), etc etc
df3[['shape','vd','vsc']] = df3['vsc_combo'].str.split('_', expand = True)
def vsc_condition( row, df3):
df_select = df3[(df3['shape'] == row['shape']) & (df3['vd'] == row['vd_type']) & (row['log_vsc'] <= df3['combo_value'])]
if df_select.empty:
return 'vsc4'
else:
return df_select['vsc'].iloc[0]
## apply vsc_type
df1['vsc_type'] = df1.apply( vsc_condition, args = ([df3]), axis = 1)
And this works!! Thanks again!
so your inputs are like:
import pandas as pd
df1 = pd.DataFrame({'shape': ['A2', 'A1', 'B1', 'B1', 'A2'],
'vd_pct': [0.78, 0.33, 0.48, 0.38, 0.59]} )
df2 = pd.DataFrame({'vd_combo': ['A1_vd1', 'A1_vd2', 'A1_vd3', 'A2_vd1', 'A2_vd2', 'A2_vd3', 'B1_vd1', 'B1_vd2', 'B1_vd3'],
'combo_value':[0.38, 0.56, 0.68, 0.42, 0.58, 0.71, 0.39, 0.57, 0.69]} )
If you are not against creating columns in df2 (you can delete them at the end if it's a problem) you generate two columns shape and vd by splitting the column vd_combo:
df2[['shape','vd']] = df2['vd_combo'].str.split('_',expand=True)
Then you can create a function condition that you will use in apply such as:
def condition( row, df2):
# row will be a row of df1 in apply
# here you select only the rows of df2 with your conditions on shape and value
df_select = df2[(df2['shape'] == row['shape']) & (row['vd_pct'] <= df2['combo_value'])]
# if empty (your condition not met) then return vd4
if df_select.empty:
return 'vd4'
# if your condition met, then return the value of 'vd' the smallest
else:
return df_select['vd'].iloc[0]
Now you can create your column vd_type in df1 with:
df1['vd_type'] = df1.apply( condition, args =([df2]), axis=1)
df1 is like:
shape vd_pct vd_type
0 A2 0.78 vd4
1 A1 0.33 vd1
2 B1 0.48 vd2
3 B1 0.38 vd1
4 A2 0.59 vd3

Resources