I have a dictionary with JSON values keyed to the value of a column (name) in my data frame, and I want to add some columns to the data frame drawn from the dictionary.
I've tried to do this with something like:
df['district_name'] = data[df['name']]['district_name']
but that doesn't work at all (it gives a "Series aren't valid keys", which makes perfect sense; I've never quite understood the black magic that allows df['col3'] = df['col1'] + df['col2'] to work). Other answers here have led me to try something like:
df['district_name'] = df.apply(lambda row:data[row['name']]['district_name'])
This gives me KeyError: ('name', 'occurred at index Name').
How can I best accomplish this?
You are quite close. Try this:
df['district_name'] = df['name'].map(data.get)['district_name']
Related
What I want to do was actually group by all similar strings in one columns and sum their
corresponding counts if there are similarity, otherwise, leave them.
A little similar to this post. Unfortunately I have not been able to apply this to my case:
How to group Pandas data frame by column with regex match
Unfortunately, I ended up with the following steps:
I wrote a function to print out all the fuzz.Wratio for each row of string,
when each row does a linear search from the top to check if there are other similar
strings in the rest of the rows. If the WRatio > 90, I would like to sum these row's
corresponding counts. Otherwise, leave them there.
I created a test data looking like this:
test_data=pd.DataFrame({
'name':['Apple.Inc.','apple.inc','APPLE.INC','OMEGA'],
'count':[4,3,2,6]
})
So what I want to do is make the result as a dataframe like:
result=pd.Dataframe({
'Nname':['Apple.Inc.','OMEGA'],
'Ncount':[9,6]
})
My function so far only gave me the fuzz ratio for each row,
and to my understanding is that,
each row compares to itself three times( here we have four rows).
So My function output would look like:
pd.Dataframe({
'Nname':['Apple.Inc.','Apple.Inc.','Apple.Inc.','apple.inc',\
'apple.inc','apple.inc'],
'Ncount':[4,4,4,3,3,3],
'FRatio': [100,100,100,100,100,100] })
This is just one portion of the whole output from the function I wrote with this test data.
And the last row "OMEGA" would give me a fuzz ratio about 18.
My function is like this:
def checkDupTitle2(data):
Nname=[]
Ncount=[]
f_ratio=[]
for i in range(0, len(data)):
current=0
count=0
space=0
for space in range(0, len(data)-1-current):
ratio=fuzz.WRatio(str(data.loc[i]['name']).strip(), \
str(data.loc[current+space]['name']).strip())
Nname.append(str(data.loc[i]['name']).strip())
Ncount.append(str(data.loc[i]['count']).strip())
f_ratio.append(ratio)
df=pd.DataFrame({
'Nname': Nname,
'Ncount': Ncount,
'FRatio': f_ratio
})
return df
So after running this function and get the output,
I tried to get what I eventually want.
here I tried group by on the df created above:
output.groupby(output.FRatio>90).sum()
But this way, I still need a "name" in my dataframe,
how can I decide on which names for this total counts, say, 9 here.
"Apple.Inc" or "apple.inc" or "APPLE.INC"?
Or, did I make it too complex?
Is there a way to group by "name" at the very first and treat "Apple.Inc.", "apple.inc" and "APPLE.INC" all the same, then my problem has solved. I have stump quite a while. Any helps would be highly
appreciated! Thanks!
The following code is using my library RapidFuzz instead of FuzzyWuzzy since it is faster and it has a process method extractIndices which does help here. This solution is quite a bit faster, but since I do not work with pandas regulary I am sure there are still some things that could be improved :)
import pandas as pd
from rapidfuzz import process, utils
def checkDupTitle(data):
values = data.values.tolist()
companies = [company for company, _ in values]
pcompanies = [utils.default_process(company) for company in companies]
counts = [count for _, count in values]
results = []
while companies:
company = companies.pop(0)
pcompany = pcompanies.pop(0)
count = counts.pop(0)
duplicates = process.extractIndices(
pcompany, pcompanies,
processor=None, score_cutoff=90, limit=None)
for (i, _) in sorted(duplicates, reverse=True):
count += counts.pop(i)
del pcompanies[i]
del companies[i]
results.append([company, count])
return pd.DataFrame(results, columns=['Nname','Ncount'])
test_data=pd.DataFrame({
'name':['Apple.Inc.','apple.inc','APPLE.INC','OMEGA'],
'count':[4,3,2,6]
})
checkDupTitle(test_data)
The result is
pd.Dataframe({
'Nname':['Apple.Inc.','OMEGA'],
'Ncount':[9,6]
})
So im trying to normalize my features by using .apply() iteratively on all columns of the dataframe but it gives KeyError. Can someone help?
I've tried using below code but it doesnt work :
for x in df.columns:
df[x+'_norm'] = df[x].apply(lambda x:(x-df[x].mean())/df[x].std())
I don't think it's a good idea to use mean and std functions inside the apply. You are calculating them each time which that any row is going to get its new value. Instead you can calculate them in the beginning of the loop and use of it in the apply function. Like below:
for x in df.columns:
mean = df[x].mean()
std = df[x].std()
df[x+'_norm'] = df[x].apply(lambda y:(y-mean)/std)
I've written a python program to look for and fix syntax errors in an excel spreadsheet. This part works.
If a cell has a correctable syntax error, the value in the cell should be fixed and the cell highlighted in yellow.
If the syntax error is not correctable (there is not enough information available to correct it), the value of the cell should be left unchanged, but the cell should be highlighted in red. This is the part I can't seem to get to work.
I can update the value of the cell or highlight it, but all my attempts to do both fails.
I've used commands like
df.at[row,col] = value
to update an individual cell, and this works. And
def colorCell(df):
color = 'background-color: red; font-weight: bold'
dfTemp = pd.DataFrame("", index=df.index, columns=df.columns)
for i in df[df["UserId"].str.match(userIdPat) != True]["UserId"].index:
dfTemp.at[i,"UserId"] = color
for i in df[df["Phone Number"].str.match(phonePat) != True]["Phone Number"].index:
dfTemp.at[i,"Phone Number"] = color
for i in df[df["MAC Address"].str.match(macPat) != True]["MAC Address"].index:
dfTemp.at[i,"MAC Address"] = color
return dfTemp
df2 = df.style.apply(colorCell, axis=None)
to color the cells, but I can't seem to get both setting a value and setting the style to work at the same time.
Part of the problem is that I'm modifying the value in place (df), while I'm creating a new dataframe to set the color (df2). If I write df2 back into df, I get errors that "AttributeError: 'Styler' object has no attribute 'at'" the next time I try to set another cell value using
df.at[row,col] = value
I'm sure there is a simple fix for this, but I am just not seeing it.
Thanks!
#user545424 Thanks for suggesting I post a short example. In doing so I solved my own issue.
My original issue was that I had to separate dataframes, one for the data, and a separate one for the style. In coming up with a short example I found a way to do everything with one dataframe (I'm very new to both python and pandas).
The fix I came up with is to create the original dataframe including ".style".
So in place of this command (which I was using):
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD'))
I now create df using this command:
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD')).style
Now, when I want to access/change the data part of df, I reference it as "df.data", and when I want to apply the style part of df, I reference it as "df.apply". This way I can access both the data and the style with one dataframe. Here is the example code I was writing when I discovered my answer:
import numpy as np
import pandas as pd
def colorCell(df, row=None, column=None, color="red"):
color = f"background-color: {color}; font-weight: bold"
dfTemp = pd.DataFrame("", index=df.index, columns=df.columns)
dfTemp.at[row,column] = color
return dfTemp
# Create the dataframe including styles.
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD')).style
# Update location [1,"A"] and set it to "red".
df.data.loc[1,"A"] = 999
df.apply(colorCell, row=1, column="A", color="red", axis=None)
# Update location [2,"B"] and set it to "yellow".
df.data.loc[2,"B"] = 999
df.apply(colorCell, row=2, column="B", color="yellow", axis=None)
As I said above, I'm very new to python, and even newer to pandas. So just because the above works, is there a better way to do this?
Thanks!
I have some data 33k rows x 57 columns.
In some columns there is a data which I want to translate with dictionary.
I have done translation, but now I want to write back translated data to my data set.
I have problem with saving tuples output from for loop.
I am using tuples for creating good translation. .join and .append is not working in my case. I was trying in many case but without any success.
Looking for any advice.
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
for index, row in data.iterrows():
row["translated"] = (tuple(slownik.get(znak) for znak in row["1st_service"]))
I just want to see in print(data["1st_service"] a translated data not the previous one before for loop.
First of all, if your csv doesn't already have a 'translated' column, you'll have to add it:
import numpy as np
data['translated'] = np.nan
The problem is the row object you're trying to write to is only a view of the dataframe, it's not the dataframe itself. Plus you're missing square brackets for your list comprehension, if I'm understanding what you're doing. So change your last line to:
data.loc[index, "translated"] = tuple([slownik.get(znak) for znak in row["1st_service"]])
and you'll get a tuple written into that one cell.
In future, posting the exact error message you're getting is very helpful!
I have manage it, below working code:
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
data.columns = []
slownik = dict([ ])
trans = ' '
for index, row in data.iterrows():
trans += str(tuple([slownik.get(znak) for znak in row["1st_service"]]))
data['1st_service'] = trans.split(')(')
data.to_csv("out.csv", index=False)
Can you tell me if it is well done?
Maybe there is an faster way to do it?
I am doing it for 12 columns in one for loop, as shown up.
I am facing a weird problem with pandas.
I donot know where I am going wrong?
But when I am creating a new df, there seems to be no problem. like
Any idea why?
Edit :
sat=pd.read_csv("2012_SAT_Results.csv")
sat.head()
#converted columns to numeric types
sat.iloc[:,2:]=sat.iloc[:,2:].apply(pd.to_numeric,errors="coerce")
sat.dtypes
sat_1=sat.iloc[:,2:].apply(pd.to_numeric,errors="coerce")
sat_1.head()
The fact that you can't apply to_numeric directly using .iloc appears to be a bug, but to get the same results that you're looking for (applying to_numeric to multiple columns at the same time), you could instead use:
df = pd.DataFrame({'a':['1','2'],'b':['3','4']})
# If you're applying to entire columns
df[df.columns[1:]] = df[df.columns[1:]].apply(pd.to_numeric, errors = 'coerce')
# If you want to apply to specific rows within columns
df.loc[df.index[1:], df.columns[1:]] = df.loc[df.index[1:], df.columns[1:]].apply(pd.to_numeric, errors = 'coerce')