I am new to this so any help is much appreciated.
I would like to add a new column to a data frame that is a function of both values in the data frame and a python object.
The format is as follows:
df['col_3'] = list(map( function, df['col_1'], df['col_2'],instance_of_class ))
def function(a,b,instance):
return a + b + instance_of_class.attribute
where one of the parameters needs to be an instance of a class.
When I do this, python throws an error that the object is non-iterable, I assume this is because it wants only lists passed as parameters. Not sure how to get around this without substantially slowing things down. Thanks!
This could be done with map or apply, but since you are just starting why don't you keep it simple?
col_3 = []
for c1, c2 in df[['col_1', 'col_2']].values:
c3_list.append(function(c1, c2, instance))
df['col_3'] = col_3
Related
Low-level python skills here (learned programming with SAS).
I am trying to apply a series of fuzzy string matching (fuzzywuzzy lib) formulas on pairs of strings, stored in a base dataframe. Now I'm conflicted about the way to go about it.
Should I write a loop that creates a specific dataframe for each formula and then append all these sub-dataframes in a single one? The trouble with this approach seems to be that, since I cannot dynamically name the sub-dataframe, the resulting value gets overwritten at each turn of the loop.
Or should I create one dataframe in a single loop, taking my formulas names and expression as a dict? The trouble here gives me the same problem as above.
Here is my formulas dict:
# ratios dict: all ratios names and functions
ratios = {"ratio": fuzz.ratio,
"partial ratio": fuzz.partial_ratio,
"token sort ratio": fuzz.token_sort_ratio,
"partial token sort ratio": fuzz.partial_token_sort_ratio,
"token set ratio": fuzz.token_set_ratio,
"partial token set ratio": fuzz.partial_token_set_ratio
}
And here is the loop I am currently sweating over:
# for loop iterating over ratios
for r, rn in ratios.items():
# fuzzing function definition
def do_the_fuzz(row):
return rn(row[base_column], row[target_column])
# new base df containing ratio data and calculations for current loop turn
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
It gives me the same problem, namely that the 'mesure' column gets overwritten, and I end up with a column full of the last value (here: 'partial token set').
My overall problem is that I cannot understand if and how I can dynamically name dataframes, columns or values in a python loop (or if I'm even supposed to do it).
I've been trying to come up with a solution myself for too long and I just can't figure it out. Any insight would be very much appreciated! Many thanks in advance!
I would create a dataframe that is updated at each loop iteration:
final_df = pd.DataFrame()
for r, rn in ratios.items():
...
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
final_df = pd.concat([final_dfl, df_out1], axis=0)
I hope this can help you.
So im trying to normalize my features by using .apply() iteratively on all columns of the dataframe but it gives KeyError. Can someone help?
I've tried using below code but it doesnt work :
for x in df.columns:
df[x+'_norm'] = df[x].apply(lambda x:(x-df[x].mean())/df[x].std())
I don't think it's a good idea to use mean and std functions inside the apply. You are calculating them each time which that any row is going to get its new value. Instead you can calculate them in the beginning of the loop and use of it in the apply function. Like below:
for x in df.columns:
mean = df[x].mean()
std = df[x].std()
df[x+'_norm'] = df[x].apply(lambda y:(y-mean)/std)
I'm trying to hash each value of a python 3.6 pandas dataframe column with the following algorithm on the dataframe-column ORIG:
HK_ORIG = base64.b64encode(hashlib.sha1(str(df.ORIG).encode("UTF-8")).digest())
However, the above mentioned code does not hash each value of the column, so, in order to hash each value of the df-column ORIG, I need to use the apply function. Unfortunatelly, I don't seem to be good enough to get this done.
I imagine it to look like the following code:
df["HK_ORIG"] = str(df['ORIG']).encode("UTF-8")).apply(hashlib.sha1)
I'm looking very much forward to your answers!
Many thanks in advance!
You can either create a named function and apply it - or apply a lambda function. In either case, do as much processing as possible withing the dataframe.
A lambda-based solution:
df['ORIG'].astype(str).str.encode('UTF-8')\
.apply(lambda x: base64.b64encode(hashlib.sha1(x).digest()))
A named function solution:
def hashme(x):
return base64.b64encode(hashlib.sha1(x).digest())
df['ORIG'].astype(str).str.encode('UTF-8')\
.apply(hashme)
I'm using a map function to generate a new column where its value depends on the result of a column that already exists in the dataframe.
def computeTechFields(row):
if row.col1!=VALUE_TO_COMPARE:
tech1=0
else:
tech1=1
return (row.col1, row.col2, row.col3, tech1)
delta2rdd = delta.map(computeTechFields)
The problem is that my main dataframe has more than 150 columns that I have to return with the map function so in the end I have something like this :
return (row.col1, row.col2, row.col3, row.col4, row.col5, row.col6, row.col7, row.col8, row.col9, row.col10, row.col11, row.col12, row.col13, row.col14, row.col15, row.col16, row.col17, row.col18 ..... row.col149, row.col150, row.col151, tech1)
As you can see, it is really long to write and difficult to read. So I tried to do something like this :
return (row.*, tech1)
But of course it did not work.
I know that the "withColumn" function exists but I don't know much about its performance and could not make it work anyway.
Edit (What happened with the withColumn function) :
def computeTech1(row):
if row.col1!=VALUE_TO_COMPARE:
tech1=0
else:
tech1=1
return tech1
delta2 = delta.withColumn("tech1", computeTech1)
And it gave me this error :
AssertionError: col should be Column
I tried to do something like this :
return col(tech1)
The error was the same
I also tried :
delta2 = delta.withColumn("tech1", col(computeTech1))
This time, the error was :
AttributeError: 'function' object has no attribute '_get_object_id'
End of the edit
So my question is, how can I return all the columns + a few more within my UDF used by the map function ?
Thanks !
Not super firm with Python, so people might correct me on the syntax here, but the general idea is to make your function a UDF with a column as input, then call that inside withColumn. I used a lambda here, but with some fiddeling it should also work with a function.
from pyspark.sql.functions import udf
computeTech1UDF = udf(
lambda col: 0 if col != VALUE_TO_COMPARE else 1, IntegerType())
delta2 = delta.withColumn("tech1", computeTech1UDF(col1))
What you tried did not work since you did not provide withColumn with a column expression (see http://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn). Using the UDF wrapper achieves exactly that.
I have a dictionary with JSON values keyed to the value of a column (name) in my data frame, and I want to add some columns to the data frame drawn from the dictionary.
I've tried to do this with something like:
df['district_name'] = data[df['name']]['district_name']
but that doesn't work at all (it gives a "Series aren't valid keys", which makes perfect sense; I've never quite understood the black magic that allows df['col3'] = df['col1'] + df['col2'] to work). Other answers here have led me to try something like:
df['district_name'] = df.apply(lambda row:data[row['name']]['district_name'])
This gives me KeyError: ('name', 'occurred at index Name').
How can I best accomplish this?
You are quite close. Try this:
df['district_name'] = df['name'].map(data.get)['district_name']