How do you implement a weighted sum transform primitive in Featuretools? - featuretools

I'm trying to figure how to implement a weighted cum sum primitive for Featuretools. The weighting shall depend on time_since_last like
cum_sum (amount) = sum_{i} exp( -a_{i} ) * amount_{i}
where i are rolling 6 Month periods....
above you find the original question. after a while of try and error I came up with this code for my purpose:
using the data and initial setup for entity and relation from here
def weight_time_until(array, time):
diff = pd.DatetimeIndex(array) - time
s = np.floor(diff.days/365/0.5)
aWidth = 9
a = math.log(0.1) / ( -(aWidth -1) )
w = np.exp(-a*s)
return w
WeightTimeUntil = make_trans_primitive(function=weight_time_until,
input_types=[Datetime],
return_type=Numeric,
uses_calc_time=True,
description="Calc weight using time until the cutoff time",
name="weight_time_until")
features, feature_names = ft.dfs(entityset = es, target_entity = 'clients',
agg_primitives = ['sum'],
trans_primitives = [WeightTimeUntil, MultiplyNumeric])
when I does above I came close to the feature I want but at the end I did not get it right which I do not understand. So I got feature
SUM(loans.WEIGHT_TIME_UNTIL(loan_start))
but not
SUM(loans.loan_amount * loans.WEIGHT_TIME_UNTIL(loan_start))
What did I miss here???
I tried further....
My guess was a type miss match! but the "types" are the same. Anyway I tried the following:
1) es["loans"].convert_variable_type("loan_amount",ft.variable_types.Numeric)
2) loans["loan_amount_"] = loans["loan_amount"]*1.0
For (1) as well for (2) I get the more promising resulting feature:
loan_amount_ * WEIGHT_TIME_UNTIL(loan_start)
and also
loan_amount * WEIGHT_TIME_UNTIL(loan_start)
but only when I have the target value = loans instead of clients which actually was not my intention.

This primitive doesn't currently exist. However, you can create your own custom primitive to accomplish this calculation.
Here is an example calculating the rolling sum, which can be updated to do a weighted sum using the appropriate pandas or python method
from featuretools.primitives import TransformPrimitive
from featuretools.variable_types import Numeric
class RollingSum(TransformPrimitive):
"""Calculates the rolling sum.
Description:
Given a list of values, return the rolling sum.
"""
name = "rolling_sum"
input_types = [Numeric]
return_type = Numeric
uses_full_entity = True
def __init__(self, window=1, min_periods=None):
self.window = window
self.min_periods = min_periods
def get_function(self):
def rolling_sum(values):
"""method is passed a pandas series"""
return values.rolling(window=self.window, min_periods=self.min_periods).sum()
return rolling_sum

Related

Issue with pd.DataFrame.apply with arguments

I want to create augmented data in a new dataframe for every row of an original dataframe.
So, I've defined augment method which I want to use in apply as following:
def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int):
# print(type(row))
target_df_start_index = target_df.shape[0]
raw_img = row[column_name].astype('uint8')
bin_image = convert_image_to_binary_image(raw_img)
bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")
for i in range(num_samples + 1):
new_row = row.copy(deep=True)
if i == 0:
new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
else:
aug_image = data_augmentation0(bin_img_reshaped)
new_row[column_name] = np.squeeze(aug_image, axis=2)
# display.display(new_row)
target_df.loc[target_df_start_index + i] = new_row
# print(target_df.shape)
# display.display(target_df)
When I call this as following, everything works:
tmp_df = pd.DataFrame(None, columns=testDF.columns)
augment(testDF.iloc[0], column_name='binMap', target_df=tmp_df, num_samples=4)
augment(testDF.iloc[1], column_name='binMap', target_df=tmp_df, num_samples=4)
However, when I try it using 'apply' method, I get the prints or the display working fine but the resultant dataframe shows error
tmp_df = pd.DataFrame(None, columns=testDF.columns)
testDF.apply(augment, args=('binMap', tmp_df, 4, ), axis=1)
This is how the o/p data looks like after the apply call -
,data
<Error>, <Error>
<Error>, <Error>
What am I doing wrong?
Your test is very nice, thank you for the clear exposition.
I am happy to be your rubber duck.
In test A, you (successfully) mess with
testDF.iloc[0] and [1],
using kind of a Fortran-style API
for augment(), leaving a side effect result in tmp_df.
Test B is carefully constructed to
be "the same" except for the .apply() call.
So let's see, what's different?
Hard to say.
Let's go examine the docs.
Oh, right!
We're using the .apply() API,
so we'd better follow it.
Down at the end it explains:
Returns: Series or DataFrame
Result of applying func along the given axis of the DataFrame.
But you're offering return None instead.
Now, I'm not here to pass judgement on
whether it's best to have side effects
on a target df -- that's up to you.
But .apply() will be bent out of shape
until you give it something nice
to store as its own result.
Happy hunting!
Tiny little style nit.
You wrote
args=('binMap', tmp_df, 4, )
to offer a 3-tuple. Better to write
args=('binMap', tmp_df, 4)
As written it tends to suggest 1-tuple notation.
When is trailing comma helpful?
in a 1-tuple it is essential: x = (7,)
in multiline dict / list expressions it minimizes git diffs, when inevitably another entry ('cherry'?) will later be added
fruits = [
'apple',
'banana',
]
This change worked for me -
def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int) -> pd.Series:
# print(type(row))
target_df_start_index = target_df.shape[0]
raw_img = row[column_name].astype('uint8')
bin_image = convert_image_to_binary_image(raw_img)
bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")
for i in range(num_samples + 1):
new_row = row.copy(deep=True)
if i == 0:
new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
else:
aug_image = data_augmentation0(bin_img_reshaped)
new_row[column_name] = np.squeeze(aug_image, axis=2)
# display.display(new_row)
target_df.loc[target_df_start_index + i] = new_row
# print(target_df.shape)
# display.display(target_df)
return row
And updated call to apply as following:
testDF = testDF.apply(augment, args=('binMap', tmp_df, 4, ), result_type='broadcast', axis=1)
Thank you #J_H.
If there are better to way to achieve what I'm doing, please feel free to suggest the improvements.

How to add coefficients, p-values and relevant variable name in mlflow?

I am running a linear regression model and I would like to add the coefficients and P-values of each variable and the variable name in to the metrics of the mlflow output. I am new to using mlflow and not very familiar in doing this. Below is an example of part of my code
with mlflow.start_run(run_name=p_key + '_' + str(o_key)):
lr = LinearRegression(
featuresCol = 'features',
labelCol = target_var,
maxIter = 10,
regParam = 0.0,
elasticNetParam = 0.0,
solver="normal"
)
lr_model_item = lr.fit(train_model_data)
lr_coefficients_item = lr_model_item.coefficients
lr_coefficients_intercept = lr_model_item.intercept
lr_predictions_item = lr_model_item.transform(train_model_data)
lr_predictions_item_oos = lr_model_item.transform(test_model_data)
rsquared = lr_model_item.summary.r2
# Log mlflow attributes for mlflow UI
mlflow.log_metric("rsquared", rsquared)
mlflow.log_metric("intercept", lr_coefficients_intercept)
for i in lr_coefficients_item:
mlflow.log_metric('coefficients', lr_coefficients_item[i])
Would like to know whether this is possible? In the final output I should have the intercept, coefficients, p-values and the relevant variable name.
If I understand you correctly, you want to register the p-value and coefficient per variable name separately in MLFlow. The difficult thing in with Spark ML is that all columns are generally merged into a single "features" column before passing it on to a given estimator (e.g. LinearRegression). Therefore, one looses the oversight of which name belongs to which column.
We can get the names of every feature in the "features" column from your linear model by defining the following function [1]:
from itertools import chain
def feature_names(model, df):
features_dict = df.schema[model.summary.featuresCol].metadata["ml_attr"]["attrs"].values()
return sorted([(attr["idx"], attr["name"]) for attr in chain(*features_dict)])
The above function returns a sorted list that contains a list of tuples, in which the first entry corresponds to the index of the feature in the "features" column, and the second entry to the name of the actual feature.
By using the above function in your code, we can now easily match the feature names with the column in the "features" column, and therefore register the coefficient and p-value per feature.
def has_pvalue(model):
''' Check if the given model supports pValues associated '''
try:
model.summary.pValues
return True
except:
return False
with mlflow.start_run():
lr = LinearRegression(
featuresCol="features",
labelCol="label",
maxIter = 10,
regParam = 1.0,
elasticNetParam = 0.0,
solver = "normal"
)
lr_model = lr.fit(train_data)
mlflow.log_metric("rsquared", lr_model.summary.r2)
mlflow.log_metric("intercept", lr_model.intercept)
for index, name in feature_names(lr_model, train_data):
mlflow.log_metric(f"Coef. {name}", lr_model.coefficients[index])
if has_pvalue(lr_model):
# P-values are not always available. This depends on the model configuration.
mlflow.log_metric(f"P-val. {name}", lr_model.summary.pValues[index])
[1]: Related Stackoverflow question

efficient way of calculating Monte Carlo results for different impact assessment methods in Brightway

I am trying to do a comparative monte carlo calculation with brightway2 using different impact assessment methods. I thought about using the switch_method method to be more efficient, since the technosphere matrix is the same for a given iteration. However, I am getting an assertion error. A code to reproduce it could be something like this
import brighway as bw
bw.projects.set_current('ei35') # project with ecoinvent 3.5
db = bw.Database("ei_35cutoff")
# select two different transport activities to compare
activity_name = 'transport, freight, lorry >32 metric ton, EURO4'
for activity in bw.Database("ei_35cutoff"):
if activity['name'] == activity_name:
truckE4 = bw.Database("ei_35cutoff").get(activity['code'])
print(truckE4['name'])
break
activity_name = 'transport, freight, lorry >32 metric ton, EURO6'
for activity in bw.Database("ei_35cutoff"):
if activity['name'] == activity_name:
truckE6 = bw.Database("ei_35cutoff").get(activity['code'])
print(truckE6['name'])
break
demands = [{truckE4: 1}, {truckE6: 1}]
# impact assessment method:
recipe_midpoint=[method for method in bw.methods.keys()
if method[0]=="ReCiPe Midpoint (H)"]
mc_mm = bw.MonteCarloLCA(demands[0], recipe_midpoint[0])
next(mc_mm)
If I try switch method I get the assertion error.
mc_mm.switch_method(recipe_midpoint[1])
assert mc_mm.method==recipe_midpoint[1]
mc_mm.redo_lcia()
next(mc_mm)
Am I doing something wrong here?
I usually store characterization factor matrices in a temporary dict and multiply these cfs with the LCI resulting from MonteCarloLCA directly.
import brightway2 as bw
import numpy as np
# Generate objects for analysis
bw.projects.set_current("my_mcs")
my_db = bw.Database('db')
my_act = my_db.random()
my_demand = {my_act:1}
my_methods = [bw.methods.random() for _ in range(2)]
I wrote this simple function to get characterization factor matrices for the product system I will generate in the MonteCarloLCA. It uses a temporara "sacrificial LCA" object that will have the same A and B matrices as the MonteCarloLCA.
This may seem like a waste of time, but it is only done once, and will make MonteCarlo quicker and simpler.
def get_C_matrices(demand, list_of_methods):
""" Return a dict with {method tuple:cf_matrix} for a list of methods
Uses a "sacrificial LCA" with exactly the same demand as will be used
in the MonteCarloLCA
"""
C_matrices = {}
sacrificial_LCA = bw.LCA(demand)
sacrificial_LCA.lci()
for method in list_of_methods:
sacrificial_LCA.switch_method(method)
C_matrices[method] = sacrificial_LCA.characterization_matrix
return C_matrices
Then:
# Create array that will store mc results.
# Shape is (number of methods, number of iteration)
my_iterations = 10
mc_scores = np.empty(shape=[len(my_methods), my_iterations])
# Instantiate MonteCarloLCA object
my_mc = bw.MonteCarloLCA(my_demand)
# Get characterization factor matrices
my_C_matrices = get_C_matrices(my_demand, my_methods)
# Generate results
for iteration in range(my_iterations):
lci = next(my_mc)
for i, m in enumerate(my_methods):
mc_scores[i, iteration] = (my_C_matrices[m]*my_mc.inventory).sum()
All your results are in mc_scores. Each row corresponds to a method, each column to an MC iteration.
Not very elegant, but try this:
iterations = 10
simulations = []
for _ in range(iterations):
mc_mm = MonteCarloLCA(demands[0], recipe_midpoint[0])
next(mc_mm)
mcresults = []
for i in demands:
print(i)
for m in recipe_midpoint[0:3]:
mc_mm.switch_method(m)
print(mc_mm.method)
mc_mm.redo_lcia(i)
print(mc_mm.score)
mcresults.append(mc_mm.score)
simulations.append(mcresults)
CC_truckE4 = [i[1] for i in simulations] # Climate Change, truck E4
CC_truckE6 = [i[1+3] for i in simulations] # Climate Change, truck E6
from matplotlib import pyplot as plt
plt.plot(CC_truckE4 , CC_truckE6, 'o')
If you then make a test and do twice the simulation for the same demand vector, by setting demands = [{truckE4: 1}, {truckE4: 1}] and plot the result you should get a straight line. This means that you are doing dependent sampling and re-using the same tech matrix for each demand vector and for each LCIA. I am not 100% sure of this but I hope it answers your question.

Using while and if function together with a condition change

I am trying to use python to conduct a calculation which will sum the values in a column only for the time period that a certain condition is met.
However, the summation should begin when the conditions are met (runstat == 0 and oil >1). The summation should then stop at the point when oil == 0.
I am new to python so I am not sure how to do this.
I connected the code to a spreadsheet for testing purposes but the intent is to connect to live data. I figured a while loop in combination with an if function might work but I am not winning.
Basically I want to have the code start when runstat is zero and oil is higher than 0. It should stop summing the values of oil when the oil row becomes zero and then it should write the data to a SQL database (this I will figure out later - for now I just want to see if it can work).
This is what code I have tried so far.
import numpy as np
import pandas as pd
data = pd.read_excel('TagValues.xlsx')
df = pd.DataFrame(data)
df['oiltag'] = df['oiltag'].astype(float)
df['runstattag'] = df['runstattag'].astype(float)
oil = df['oiltag']
runstat = df['runstattag']
def startup(oil,runstat):
while oil.all() > 0:
if oil > 0 and runstat == 0:
totaloil = sum(oil.all())
print(totaloil)
else:
return None
return
print(startup(oil.all(), runstat.all()))
It should sum the values in the column but it is returning: None
OK, so I think that what you want to do is get the subset of rows between the two conditions, then get a sum of those.
Method: Slice the dataframe to get the relevant rows and then sum.
import numpy as np
import pandas as pd
data = pd.read_excel('TagValues.xlsx')
df = pd.DataFrame(data)
df['oiltag'] = df['oiltag'].astype(float)
df['runstattag'] = df['runstattag'].astype(float)
def startup(dframe):
start_row = dframe[(dframe.oiltag > 0) & (dframe.runstattag == 0)].index[0]
end_row = dframe[(dframe.oiltag == 0) & (dframe.index > start_row)].index[0]
subset = dframe[start_row:end_row+1] # +1 because the end slice is non-inclusive
totaloil = subset.oiltag.sum()
return totaloil
print(startup(df))
This code will raise an error if it can't find a subset of rows which match your criteria. If you need to handle that case, then we could add some exception handling.
EDIT: Please note this assumes that your criteria is only expected to occur once per excel. If you have multiple “chunks” that you will want to sum then this will need tweaking.

Problems regarding Pyomo provided math functions

I am trying to solve a maximization problem using Pyomo which has a recursive relationship. I am trying to maximize the revenue from a battery and it involves updating the state of charge of the battery every hour (which is the recursive relationship here). I am using the following code:
import pyomo
import numpy as np
from pyomo.environ import *
import pandas as pd
model = ConcreteModel()
N = 24 #number of hours
lmpdata = np.random.randint(1,10,24) #LMP Data (to be imported from MISO/PJM)
R = 0 #discount
eta_s = 0.99 #self-discharge efficiency
eta_c = 0.95 #round-trip efficiency
gammas_min = 0.1 #fraction of energy capacity to reserve for discharging
gammas_max = 0.05 #fraction of energy capacity to reserve for charging
S_bar = 50 #energy capacity
Q_bar = 50 #energy charge/discharge rating
model.qd = Var(range(N), within = NonNegativeReals) #variables for energy sold at time t
model.qr = Var(range(N), within = NonNegativeReals) #variables for energy purchased at time t
model.obj = Objective(expr = sum((model.qd[i]-model.qr[i])*lmpdata[i]*np.exp(-R*(i+1)) for i in range(N)), sense = maximize) #objective function
model.SOC = np.zeros(N) #state of charge (s(t) in Sandia's Model)
model.SOC[0] = 25 #SOC at hour 0
#recursion relation describing the SOC
def con_rule1(model,i):
model.SOC[i] = eta_s*model.SOC[i-1] + eta_c*model.qr[i-1] - model.qd[i-1]
return (eta_s*model.SOC[i-1] + eta_c*model.qr[i-1] - model.qd[i-1]== model.SOC[i])
#def con_rule1(model,i):
model.con1 = Constraint(range(1,N), rule = con_rule1)
#model.con2 = Constraint(expr = eta_s*SOC[N-1] + eta_c*model.qr[N-1] - model.qd[N-1] == SOC[0]) #SOC relation for the last hour
#SOC boundaries
def con_rule2(model,i):
return (gammas_min*S_bar <= eta_s*model.SOC[i] + eta_c*model.qr[i] - model.qd[i] <= (1-gammas_max)*S_bar)
model.con3 = Constraint(range(N), rule = con_rule2)
#limits the total energy charged over each time step to the energy
#charge limit (derived from the power limit)
#It restricts the throughput based on the power rating
def con_rule3(model,i):
return (0 <= model.qr[i]+model.qd[i] <= Q_bar)
model.con4 = Constraint(range(N),rule = con_rule3)
def pyomo_postprocess(options=None, instance=None, results=None):
model.qd.display()
model.qr.display()
model.pprint()
However, when I try to run the code, I am getting the following error:
Implicit conversion of Pyomo NumericValue type `<class 'pyomo.core.kernel.expr_coopr3._SumExpression'>' to a float is
disabled. This error is often the result of using Pyomo components as
arguments to one of the Python built-in math module functions when
defining expressions. Avoid this error by using Pyomo-provided math
functions.
I could not find any reference to Pyomo's math function in its documentation. It would be great if anyone could help me solve this problem!
Pyomo defines its own set of math module functions for operations like exp, log, sin, etc. If you want to use any of these functions in your Pyomo expressions you should make sure they are the ones provided by Pyomo and not from some other Python package. I think the issue with your model is that you are using np.exp in your Objective function. The Pyomo math functions are automatically imported when you import pyomo.environ so you should be able to replace np.exp with exp to get the Pyomo-defined function.

Resources