FeatureTools - How to Add 2 Columns Together? - python-3.x

I'm stuck. Using Featuretools, all I want to do is create a new column that sums two columns together from my dataset, creating a "stacked" feature of sorts. Do this for all columns in my dataset.
My code looks like this:
# Define the function
def feature_engineering_dataset(df):
es = ft.EntitySet(id = 'stockdata')
# Make the "Date" index an actual column cuz defining it as the index below throws
# a "can't find Date in index" error for some reason.
df = df.reset_index()
# Save some columns not used in Featuretools to concat back later
dates = df['Date']
tickers = df['Ticker']
dailychange = df['DailyChange']
classes = df['class']
dataframe = df.drop(['Date', 'Ticker', 'DailyChange', 'class'],axis=1)
# Define the entity
es.entity_from_dataframe(entity_id='data', dataframe=dataframe, index='Date') # Won't find Date so uses a numbered index. We'll re-define date as index later
# Pesky warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)
warnings.filterwarnings("once", category=ImportWarning)
# Run deep feature synthesis
feature_matrix, feature_defs = ft.dfs(n_jobs=-2,entityset=es, target_entity='data',
chunk_size=0.015,max_depth=2,verbose=True,
agg_primitives = ['sum'],
trans_primitives = []
)
# Now re-add previous columnes because featuretools...
df = pd.concat([dates, tickers, feature_matrix, dailychange, classes], axis=1)
df = df.set_index(['Date'])
# Return our new dataset!
return(df)
# Now run that defined function
df = feature_engineering_dataset(df)
I'm not sure what's really happening here, but I've defined a depth of 2, so it's my understanding that for every combination of pairs of columns in my dataset, it'll create a new column that sums the two together?
My initial dataframes shape has 3101 columns, and when I run this command it says Built 3098 features, and the final df has 3098 columns after the concat'ing, which isn't right, it should have all my original features, PLUS the engineered ones.
How can I achieve what I'm after? The examples on the featuretools page and API docs are extremely confusing and deal a lot with dated examples, like "time_since_last" trans primitives and other stuff that doesn't seem to apply here. Thanks!

Thanks for the question. You can create a new column that sums two columns by using the transform primitive add_numeric. I'll go through a quick example using this data.
id time open high low close
0 2019-07-10 07:00:00 1.053362 1.053587 1.053147 1.053442
1 2019-07-10 08:00:00 1.053457 1.054057 1.053457 1.053987
2 2019-07-10 09:00:00 1.053977 1.054192 1.053697 1.053917
3 2019-07-10 10:00:00 1.053902 1.053907 1.053522 1.053557
4 2019-07-10 11:00:00 1.053567 1.053627 1.053327 1.053397
First, we create the entity set for the data.
import featuretools as ft
es = ft.EntitySet('stockdata')
es.entity_from_dataframe(
entity_id='data',
dataframe=df,
index='id',
time_index='time',
)
Now, we apply DFS using the transform primitive to add the numeric columns.
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_entity='data',
trans_primitives=['add_numeric'],
)
Then, the new engineered features are returned along with the original ones.
feature_matrix
open high low close close + high low + open high + low close + open high + open close + low
id
0 1.053362 1.053587 1.053147 1.053442 2.107029 2.106509 2.106734 2.106804 2.106949 2.106589
1 1.053457 1.054057 1.053457 1.053987 2.108044 2.106914 2.107514 2.107444 2.107514 2.107444
2 1.053977 1.054192 1.053697 1.053917 2.108109 2.107674 2.107889 2.107894 2.108169 2.107614
3 1.053902 1.053907 1.053522 1.053557 2.107464 2.107424 2.107429 2.107459 2.107809 2.107079
4 1.053567 1.053627 1.053327 1.053397 2.107024 2.106894 2.106954 2.106964 2.107194 2.106724
You can see a list of all the built-in primitives by calling the function ft.list_primitives().

Related

Nested loops altering rows in pandas - Avoiding "A value is trying to be set on a copy of a slice from a DataFrame"

Summary
I am trying to loop through a pandas dataframe, and to run a secondary loop at each iteration. The secondary loop calculates something that I want to append into the original dataframe, so that when the primary loop advances, some of the rows are recalculated based on the changed values. (For those interested, this is a simple advective model of carbon accumulation in soils. When a new layer of soil is deposited, mixing processes penetrate into older layers and transform their properties to a set depth. Thus, each layer deposited changes those below it incrementally, until a former layer lies below the mixing depth.)
I have produced an example of how I want this to work, however it is generating the common error message:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
I have looked into the linked information in the error message as well as myriad posts on this forum, but none get into the continual looping through a changed dataframe.
What I've tried, and some possible solutions
Below is some example code. This code works more or less as well as I want it to. But it produces the warning. Should I:
Suppress the warning and continue working with this architecture? In this case, am I asking for trouble with un-reproducible results?
Try a different architecture altogether, like a numpy array from the original dataframe?
Try df.append() or df.copy() to avoid the warning?
I have tried `df.copy()' to no avail - the warning was still thrown.
Example code:
import pandas as pd
a = pd.DataFrame(
{
'a':[x/2 for x in range(1,11)],
'b':['hot dog', 'slider', 'watermelon', 'funnel cake', 'cotton candy', 'lemonade', 'fried oreo', 'ice cream', 'corn', 'sausage'],
'c':['meat', 'meat', 'vegan', 'vegan', 'vegan', 'vegan', 'dairy','dairy', 'vegan', 'meat']
}
)
print(a)
z = [x/(x+2) for x in range(1,5)]
print(z)
#Primary loop through rows of the main dataframe
for ind, row in a.iterrows():
#Pull out a chunk of the dataframe. This is the portion of the dataframe that will be modified. What is below
#this is already modified and locked into the geological record. What is above has not yet been deposited.
b = a.iloc[ind:(ind+len(z)), :]
#Define the size of the secondary loop. Taking the minimum avoids the model mixing below the boundary layer (key error)
loop = min([len(z), len(b)])
#Now loop through the sub-dataframe and change accordingly.
for fraction in range(loop):
b['a'].iloc[fraction] = b['a'].iloc[fraction]*z[fraction]
#Append the original dataframe with new data:
a.iloc[ind:(ind+loop), :] = b
#Try df.copy(), but still throws warning!
#a.iloc[ind:(ind+loop), :] = b.copy()
print(a)

Pandas or Python method for removing unwanted string elements in a column, based on strings in another column

I have a problem similar to this question.
I am importing a large .csv file into pandas for a project. One column in the dataframe contains ultimately 4 columns of concatenated data(I can't control the data I receive) a Brand name (what I want to remove), a product description, product size and UPC. Please note that the brand description in the Item_UPC does not always == Brand.
for example
import pandas as pd
df = pd.DataFrame({'Item_UPC': ['fubar baz dr frm prob onc dly wmn ogc 30vcp 06580-66-832',
'xxx stuff coll tides 20 oz 09980-66-832',
'hel world sambucus elder 60 chw 0392-67-491',
'northern cold ultimate 180 sg 06580-66-832',
'ancient nuts boogs 16oz 58532-42-123 '],
'Brand': ['FUBAR OF BAZ',
'XXX STUFF',
'HELLO WORLD',
'NORTHERN COLDNITES',
'ANCIENT NUTS']})
I want to remove the brand name from the Item_UPC column as this is redundant information among other issues. Currently I have a function, that takes the new df and pulls out the UPC and cleans it up to match what one finds on bottles and another database I have for a single brand, minus the last check sum digit.
def clean_upc(df):
#take in a dataframe, expand the number of columns into a temp
#dataframe
temp = df["Item_UPC"].str.rsplit(" ", n=1, expand = True)
#add columns to main dataframe from Temp
df.insert(0, "UPC", temp[1])
df.insert(1, "Item", temp[0])
#drop original combined column
df.drop(columns= ["Item_UPC"], inplace=True)
#remove leading zero on and hyphens in UPC.
df["UPC"]= df["UPC"].apply(lambda x : x[1:] if x.startswith("0") else x)
df["UPC"]=df["UPC"].apply(lambda x :x.replace('-', ''))
col_names = df.columns
#make all columns lower case to ease searching
for cols in col_names:
df[cols] = df[cols].apply(lambda x: x.lower() if type(x) == str else x)
after running this I have a data frame with three columns
UPC, Item, Brand
The data frame has over 300k rows and 2300 unique brands in it. There is also no consistent manner in which they shorten names. When I run the following code
temp = df["Item"].str.rsplit(" ", expand = True)
temp has a shape of
temp.shape
(329868, 13)
which makes manual curating a pain when most of columns 9-13 are empty.
Currently my logic is to first split brand in to 2 while dropping the first column in temp
brand = df["brand"].str.rsplit(" ", n=1,expand = True) #produce a dataframe of two columns
temp.drop(columns= [0], inplace=True)
and then do a string replace on temp[1] to see if it contains regex in brand[1] and then replace it with " " or vice versa, and then concatenate temp back together (
temp["combined"] = temp[1] + temp[2]....+temp[13]
and replace the existing Item column with the combined column
df["Item"] = temp["combined"]
or is there a better way all around? There are many brands that only have one name, which may make everything faster. I have been struggling with regex and logically it seems like this would be faster, I just have a hard time thinking of the syntax to make it work.
Because the input does not follow any well-defined rules, this looks like more of an optimization problem. You can start by stripping exact matches:
df["Item_cleaned"] = df.apply(lambda x: x.Item_UPC.lstrip(x.Brand.lower()), axis=1)
output:
Item_UPC Brand Item_cleaned
0 fubar baz dr frm prob onc dly wmn ogc 30vcp 06... FUBAR OF BAZ dr frm prob onc dly wmn ogc 30vcp 06580-66-832
1 xxx stuff coll tides 20 oz 09980-66-832 XXX STUFF coll tides 20 oz 09980-66-832
2 hel world sambucus elder 60 chw 0392-67-491 HELLO WORLD sambucus elder 60 chw 0392-67-491
3 northern cold ultimate 180 sg 06580-66-832 NORTHERN COLDNITES ultimate 180 sg 06580-66-832
4 ancient nuts boogs 16oz 58532-42-123 ANCIENT NUTS boogs 16oz 58532-42-123
This method should will strip any exact matches and output to a new column Item_cleaned. If your input is abbreviated, you should apply a more complex fuzzy string matching algorithm. This may be prohibitively slow, however. In that case, I would recommend a two-step method, saving all rows that have been cleaned by the approach above, and do a second pass for more complicated cleaning as needed.

Append each value in a DataFrame to a np vector, grouping by column

I am trying to create a list, which will be fed as input to the neural network of a Deep Reinforcement Learning model.
What I would like to achieve:
This list should have the properties of this code's output
vec = []
lines = open("data/" + "GSPC" + ".csv", "r").read().splitlines()
for line in lines[1:]:
vec.append(float(line.split(",")[4]))
i.e. just a list of values like this [enter image description here][1]
The original dataframe looks like:
Out[0]:
Close sma15
0 1.26420 1.263037
1 1.26465 1.263193
2 1.26430 1.263350
3 1.26450 1.263533
but by using df.transpose() i obtained the following:
0 1 2 3
Close 1.264200 1.264650 1.26430 1.26450
sma15 1.263037 1.263193 1.26335 1.263533
from here I would like to obtain a list grouped by column, of the type:
[1.264200, 1.263037, 1.264650, 1.263193, 1.26430, 1.26335, 1.26450, 1.263533]
I tried
x = np.array(df.values.tolist(), dtype = np.float32).reshape(1,-1)
but this gives me a float with 1 row and 6 columns, how could I achieve a result that has the properties I am looking for?
From what I can understand, you just want a flattened version of the DataFrame's values. That can be done simply with the ndarray.flatten() method rather than reshaping it.
# Creating your DataFrame object
a = [[1.26420, 1.263037],
[1.26465, 1.263193],
[1.26430, 1.263350],
[1.26450, 1.263533]]
df = pd.DataFrame(a, columns=['Close', 'sma15'])
df.values.flatten()
This gives array([1.2642, 1.263037, 1.26465, 1.263193, 1.2643, 1.26335, 1.2645, 1.263533]) as is (presumably) desired.
PS: I am not sure why you have not included the last row of the DataFrame as the output of your transpose operation. Is that an error?

Get feature names for dataframe.corr

I am using the cancer data set from sklearn and I need to find the correlations between features. I am able to find the correlated columns, but I am not able to present them in a "nice" way, so that they will be an input for Dataframe.drop.
Here is my code:
cancer_data = load_breast_cancer()
df=pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
corr = df.corr()
#filter to find correlations above 0.6
corr_triu = corr.where(~pd.np.tril(pd.np.ones(corr.shape)).astype(pd.np.bool))
corr_triu = corr_triu.stack()
corr_result = corr_triu[corr_triu > 0.6]
print(corr_result)
df.drop(columns=[?])
IIUC, you want the columns that correlate with some other column in the dataset, ie drop columns that don't appear in corr_result. So you'll want to get the unique variables from the index of corr_result, from each level. There may be repeats so take care of that as well, such as with sets:
corr_result.index = corr_result.index.remove_unused_levels()
corr_vars = set()
corr_vars.update(corr_result.index.unique(level=0))
corr_vars.update(corr_result.index.unique(level=1))
all_vars = set(df.columns)
df.drop(columns=all_vars - corr_vars)

Unalignable boolean Series key provided

I keep running into this problem when trying to take a slice of a pandas DataFrame. It has four columns as shown below and we'll call it all_data:
I can plot it nicely with the code below:
generations = ['Boomers+', 'Gen X', 'Millennials', 'Students', 'Gen Z']
fig = plt.figure(figsize = (15,15))
for i,generations in enumerate(generations):
ax = plt.subplot(2,3,i+1)
ix = all_data['CustomerGroups'] == generations
kmf.fit(T[ix], C[ix], label=generations)
kmf.plot(ax=ax, legend=False)
plt.title(generations, size = 16)
plt.xlabel('Timeline in years')
plt.xlim(0,5)
if i ==0:
plt.ylabel('Frac Remaining After $n$ Yrs')
if i ==3:
plt.ylabel('Frac Remaining After $n$ Yrs')
fig.tight_layout()
#fig.suptitle('Survivability of Checking Accts By Generation', size = 16)
fig.subplots_adjust(top=0.88, hspace = .4)
plt.show()
However, I would like to do a couple of things that seem similar. The CustomerGroups column has NaN in it which is the reason for the manual array of generations.
It seems like every operation to slice the dataframe and remove the NaN gives me an error Unalignable boolean Series key provided when I attempt to plot is using the same code and just changing the dataframe.
Likewise, the Channel column can be either Online or Branch. Again, any way I try to split the all_data either by the Channel column to create new dataframes from either Online or Branch criteria, I get the same error regarding a boolean index.
I have tried many options from other posts here (reset_index, pd.notnull,) etc. , but it keeps creating the index problem when I go to plot it with the same code.
What is the best method to create subsets from the all_data that does not created the Unalignable boolean Series key error?
This worked in the past:
#create the slice using the .copy
online = checkingRed[['Channel', 'State', 'CustYears', 'Observed',
'CustomerGroups', 'ProductYears']].copy()
#remove the branch from the data set
online = online.loc[online['Channel'] != 'Branch'] #use .loc for cleaner slice
#reset the index so that it is unique
online['index'] = np.arange(len(online))
online = online.set_index('index')

Resources