Get feature names for dataframe.corr - python-3.x

I am using the cancer data set from sklearn and I need to find the correlations between features. I am able to find the correlated columns, but I am not able to present them in a "nice" way, so that they will be an input for Dataframe.drop.
Here is my code:
cancer_data = load_breast_cancer()
df=pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
corr = df.corr()
#filter to find correlations above 0.6
corr_triu = corr.where(~pd.np.tril(pd.np.ones(corr.shape)).astype(pd.np.bool))
corr_triu = corr_triu.stack()
corr_result = corr_triu[corr_triu > 0.6]
print(corr_result)
df.drop(columns=[?])

IIUC, you want the columns that correlate with some other column in the dataset, ie drop columns that don't appear in corr_result. So you'll want to get the unique variables from the index of corr_result, from each level. There may be repeats so take care of that as well, such as with sets:
corr_result.index = corr_result.index.remove_unused_levels()
corr_vars = set()
corr_vars.update(corr_result.index.unique(level=0))
corr_vars.update(corr_result.index.unique(level=1))
all_vars = set(df.columns)
df.drop(columns=all_vars - corr_vars)

Related

FeatureTools - How to Add 2 Columns Together?

I'm stuck. Using Featuretools, all I want to do is create a new column that sums two columns together from my dataset, creating a "stacked" feature of sorts. Do this for all columns in my dataset.
My code looks like this:
# Define the function
def feature_engineering_dataset(df):
es = ft.EntitySet(id = 'stockdata')
# Make the "Date" index an actual column cuz defining it as the index below throws
# a "can't find Date in index" error for some reason.
df = df.reset_index()
# Save some columns not used in Featuretools to concat back later
dates = df['Date']
tickers = df['Ticker']
dailychange = df['DailyChange']
classes = df['class']
dataframe = df.drop(['Date', 'Ticker', 'DailyChange', 'class'],axis=1)
# Define the entity
es.entity_from_dataframe(entity_id='data', dataframe=dataframe, index='Date') # Won't find Date so uses a numbered index. We'll re-define date as index later
# Pesky warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)
warnings.filterwarnings("once", category=ImportWarning)
# Run deep feature synthesis
feature_matrix, feature_defs = ft.dfs(n_jobs=-2,entityset=es, target_entity='data',
chunk_size=0.015,max_depth=2,verbose=True,
agg_primitives = ['sum'],
trans_primitives = []
)
# Now re-add previous columnes because featuretools...
df = pd.concat([dates, tickers, feature_matrix, dailychange, classes], axis=1)
df = df.set_index(['Date'])
# Return our new dataset!
return(df)
# Now run that defined function
df = feature_engineering_dataset(df)
I'm not sure what's really happening here, but I've defined a depth of 2, so it's my understanding that for every combination of pairs of columns in my dataset, it'll create a new column that sums the two together?
My initial dataframes shape has 3101 columns, and when I run this command it says Built 3098 features, and the final df has 3098 columns after the concat'ing, which isn't right, it should have all my original features, PLUS the engineered ones.
How can I achieve what I'm after? The examples on the featuretools page and API docs are extremely confusing and deal a lot with dated examples, like "time_since_last" trans primitives and other stuff that doesn't seem to apply here. Thanks!
Thanks for the question. You can create a new column that sums two columns by using the transform primitive add_numeric. I'll go through a quick example using this data.
id time open high low close
0 2019-07-10 07:00:00 1.053362 1.053587 1.053147 1.053442
1 2019-07-10 08:00:00 1.053457 1.054057 1.053457 1.053987
2 2019-07-10 09:00:00 1.053977 1.054192 1.053697 1.053917
3 2019-07-10 10:00:00 1.053902 1.053907 1.053522 1.053557
4 2019-07-10 11:00:00 1.053567 1.053627 1.053327 1.053397
First, we create the entity set for the data.
import featuretools as ft
es = ft.EntitySet('stockdata')
es.entity_from_dataframe(
entity_id='data',
dataframe=df,
index='id',
time_index='time',
)
Now, we apply DFS using the transform primitive to add the numeric columns.
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_entity='data',
trans_primitives=['add_numeric'],
)
Then, the new engineered features are returned along with the original ones.
feature_matrix
open high low close close + high low + open high + low close + open high + open close + low
id
0 1.053362 1.053587 1.053147 1.053442 2.107029 2.106509 2.106734 2.106804 2.106949 2.106589
1 1.053457 1.054057 1.053457 1.053987 2.108044 2.106914 2.107514 2.107444 2.107514 2.107444
2 1.053977 1.054192 1.053697 1.053917 2.108109 2.107674 2.107889 2.107894 2.108169 2.107614
3 1.053902 1.053907 1.053522 1.053557 2.107464 2.107424 2.107429 2.107459 2.107809 2.107079
4 1.053567 1.053627 1.053327 1.053397 2.107024 2.106894 2.106954 2.106964 2.107194 2.106724
You can see a list of all the built-in primitives by calling the function ft.list_primitives().

Dynamically generating an object's name in a panda column using a for loop (fuzzywuzzy)

Low-level python skills here (learned programming with SAS).
I am trying to apply a series of fuzzy string matching (fuzzywuzzy lib) formulas on pairs of strings, stored in a base dataframe. Now I'm conflicted about the way to go about it.
Should I write a loop that creates a specific dataframe for each formula and then append all these sub-dataframes in a single one? The trouble with this approach seems to be that, since I cannot dynamically name the sub-dataframe, the resulting value gets overwritten at each turn of the loop.
Or should I create one dataframe in a single loop, taking my formulas names and expression as a dict? The trouble here gives me the same problem as above.
Here is my formulas dict:
# ratios dict: all ratios names and functions
ratios = {"ratio": fuzz.ratio,
"partial ratio": fuzz.partial_ratio,
"token sort ratio": fuzz.token_sort_ratio,
"partial token sort ratio": fuzz.partial_token_sort_ratio,
"token set ratio": fuzz.token_set_ratio,
"partial token set ratio": fuzz.partial_token_set_ratio
}
And here is the loop I am currently sweating over:
# for loop iterating over ratios
for r, rn in ratios.items():
# fuzzing function definition
def do_the_fuzz(row):
return rn(row[base_column], row[target_column])
# new base df containing ratio data and calculations for current loop turn
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
It gives me the same problem, namely that the 'mesure' column gets overwritten, and I end up with a column full of the last value (here: 'partial token set').
My overall problem is that I cannot understand if and how I can dynamically name dataframes, columns or values in a python loop (or if I'm even supposed to do it).
I've been trying to come up with a solution myself for too long and I just can't figure it out. Any insight would be very much appreciated! Many thanks in advance!
I would create a dataframe that is updated at each loop iteration:
final_df = pd.DataFrame()
for r, rn in ratios.items():
...
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
final_df = pd.concat([final_dfl, df_out1], axis=0)
I hope this can help you.

Functions to prevent repetition of code ( Pandas for Python using Jupyter Notebook)

I am new to programming and would appreciate your help.
Trying to avoid repetition of code for querying on a pandas dataframe.
x1 is the dataframe with various column names such as Hypertension, Diabetes, Alcoholism, Handicap, Age_Group, Date_Appointment
Each of the disease column listed above contains 0 - not having disease, 2/3/4 - has different stages of disease
So when I filter on ' != 0 ' it will list records for patients with that specific disease. As such each disease will filter out different sets of records.
I wrote below query 4 times and replaced the word Hypertension with the other diseases to get 4 different graphs for each of the diseases.
But it is not clean coding. I need help to understand how any which function could be used and how to use it to write just 1 query instead of 4.
hyp1 = x1.query('Hypertension != 0')
i1 = hyp1.groupby('Age_Group')['Hypertension'].value_counts().plot(kind = 'bar',label = 'Hypertension',figsize=(6, 6))
plt.title('Appointments Missed by Patients with Hypertension')
plt.xlabel('Hypertension Age_Group')
plt.ylabel('Appointments missed');
Below is another set I don't know how to condense.
`print('Details of all appointments')
`print('')`
`print(df.Date_Appointment.value_counts().sort_index())`
`print('')`
`print(df.Date_Appointment.describe())`
`print('')`
`print(df.Date_Appointment.value_counts().describe())`
`print('')`
`print('Median = ', (round(df.Date_Appointment.value_counts().mean())))`
`print('Median = ', (round (df.Date_Appointment.value_counts().median())))`
`print('Mode = ', (df.Date_Appointment.value_counts().mode()))`
Would appreciate your detailed response. Thank you in advance.
Create a list of the desired columns
Iterate through them
Use f-strings (e.g. f'{...})
diseases = {'Hypertension': 'red', 'Diabetes': 'blue', 'Alcoholism': 'green', 'Handicap': 'yellow'}
for disease, color in diseases.items():
subset = x1.query(f'{disease} != 0')
i1 = subset.groupby('Age_Group')[f'{disease}'].value_counts().plot(kind='bar', label=f'{disease}', figsize=(6, 6), color=color)
plt.title(f'Appointments Missed by Patients with {disease}')
plt.xlabel(f'{disease} Age Group')
plt.ylabel('Appointments missed')
plt.show()
Incidentally, this would be easier with sample data to work with
For the second half, it's not clear what you want to condense or replace Date_Appointment with.

how do i check if a data set is normal or not in python?

So I'm creating a master program for machine learning from scratch in python and the first step i want to do is to check if the data set is normal or not.
ps : the data set can have many features or just a single feature.
It has to be implemented in python3.
also, normalizing the data can be done by the below function right :
# Find the min and max values for each column
def dataset_minmax(dataset):
minmax = list()
for i in range(len(dataset[0])):
col_values = [row[i] for row in dataset]
value_min = min(col_values)
value_max = max(col_values)
minmax.append([value_min, value_max])
return minmax
# Rescale dataset columns to the range 0-1
def normalize_dataset(dataset, minmax):
for row in dataset:
for i in range(len(row)):
row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0])
THANKS IN ADVANCE!
Your question seems discordant: if your features are not coming from a normal distribution, you cannot "normalize" them, in the sense of changing their distribution. If you mean to check if they have average 0 and SD of 1 that is a different ballpark game.

Unalignable boolean Series key provided

I keep running into this problem when trying to take a slice of a pandas DataFrame. It has four columns as shown below and we'll call it all_data:
I can plot it nicely with the code below:
generations = ['Boomers+', 'Gen X', 'Millennials', 'Students', 'Gen Z']
fig = plt.figure(figsize = (15,15))
for i,generations in enumerate(generations):
ax = plt.subplot(2,3,i+1)
ix = all_data['CustomerGroups'] == generations
kmf.fit(T[ix], C[ix], label=generations)
kmf.plot(ax=ax, legend=False)
plt.title(generations, size = 16)
plt.xlabel('Timeline in years')
plt.xlim(0,5)
if i ==0:
plt.ylabel('Frac Remaining After $n$ Yrs')
if i ==3:
plt.ylabel('Frac Remaining After $n$ Yrs')
fig.tight_layout()
#fig.suptitle('Survivability of Checking Accts By Generation', size = 16)
fig.subplots_adjust(top=0.88, hspace = .4)
plt.show()
However, I would like to do a couple of things that seem similar. The CustomerGroups column has NaN in it which is the reason for the manual array of generations.
It seems like every operation to slice the dataframe and remove the NaN gives me an error Unalignable boolean Series key provided when I attempt to plot is using the same code and just changing the dataframe.
Likewise, the Channel column can be either Online or Branch. Again, any way I try to split the all_data either by the Channel column to create new dataframes from either Online or Branch criteria, I get the same error regarding a boolean index.
I have tried many options from other posts here (reset_index, pd.notnull,) etc. , but it keeps creating the index problem when I go to plot it with the same code.
What is the best method to create subsets from the all_data that does not created the Unalignable boolean Series key error?
This worked in the past:
#create the slice using the .copy
online = checkingRed[['Channel', 'State', 'CustYears', 'Observed',
'CustomerGroups', 'ProductYears']].copy()
#remove the branch from the data set
online = online.loc[online['Channel'] != 'Branch'] #use .loc for cleaner slice
#reset the index so that it is unique
online['index'] = np.arange(len(online))
online = online.set_index('index')

Resources