Unalignable boolean Series key provided - python-3.x

I keep running into this problem when trying to take a slice of a pandas DataFrame. It has four columns as shown below and we'll call it all_data:
I can plot it nicely with the code below:
generations = ['Boomers+', 'Gen X', 'Millennials', 'Students', 'Gen Z']
fig = plt.figure(figsize = (15,15))
for i,generations in enumerate(generations):
ax = plt.subplot(2,3,i+1)
ix = all_data['CustomerGroups'] == generations
kmf.fit(T[ix], C[ix], label=generations)
kmf.plot(ax=ax, legend=False)
plt.title(generations, size = 16)
plt.xlabel('Timeline in years')
plt.xlim(0,5)
if i ==0:
plt.ylabel('Frac Remaining After $n$ Yrs')
if i ==3:
plt.ylabel('Frac Remaining After $n$ Yrs')
fig.tight_layout()
#fig.suptitle('Survivability of Checking Accts By Generation', size = 16)
fig.subplots_adjust(top=0.88, hspace = .4)
plt.show()
However, I would like to do a couple of things that seem similar. The CustomerGroups column has NaN in it which is the reason for the manual array of generations.
It seems like every operation to slice the dataframe and remove the NaN gives me an error Unalignable boolean Series key provided when I attempt to plot is using the same code and just changing the dataframe.
Likewise, the Channel column can be either Online or Branch. Again, any way I try to split the all_data either by the Channel column to create new dataframes from either Online or Branch criteria, I get the same error regarding a boolean index.
I have tried many options from other posts here (reset_index, pd.notnull,) etc. , but it keeps creating the index problem when I go to plot it with the same code.
What is the best method to create subsets from the all_data that does not created the Unalignable boolean Series key error?
This worked in the past:
#create the slice using the .copy
online = checkingRed[['Channel', 'State', 'CustYears', 'Observed',
'CustomerGroups', 'ProductYears']].copy()
#remove the branch from the data set
online = online.loc[online['Channel'] != 'Branch'] #use .loc for cleaner slice
#reset the index so that it is unique
online['index'] = np.arange(len(online))
online = online.set_index('index')

Related

Nested loops altering rows in pandas - Avoiding "A value is trying to be set on a copy of a slice from a DataFrame"

Summary
I am trying to loop through a pandas dataframe, and to run a secondary loop at each iteration. The secondary loop calculates something that I want to append into the original dataframe, so that when the primary loop advances, some of the rows are recalculated based on the changed values. (For those interested, this is a simple advective model of carbon accumulation in soils. When a new layer of soil is deposited, mixing processes penetrate into older layers and transform their properties to a set depth. Thus, each layer deposited changes those below it incrementally, until a former layer lies below the mixing depth.)
I have produced an example of how I want this to work, however it is generating the common error message:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
I have looked into the linked information in the error message as well as myriad posts on this forum, but none get into the continual looping through a changed dataframe.
What I've tried, and some possible solutions
Below is some example code. This code works more or less as well as I want it to. But it produces the warning. Should I:
Suppress the warning and continue working with this architecture? In this case, am I asking for trouble with un-reproducible results?
Try a different architecture altogether, like a numpy array from the original dataframe?
Try df.append() or df.copy() to avoid the warning?
I have tried `df.copy()' to no avail - the warning was still thrown.
Example code:
import pandas as pd
a = pd.DataFrame(
{
'a':[x/2 for x in range(1,11)],
'b':['hot dog', 'slider', 'watermelon', 'funnel cake', 'cotton candy', 'lemonade', 'fried oreo', 'ice cream', 'corn', 'sausage'],
'c':['meat', 'meat', 'vegan', 'vegan', 'vegan', 'vegan', 'dairy','dairy', 'vegan', 'meat']
}
)
print(a)
z = [x/(x+2) for x in range(1,5)]
print(z)
#Primary loop through rows of the main dataframe
for ind, row in a.iterrows():
#Pull out a chunk of the dataframe. This is the portion of the dataframe that will be modified. What is below
#this is already modified and locked into the geological record. What is above has not yet been deposited.
b = a.iloc[ind:(ind+len(z)), :]
#Define the size of the secondary loop. Taking the minimum avoids the model mixing below the boundary layer (key error)
loop = min([len(z), len(b)])
#Now loop through the sub-dataframe and change accordingly.
for fraction in range(loop):
b['a'].iloc[fraction] = b['a'].iloc[fraction]*z[fraction]
#Append the original dataframe with new data:
a.iloc[ind:(ind+loop), :] = b
#Try df.copy(), but still throws warning!
#a.iloc[ind:(ind+loop), :] = b.copy()
print(a)

Dynamically generating an object's name in a panda column using a for loop (fuzzywuzzy)

Low-level python skills here (learned programming with SAS).
I am trying to apply a series of fuzzy string matching (fuzzywuzzy lib) formulas on pairs of strings, stored in a base dataframe. Now I'm conflicted about the way to go about it.
Should I write a loop that creates a specific dataframe for each formula and then append all these sub-dataframes in a single one? The trouble with this approach seems to be that, since I cannot dynamically name the sub-dataframe, the resulting value gets overwritten at each turn of the loop.
Or should I create one dataframe in a single loop, taking my formulas names and expression as a dict? The trouble here gives me the same problem as above.
Here is my formulas dict:
# ratios dict: all ratios names and functions
ratios = {"ratio": fuzz.ratio,
"partial ratio": fuzz.partial_ratio,
"token sort ratio": fuzz.token_sort_ratio,
"partial token sort ratio": fuzz.partial_token_sort_ratio,
"token set ratio": fuzz.token_set_ratio,
"partial token set ratio": fuzz.partial_token_set_ratio
}
And here is the loop I am currently sweating over:
# for loop iterating over ratios
for r, rn in ratios.items():
# fuzzing function definition
def do_the_fuzz(row):
return rn(row[base_column], row[target_column])
# new base df containing ratio data and calculations for current loop turn
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
It gives me the same problem, namely that the 'mesure' column gets overwritten, and I end up with a column full of the last value (here: 'partial token set').
My overall problem is that I cannot understand if and how I can dynamically name dataframes, columns or values in a python loop (or if I'm even supposed to do it).
I've been trying to come up with a solution myself for too long and I just can't figure it out. Any insight would be very much appreciated! Many thanks in advance!
I would create a dataframe that is updated at each loop iteration:
final_df = pd.DataFrame()
for r, rn in ratios.items():
...
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
final_df = pd.concat([final_dfl, df_out1], axis=0)
I hope this can help you.

Carrying out multiple piecewise regressions with variables from same dataframe (but varying columnpair lengths)

I'm trying to analyse and plot piecewise regressions for daily temperature and gas use. I have six columns (two corresponding to each year) within a csv which I am pulling in using pandas then defining each column as a seperate variable.
I found one of the answers on How to apply piecewise linear fit in Python? extremely helpful and was able to use the following code to run a breakpoint analysis and also plot a graph:
import matplotlib.pyplot as plt
import pwlf
# Importing the csv and defining columns as variables
df = pd.read_csv(PATH)
Y_A = df.Column1
X_A = df.Column2
Y_B = df.Column3
X_B = df.Column4
# Analysing breakpoints
my_pwlf_a = pwlf.PiecewiseLinFit(X_A, Y_A)
breaks_a = my_pwlf_a.fit(2)
print(breaks_a)
# Graphing
x_hat = np.linspace(X_A.min(), X_A.max(), 100)
y_hat = my_pwlf.predict(x_hat)
plt.figure()
plt.plot(X_A, Y_A, 'o')
plt.plot(x_hat, y_hat, '-')
plt.xlabel('X'); plt.ylabel('Y');
plt.show()
This runs with no problems and gives the results the desired.
When I try to repurpose the code using my next pair of variables (Y_B and X_B) I run into problems:
my_pwlf_b = pwlf.PiecewiseLinFit(X_B, Y_B)
breaks_b = my_pwlf_b.fit(2)
print(breaks_b)
The error returned is:
ValueError: bounds should be a sequence containing real valued (min, max) pairs for each value in x
All variables are float64 and each column contains 366 rows. Thanks for any help in spotting what I'm missing!
Thansk to Zionsof for the nudge back towards the data!
Further testing shows that unequal lengths of the column pairings was the problem (e.g. Columns 1 & 2 contained 366 while Columns 3 & 4 contained 365). I had foolishly thought that seperating the columns into seperate variables may fix this but I was incorrect. Here is what I used to fix it (numpy.isfinite):
# Remove any blanks by ensuring the values are finite
Y_A = df.Column1[np.isfinite(df['Column1'])]
X_A = df.Column2[np.isfinite(df['Column2'])]
Y_B = df.Column3[np.isfinite(df['Column3'])]
X_B = df.Column4[np.isfinite(df['Column4'])]

How to label line chart with column from pandas dataframe (from 3rd column values)?

I have a data set I filtered to the following (sample data):
Name Time l
1 1.129 1G-d
1 0.113 1G-a
1 3.374 1B-b
1 3.367 1B-c
1 3.374 1B-d
2 3.355 1B-e
2 3.361 1B-a
3 1.129 1G-a
I got this data after filtering the data frame and converting it to CSV file:
# Assigns the new data frame to "df" with the data from only three columns
header = ['Names','Time','l']
df = pd.DataFrame(df_2, columns = header)
# Sorts the data frame by column "Names" as integers
df.Names = df.Names.astype(int)
df = df.sort_values(by=['Names'])
# Changes the data to match format after converting it to int
df.Time=df.Time.astype(int)
df.Time = df.Time/1000
csv_file = df.to_csv(index=False, columns=header, sep=" " )
Now, I am trying to graph lines for each label column data/items with markers.
I want the column l as my line names (labels) - each as a new line, Time as my Y-axis values and Names as my X-axis values.
So, in this case, I would have 7 different lines in the graph with these labels: 1G-d, 1G-a, 1B-b, 1B-c, 1B-d, 1B-e, 1B-a.
I have done the following so far which is the additional settings, but I am not sure how to graph the lines.
plt.xlim(0, 60)
plt.ylim(0, 18)
plt.legend(loc='best')
plt.show()
I used sns.lineplot which comes with hue and I do not want to have name for the label box. Also, in that case, I cannot have the markers without adding new column for style.
I also tried ply.plot but in that case, I am not sure how to have more lines. I can only give x and y values which create only one line.
If there's any other source, please let me know below.
Thanks
The final graph I want to have is like the following but with markers:
You can apply a few tweaks to seaborn's lineplot. Using some created data since your sample isn't really long enough to demonstrate:
# Create data
np.random.seed(2019)
categories = ['1G-d', '1G-a', '1B-b', '1B-c', '1B-d', '1B-e', '1B-a']
df = pd.DataFrame({'Name':np.repeat(range(1,11), 10),
'Time':np.random.randn(100).cumsum(),
'l':np.random.choice(categories, 100)
})
# Plot
sns.lineplot(data=df, x='Name', y='Time', hue='l', style='l', dashes=False,
markers=True, ci=None, err_style=None)
# Temporarily removing limits based on sample data
#plt.xlim(0, 60)
#plt.ylim(0, 18)
# Remove seaborn legend title & set new title (if desired)
ax = plt.gca()
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles[1:], labels=labels[1:], title='New Title', loc='best')
plt.show()
To apply markers, you have to specify a style variable. This can be the same as hue.
You likely want to remove dashes, ci, and err_style
To remove the seaborn legend title, you can get the handles and labels, then re-add the legend without the first handle and label. You can also specify the location here and set a new title if desired (or just remove title=... for no title).
Edits per comments:
Filtering your data to only a subset of level categories can be done fairly easily via:
categories = ['1G-d', '1G-a', '1B-b', '1B-c', '1B-d', '1B-e', '1B-a']
df = df.loc[df['l'].isin(categories)]
markers=True will fail if there are too many levels. If you are only interested in marking points for aesthetic purposes, you can simply multiply a single marker by the number of categories you are interested in (which you have already created to filter your data to categories of interest): markers='o'*len(categories).
Alternatively, you can specify a custom dictionary to pass to the markers argument:
points = ['o', '*', 'v', '^']
mult = len(categories) // len(points) + (len(categories) % len(points) > 0)
markers = {key:value for (key, value)
in zip(categories, points * mult)}
This will return a dictionary of category-point combinations, cycling over the marker points specified until each item in categories has a point style.

Get feature names for dataframe.corr

I am using the cancer data set from sklearn and I need to find the correlations between features. I am able to find the correlated columns, but I am not able to present them in a "nice" way, so that they will be an input for Dataframe.drop.
Here is my code:
cancer_data = load_breast_cancer()
df=pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
corr = df.corr()
#filter to find correlations above 0.6
corr_triu = corr.where(~pd.np.tril(pd.np.ones(corr.shape)).astype(pd.np.bool))
corr_triu = corr_triu.stack()
corr_result = corr_triu[corr_triu > 0.6]
print(corr_result)
df.drop(columns=[?])
IIUC, you want the columns that correlate with some other column in the dataset, ie drop columns that don't appear in corr_result. So you'll want to get the unique variables from the index of corr_result, from each level. There may be repeats so take care of that as well, such as with sets:
corr_result.index = corr_result.index.remove_unused_levels()
corr_vars = set()
corr_vars.update(corr_result.index.unique(level=0))
corr_vars.update(corr_result.index.unique(level=1))
all_vars = set(df.columns)
df.drop(columns=all_vars - corr_vars)

Resources