I'm trying to generate a stack plot of version data using matplotlib. I have that portion working and displaying properly, but I'm unable to get the legend to display anything other than an empty square in the corner.
ra_ys = np.asarray(ra_ys)
# Going to generate a stack plot of the version stats
fig = plt.figure()
ra_plot = fig.add_subplot(111)
# Our x axis is going to be the dates, but we need them as numbers
x = [date2num(date) for date in dates]
# Plot the data
ra_plot.stackplot(x, ra_ys)
# Setup our legends
ra_plot.legend(ra_versions) #Also tried converting to a tuple
ra_plot.set_title("blah blah words")
print(ra_versions)
# Only want x ticks on the dates we supplied, and want them to display AS dates
ra_plot.set_xticks(x)
ra_plot.set_xticklabels([date.strftime("%m-%d") for date in dates])
plt.show()
ra_ys is a multidimensional array:
[[ 2 2 2 2 2 2 2 2 2 2 1]
[ 1 1 1 1 1 1 1 1 1 1 1]
[ 1 1 1 1 1 1 1 1 1 1 1]
[53 52 51 50 50 49 48 48 48 48 47]
[18 19 20 20 20 20 21 21 21 21 21]
[ 0 0 12 15 17 18 19 19 19 19 22]
[ 5 5 3 3 3 3 3 3 3 3 3]
[ 4 4 3 3 2 2 2 2 2 2 2]
[14 14 6 4 3 3 2 2 2 2 2]
[ 1 1 1 1 1 1 1 1 1 1 1]
[ 1 1 1 1 1 1 1 1 1 1 1]
[ 1 1 1 1 1 1 1 1 1 1 1]
[ 2 2 2 2 2 2 2 2 2 2 2]
[ 1 1 1 1 1 1 1 1 1 1 1]
[ 1 1 1 1 1 1 1 1 1 1 1]
[ 3 3 2 2 2 2 2 2 2 2 2]]
x is some dates: [734969.0, 734970.0, 734973.0, 734974.0, 734975.0, 734976.0, 734977.0, 734978.0, 734979.0, 734980.0, 734981.0]
ra_versions is a list: ['4.5.2', '4.5.7', '4.5.8', '5.0.0', '5.0.1', '5.0.10', '5.0.7', '5.0.8', '5.0.9', '5.9.105', '5.9.26', '5.9.27', '5.9.29', '5.9.31', '5.9.32', '5.9.34']
Am I doing something wrong? Can stack plots not have legends?
EDIT: I tried to print the handles and labels for the plot and got two empty lists ([] []):
handles, labels = theplot.get_legend_handles_labels()
print(handles,labels)
I then tested the same figure using the follow code for a proxy handle and it worked. So it looks like the lack of handles is the problem.
p = plt.Rectangle((0, 0), 1, 1, fc="r")
theplot.legend([p], ['test'])
So now the question is, how can I generate a variable number of proxy handles that match the colors of my stack plot?
This is the final (cleaner) approach to getting the legend. Since there are no handles, I generate proxy artists for each line. It's theoretically capable of handling cases where colors are reused, but it'll be confusing.
def plot_version_data(title, dates, versions, version_ys, savename=None):
print("Prepping plot for \"{0}\"".format(title))
fig = plt.figure()
theplot = fig.add_subplot(111)
# Our x axis is going to be the dates, but we need them as numbers
x = [date2num(date) for date in dates]
# Use these colors
colormap = "bgrcmy"
theplot.stackplot(x, version_ys, colors=colormap)
# Make some proxy artists for the legend
p = []
i = 0
for _ in versions:
p.append(plt.Rectangle((0, 0), 1, 1, fc=colormap[i]))
i = (i + 1) % len(colormap)
theplot.legend(p, versions)
theplot.set_ylabel(versions) # Cheating way to handle the legend
theplot.set_title(title)
# Setup the X axis - rotate to keep from overlapping, display like Oct-16,
# make sure there's no random whitespace on either end
plt.xticks(rotation=315)
theplot.set_xticks(x)
theplot.set_xticklabels([date.strftime("%b-%d") for date in dates])
plt.xlim(x[0],x[-1])
if savename:
print("Saving output as \"{0}\"".format(savename))
fig.savefig(os.path.join(sys.path[0], savename))
else:
plt.show()
Related
in this dataframe:
Feat1 Feat2 Feat3 Feat4 Labels
-46.220314 22.862856 -6.1573067 5.6060414 2
-23.80669 20.536781 -5.015675 4.2216353 2
-42.092365 25.680704 -5.0092897 5.665794 2
-35.29639 21.709473 -4.160352 5.578346 2
-37.075096 22.347767 -3.860426 5.6953945 2
-42.8849 28.03802 -7.8572545 3.3361 2
-32.3057 26.568039 -9.47018 3.4532788 2
-24.469942 27.005375 -9.301921 4.3995037 2
-97.89892 -0.38156664 6.4163384 7.234347 1
-81.96325 0.1821717 -1.2870358 4.703838 1
-78.41986 -6.766374 0.8001185 0.83444935 1
-100.68544 -4.5810957 1.6977689 1.8801615 1
-87.05412 -2.9231584 6.817379 5.4460077 1
-64.121056 -3.7892206 -0.283514 6.3084154 1
-94.504845 -0.9999217 3.2884297 6.881124 1
-61.951996 -8.960198 -1.5915259 5.6160254 1
-108.19452 13.909201 0.6966458 -1.956591 0
-97.4037 22.897585 -2.8488266 1.4105041 0
-92.641335 22.10624 -3.5110545 2.467166 0
-199.18787 3.3090565 -2.5994794 4.0802555 0
-137.5976 6.795896 1.6793671 2.2256763 0
-208.0035 -1.33229 -3.2078092 1.5177402 0
-108.225975 14.341716 1.02891 -1.8651972 0
-121.29299 18.274035 2.2891548 2.3360753 0
I wanted to sort the rows based on different column values in the "Labels" column.
I am able to sort in ascending such that the labels appear as [0 1 2] via the command
df2 = df1.sort_values(by = 'Labels', ascending = True)
Then ascending = False, where the labels appear [2 1 0].
How then do I go about sorting the labels as [1 0 2]?
Any help will be greatly appreciated!
Here's a way using Categorical:
df['Labels'] = pd.Categorical(df['Labels'],
categories = [1, 0, 2],
ordered=True)
df.sort_values('Labels')
Output:
Feat1 Feat2 Feat3 Feat4 Labels
11 -100.685440 -4.581096 1.697769 1.880162 1
15 -61.951996 -8.960198 -1.591526 5.616025 1
8 -97.898920 -0.381567 6.416338 7.234347 1
9 -81.963250 0.182172 -1.287036 4.703838 1
10 -78.419860 -6.766374 0.800118 0.834449 1
14 -94.504845 -0.999922 3.288430 6.881124 1
12 -87.054120 -2.923158 6.817379 5.446008 1
13 -64.121056 -3.789221 -0.283514 6.308415 1
21 -208.003500 -1.332290 -3.207809 1.517740 0
20 -137.597600 6.795896 1.679367 2.225676 0
19 -199.187870 3.309057 -2.599479 4.080255 0
18 -92.641335 22.106240 -3.511055 2.467166 0
17 -97.403700 22.897585 -2.848827 1.410504 0
16 -108.194520 13.909201 0.696646 -1.956591 0
23 -121.292990 18.274035 2.289155 2.336075 0
22 -108.225975 14.341716 1.028910 -1.865197 0
7 -24.469942 27.005375 -9.301921 4.399504 2
6 -32.305700 26.568039 -9.470180 3.453279 2
5 -42.884900 28.038020 -7.857254 3.336100 2
4 -37.075096 22.347767 -3.860426 5.695394 2
3 -35.296390 21.709473 -4.160352 5.578346 2
2 -42.092365 25.680704 -5.009290 5.665794 2
1 -23.806690 20.536781 -5.015675 4.221635 2
0 -46.220314 22.862856 -6.157307 5.606041 2
You can use an ordered Categorical, or if you don't want to change the DataFrame, the poor-man's variant, a mapping Series:
order = [1, 0, 2]
key = pd.Series({k:v for v,k in enumerate(order)}).get
# or
# pd.Series(range(len(order)), index=order).get
df1.sort_values(by='Labels', key=key)
Example:
df1 = pd.DataFrame({'Labels': [1,0,1,2,0,2,1]})
order = [1, 0, 2]
key = pd.Series({k:v for v,k in enumerate(order)}).get
print(df1.sort_values(by='Labels', key=key))
Labels
0 1
2 1
6 1
1 0
4 0
3 2
5 2
here is another way to do it
create a new column using map and map the new order sequence and then sort as usual
df['sort_label'] = df['Labels'].map({1:0, 0:1, 2:2 }) #).sort_values('sort_label', ascending=False)
df.sort_values('sort_label')
Feat1 Feat2 Feat3 Feat4 Labels sort_label
11 -100.685440 -4.581096 1.697769 1.880162 1 0
15 -61.951996 -8.960198 -1.591526 5.616025 1 0
8 -97.898920 -0.381567 6.416338 7.234347 1 0
9 -81.963250 0.182172 -1.287036 4.703838 1 0
10 -78.419860 -6.766374 0.800119 0.834449 1 0
14 -94.504845 -0.999922 3.288430 6.881124 1 0
12 -87.054120 -2.923158 6.817379 5.446008 1 0
13 -64.121056 -3.789221 -0.283514 6.308415 1 0
21 -208.003500 -1.332290 -3.207809 1.517740 0 1
20 -137.597600 6.795896 1.679367 2.225676 0 1
19 -199.187870 3.309057 -2.599479 4.080255 0 1
18 -92.641335 22.106240 -3.511054 2.467166 0 1
17 -97.403700 22.897585 -2.848827 1.410504 0 1
16 -108.194520 13.909201 0.696646 -1.956591 0 1
23 -121.292990 18.274035 2.289155 2.336075 0 1
22 -108.225975 14.341716 1.028910 -1.865197 0 1
7 -24.469942 27.005375 -9.301921 4.399504 2 2
6 -32.305700 26.568039 -9.470180 3.453279 2 2
5 -42.884900 28.038020 -7.857254 3.336100 2 2
4 -37.075096 22.347767 -3.860426 5.695394 2 2
3 -35.296390 21.709473 -4.160352 5.578346 2 2
2 -42.092365 25.680704 -5.009290 5.665794 2 2
1 -23.806690 20.536781 -5.015675 4.221635 2 2
0 -46.220314 22.862856 -6.157307 5.606041 2 2
I have a dataframe(edata) as given below
Domestic Catsize Type Count
1 0 1 1
1 1 1 8
1 0 2 11
0 1 3 14
1 1 4 21
0 1 4 31
From this dataframe I want to calculate the sum of all counts where the logical AND of both variables (Domestic and Catsize) results in Zero (0) such that
1 0 0
0 1 0
0 0 0
The code I use to perform the process is
g=edata.groupby('Type')
q3=g.apply(lambda x:x[((x['Domestic']==0) & (x['Catsize']==0) |
(x['Domestic']==0) & (x['Catsize']==1) |
(x['Domestic']==1) & (x['Catsize']==0)
)]
['Count'].sum()
)
q3
Type
1 1
2 11
3 14
4 31
This code works fine, however, if the number of variables in the dataframe increases then the number of conditions grows rapidly. So, is there a smart way to write a condition that states that if the ANDing the two (or more) variables result in a zero then perform the sum() function
You can filter first using pd.DataFrame.all negated:
cols = ['Domestic', 'Catsize']
res = df[~df[cols].all(1)].groupby('Type')['Count'].sum()
print(res)
# Type
# 1 1
# 2 11
# 3 14
# 4 31
# Name: Count, dtype: int64
Use np.logical_and.reduce to generalise.
columns = ['Domestic', 'Catsize']
df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
Type
1 1
2 11
3 14
4 31
Name: Count, dtype: int64
Before adding it back, use map to broadcast:
u = df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
df['NewCol'] = df.Type.map(u)
df
Domestic Catsize Type Count NewCol
0 1 0 1 1 1
1 1 1 1 8 1
2 1 0 2 11 11
3 0 1 3 14 14
4 1 1 4 21 31
5 0 1 4 31 31
how about
columns = ['Domestic', 'Catsize']
df.loc[~df[columns].prod(axis=1).astype(bool), 'Count']
and then do with it whatever you want.
for logical AND the product does the trick nicely.
for logcal OR you can use sum(axis=1) with proper negation in advance.
I have dataframe with columns A,B and flag. I want to calculate mean of 2 values before flag change from 0 to 1 , and record value when flag change from 0 to 1 and record value when flag changes from 1 to 0.
# Input dataframe
df=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,0,0,0]})
# Expected output
df_out=df=pd.DataFrame({'A_mean_before_flag_change':[5.5],
'B_mean_before_flag_change':[5],
'A_value_before_change_flag':[7],
'B_value_before_change_flag':[6]})
I try to create more general solution:
df=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,1,0,1]})
print (df)
A B flag
0 1 1 0
1 3 3 0
2 4 4 0
3 7 6 0
4 8 8 1
5 11 11 1
6 1 1 1
7 15 19 0
8 20 20 0
9 15 15 1
10 16 16 0
11 87 87 1
First create groups by mask for 0 with next 1 values of flag:
m1 = df['flag'].eq(0) & df['flag'].shift(-1).eq(1)
df['g'] = m1.iloc[::-1].cumsum()
print (df)
A B flag g
0 1 1 0 3
1 3 3 0 3
2 4 4 0 3
3 7 6 0 3
4 8 8 1 2
5 11 11 1 2
6 1 1 1 2
7 15 19 0 2
8 20 20 0 2
9 15 15 1 1
10 16 16 0 1
11 87 87 1 0
then filter out groups with size less like N:
N = 4
df1 = df[df['g'].map(df['g'].value_counts()).ge(N)].copy()
print (df1)
A B flag g
0 1 1 0 3
1 3 3 0 3
2 4 4 0 3
3 7 6 0 3
4 8 8 1 2
5 11 11 1 2
6 1 1 1 2
7 15 19 0 2
8 20 20 0 2
Filter last N rows:
df2 = df1.groupby('g').tail(N)
And aggregate last with mean:
d = {'mean':'_mean_before_flag_change', 'last': '_value_before_change_flag'}
df3 = df2.groupby('g')['A','B'].agg(['mean','last']).sort_index(axis=1, level=1).rename(columns=d)
df3.columns = df3.columns.map(''.join)
print (df3)
A_value_before_change_flag B_value_before_change_flag \
g
2 20 20
3 7 6
A_mean_before_flag_change B_mean_before_flag_change
g
2 11.75 12.75
3 3.75 3.50
I'm assuming that this needs to work for cases with more than one rising edge and that the consecutive values and averages get appended to the output lists:
# the first step is to extract the rising and falling edges using diff(), identify sections and length
df['flag_diff'] = df.flag.diff().fillna(0)
df['flag_sections'] = (df.flag_diff != 0).cumsum()
df['flag_sum'] = df.flag.groupby(df.flag_sections).transform('sum')
# then you can get the relevant indices by checking for the rising edges
rising_edges = df.index[df.flag_diff==1.0]
val_indices = [i-1 for i in rising_edges]
avg_indices = [(i-2,i-1) for i in rising_edges]
# and finally iterate over the relevant sections
df_out = pd.DataFrame()
df_out['A_mean_before_flag_change'] = [df.A.loc[tpl[0]:tpl[1]].mean() for tpl in avg_indices]
df_out['B_mean_before_flag_change'] = [df.B.loc[tpl[0]:tpl[1]].mean() for tpl in avg_indices]
df_out['A_value_before_change_flag'] = [df.A.loc[idx] for idx in val_indices]
df_out['B_value_before_change_flag'] = [df.B.loc[idx] for idx in val_indices]
df_out['length'] = [df.flag_sum.loc[idx] for idx in rising_edges]
df_out.index = rising_edges
I have a boolean column in a csv file for example:
1 1
2 0
3 0
4 0
5 1
6 1
7 1
8 0
9 0
10 1
11 0
12 0
13 1
14 0
15 1
You can see here 1 is reapting every 5 lines.
I want to recognize this repeating pattern [1,0,0,0] as soon as the repetition is above 10 in python (I have ~20.000 rows/file).
The pattern can start at any position
How could I manage this in python avoiding if .....
# Generate 20000 of 0s and 1s
data = pd.Series(np.random.randint(0, 2, 20000))
# Keep indices of 1s
idx = df[df > 0].index
# Check distance of current index with next index whether is 4 or not,
# Say if position 2 and position 6 is found as 1, so 6 - 2 = 4
found = []
for i, v in enumerate(idx):
if i == len(idx) - 1:
break
next_value = idx[i + 1]
if (next_value - v) == 4:
found.append(v)
print(found)
my data
y n Rh y2
1 1 1.166666667 1
-1 2 0.5 1
-1 3 0.333333333 1
-1 4 0.166666667 1
1 5 1.666666667 2
1 6 1.333333333 1
-1 7 0.333333333 1
-1 8 0.333333333 1
1 9 0.833333333 1
1 10 2.333333333 2
1 11 1 1
-1 12 0.166666667 1
1 13 0.666666667 1
1 14 0.833333333 1
1 15 0.833333333 1
-1 16 0.333333333 1
-1 17 0.166666667 1
1 18 2 2
1 19 0.833333333 1
1 20 1.333333333 1
1 21 1.333333333 1
-1 22 0.166666667 1
-1 23 0.166666667 1
-1 24 0.333333333 1
-1 25 0.166666667 1
-1 26 0.166666667 1
-1 27 0.333333333 1
-1 28 0.166666667 1
-1 29 0.166666667 1
-1 30 0.5 1
1 31 0.833333333 1
-1 32 0.166666667 1
-1 33 0.333333333 1
-1 34 0.166666667 1
-1 35 0.166666667 1
my codes r
data=xlsread('btpdata.xlsx',1.)
A = data(1:end,2:3)
B = data(1:end,1)
svmStruct = svmtrain(A,B,'showplot',true)
hold on
C = data(1:end,2:3)
D = data(1:end,4)
svmStruct = svmtrain(C,D,'showplot',true)
hold off
How can i get the approximate equations of this black lines in the given mat-lab plot?
It depends what package you did use, but as it is a linear Support Vector Machine there are more or less two options:
Your trained svm contains the equation of the line in a property coefs (sometimes called w or weights) and b (or intercept), so your line is <coefs, X> + b = 0
Your svm containes alphas (dual coefficients, Lagrange multipliers) and then coefs = SUM_i alphas_i * y_i * SV_i where SV_i is i'th support vector (the ones in circles on your plot) and y_i is its label (-1 or +1). Sometimes alphas are already multiplied by y_i, then your coefs = SUM_i alphas_i * SV_i.
If you are trying to get the equation from the actual plot (image), then you can only read it (and it is more or less y = 0.6, meaning that coefs = [0 1] and b = -0.6. Image analysis based approach (for arbitrary such plot) would require:
detecting image part (object detection)
reading the ticks/scale (OCR + object detection) <- this would be actually the hardest part
filtering out everything non-black and performing linear regression to points left, then trasforming through scale detected earlier.
I have had the same problem. To build the linear equation (y = mx + b) of the decision boundary you need the gradient (m) and the y-intercept (b). SVMStruct.Bias is the b-term. The gradient is determined by the SVM beta weights, which SVMStruct does not contain so you need to calculate them from the alphas (which are included in SVMStruct):
alphas = SVMStruct.Alpha;
SV = SVMStruct.SupportVectors;
betas = sum(alphas.*SV);
m = betas(1)/betas(2)
By the way, if your SVM has scaled the data, then I think you will need to unscale it.