I'm new to groups in pandas, and relatively new to pandas, so I hope someone of you can help me with my problem.
Aim: flag outliers within a group by setting the relevant cell in the relevant column to 1.The condition is that the data point is outside a calculated group specific limit.
Data: This is a geopandas dataframe with multiple time series with some numeric variables. Each timeseries has its own id.
Some background:
I want to determine outliers for each timeseries by
first grouping the timeseries according to timeseries id
then calculate the lower and upper limit of the variables PER group
then 'flag' the values which are outside the limit by adding a 1 in a specific 'outlier'column
Here is the code which calculates the limits, however, when it comes to setting the flag I have a hard time to figure that out:
df_timeseries['outlier'] = np.zeros
for timeseries, group in df_timeseries.groupby('timeseries.id'):
Q1 = group['Variable.value'].quantile(0.25)
Q3 = group['Variable.value'].quantile(0.75)
IQR = Q3 - Q1
low_lim = Q1 - 1.5 * IQR
up_lim = Q3 + 1.5 *IQR
for value in group['Variable.value']:
if ((value < low_lim) or (value > up_lim)):
# here --> set '1' in the column 'outlier'
I tried it multiple ways, for example:
df_timeseries.loc[df_timeseries['Variable.value'] > up_lim, 'outlier']=1
I also tried 'apply()', so instead of iterating over the tracks I tried to first define a function and then apply it on the group. However nothing really worked, and I could not find out what I actually do wrong. If someone can help, I would be really glad, as I have already tried to figure this out about a couple of hours.
I would need something like:
group.loc[group['outlier']] = 1
everyone...
I just started on python a couple of days ago because I require to handle some excel data in order to automatically update the data of certain cells from one file into another.
However, I'm kind of stuck since I have barely programmed before, and it's my first time using python as well, but my job required me to find a solution and I'm trying to make it work even though it's not my field of expertise.
I used the "xlrd library", imported my file and managed to print the columns I'm needing... However, I can't find a way to put those columns into a matrix in order to handle the data like this:
Matrix =[DataColumnA DataColumnG DataColumnH] in the size [nrows x 3]
As for now, I have 3 different outputs for the 3 different columns I need, but I'm trying to join them together into one big matrix.
So far my code looks like this:
import xlrd
workbook = xlrd.open_workbook("190219_serviciosWRAmanualV5.xls");
worksheet = workbook.sheet_by_name("ServiciosDWDM");
workbook2 = xlrd.open_workbook("Potencia2.xlsx");
worksheet2 = workbook2.sheet_by_name("Hoja1");
filas = worksheet.nrows
filas2 = worksheet2.nrows
columnas = worksheet.ncols
for row in range (2, filas):
Equipo_A = worksheet.cell(row,12).value
Client_A = worksheet.cell(row,13).value
Line_A = worksheet.cell(row, 14).value
print (Equipo_A, Line_A, Client_A)
So I have only gotten, as mentioned above, the data in the columns which is what I'm printing which you can see.
What I'm trying to do, or the main thing I need to do is to read the cell of the first row in Column A and look for it in the other excel file... if the names match, I would have to validate that for the same row (in file 1) the data in both the ColumnG and ColumnH is the same as the data in the second file.
If they match I would have to update Column J in the first file with the data from the second file.
My other approach is to retrieve the value of the cell in ColumnA and look for it in the column A of the second file, then I would make an if conditional to see if ColumnsG and H are equal to Column C of 2nd file and so on...
The thing here is, I have no idea how to pin point the position of the cell and extract the data to make the conditional for this second approach.
I'm not sure if by making that matrix my approach is okay or if the second way is better, so any suggestion would be absolutely appreciated.
Thank you in advance!
I'm a total beginner with CPLEX and OPL, so maybe you can help me with the coding of a mixed integer programming model.
In my case: I have an optimization function including a parameter transportation cost which are specific for the starting point (Hubs h), the destination (DCs i), the transported good (Products k) and the mode of transportation (TransportOptions r) used.
I wrote it like this:
float transportC_Hub_DC[Hubs][DCs][Products][TransportOptions] = ...;
//transport cost of one unit of gook k vor starting point h to destination i using transportation option r
I would like to fill this array with its multiple dimensions from an excel spread sheet. At the moment my spreadsheet has the four indexes in separate columns and the specifice transportation cost in another column. It looks like this:
Excel Datasheet
My problem is that I do not know how to make the programme understand how the transportation cost data are ordered. How does the programme know that in the first cell of the column "transportation cost" is the cost for the specific combination of the different indexes? So how do I tell the programme that I used h=1, i=1, k=1, r=1 in the first cell and h=1, i=1, k=1, r=2 in the second cell and not h=1, i=1, k=2, r=1 in the second cell? What do I have to write in the model or the data file in CPLEX to make this clear?
See technote http://www-01.ibm.com/support/docview.wss?rs=0&context=SSCMS55&uid=swg21401340&loc=en_US&cs=utf-8&cc=us&lang=all
The idea is to read a tuple set and then turn your tuple set into a 4D array.
I have 2 series of data. For sake of simplicity, lets say the data looks like below,
set 1:
1 3
2 3.5
3 4
4 4.5
5 5
6 5.5
7 6
8 6.5
9 7
10 7.5
set 2:
1.5 2
2.8 4.5
3.5 8
4.5 6
5.5 4.8
6.5 4
7.5 6.5
8.5 9
9.5 3
10.5 4
After charting these 2 sets, I want to get the line with the higher data. I want the black line, In the attached pic. How do I get that? My actual data has thousands of data points, so doing this manually isn't possible.
Added later: Another thing I forgot to mention, in my actual data 1 set has about 500 x,y values, and the other set has about 50 values. Though the end points have same/similar x values.
Thanks for your help.
Given your information about the chart and the tables, I would do something like this:
The new series will be based on two formulas:
In Column H, I have the formula for the max value (between your two series):
=MAX(B2,E2)
In Column G, I have the formula that based on the Max value (formula above), which X value I should use (X-value from Series 1 or 2).
=IF(H2=B2,A2,D2)
Then I can plot my graph:
Series 1, Column B
Series 2, Column E
Series 3, Column H.
All series uses the X values of Column G.
Introduction
A few assumptions/comments/pitfalls/constraints regarding my solution:
Set 1 and Set 2 are in columns A till D.
The combined data set will combine the x-values of both Sets, and will have additional data points where the lines cross.
It involves several helper columns, in particular to allow you to copy/paste this across multiple worksheet with data.
I did not try to condense too much, to improve readability, and probably some helper columns could be combined.
It was tested with the data set from the question, but difficult to guarantee all "boundary" conditions, e.g. identical data points between Set 1 and Set 2, zero overlap between the two data sets, empty data sets, etc. (I did test some of these, see my comments at the end).
Set 1 and Set 2 must be sorted (on x-values). If this is not the case, a few additional helper columns are needed to sort the data dynamically.
To better understand the solution described below, see herewith the resulting graph, based on the data set in the question (although I added one data point [2.5;3.75] to avoid having the data points of Set 1 and Set 2 perfectly alternating):
General solution outline / methodology
Combine both datasets in a single (sorted) column;
For all x-values, determine highest y-value, between the y-value in the Set, and the calculated y-value on the line segment from the neighboring values in the other Set (looks simple, in particular with the given example data set, but this is quite tricky to do when data sets have no alternating x-values);
Find the points (x & y values) where the lines of the graph are crossing (intersecting), let's call this Set 3
Combine and sort (on x-values) the three data sets in a two columns (for x & y values).
The details and formulas
For the formulas, I assume row 1 contains headings, and the data start on row 2. All formulas should be entered in row 2, except for a few, where I mention to put them in row 3 (because they need data from the preceding row). The result is in columns E (x-values) and F (y-values), and G till AG are helper columns).
Column E : =INDEX(AH$2:AH$30;MATCH(ROWS(AH$2:AH2);$AJ$2:$AJ$30;0)) These is the actual result. Gets all x-values in AH and sorts these based on an index column AJ; this should actually be the last column in the logical flow, but for presentation purposes it is cleaner to have this next to the input data sets;
F : =INDEX(AF$2:AF$30;MATCH(ROWS(AF$2:AF2);$AG$2:$AG$30;0)) Same for y-values;
G : =IF(ISNA(H2);NA();COUNTIF($H$2:$H$30;"<="&H2)) Creates index to sort combined x-values of both data sets. You also can dynamically sort without such helper column, but then you need a VLOOKUP or INDEX/MATCH and with long decimal numbers I have some bad experiences with these;
H : =IF(ROW()-1<=COUNT($A$2:$A$30);A2;IF((ROW()-1)<=(COUNT($A$2:$A$30)+COUNT($C$2:$C$30));INDEX($C$2:$C$30;ROW()-COUNT($A$2:$A$30)-1;1);NA())) Combines x-values of both data sets, i.e. in columns A & C;
I : =IF(ROW()-1<=COUNT($B$2:$B$30);B2;IF((ROW()-1)<=(COUNT($B$2:$B$30)+COUNT($D$2:$D$30));INDEX($D$2:$D$30;ROW()-COUNT($B$2:$B$30)-1;1);NA())) Same for the y-values;
J : =IF(ROW()-1<=COUNT($A$2:$A$30);"S1";IF((ROW()-1)<=(COUNT($A$2:$A$30)+COUNT($C$2:$C$30));"S2";NA())) Assign "S1", or "S2" to each data point, as indication from which data set they come;
K : =IF(J2=J3;INTERCEPT(I2:I3;H2:H3);NA()) Determines the intercept of the line segment starting at that data point;
L : =IF(J2=J3;SLOPE(I2:I3;H2:H3);NA()) Same for slope;
M : =INDEX(H$2:H$30;MATCH(ROWS(H$2:H2);$G$2:$G$30;0)) Sorts all x-values;
N : =INDEX(I$2:I$30;MATCH(ROWS(I$2:I2);$G$2:$G$30;0)) Same for y-values
O : =INDEX(J$2:J$30;MATCH(ROWS(J$2:J2);$G$2:$G$30;0)) Same for corresponding "S1/S2" value to indicate from which data set they come;
P : =INDEX(K$2:K$30;MATCH(ROWS(K$2:K2);$G$2:$G$30;0)) Same for intercept;
Q : =INDEX(L$2:L$30;MATCH(ROWS(L$2:L2);$G$2:$G$30;0)) Same for slope;
R : =IF(O2="S1";"S2";"S1") Inversion between S1 & S2.
S : {=IFERROR(INDEX($O$2:$Q2;MAX(IF($O$2:$O2=$R3;ROW($O$2:$O2)-ROW(INDEX($O$2:$O2;1;1))+1));2);NA())} Array formula to be put in cell S3 (hence ctrl+shift+enter) that will search for the intercept of the preceding data point of the other data set.
T : {=IFERROR(INDEX($O$2:$Q2;MAX(IF($O$2:$O2=$R3;ROW($O$2:$O2)-ROW(INDEX($O$2:$O2;1;1))+1));3);NA())} Same for slope;
U : =IF(OR(ISNA(N2);NOT(ISNUMBER(S2)));NA();M2*T2+S2) Calculates the y-value on the line segment of the other data set;
V : =MAX(IFNA(U2;N2);N2) Maximum value between the original y-value and the calculated y-value on the corresponding line segment of the other data set;
W : =(V2=N2) Checks whether the y-value comes from the original data set or not;
X : =IF(O2="S1";IF(W2;"S1";"S2");IF(W2;"S2";"S1")) Determines on which data set (line) the y-value sits (S1 or S2);
Y : =IFERROR(AND((X2<>X3);COUNTIF(X3:$X$30;X2)>0);FALSE) Determines when the data sets cross (i.e. the lines on the graph intersect);
Z : =IF(Y2;(S2-P2)/(Q2-T2);NA()) Calculates x-value of intersection;
AA : =IF(Y2;Z2*Q2+P2;NA()) Calculates y-value of intersection;
AB : =COUNTIF($Z$2:$Z$30;"<="&Z2) Index to sort the newly calculated intersection points (I sort them because then the combining with the other data sets is straightforward, re-using formula of column H;
AC : =INDEX(Z$2:Z$30;MATCH(ROWS(Z$2:Z2);$AB$2:$AB$30;0)) Sorted x-values of intersection points;
AD : =INDEX(AA$2:AA$30;MATCH(ROWS(AA$2:AA2);$AB$2:$AB$30;0)) Same for y-values;
AE : =IF(ROW()-1<=COUNT(M$2:M$30);M2;IF((ROW()-1)<=(COUNT(M$2:M$30)+COUNT(AC$2:AC$30));INDEX(AC$2:AC$30;ROW()-COUNT(M$2:M$30)-1;1);NA())) Combine x-values of Set 1, Set 2, and the intersection points;
AF : =IF(ROW()-1<=COUNT(V$2:V$30);V2;IF((ROW()-1)<=(COUNT(V$2:V$30)+COUNT(AD$2:AD$30));INDEX(AD$2:AD$30;ROW()-COUNT(V$2:V$30)-1;1);NA())) Same for y-values;
AG : =IF(ISNA(AE2);NA();COUNTIF($AE$2:$AE$30;"<="&AE2)) Create index to sort the resulting data set (and this is used to calculate the final results in columns E & F;
All formulas go until row 30, but this need to be changed of course based on the actual data sets. The idea is to add these formulas to one worksheet, and then columns E > AG can be copied to all other worksheets. There are obviously quite a few #NA values, but this is on purpose, and are not errors or mistakes. On request, I can share the actual spreadsheet, so you do not have to retype all formulas.
Some additional comments
You have to modify some formulas (the sort indices) if there are identical x-values, either within Set 1 (which I will not cover here, as it seems this would be unlikely, or be data input errors), or between Set 1 and Set 2. The dynamic sorting does not work in that case. A workaround is to create a "synthetic" sort column, e.g. with =TEXT(J2;"0000.00000000000")&L2. This formats all numbers the same way as text, and appends S1 or S2. So this should give unique sort values, which would sort the same way as the corresponding numbers.
Empty data sets or data sets with only 1 value are not treated correctly either (the intercept formulas and finding values for the "previous" data point are meaningless in these cases).
I need to display this kind of data in one chart:
Which means:
Function A: Y= X + 10 ;
Function B: Y= X * 2 ;
but what I got from excel is:
Actually what I want is (sorry for the drawing):
My real data is like:
If I format it like this:
I can still not get the correct charts.
Updated to account for change of question
Your data structure is wrong to achieve this in Excel, see for example:
On the left you want your x axis values, don't put a header in this column otherwise it will plotted on your graph as a series (extra line).
You can fill in gaps in your data with #N/A and excel will ignore these points, joining the line to the next available point.