loc statement returns series and I need a single value - python-3.x

I use the loc statement to look up each value of a coordinate pair but apparently, if you use loc with a condition it returns a series and not a single value. My conditions make that I should always get only one value back but when I assign it to its field I always end up in the 'except' part of my statement. How can I make this work ?
My situation is as follows: I have a dataframe containing a sequence of possession points (defined by x,y coordinates). For each line I have the following fields:
Possession | PossessionSequence | x_from | y_from
-----------+--------------------+--------+-------
1 1 12 24
1 2 89 45
1 3 67 80
1 4 110 72
2 1 23 79
2 2 32 93
Now, I want to add x_to and y_to fields to this dataframe where the values for these fields in the first records are the x_from and y_from from the second record. The x_to and y_to for the second record are the x_from and y_from from the third record and so on. So within a possession, I always need to take the values from the next possessionsequence. So I would like to get the following:
Possession | PossessionSequence | x_from | y_from | x_to | y_to
-----------+--------------------+--------+--------+------+------
1 1 12 24 89 45
1 2 89 45 67 80
1 3 67 80 110 72
1 4 110 72
2 1 23 79 23 79
2 2 32 93
Now when I get to the last value of the possessionsequence (e.g. 4 for possession 1 in the dataframe above) there is no next record (and as shown the x_to and y_to values should be left blank) and I therefor wrapped the lines in a try ... except statement so that when no next line is found blanks are assigned instead.
So far I have come up with the following code:
# Add the TO x,y coordinates to each line (except the last one)
df['X_to'] = 0
df['Y_to'] = 0
for index, row in df.iterrows():
current_team = df.loc[index, 'Team']
current_posession = df.loc[index, 'Posession']
current_sequence = df.loc[index, 'PosSeq']
try:
df.loc[index, 'X_to'] = df.loc[(df['Team'] == current_team) & (df['Posession'] == current_posession) & (df['PosSeq'] == current_sequence + 1), 'X_from']
df.loc[index, 'Y_to'] = df.loc[(df['Team'] == current_team) & (df['Posession'] == current_posession) & (df['PosSeq'] == current_sequence + 1), 'Y_from']
except:
df.loc[index, 'X_to'] = ""
df.loc[index, 'Y_to'] = ""
But when I run this it always ends up in the 'except' part and no 'to' coordinates are assigned. I am trying to familiarize myself with the debug mode in Visual Studio code and there I see that the loc statement to look up the 'to' coordinates comes up with the correct value, but only as part of a series with the first element the current iteration in the for loop. How can I extract only the coordinate value from this ? All help is welcome !!

Sorry for the bother ... it was as simple as adding .values[0] to the lines in question (and defining the columns upfront with the correct datatype).

Related

Selecting columns by their position while using a function in pandas

My dataframe looks somthing like this
frame = pd.DataFrame({'id':[1,2,3,4,5],
'week1_values':[0,0,13,39,64],
'week2_values':[32,35,25,78,200]})
I am trying to apply a function to calculate the Week over Week percentage difference between two columns('week1_values' and 'week2_values') whose names are being generated dynamically.
I want to create a function to calculate the percentage difference between weeks keeping in mind the zero values in the 'week1_values' column.
My function is something like this:
def WoW(df):
if df.iloc[:,1] == 0:
return (df.iloc[:,1] - df.iloc[:,2])
else:
return ((df.iloc[:,1] - df.iloc[:,2]) / df.iloc[:,1]) *100
frame['WoW%'] = frame.apply(WoW,axis=1)
When i try to do that, i end up with this error
IndexingError: ('Too many indexers', 'occurred at index 0')
How is it that one is supposed to specify columns by their positions inside a function?
PS: Just want to clarify that since the column names are being generated dynamically, i am trying to select them by their position with iloc function.
Because working with Series, remove indexing columns:
def WoW(df):
if df.iloc[1] == 0:
return (df.iloc[1] - df.iloc[2])
else:
return ((df.iloc[1] - df.iloc[2]) / df.iloc[1]) *100
frame['WoW%'] = frame.apply(WoW,axis=1)
Vectorized alternative:
s = frame.iloc[:,1] - frame.iloc[:,2]
frame['WoW%1'] = np.where(frame.iloc[:, 1] == 0, s, (s / frame.iloc[:,1]) *100)
print (frame)
id week1_values week2_values WoW% WoW%1
0 1 0 32 -32.000000 -32.000000
1 2 0 35 -35.000000 -35.000000
2 3 13 25 -92.307692 -92.307692
3 4 39 78 -100.000000 -100.000000
4 5 64 200 -212.500000 -212.500000
You can use pandas pct_change method to automatically compute the percent change.
s = (frame.iloc[:, 1:].pct_change(axis=1).iloc[:, -1]*100)
frame['WoW%'] = s.mask(np.isinf(s), frame.iloc[:, -1])
output:
id week1_values week2_values WoW
0 1 0 32 32.000000
1 2 0 35 35.000000
2 3 13 25 92.307692
3 4 39 78 100.000000
4 5 64 200 212.500000
Note however that the way you currently do it in your custom function is biased. Changes from 0->20, or 10->12, or 100->120 would all produce 20 as output, which seems ambiguous.
suggested alternative
use a classical percent increase, even if it leads to infinite:
frame['WoW'] = frame.iloc[:, 1:].pct_change(axis=1).iloc[:, -1]*100
output:
id week1_values week2_values WoW
0 1 0 32 inf
1 2 0 35 inf
2 3 13 25 92.307692
3 4 39 78 100.000000
4 5 64 200 212.500000

Finding which rows have duplicates in a .csv, but only if they have a certain amount of duplicates

I am trying to determine which sequential rows have at least 50 duplicates within one column. Then I would like to be able to read which rows have the duplicates in a summarized manner, ie
start end total
9 60 51
200 260 60
I'm trying to keep the start and end separate so I can call on them independently later.
I have this to open the .csv file and read its contents:
df = pd.read_csv("BN4 A4-F4, H4_row1_column1_watershed_label.csv", header=None)
df.groupby(0).filter(lambda x: len(x) > 0)
Which gives me this:
0
0 52.0
1 65.0
2 52.0
3 52.0
4 52.0
... ...
4995 8.0
4996 8.0
4997 8.0
4998 8.0
4999 8.0
5000 rows × 1 columns
I'm having a number of problems with this. 1) I'm not sure I totally understand the second function. It seems like it is supposed to group the numbers in my column together. This code:
df.groupby(0).count()
gives me this:
0
0.0
1.0
2.0
3.0
4.0
...
68.0
69.0
70.0
71.0
73.0
65 rows × 0 columns
Which I assume means that there are a total of 65 different unique identities in my column. This just doesn't tell me what they are or where they are. I thought that's what this one would do
df.groupby(0).filter(lambda x: len(x) > 0)
but if I change the 0 to anything else then it screws up my generated list.
Problem 2) I think in order to get the number of duplicates in a sequence, and which rows they are in, I would probably need to use a for loop, but I'm not sure how to build it. So far, I've been pulling my hair out all day trying to figure it out but I just don't think I know Python well enough yet.
Can I get some help, please?
UPDATE
Thanks! So this is what I have thanks to #piterbarg:
#function to identify which behaviors have at least 49 frames, and give the starting, ending, and number of frames
def behavior():
df2 = (df
.reset_index()
.shift(periods=-1)
.groupby((df[0].diff() != 0).cumsum()) #if the diff between a row and the prev row is not 0, increase cumulative sum
.agg({0 : 'mean', 'index':['first','last',len]})) #mean is the behavior category
df3 = (df2.where(df2[('index','len')]>49)
.dropna() #drop N/A
.astype(int) #type = int
.reset_index(drop = True))
print(df3)
out:
0 index
mean first last len
0 7 32 87 56
1 19 277 333 57
2 1 785 940 156
3 30 4062 4125 64
4 29 4214 4269 56
5 7 4450 4599 150
6 1 4612 4775 164
7 7 4778 4882 105
8 8 4945 4999 56
The current issue is trying to make it so the dataframe includes the last row of my .csv. If anyone happens to see this, I would love your input!
Let's start by mocking a df:
import numpy as np
np.random.seed(314)
df=pd.DataFrame({0:np.random.randint(10,size = 5000)})
# make sure we have a couple of large blocks
df.loc[300:400,0] = 5
df.loc[600:660,0] = 4
First we identify where the changes to the consecutive numbers occur, and groupby each of such groups. We record where it starts, where it finishes, and the size of each group
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum())
.agg({'index':['first','last',len]})
)
Then we only pick those groups that are longer than 50
(df2.where(df2[('index','len')]>50)
.dropna()
.astype(int)
.reset_index(drop = True)
)
output:
index
first last len
0 300 400 101
1 600 660 61
For your question as to what df.groupby(0).filter(lambda x: len(x) > 0) does, as far as I can tell it does nothing. It groups by different values in column 0 and then discard those groups whose size is 0, which is none of them by definition. So this returns your full df
Edit
Your code is not quite right, should be
def behavior():
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum())
.agg({0 : 'mean', 'index':['first','last',len]}))
df3 = (df2.where(df2[('index','len')]>50)
.dropna()
.astype(int)
.reset_index(drop = True))
print(df3)
note that we define and return df3 not df2, and also I amended the code to return the value that is repeated in the mean column (sorry names are not very intuitive but you can change them if you want)
first is the index when the repetition starts, last is the last index, and len is how many elements there.
#function to identify which behaviors have at least 49 frames, and give the starting, ending, and number of frames
def behavior():
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum()) #if the diff between a row and the prev row is not 0, increase cumulative sum
.agg({0 : 'mean', 'index':['first','last',len]})) #mean is the behavior category
.shift(-1)
df3 = (df2.where(df2[('index','len')]>49)
.dropna() #drop N/A
.astype(int) #type = int
.reset_index(drop = True))
print(df3)
yields this:
0 index
mean first last len
0 7 31 86 56
1 19 276 332 57
2 1 784 939 156
3 31 4061 4124 64
4 29 4213 4268 56
5 8 4449 4598 150
6 1 4611 4774 164
7 8 4777 4881 105
8 8 4944 4999 56
Which I love. I did notice that the group with 56x duplicates of '7' actually starts on row 32, and ends on row 87 (just one later in both cases, and the pattern is consistent throughout the sheet). Am I right in believing that this can be fixed with the shift() function somehow? I'm toying around with this still :D

How to generate random numbers from different intervals that add up to a fixed sum in excel?

I need to generate 13 numbers from 13 different intervals which will add up to 1360. In the chart below, "index" means the index of the 13 different numbers. Mean means the mean (average) of the intervals. The range will be plus or minus 15% of the mean as shown below. I will prefer to have the random numbers generated based on the normal distribution with N(mean, 7.5% of mean). I take it back. No normal distribution. Please use +- 15% as hard limits of the intervals.
It will be great if anyone could figure out how to do it in excel. Algorithms will be appreciated as well.
Index mean 15% low high
A 288 43 245 331
B 50 8 43 58
C 338 51 287 389
D 50 8 43 58
E 16 2 14 18
F 66 10 56 76
G 118 18 100 136
H 17 3 14 20
I 91 14 77 105
J 26 4 22 30
K 117 18 99 135
L 165 25 140 190
M 18 3 15 21
I would sort the table by increasing mean:
and use a column for a helper value (column H above).
The idea is to maintain -- while going to the next row -- the current deviation from a perfect aim for the final target. Perfect would mean that every random value coincides with the mean for that row. If a value is 2 less than the mean, then that 2 will appear in the H column for the next row. The random number generated for that next row will then not aim for the given mean, but for 2 less than the mean. The range for the random number will appropriately be reduced so that the low/high values will never be crossed.
By first sorting the rows, we can be sure that this corrected mean will always fall within the next row's low/high range, and so it will always be possible to generate an acceptable random number there.
The final value will be calculated differently: it will be the remainder that is needed to achieve the target sum. For the same reason as above, this value is guaranteed to be within the low/high range.
The formulas used are as follows:
| F | H
--+--------------------------------------------------+------------------------------
2 | =RANDBETWEEN(D2, E2) |
3 | =RANDBETWEEN(B3+H3-C3+ABS(H3), B3+H3+C3-ABS(H3)) | =SUM($B$2:$B2)-SUM($F$2:$F2)
4 | (copy above formula) | (copy above formula)
...| ... | ...
13 | (copy above formula) | (copy above formula)
14 | =SUM($B$2:$B14)-SUM($F$2:$F13) |
In theory the rows do not need to be sorted first, but then the formulas cannot be copied down like above, but must reference the correct rows. That would make it quite complicated.
If it is absolutely necessary that the rows are presented in order of the Index column (A, B, C...), then use another sheet to do the above. Then in the main sheet read the value into the F column with a VLOOKUP from the other sheet. So in F2 you would have:
=VLOOKUP(A2, OtherSheet!$A$2:$F$14, 6, 0)
Get the random number like this
num = Int ((300 - 200 + 1) * Rnd + 200) //between 200 and 300
Click here for more information
and the random number need to be the total sum minus the sum that you already got and the last one will be that left.
for example: (if we have 4 numbers sum up to 100)
A is a random number between 0 to 100 //lets say 42
then B is a random number between 0 to (100-42) => 0 to 78 //lets say 18
then C is a random number between 0 to (100-42-18) => 0 to 40 //lets say 25
then, in the end D is 100-42-18-25 => D is 15
*100-42-18-25 is the same as 100-Sum(A,B,C)
Here is my example generate random number based on low and high.
The formula in column F is just a RANDBETWEEN:
=RANDBETWEEN($D2,$E2)
Then you can get the result always equal to 1360 with the formula below for column G:
=F2/SUM($F$2:$F$14)*1360
So cell G15 will always be 1360 which is the sum of all those 13 intervals.

PowerPivot field of same Row in Calculation

Im trying to have a formula, that gets the first result of entry, for every line.
An Example Table would be like this:
Column A Column B Column C Excepted Output from Formula
3 99 P 18 P 4
4 88 P 144 P 1
2 77 P 2
2 77 P 2
1 88 P 1 P 1
1 99 P 4 P 4
2 44 P 5
3 22 P 7
1 88 P 99 P 1
Now, on Column D it should always show the first time it finds Coulmn A = 1, and Column B the same value as the own row (99 for the first row, 88 for the second, 77 for the 3rd...), and Display the Column C of it.
I tried it with the following Formula:
=CALCULATE(
FIRSTNONBLANK('Table'[Column C]; TRUE());
FILTER('Table';'Table'[Column A]=1);
FILTER('Table';'Table'[Column B]='Table'[Column B])
)
Which doesnt work. No errors, but it ignores the second filter.
If i now replace the "='Table'[Column B]" with a number that it should take (99,88,77...) it shows the correct result. But since its now a static number, it shows the same Result in every line, instead of calc it always new.
Can someone help?
Try this:
= CALCULATE(FIRSTNONBLANK('Table'[Column C], TRUE()),
FILTER(FILTER('Table','Table'[Column A]=1),'Table'[Column B] = earlier('Table'[Column B])))

Dropping all id rows if at least one cell meets a given criterion (e.g. has a missing value)

My dataset is in the following form:
clear
input id var
1 20
1 21
1 32
1 34
2 11
2 .
2 15
3 21
3 22
3 1
3 2
3 5
end
In my true dataset, observations are sorted by id and by year (not shown here).
What I need to do is to drop all the rows of a specific id if (at least) one of the following two conditions is met:
there is at least one missing value of var.
var decreases from one row to the next (for the same id)
So in my example what I would like to obtain is:
id var
1 20
1 21
1 32
1 34
Now, my unfortuante attempt has been to use row-wise operations together with by, in order to create a drop1 variable to be later used to subset the dataset.
Something on these lines (which is clearly wrong), :
bysort id: gen drop1=1 if var[_n] < var[_n-1] | var[_n]==.
This doesn't work, and I am not even sure that I am considering the most "clean" and direct way to solve the task.
How would you proceed? Any help would be highly appreciated.
My interpretation is that you want to drop the complete group if either of two conditions are met. I assume your dataset is sorted in some way, most likely, based on another variable. Otherwise, the structure is fragile.
The logic is simple. Check for decreasing values but leave out the first observation of each group, i.e., leave out _n == 1. The first observation, if non-missing, will always be smaller. Then, check also for missings.
clear
set more off
input id var
1 20
1 21
1 32
1 34
2 11
2 .
2 15
3 21
3 22
3 1
3 2
3 5
end
// maintain original sequencing
gen orig = _n
order id orig
bysort id (orig) : gen todrop = sum((var < var[_n-1] & _n > 1) | missing(var))
list, sepby(id)
by id : drop if todrop[_N]
list, sepby(id)
One way to do this is to create some indicator variable as you had attempted. If you only want to drop where var decreases from one observation to the next, you could use:
clear
input id var
1 20
1 21
1 32
1 34
2 11
2 .
2 15
3 21
3 22
3 1
3 2
3 5
4 .
4 2
end
gen i = id if mi(var)
bysort id : egen k = mean(i)
drop if id == k
drop i k
drop if var[_n-1] > var[_n] & _n != 1
However, if you want to get the output you supplied in the post (drop all subsequent observations where var decreases from some max value), you could try the following in place of the last line above.
local N = _N
forvalues i = 1/`N' {
drop if var[_n-1] > var[_n] & _n != 1
}
The loop just ensures that the drop if var... section of code is executed enough so that all observations where var < 34 are dropped.

Resources