Dropping all id rows if at least one cell meets a given criterion (e.g. has a missing value) - subset

My dataset is in the following form:
clear
input id var
1 20
1 21
1 32
1 34
2 11
2 .
2 15
3 21
3 22
3 1
3 2
3 5
end
In my true dataset, observations are sorted by id and by year (not shown here).
What I need to do is to drop all the rows of a specific id if (at least) one of the following two conditions is met:
there is at least one missing value of var.
var decreases from one row to the next (for the same id)
So in my example what I would like to obtain is:
id var
1 20
1 21
1 32
1 34
Now, my unfortuante attempt has been to use row-wise operations together with by, in order to create a drop1 variable to be later used to subset the dataset.
Something on these lines (which is clearly wrong), :
bysort id: gen drop1=1 if var[_n] < var[_n-1] | var[_n]==.
This doesn't work, and I am not even sure that I am considering the most "clean" and direct way to solve the task.
How would you proceed? Any help would be highly appreciated.

My interpretation is that you want to drop the complete group if either of two conditions are met. I assume your dataset is sorted in some way, most likely, based on another variable. Otherwise, the structure is fragile.
The logic is simple. Check for decreasing values but leave out the first observation of each group, i.e., leave out _n == 1. The first observation, if non-missing, will always be smaller. Then, check also for missings.
clear
set more off
input id var
1 20
1 21
1 32
1 34
2 11
2 .
2 15
3 21
3 22
3 1
3 2
3 5
end
// maintain original sequencing
gen orig = _n
order id orig
bysort id (orig) : gen todrop = sum((var < var[_n-1] & _n > 1) | missing(var))
list, sepby(id)
by id : drop if todrop[_N]
list, sepby(id)

One way to do this is to create some indicator variable as you had attempted. If you only want to drop where var decreases from one observation to the next, you could use:
clear
input id var
1 20
1 21
1 32
1 34
2 11
2 .
2 15
3 21
3 22
3 1
3 2
3 5
4 .
4 2
end
gen i = id if mi(var)
bysort id : egen k = mean(i)
drop if id == k
drop i k
drop if var[_n-1] > var[_n] & _n != 1
However, if you want to get the output you supplied in the post (drop all subsequent observations where var decreases from some max value), you could try the following in place of the last line above.
local N = _N
forvalues i = 1/`N' {
drop if var[_n-1] > var[_n] & _n != 1
}
The loop just ensures that the drop if var... section of code is executed enough so that all observations where var < 34 are dropped.

Related

loc statement returns series and I need a single value

I use the loc statement to look up each value of a coordinate pair but apparently, if you use loc with a condition it returns a series and not a single value. My conditions make that I should always get only one value back but when I assign it to its field I always end up in the 'except' part of my statement. How can I make this work ?
My situation is as follows: I have a dataframe containing a sequence of possession points (defined by x,y coordinates). For each line I have the following fields:
Possession | PossessionSequence | x_from | y_from
-----------+--------------------+--------+-------
1 1 12 24
1 2 89 45
1 3 67 80
1 4 110 72
2 1 23 79
2 2 32 93
Now, I want to add x_to and y_to fields to this dataframe where the values for these fields in the first records are the x_from and y_from from the second record. The x_to and y_to for the second record are the x_from and y_from from the third record and so on. So within a possession, I always need to take the values from the next possessionsequence. So I would like to get the following:
Possession | PossessionSequence | x_from | y_from | x_to | y_to
-----------+--------------------+--------+--------+------+------
1 1 12 24 89 45
1 2 89 45 67 80
1 3 67 80 110 72
1 4 110 72
2 1 23 79 23 79
2 2 32 93
Now when I get to the last value of the possessionsequence (e.g. 4 for possession 1 in the dataframe above) there is no next record (and as shown the x_to and y_to values should be left blank) and I therefor wrapped the lines in a try ... except statement so that when no next line is found blanks are assigned instead.
So far I have come up with the following code:
# Add the TO x,y coordinates to each line (except the last one)
df['X_to'] = 0
df['Y_to'] = 0
for index, row in df.iterrows():
current_team = df.loc[index, 'Team']
current_posession = df.loc[index, 'Posession']
current_sequence = df.loc[index, 'PosSeq']
try:
df.loc[index, 'X_to'] = df.loc[(df['Team'] == current_team) & (df['Posession'] == current_posession) & (df['PosSeq'] == current_sequence + 1), 'X_from']
df.loc[index, 'Y_to'] = df.loc[(df['Team'] == current_team) & (df['Posession'] == current_posession) & (df['PosSeq'] == current_sequence + 1), 'Y_from']
except:
df.loc[index, 'X_to'] = ""
df.loc[index, 'Y_to'] = ""
But when I run this it always ends up in the 'except' part and no 'to' coordinates are assigned. I am trying to familiarize myself with the debug mode in Visual Studio code and there I see that the loc statement to look up the 'to' coordinates comes up with the correct value, but only as part of a series with the first element the current iteration in the for loop. How can I extract only the coordinate value from this ? All help is welcome !!
Sorry for the bother ... it was as simple as adding .values[0] to the lines in question (and defining the columns upfront with the correct datatype).

Is there a way to sort a list so that rows with the same value in one column are evenly distributed?

Hoping to sort (below left) by sector but distribute evenly (below right):
Name
Sector.
Name.
Sector
A
1
A
1
B
1
E
2
C
1
H
3
D
4
D
4
E
2
B
1
F
2
F
2
G
2
J
3
H
3
I
4
I
4
C
1
J
3
G
2
Real data is 70+ rows with 4 sectors.
I've worked around it manually but would love to figure out how to do it with a formula in excel.
Here's a more complete (and hopefully more accurate) idea - the carouselOrder is the column I'd like to generate via a formula.
guestID
guestSector
carouselOrder
1
1
1
2
1
5
3
1
9
4
1
13
5
2
2
6
2
6
7
2
10
8
2
14
9
3
3
10
3
7
11
3
11
12
2
18
13
1
17
14
1
20
15
1
23
16
2
21
17
2
24
18
2
27
19
1
26
20
1
29
21
1
30
22
1
31
23
3
15
24
3
19
25
3
22
26
3
25
27
3
28
28
1
32
29
4
4
30
4
8
31
4
12
32
4
16
When using Office 365 you can use the following in D2: =MOD(SEQUENCE(COUNTA(A2:A11),,0),4)+1
This create the repetitive counter of the sectors 1 to 4 to the total count of rows in your data.
In C2 use the following:
=BYROW(D2#,LAMBDA(x,
INDEX(
FILTER($A$2:$A$11,$B$2:$B$11=x),
SUM(--(D$2:x=x)))))
This filters the Names that equal the sector of mentioned row and indexes it to show only the result where the row in the filter result equals the count of the same sector (D2#) up to current row.
Let's try the following approach that doesn't require to create a helper column. I would like to explain first the logic to build the recurrence, then the excel formula that builds such recurrence.
If we sort the input data Name and Sector. by Sector. in ascending order, the new positions of the Name values (letters) can be calculated as follow (Table 1):
Name
Sector.Sorted
Position
A
1
1+4*0=1
B
1
1+4*1=5
C
1
1+4*2=9
E
2
2+4*0=2
F
2
2+4*1=6
G
2
2*4*2=10
H
3
3+4*0=3
J
3
3+4*1=7
D
4
4+4*0=4
I
4
4+4*1=8
The new positions of Name (letters) follows this pattern (Formula 1):
position = Sector.Sorted + groupSize * factor
where groupSize is 4 in our case and factor counts how many times the same Sector.Sorted value is repeated, starting from 0. Think about Sector.Sorted as groups, where each set of repeated values represents a group: 1,2,3 and 4.
If we are able to build the Position values we can sort Name, based on the new positions via SORTBY(array, by_array1) function. Check SORTBY documentation for more information how this function works.
Here is the formula to get the Name sorted in cell E2:
=LET(groupSize, 4, sorted, SORT(A2:B11,2), sName,
INDEX(sorted,,1),sSector, INDEX(sorted,,2),
seq0, SEQUENCE(ROWS(sSector),,0), mapResult,
MAP(sSector, seq0, LAMBDA(a,b, IF(b=0, "SAME",
IF(a=INDEX(sSector,b), "SAME", "NEW")))), factor,
SCAN(-1,mapResult, LAMBDA(aa,c,IF(c="SAME", aa+1,0))),
pos,MAP(sSector, factor, LAMBDA(m,n, m + groupSize*n)),
SORTBY(sName,pos)
)
Here is the output:
Explanation
The name sorted represents the input data sorted by Sector. in ascending order, i.e.: SORT(A2:B11,2). The names sName and sSector represent each column of sorted.
To identify each group we need the following sequence (seq0) starting from 0, i.e. SEQUENCE(ROWS(sSector),,0).
Now we need to identify when a new group starts. We use MAP function for that and the result is represented by the name mapResult:
MAP(sSector, seq0, LAMBDA(a,b, IF(b=0, "SAME",
IF(a=INDEX(sSector,b), "SAME", "NEW"))))
The logic is the following: If we are at the beginning of the sequence (first value of seq0), then returns SAME otherwise we check current value of sSector (a) against the previous one represented by INDEX(sSector,b) if they are the same, then we are in the same group, otherwise a new group started.
The intermediate result of mapResult is:
Name
Sector Sorted
mapResult
A
1
SAME
B
1
SAME
C
1
SAME
E
2
NEW
F
2
SAME
G
2
SAME
H
3
NEW
J
3
SAME
D
4
NEW
I
4
SAME
The first two columns are shown just for illustrative purpose, but mapResult only returns the last column.
Now we just need to create the counter based on every time we find NEW. In order to do that we use SCAN function and the result is stored under the name factor. This value represents the factor we use to multiply by 4 within each group (see Table 1):
SCAN(-1,mapResult, LAMBDA(aa,c,IF(c="SAME", aa+1,0)))
The accumulator starts in -1, because the counter starts with 0. Every time we find SAME, it increments by 1 the previous value. When it finds NEW (not equal to SAME), the accumulator is reset to 0.
Here is the intermediate result of factor:
Name
Sector Sorted
mapResult
factor
A
1
SAME
0
B
1
SAME
1
C
1
SAME
2
E
2
NEW
0
F
2
SAME
1
G
2
SAME
2
H
3
NEW
0
J
3
SAME
1
D
4
NEW
0
I
4
SAME
1
The first three columns are shown for illustrative purpose.
Now we have all the elements to build our pattern for the new positions represented with the name pos:
MAP(sSector, factor, LAMBDA(m,n, m + groupSize*n))
where m represents each element of Sector.Sorted and factor the previous calculated values. As you can see the formula in Excel represents the generic formula (Formula 1 see above). The intermediate result will be:
Name
Sector Sorted
mapResult
factor
pos
A
1
SAME
0
1
B
1
SAME
1
5
C
1
SAME
2
9
E
2
NEW
0
2
F
2
SAME
1
6
G
2
SAME
2
10
H
3
NEW
0
3
J
3
SAME
1
7
D
4
NEW
0
4
I
4
SAME
1
8
The previous columns are shown just for illustrative purpose. Now we have the new positions, so we are ready to sort based on the new positions for Name via:
SORTBY(sName,pos)
Update
The first MAP can be removed creating an array as input for SCAN that has the information of sSector and the index position to be used for finding the previous element. SCAN only allows a single array as input argument, so we can combine both information in a new array. This is the formula can be used instead:
=LET(groupSize, 4, sorted, SORT(A2:B11,2), sName,
INDEX(sorted,,1),sSector, INDEX(sorted,,2),
factor, SCAN(-1,sSector&"-"&SEQUENCE(ROWS(sSector),,0),
LAMBDA(aa,b, LET(s, TEXTSPLIT(b,"-"),item, INDEX(s,,1),
idx, INDEX(s,,2), IF(aa=-1, 0, IF(1*item=INDEX(sSector, idx), aa+1,0))))),
pos,MAP(sSector, factor, LAMBDA(m,n, m + groupSize*n)),
SORTBY(sName,pos)
)
We use inside of SCAN a LET function to calculate all required elements for doing the comparison as part of the calculation of the corresponding LAMBDA function. We extract the item and the idx position used to find previous element of sSector via:
1*item=INDEX(sSector, idx)
we are able to compare each element of sSector with previous one, starting from the second element of sSector. We multiply item by 1, because TEXTSPLIT converts the result to text, otherwise the comparison will fail.

Selecting columns by their position while using a function in pandas

My dataframe looks somthing like this
frame = pd.DataFrame({'id':[1,2,3,4,5],
'week1_values':[0,0,13,39,64],
'week2_values':[32,35,25,78,200]})
I am trying to apply a function to calculate the Week over Week percentage difference between two columns('week1_values' and 'week2_values') whose names are being generated dynamically.
I want to create a function to calculate the percentage difference between weeks keeping in mind the zero values in the 'week1_values' column.
My function is something like this:
def WoW(df):
if df.iloc[:,1] == 0:
return (df.iloc[:,1] - df.iloc[:,2])
else:
return ((df.iloc[:,1] - df.iloc[:,2]) / df.iloc[:,1]) *100
frame['WoW%'] = frame.apply(WoW,axis=1)
When i try to do that, i end up with this error
IndexingError: ('Too many indexers', 'occurred at index 0')
How is it that one is supposed to specify columns by their positions inside a function?
PS: Just want to clarify that since the column names are being generated dynamically, i am trying to select them by their position with iloc function.
Because working with Series, remove indexing columns:
def WoW(df):
if df.iloc[1] == 0:
return (df.iloc[1] - df.iloc[2])
else:
return ((df.iloc[1] - df.iloc[2]) / df.iloc[1]) *100
frame['WoW%'] = frame.apply(WoW,axis=1)
Vectorized alternative:
s = frame.iloc[:,1] - frame.iloc[:,2]
frame['WoW%1'] = np.where(frame.iloc[:, 1] == 0, s, (s / frame.iloc[:,1]) *100)
print (frame)
id week1_values week2_values WoW% WoW%1
0 1 0 32 -32.000000 -32.000000
1 2 0 35 -35.000000 -35.000000
2 3 13 25 -92.307692 -92.307692
3 4 39 78 -100.000000 -100.000000
4 5 64 200 -212.500000 -212.500000
You can use pandas pct_change method to automatically compute the percent change.
s = (frame.iloc[:, 1:].pct_change(axis=1).iloc[:, -1]*100)
frame['WoW%'] = s.mask(np.isinf(s), frame.iloc[:, -1])
output:
id week1_values week2_values WoW
0 1 0 32 32.000000
1 2 0 35 35.000000
2 3 13 25 92.307692
3 4 39 78 100.000000
4 5 64 200 212.500000
Note however that the way you currently do it in your custom function is biased. Changes from 0->20, or 10->12, or 100->120 would all produce 20 as output, which seems ambiguous.
suggested alternative
use a classical percent increase, even if it leads to infinite:
frame['WoW'] = frame.iloc[:, 1:].pct_change(axis=1).iloc[:, -1]*100
output:
id week1_values week2_values WoW
0 1 0 32 inf
1 2 0 35 inf
2 3 13 25 92.307692
3 4 39 78 100.000000
4 5 64 200 212.500000

How to use AWK command for a variable with multiple entries? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
How can I use awk to get the information from an ID with various observations/variables. As an example following, I have two IDs (a,b). Both have 6 observations at different ages with different measurements.
The file is sorted based on ID and ages.
The 'line numbers' are not the part of the actual data file but the headings are!
I'd like to use awk to identify and extract {print} the "measurement" difference between the earliest age and the latest age of every each unique ID. Looking at the following example, for ID (a), I'd like to obtain 50-11=39.
id age measurement
1 a 2 11
2 a 4 20
3 a 6 19
4 a 7 89
5 a 8 43
6 a 12 50
7 b 1 15
8 b 3 23
9 b 5 30
10 b 6 33
11 b 7 45
12 b 10 60
I would highly appreciate if you please explain in details that I could learn.
Ordered input data
Given that the line numbers are not part of the data but the headings are, the file looks more like:
id age measurement
a 2 11
a 4 20
a 6 19
a 7 89
a 8 43
a 12 50
b 1 15
b 3 23
b 5 30
b 6 33
b 7 45
b 10 60
This script analyzes that file as desired:
awk 'NR==1 { next }
$1 != id { if (id != "") print id, id_max - id_min; id = $1; id_min = $3; }
{ id_max = $3 }
END { if (id != "") print id, id_max - id_min; }' data
The first line skips the first line of the file.
The second line checks whether the ID has changed; if so, it checks whether there was an old ID, and if so, prints the data. It then stores the current ID and records the measurement for the minimum age.
The third line records the measurement for the current maximum age for the current ID.
The last line prints out the data for the last group in the file, if there was any data in the file.
Sample output:
a 39
b 45
Unordered input data
Even if the data is not sequenced by ID and age, the code can be adapted to work:
awk 'NR==1 { next }
{ if (group[$1]++ == 0)
{
order[++sequence] = $1
id_age_min[$1] = $2; id_val_min[$1] = $3
id_age_max[$1] = $2; id_val_max[$1] = $3
}
if ($2 < id_age_min[$1]) { id_age_min[$1] = $2; id_val_min[$1] = $3; }
if ($2 > id_age_max[$1]) { id_age_max[$1] = $2; id_val_max[$1] = $3; }
}
END { for (i = 1; i <= sequence; i++)
{
id = order[i];
print id, id_val_max[id] - id_val_min[id]
}
}' data
This skips the heading line, then tracks groups as they arrive and arranges to print the data in that order (group and order — and sequence). For each row, if the group has not been seen before, set up the data for the current row (the values are both the minimum and maximum).
If the age in the current row ($2) is less than the age for the current minimum age (id_min_age[$1]), record the new age and the corresponding value. If the age in the current row is larger than the age for the current maximum age (id_max_age[$1]), record the new age and the corresponding value.
At the end, for each ID in sequence, print out the ID and the difference between the maximum and minimum value for that ID.
Shuffled data:
id age measurement
a 12 50
b 10 60
a 4 20
b 3 23
b 5 30
a 7 89
b 6 33
b 7 45
a 8 43
a 6 19
a 2 11
b 1 15
Sample output:
a 39
b 45
It so happens that an a row still appeared before a b row, so the output is the same as before.
More data (note b appears before a this time):
id age measurement
b 10 60
a 12 50
a 4 20
b 3 23
c 2 19
d -9 20
d 10 31
e 10 31
b 5 30
a 7 89
b 6 33
e -9 20
b 7 45
a 8 43
a 6 19
f -9 -3
f -7 -1
g -2 -8
g -8 -3
a 2 11
b 1 15
Result:
b 45
a 39
c 0
d 11
e 11
f 2
g -5

Excel bottom up autofill/addition

I need to fill and add up the values in column with the value below, if the value is null or 0, I want it to use the previous field value. This is part of my backlog calculation, and I can't find/remember the formula I used last time.
A is what I have, b is what I need
A B
21 34
6 13
3 7
1 4
1 3
2
1 2
1
1
1
1
1
1 1
What about:
=IF(ISBLANK(A1),B2,A1+B2)

Resources