Pandas filter, group-by and then transform - python-3.x

I have a pandas dataframe, which looks like the following:
df =
a b
a1. 1
a2 0
a1 0
a3 1
a2 1
a1 1
I would like to first filter b on 1 and then, group by a and count number of times each group occurs (call this column count) and then attach this column with original df. b is guaranteed to be have at least one time 1 for each value of a.
Expected output:
df =
a b. count
a1. 1 2
a2 0. 1
a1 0. 2
a3 1 1
a2 1. 1
a1 1 2
I tried:
df['count] = df.groupby('a').b.transform('size')
But, this counts zeros as well. I want to filter for b == 1 first.
I also tried:
df['count'] = df[df['b' == 1].groupby('a').b.transform('size')
But, this introduces nans in the count column?
How can I do this in one line?

Check with get the condition apply to b then sum
df['b'].eq(1).groupby(df['a']).transform('sum')
Out[103]:
0 2.0
1 1.0
2 2.0
3 1.0
4 1.0
5 2.0
Name: b, dtype: float64

Related

Identify and count alternating parts of a column in a (timeseries) dataframe

I am analyzing trades done in a futures contract, based on a csv file with a list of trades (columns are Side, Qty, Price, Date).
I have imported the file and sorted the trades chronologically by time. The column "Side" (BUY/SELL) is now:
B
S
S
B
B
S
S
B
B
B
B
I want to give each row of B's and each row of S's a unique number, in order for me to group each individual parts of B's and S's for further analysis. I want for example to find out what the average price of each row of Bs and each row of Ss are.
In the example above there are 5 rows/parts in total, 3 B's and 2 S's. The first row of B's should be 1. The second row of B's should be 3 and the last row of B's should be 5. Basically I want to add a column with this output:
1
2
2
3
3
4
4
5
5
5
5
Now I should be able to find the average price of the four B's in row number 5 using groupby with the new column as argument and mean().
But how can I make the counter needed for this new column? I am able to identify each change using somehing like np.where(), diff(), abs() + cumsum() and 1 and -1, but I dont see how I can add +1 to each alternation.
Use Series.shift with compare not equal and cumulative sum by Series.cumsum:
df['new'] = df['Side'].ne(df['Side'].shift()).cumsum()
How it working:
df = df.assign(shifted = df['Side'].shift(),
mask = df['Side'].ne(df['Side'].shift()),
new = df['Side'].ne(df['Side'].shift()).cumsum())
print (df)
Side shifted mask new
0 B NaN True 1
1 S B True 2
2 S S False 2
3 B S True 3
4 B B False 3
5 S B True 4
6 S S False 4
7 B S True 5
8 B B False 5
9 B B False 5
10 B B False 5

Pandas groupby value and return observation count to dataset

I have a dataset like the following:
id value
a 0
a 0
a 0
a 0
a 1
a 2
a 2
a 2
b 0
b 0
b 1
b 2
b 2
I want to groupby the "id" column and grab the number of observations in the "value" column, and return a new column in the original dataset that counts the number of times the "value" observation occurs within each id.
An example of the output I'm looking for is represented in column "output":
id value output
a 0 4
a 0 4
a 0 4
a 0 4
a 1 1
a 2 3
a 2 3
a 2 3
b 0 2
b 0 2
b 1 1
b 2 2
b 2 2
When grouping on id "a", there are 4 observations of 0, which is provided in the column "output" for each row that contains id of "a" and value of 0.
I have tried applications of groupby and apply, to no avail. Any suggestions would be very helpful. Thank you.
Update: I figured out a solution for anyone who also faces this problem, and it works well.
grouped = df.groupby(['id','value'])
df['output'] = grouped['value'].transform('count')
This will return the count of observations under each bucket and return that count to each observation that meets that criteria, as shown in the "output" column above.
group by id and and value then count value.
data.groupby(['id' , 'value'])['id'].transform('count')

How to use extractall in Pandas and get a new column with the extracted strings?

I have a data frame of 15 columns from a csv file. I am trying to remove one part of the text of a column and create a new column containing that information on each row. Each row of 'phospho' should have only one match to my demands on extractall. Now, I am trying to add the result to my data frame but I get the error:
TypeError: incompatible index of inserted column with frame index
The dataset has two column with names, and 6 columns with values (like 65.98, for ex).
Ex:
accession sequence modification phospho CON_1 CON_2 CON_3 LIF1
LIF2 LIF3 P18767 [R].GAAQNIIPASTGAAK.[A]
1xTMT6plex[K15];1xTMT6plex[N-Term] 1xPhospho [S3(98.3)]
Here is the freaking code:
a = pmap1['phospho'].str.extractall(r'([STEHRYD]\d*)')
pmap1['phosphosites'] = a
Thanks!
I created pmap1 using the following sample data:
pmap1 = pd.DataFrame(data=[[ 'S34T44X', 1 ], [ 'E23H78Y', 2 ],
[ 'R49Y81Z', 3 ], [ 'D20U23X', 4 ]], columns=['phospho', 'nn'])
When you extract all matches:
a = pmap1['phospho'].str.extractall(r'([STEHRYD]\d*)')
the result is:
0
match
0 0 S34
1 T44
1 0 E23
1 H78
2 Y
2 0 R49
1 Y81
3 0 D20
Note that:
The result is of DataFrame type (with a single column named 0).
It contains eight rows. So it is not clear to which row insert
particular matches.
The index is actually a MultiIndex with 2 levels:
The first (unnamed) level is the index of the source row,
The second level (named match) contains the number of
match within the current row.
E.g. in row with index 0 there were founde 2 matches:
S34 - No 0,
T44 - No 1.
So you can not directly save a as a new column of pmap1,
e.g. because pmap1 contains "ordinary" index and
a is a MultiIndex, incompatible with the index of pmap1.
And just this is written in the error message.
If you want somehow "add" a to pmap1, you can e.g. "break" each match
as a separate column the following way:
a2 = a.unstack()
Gives the result:
0
match 0 1 2
0 S34 T44 NaN
1 E23 H78 Y
2 R49 Y81 NaN
3 D20 NaN NaN
where columns are MultiIndex, so to drop the first
level if it, run:
a2.columns = a2.columns.droplevel()
The result is:
match 0 1 2
0 S34 T44 NaN
1 E23 H78 Y
2 R49 Y81 NaN
3 D20 NaN NaN
Then you can perform the actual join, executing:
pmap1.join(a2)
The result is:
phospho nn 0 1 2
0 S34T44X 1 S34 T44 NaN
1 E23H78Y 2 E23 H78 Y
2 R49Y81Z 3 R49 Y81 NaN
3 D20U23X 4 D20 NaN NaN
If you are unhappy about numbers as column names, you can change them as
you wish.
If you are unhappy about NaN values for "missing" matches
(for rows where less matches have been found compared to other rows),
add .fillna('') to the last instruction.
Edit
There is a shorter solution:
After you created a, you can do the whole rest of processing
with a single instruction:
pmap1.join(a[0].unstack()).fillna('')

Matrix with boolean values from a list of paired observations

In the below spreadsheet, the cell values represent an ID for a person. The person in column A likes the person in column B, but it may not be mutual. So, in the first row with data, person 1 likes 2. In the second row with data person 1 likes 3.
A B
1 2
1 3
2 1
2 4
3 4
4 1
I'm looking for a way to have a 4 x 4 matrix with an entry of 1 in (i,j) to indicate person i likes person j and an entry of 0 to indicate they don't. The example above should like this after performing the task:
1 2 3 4
1 0 1 1 0
2 1 0 0 1
3 0 0 0 1
4 1 0 0 0
So, reading the first row of the matrix we would interpret it like this: person 1 does not like person 1 (cell value = 0), person 1 likes person 2 (cell value = 1), person 1 likes person 3 (cell value =1), person 1 does not like person 4 (cell value = 0)
Note that order of pairing matter so [4 2] does not equal [2 4].
How could this be done?
Assuming your existing data is in A1:B6, then in A10 enter:
=COUNTIFS($A$1:$A$6, ROW()-9,$B$1:$B$6, COLUMN())
This will return a 1 or a 0 depending on whether person 1 likes person 1. They don't so you get a 0. It uses Row()-9 to return 1 and COLUMN() to return 1 to find the match.
Copy this formula over 4 columns and down 4 rows and that ROW()-9 and COLUMN() formula will return the appropriate values for the check into the COUNTIFS() formula which will look for the matching pair.
Personally, if this was something I had to do and my matrix was of indeterminate size, I would probably stick these formulas on a second tab, starting at A1 and use ROW() where I don't have to adjust it by 9. But for a one off on the same tab, to help check the results, the above is fine.

excel formula depending on dynamic values in different columns

I am trying to create an excel formula using SUM and SUMIF but cannot find how to.
I have a first column(A) which is the total time of a piece of work and then for each row the time spent in that task during each day(columns B, C, ...).
For each day(columns B, C, ...), the formula would return the sum of only those values in column A that(for that specific column), relate to task that have been completed that day: the sum of all cells within a row is equals or more than the time the task was allocated.
Example for one 12-hours task:
A B C D E
12 4 6 2 0
Using the formula:
A B C D E
12 4 6 2 0
0 0 0 12 0
where 12 is displayed in column D because 4 + 6 + 2 = 12(Column A)
Second example(3 tasks):
A B C D E
10 9 0 1 0
21 8 8 5 0
5 0 0 3 2
Using the formula:
A B C D E
10 9 0 1 0
21 8 8 5 0
5 0 0 3 2
0 0 0 31 5
Where:
31(Day D) = 10(Task 1 is finished that day) + 21(Task 2 is finished that day too)
5(Day E) = Task 3 is finished that day
Tried this formula (for Day B):
SUMIF(B1:B3,">=A1:A3",A1:A3)
(Sum those values in column A if the cells in that row p to column B(in this case just B) are >= than those iterated).
Then for column C, it would be,
SUMIF(C1:C3 + B1:B3,">=A1:A3",A1:A3)
The above examples did not work(first returns zero, second is an invalid formula),
Any ideas?
Thank you.
Formula below given by user ServerS works fine:
Col B:
=IF(SUM(B2)=A2,A2,0)+IF(SUM(B3)=A3,A3,0)+IF(SUM(B4)=A4,A4,0)+IF(SUM(B5)=A5,A5,0)
Col C:
=IF(SUM(B2:C2)=A2,A2,0)+IF(SUM(B3:C3)=A3,A3,0)+IF(SUM(B4:C4)=A4,A4,0)+IF(SUM(B5:C5)=A5,A5,0)
Col D
=IF(SUM(B2:D2)=A2,A2,0)+IF(SUM(B3:D3)=A3,A3,0)+IF(SUM(B4:D4)=A4,A4,0)+IF(SUM(B5:D5)=A5,A5,0)
However there are two inconvenients:
if new rows are added it needs to be adapted and include another IF(). Would be better to have a generic SUM if IF's
Trying to propagate the formula to adjacent cells is not possible as it would change part of the formula like "=A2,A2,0" to "=A3,A3,0" which needs to keep the same.
Any other ideas that improve this, if possible, are appreciated.
You can avoid using IF with a sumproduct. This method allows use to insert any row you want. Make sure range are correct (eg A2:A5 with 5 the last row used). I would go for this :
in column B :
=SOMMEPROD(($A$2:$A$5)*($A$2:$A$5=(B2:B5)))
in column C :
=SUMPRODUCT(($A$2:$A$5)*($A$2:$A$5=(B2:B5+C2:C5)))-B6
in column D
=SUMPRODUCT(($A$2:$A$5)*($A$2:$A$5=(B2:B5+C2:C5+D2:D5)))-C6-B6
in column E
=SUMPRODUCT(($A$2:$A$5)*($A$2:$A$5=(B2:B5+C2:C5+D2:D5+E2:E5)))-D6-C6-B6

Resources