For example, I have a table in with thousands of row:
| col
row 1 | NN
row 2 | NNP
row 3 | VB
row 4 | NNP
row 5 | NN
row 6 | NNP
row 7 | JJ
row 8 | NUM
row 9 | CW
.....
row 10000 | NNP
Can I count with condition :
IF (row 1 = NN AND row 2= NNP) AND (row 2 = NN AND row 3= NNP) AND ... AND (row 9999 = NN and row 10000 = NNP) THEN count?
Try -
=SUMPRODUCT((A1:A10000="NN")*(A2:A10001="NNP"))
This should also work:
=COUNTIFS(A1:A999,"NN",A2:A1000,"NNP")
Related
I need to drop duplicate rows in my DataFrame only if the number of duplicates is less than x (e.g. 3)
(if more than 3 duplicates, keep them !)
Sample:
where count is number of duplicates and duplicates are in col data
data | count
-------------
a | 1
b | 2
b | 2
c | 1
d | 3
d | 3
d | 3
Desired result:
data | count
-------------
a | 1
b | 1
c | 1
d | 3
d | 3
d | 3
How can i achieve this? Thanks in advance.
I believe you need chain conditions with Series.duplicated and get greater or equal values of N in boolean indexing, last set 1 for count column:
N = 3
df1 = df[~df.duplicated('data') | df['count'].ge(N)].copy()
df1.loc[df['count'] < N, 'count'] = 1
print (df1)
data count
0 a 1
1 b 1
3 c 1
4 d 3
5 d 3
6 d 3
IIUC, you could do the following:
# create mask for non-duplicates and groups larger than 3
mask = (df.groupby('data')['count'].transform('count') >= 3) | ~df.duplicated('data')
# filter
filtered = df.loc[mask].drop('count', axis=1)
# reset count column
filtered['count'] = filtered.groupby('data')['data'].transform('count')
print(filtered)
Output
data count
0 a 1
1 b 1
3 c 1
4 d 3
5 d 3
6 d 3
N = 3
df['count'] = df['count'].apply(lambda x: 1 if x < N else x)
result = pd.concat([df[df['count'].eq(1)].drop_duplicates(), df[df['count'].eq(N)]])
result
data count
0 a 1
1 b 1
3 c 1
4 d 3
5 d 3
6 d 3
I have a dataframe with a bunch of integer values. I then compute the column totals and append it as a new row to the dataframe. So far so good.
Now I want to append another computed row where the value of each cell is the sum of cell above and the cell on the left. You can see what I mean below:
----------------------------------------------------------------
|250000 |0 |145000 |145000 |220000 |165000 |145000 |145000 |
----------------------------------------------------------------
|250000 |250000 |395000 |540000 |760000 |925000 |1070000|1215000 |
----------------------------------------------------------------
How can this be done?
I think you need Series.cumsum with select last row (total row) by DataFrame.iloc:
df = pd.DataFrame({
'B':[4,5,4],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
})
df.loc['sum'] = df.sum()
df.loc['cumsum'] = df.iloc[-1].cumsum()
#if need only cumsum row
#df.loc['cumsum'] = df.sum().cumsum()
print (df)
B C D E
0 4 7 1 5
1 5 8 3 3
2 4 9 5 6
sum 13 24 9 14
cumsum 13 37 46 60
In excel, I have data divided into
Year Code Class Count
2001 RAI01 LNS 9
2001 RAI01 APRP 4
2001 RAI01 3
2002 RAI01 BPR 3
2002 RAI01 BRK 3
2003 RAI01 URE 3
2003 CFCOLLTXFT APRP 2
2003 CFCOLLTXFT BPR 2
2004 CFCOLLTXFT GRL 2
2004 CFCOLLTXFT HDS 2
2005 RAI HDS 2
where I need to find the top 3 products for that particular customer for that particular year.
The real trick here is to rank each row based on a group.
Your rank is determined by your Count column (Column D).
Your group is determined by your Year and Code (I think) columns (Column A and B respectively).
You can use this gnarly sumproduct() formula to get a rank (Starting at 1) based on the Count for each Group.
So to get a ranking for each Year and Code from 1 to whatever, in a new column next to this data:
=SUMPRODUCT(($A$2:$A$50=A2)*(B2=$B$2:$B$50)*(D2<$D$2:$D$50))+1
And copy that down. Now you can AutoFilter on this to show all rows that have a rank less than 4. You can sort this on Customer, then Year and you should have a nice list of top 3 within each year/code.
Explanation of sumproduct.
Sumproduct goes row by row and applies the math that is defined for each row. When it is done it sums the results.
As an example, take the following worksheet:
+---+---+---+
| | A | B |
+---+---+---+
| 1 | 1 | 1 |
| 2 | 1 | 4 |
| 3 | 2 | 2 |
| 4 | 4 | 1 |
| 5 | 1 | 2 |
+---+---+---+
`=SUMPRODUCT((A1:A5)*(B1:B5))`
This sumproduct will take A1*B1, A2*B2, A3*B3, A4*B4, A5*B5 and then add those five results up to give you a number. That is 1 + 4 + 4 + 4 + 1 = 15
It will also work on conditional/boolean statements returning, for each row/condition a 1 or a 0 (for True and False, which is a "Boolean" value).
As an example, take the following worksheet that holds the type of publication in a library and a count:
+---+----------+---+
| | A | B |
+---+----------+---+
| 1 | Book | 1 |
| 2 | Magazine | 4 |
| 3 | Book | 2 |
| 4 | Comic | 1 |
| 5 | Pamphlet | 2 |
+---+----------+---+
=SUMPRODUCT((A1:A5="Book")*(B1:B5))
This will test to see if A1 is "Book" and return a 1 or 0 then multiple that result by whatever is B1. Then continue for each row in the range up to row 5. The result will 1+0+2+0+0 = 3. There are 3 books in the library (it's not a very big library).
For this answer's sumproduct:
So ($A$2:$A$50=A2) says to return a 1 if A2=A2 or a 0 if A2<>A2. It does that for A2 through A50 comparing it to A2, returning a 1 or a 0.
(B2=$B$2:$B$50) will test each cell B2 through B50 to see if it is equal to B2 and return a 1 or 0 for each test.
The same is true for (D2<$D$2:$D$50) but it's testing to see if the count is less than the current cells count.
So... essentially this is saying "For all the rows 1 through 50, test to find all the other rows that have the same value in Column A and B AND have a count less than this rows count. Count all of those rows up that meet that criteria, and add 1 to it. This is the rank of this row within its group."
Copying this formula has it redetermine that rank for each row allowing you to rank and filter.
Column: A | B | C | D
Row 1: Variable | Margin | Sales | Index
Row 2: banana | 2 | 20 | 1
Row 3: apple | 5 | 10 | 2
Row 4: apple | 10 | 20 | 3
Row 5: apple | 10 | 10 | 4
Row 6: banana | 10 | 15 | 5
Row 7: apple | 10 | 15 | 6
"Variable" sits in column A, row 1.
"Fruit" refers to A2:A6
"Margin" refers to B2:B6
"Sales" refers to C2:C6
"Index" refers to D2:D6
Question:
From the above table, I would like to find the row of two largest "Sales" values when Fruit = "apple" and Margin >= 10. The correct answer would be values from row 3 and 6. I have tried the following methods without success.
I have tried
=LARGE(IF(Fruit="apple",IF(Margin>=10,Sales)),{1,2}) + CSE
and this returns 20 and 15, but not the row.
I have tried
=MATCH(LARGE(IF(Fruit="apple",IF(Margin>=10,sales)),{1,2}),Sales,0)+1
but returns row 2 and 6 as the first matches to come up are the 20 and 15 from "banana" not "apple".
I have tried
=INDEX(D2:D7,LARGE(IF(Fruit="apple",IF(Margin>=10,ROW(Sales)-ROW(INDEX(Sales,1,1))+1)),{1,2}),1)
But this returns row 7 and 5 (i.e. "Index" 6 and 4) as these are just the first occurrences of "apple" starting from the bottom of the table. They are not the largest values.
Can this be done with an Excel formula or do would I need a macro? If macro, can I please get help with the macro? Thank you!
use this formula:
=INDEX(D:D,AGGREGATE(15,6,ROW($A$2:$A$7)/(($B$2:$B$7>=10)*($A$2:$A$7="apple")*($C$2:$C$7 = AGGREGATE(14,6,$C$2:$C$7/(($B$2:$B$7>=10)*($A$2:$A$7="apple")),F2))),1))
I put 1 and 2 in F2 and F3 respectively to find the first and second.
Edit #1
to deal with duplicates we need to add (COUNTIF($G$1:G1,$D$2:$D$7) = 0). The $G$1:G1 needs to refer to the cell directly above the first placement of this formula. So the formula needs to start in at least row 2.
=INDEX(D:D,AGGREGATE(15,6,ROW($A$2:$A$7)/((COUNTIF($G$1:G1,$D$2:$D$7) = 0)*($B$2:$B$7>=10)*($A$2:$A$7="apple")*($C$2:$C$7 = AGGREGATE(14,6,$C$2:$C$7/(($B$2:$B$7>=10)*($A$2:$A$7="apple")),F2))),1))
I need to create multiple Excel rows based off of a single row. For example, I currently have a single row for each personnel and there are dozens of columns that are "grouped" so to speak. So say column K is its own group, then columns M, N, O are a group, P, Q, R, are a group, etc. I need that single row to become multiple rows - one row per group of columns. So the current situation is:
1 | Smith, John | Column K | Column M | Column N | Column O | Column P | Column Q| Column R
And I need that to become:
1 | Smith, John | Column K
2 | Smith, John | Blank | Column M | Column N | Column O
3 | Smith, John | Blank | Blank | Blank | Blank | Column P | Column Q | Column R
Here's a solution. You probably can do that math a bit smarter. I can't right now. ;)
Sub splitRows()
Dim i As Integer
With Sheets(1)
For i = 2 To (.UsedRange.Columns.Count / 3)
.Range(Cells(i, 1), Cells(i, 3)).Value = .Range(Cells(1, 1), Cells(1, 3)).Value
.Range(Cells(i, (i - 1) * 3 + 1), Cells(i, (i - 1) * 3 + 3)).Value = .Range(Cells(1, (i - 1) * 3 + 1), Cells(1, (i - 1) * 3 + 3)).Value
Next
End With
End Sub
This will only work correctly for sheets where there are always groups of three columns. If not, you have to change that .UsedRange.Columns.Count / 3 part.
Cheers.