Is there a way with DataFrame objects to grab rows based on single-row conditions AND the preceding rows as well? Like 'grep -B1 ...' does on Linux - python-3.x

I have a pd.DataFrame object df, and i can select some rows, say on a single-column condition, and i can grab all the rows matching the condition, but i wish to also grab the preceding row before each of the rows matching the condition. The result should be a pd.DataFrame with these rows.
I can write code to do that, i am not asking for it (but feel free to illustrate if you think you have a neat + short way of doing it), but i was wondering if pandas doesn't have a built-in tool to do it i am not aware of.
An example showing what i'm looking for:
import pandas as pd
df = pd.DataFrame([{'a':1, 'b':'apples'}, {'a':5, 'b':'pears'}, {'a':2, 'b':'4 plums'},
{'a':9, 'b':'bananas'}, {'a':5, 'b':'cherries'}, {'a':2, 'b':'100 grapes'},
{'a':3, 'b':'oranges'}, {'a':8, 'b':'cherries'}])
print(df)
# prints: | my markings here, not part of printout, showing
# a b | with a '+' the rows i wish to select and why
# 0 1 apples |
# 1 5 pears | + - because it's a preceding row
# 2 2 4 plums | + - because it has a number
# 3 9 bananas |
# 4 5 cherries | + - because it's a preceding row
# 5 2 100 grapes | + - because it has a number
# 6 3 oranges |
# 7 8 cherries |
# condition would be all the rows where 'b' column has the number of items too:
df[[not x.isalpha() for x in df.b]]
# but this returns only the condition rows, of index 2 and 5, not rows
# 1, 2, 4, 5 as i want it.

IIUC, you are looking for shift(-1):
c=~df.b.str.isalpha()
df[c|c.shift(-1)]
a b
1 5 pears
2 2 4 plums
4 5 cherries
5 2 100 grapes

Related

Pandas, combine unique value from two column into one column while preserving order

I have data in four column as shown below. There are some values which are present in column 1, and some value of column 1 is again duplicated in column 3. I would like to combine column 1 with 3, while removing the duplicates from column 3. I would also like to preserve the order of column. Column 1 is associated with column 2 and column 3 is associated with column 4, so it would be nice if I can move column 1 items with column 2 and column 3 items with column 4 during merge. Any help will be appreciated.
Input table:
Item
Price
Item
Price
Car
105
Truck
54822
Chair
20
Pen
1
Cup
2
Car
105
Glass
1
Output table:
Item
Price
Car
105
Chair
20
Cup
2
Truck
54822
Pen
1
Glass
1
Thank you in advance.
After separating the input table into the left and right part, we can concatenate the left hand items with the unduplicated right hand items quite simply with boolean indexing:
import pandas as pd
# this initial section only recreates your sample input table
from io import StringIO
input = pd.read_table(StringIO("""| Item | Price | Item | Price |
|-------|-------|------|-------|
| Car | 105 | Truck| 54822 |
| Chair | 20 | Pen | 1 |
| Cup | 2 | Car | 105 |
| | | Glass| 1 |
"""), ' *\| *', engine='python', usecols=[1,2,3,4], skiprows=[1], keep_default_na=False)
input.columns = list(input.columns[:2])*2
# now separate the input table into the left and right part
left = input.iloc[:,:2].replace("", pd.NA).dropna().set_index('Item')
right = input.iloc[:,2:] .set_index('Item')
# finally construct the output table by concatenating without duplicates
output = pd.concat([left, right[~right.index.isin(left.index)]])
Price
Item
Car 105
Chair 20
Cup 2
Truck 54822
Pen 1
Glass 1

Filter rows based on the count of unique values

I need to count the unique values of column A and filter out the column with values greater than say 2
A C
Apple 4
Orange 5
Apple 3
Mango 5
Orange 1
I have calculated the unique values but not able to figure out how to filer them df.value_count()
I want to filter column A that have greater than 2, expected Dataframe
A B
Apple 4
Orange 5
Apple 3
Orange 1
value_counts should be called on a Series (single column) rather than a DataFrame:
counts = df['A'].value_counts()
Giving:
A
Apple 2
Mango 1
Orange 2
dtype: int64
You can then filter this to only keep those >= 2 and use isin to filter your DataFrame:
filtered = counts[counts >= 2]
df[df['A'].isin(filtered.index)]
Giving:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1
Use duplicated with parameter keep=False:
df[df.duplicated(['A'], keep=False)]
Output:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1

Excel array formula to find row of largest values based on multiple criteria

Column: A | B | C | D
Row 1: Variable | Margin | Sales | Index
Row 2: banana | 2 | 20 | 1
Row 3: apple | 5 | 10 | 2
Row 4: apple | 10 | 20 | 3
Row 5: apple | 10 | 10 | 4
Row 6: banana | 10 | 15 | 5
Row 7: apple | 10 | 15 | 6
"Variable" sits in column A, row 1.
"Fruit" refers to A2:A6
"Margin" refers to B2:B6
"Sales" refers to C2:C6
"Index" refers to D2:D6
Question:
From the above table, I would like to find the row of two largest "Sales" values when Fruit = "apple" and Margin >= 10. The correct answer would be values from row 3 and 6. I have tried the following methods without success.
I have tried
=LARGE(IF(Fruit="apple",IF(Margin>=10,Sales)),{1,2}) + CSE
and this returns 20 and 15, but not the row.
I have tried
=MATCH(LARGE(IF(Fruit="apple",IF(Margin>=10,sales)),{1,2}),Sales,0)+1
but returns row 2 and 6 as the first matches to come up are the 20 and 15 from "banana" not "apple".
I have tried
=INDEX(D2:D7,LARGE(IF(Fruit="apple",IF(Margin>=10,ROW(Sales)-ROW(INDEX(Sales,1,1))+1)),{1,2}),1)
But this returns row 7 and 5 (i.e. "Index" 6 and 4) as these are just the first occurrences of "apple" starting from the bottom of the table. They are not the largest values.
Can this be done with an Excel formula or do would I need a macro? If macro, can I please get help with the macro? Thank you!
use this formula:
=INDEX(D:D,AGGREGATE(15,6,ROW($A$2:$A$7)/(($B$2:$B$7>=10)*($A$2:$A$7="apple")*($C$2:$C$7 = AGGREGATE(14,6,$C$2:$C$7/(($B$2:$B$7>=10)*($A$2:$A$7="apple")),F2))),1))
I put 1 and 2 in F2 and F3 respectively to find the first and second.
Edit #1
to deal with duplicates we need to add (COUNTIF($G$1:G1,$D$2:$D$7) = 0). The $G$1:G1 needs to refer to the cell directly above the first placement of this formula. So the formula needs to start in at least row 2.
=INDEX(D:D,AGGREGATE(15,6,ROW($A$2:$A$7)/((COUNTIF($G$1:G1,$D$2:$D$7) = 0)*($B$2:$B$7>=10)*($A$2:$A$7="apple")*($C$2:$C$7 = AGGREGATE(14,6,$C$2:$C$7/(($B$2:$B$7>=10)*($A$2:$A$7="apple")),F2))),1))

tabulate frequency counts including zeros

To illustrate the problem, consider the following data: 1,2,3,5,3,2. Enter this in a spreadsheet column and make a pivot table displaying the counts. Making use of the information in this pivot table, I want to create a new table, with counts for every value between 1 and 5.
1,1
2,2
3,2
4,0
5,1
What is a good way to do this? My first thought was to use VLOOKUP, trapping any lookup error. But GETPIVOTDATA is apparently preferred for pivot tables. In any case, I failed with both approaches.
To be a bit more specific, assume my pivot table of counts is "PivotTable1" and that I have already created a one column table holding all the needed lookup keys (i.e., the numbers from 1 to 5). What formula should I put in the second column of this new table?
So starting with this:
To illustrate the problem, consider the following data: 1,2,3,5,3,2. Enter this in a spreadsheet column and make a pivot table displaying the counts.
I then created the table like this:
X | Freq
- | ---------------------------------------------
1 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
2 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
3 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
4 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
5 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
Or, in A1 mode:
X | Freq
- | -----------------------------------------
1 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F3),0)
2 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F4),0)
3 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F5),0)
4 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F6),0)
5 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F7),0)
The column X in my summary table is in column F.
Or as a table formula:
X | Freq
- | -------------------------------------------
1 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
2 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
3 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
4 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
5 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
That gave me this result:
X | Freq
- | ----
1 | 1
2 | 2
3 | 2
4 | 0
5 | 1
If performance is not a major concern, you can bypass the pivot table and use the COUNTIF() function.
Create a list of all consecutive numbers that you want the counts for and use COUNTIF() for each of them with the first parameter being the range of your input numbers and the second being the number of the ordered result list:
A B C D
1 1 1 =COUNTIF(A:A,C1)
2 2 2 =COUNTIF(A:A,C2)
3 3 3 =COUNTIF(A:A,C3)
4 5 4 =COUNTIF(A:A,C4)
5 3 5 =COUNTIF(A:A,C5)
6 2

sort pandas value_counts() primarily by descending counts and secondarily by ascending values

When applying value_counts() to a series in pandas, by default the counts are sorted in descending order, however the values are not sorted within each count.
How can i have the values within each identical count sorted in ascending order?
apples 5
peaches 5
bananas 3
carrots 3
apricots 1
The output of value_counts is a series itself (just like the input), so you have available all of the standard sorting options as with any series. For example:
df = pd.DataFrame({ 'fruit':['apples']*5 + ['peaches']*5 + ['bananas']*3 +
['carrots']*3 + ['apricots'] })
df.fruit.value_counts().reset_index().sort([0,'index'],ascending=[False,True])
index 0
0 apples 5
1 peaches 5
2 bananas 3
3 carrots 3
4 apricots 1
I'm actually getting the same results by default so here's a test with ascending=[False,False] to demonstrate that this is actually working as suggested.
df.fruit.value_counts().reset_index().sort([0,'index'],ascending=[False,False])
index 0
1 peaches 5
0 apples 5
3 carrots 3
2 bananas 3
4 apricots 1
I'm actually a bit confused about exactly what desired output here in terms of ascending vs descending, but regardless, there are 4 possible combos here and you can get it however you like by altering the ascending keyword argument.

Resources