I have data in four column as shown below. There are some values which are present in column 1, and some value of column 1 is again duplicated in column 3. I would like to combine column 1 with 3, while removing the duplicates from column 3. I would also like to preserve the order of column. Column 1 is associated with column 2 and column 3 is associated with column 4, so it would be nice if I can move column 1 items with column 2 and column 3 items with column 4 during merge. Any help will be appreciated.
Input table:
Item
Price
Item
Price
Car
105
Truck
54822
Chair
20
Pen
1
Cup
2
Car
105
Glass
1
Output table:
Item
Price
Car
105
Chair
20
Cup
2
Truck
54822
Pen
1
Glass
1
Thank you in advance.
After separating the input table into the left and right part, we can concatenate the left hand items with the unduplicated right hand items quite simply with boolean indexing:
import pandas as pd
# this initial section only recreates your sample input table
from io import StringIO
input = pd.read_table(StringIO("""| Item | Price | Item | Price |
|-------|-------|------|-------|
| Car | 105 | Truck| 54822 |
| Chair | 20 | Pen | 1 |
| Cup | 2 | Car | 105 |
| | | Glass| 1 |
"""), ' *\| *', engine='python', usecols=[1,2,3,4], skiprows=[1], keep_default_na=False)
input.columns = list(input.columns[:2])*2
# now separate the input table into the left and right part
left = input.iloc[:,:2].replace("", pd.NA).dropna().set_index('Item')
right = input.iloc[:,2:] .set_index('Item')
# finally construct the output table by concatenating without duplicates
output = pd.concat([left, right[~right.index.isin(left.index)]])
Price
Item
Car 105
Chair 20
Cup 2
Truck 54822
Pen 1
Glass 1
I need to count the unique values of column A and filter out the column with values greater than say 2
A C
Apple 4
Orange 5
Apple 3
Mango 5
Orange 1
I have calculated the unique values but not able to figure out how to filer them df.value_count()
I want to filter column A that have greater than 2, expected Dataframe
A B
Apple 4
Orange 5
Apple 3
Orange 1
value_counts should be called on a Series (single column) rather than a DataFrame:
counts = df['A'].value_counts()
Giving:
A
Apple 2
Mango 1
Orange 2
dtype: int64
You can then filter this to only keep those >= 2 and use isin to filter your DataFrame:
filtered = counts[counts >= 2]
df[df['A'].isin(filtered.index)]
Giving:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1
Use duplicated with parameter keep=False:
df[df.duplicated(['A'], keep=False)]
Output:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1
Column: A | B | C | D
Row 1: Variable | Margin | Sales | Index
Row 2: banana | 2 | 20 | 1
Row 3: apple | 5 | 10 | 2
Row 4: apple | 10 | 20 | 3
Row 5: apple | 10 | 10 | 4
Row 6: banana | 10 | 15 | 5
Row 7: apple | 10 | 15 | 6
"Variable" sits in column A, row 1.
"Fruit" refers to A2:A6
"Margin" refers to B2:B6
"Sales" refers to C2:C6
"Index" refers to D2:D6
Question:
From the above table, I would like to find the row of two largest "Sales" values when Fruit = "apple" and Margin >= 10. The correct answer would be values from row 3 and 6. I have tried the following methods without success.
I have tried
=LARGE(IF(Fruit="apple",IF(Margin>=10,Sales)),{1,2}) + CSE
and this returns 20 and 15, but not the row.
I have tried
=MATCH(LARGE(IF(Fruit="apple",IF(Margin>=10,sales)),{1,2}),Sales,0)+1
but returns row 2 and 6 as the first matches to come up are the 20 and 15 from "banana" not "apple".
I have tried
=INDEX(D2:D7,LARGE(IF(Fruit="apple",IF(Margin>=10,ROW(Sales)-ROW(INDEX(Sales,1,1))+1)),{1,2}),1)
But this returns row 7 and 5 (i.e. "Index" 6 and 4) as these are just the first occurrences of "apple" starting from the bottom of the table. They are not the largest values.
Can this be done with an Excel formula or do would I need a macro? If macro, can I please get help with the macro? Thank you!
use this formula:
=INDEX(D:D,AGGREGATE(15,6,ROW($A$2:$A$7)/(($B$2:$B$7>=10)*($A$2:$A$7="apple")*($C$2:$C$7 = AGGREGATE(14,6,$C$2:$C$7/(($B$2:$B$7>=10)*($A$2:$A$7="apple")),F2))),1))
I put 1 and 2 in F2 and F3 respectively to find the first and second.
Edit #1
to deal with duplicates we need to add (COUNTIF($G$1:G1,$D$2:$D$7) = 0). The $G$1:G1 needs to refer to the cell directly above the first placement of this formula. So the formula needs to start in at least row 2.
=INDEX(D:D,AGGREGATE(15,6,ROW($A$2:$A$7)/((COUNTIF($G$1:G1,$D$2:$D$7) = 0)*($B$2:$B$7>=10)*($A$2:$A$7="apple")*($C$2:$C$7 = AGGREGATE(14,6,$C$2:$C$7/(($B$2:$B$7>=10)*($A$2:$A$7="apple")),F2))),1))
To illustrate the problem, consider the following data: 1,2,3,5,3,2. Enter this in a spreadsheet column and make a pivot table displaying the counts. Making use of the information in this pivot table, I want to create a new table, with counts for every value between 1 and 5.
1,1
2,2
3,2
4,0
5,1
What is a good way to do this? My first thought was to use VLOOKUP, trapping any lookup error. But GETPIVOTDATA is apparently preferred for pivot tables. In any case, I failed with both approaches.
To be a bit more specific, assume my pivot table of counts is "PivotTable1" and that I have already created a one column table holding all the needed lookup keys (i.e., the numbers from 1 to 5). What formula should I put in the second column of this new table?
So starting with this:
To illustrate the problem, consider the following data: 1,2,3,5,3,2. Enter this in a spreadsheet column and make a pivot table displaying the counts.
I then created the table like this:
X | Freq
- | ---------------------------------------------
1 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
2 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
3 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
4 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
5 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
Or, in A1 mode:
X | Freq
- | -----------------------------------------
1 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F3),0)
2 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F4),0)
3 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F5),0)
4 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F6),0)
5 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F7),0)
The column X in my summary table is in column F.
Or as a table formula:
X | Freq
- | -------------------------------------------
1 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
2 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
3 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
4 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
5 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
That gave me this result:
X | Freq
- | ----
1 | 1
2 | 2
3 | 2
4 | 0
5 | 1
If performance is not a major concern, you can bypass the pivot table and use the COUNTIF() function.
Create a list of all consecutive numbers that you want the counts for and use COUNTIF() for each of them with the first parameter being the range of your input numbers and the second being the number of the ordered result list:
A B C D
1 1 1 =COUNTIF(A:A,C1)
2 2 2 =COUNTIF(A:A,C2)
3 3 3 =COUNTIF(A:A,C3)
4 5 4 =COUNTIF(A:A,C4)
5 3 5 =COUNTIF(A:A,C5)
6 2
When applying value_counts() to a series in pandas, by default the counts are sorted in descending order, however the values are not sorted within each count.
How can i have the values within each identical count sorted in ascending order?
apples 5
peaches 5
bananas 3
carrots 3
apricots 1
The output of value_counts is a series itself (just like the input), so you have available all of the standard sorting options as with any series. For example:
df = pd.DataFrame({ 'fruit':['apples']*5 + ['peaches']*5 + ['bananas']*3 +
['carrots']*3 + ['apricots'] })
df.fruit.value_counts().reset_index().sort([0,'index'],ascending=[False,True])
index 0
0 apples 5
1 peaches 5
2 bananas 3
3 carrots 3
4 apricots 1
I'm actually getting the same results by default so here's a test with ascending=[False,False] to demonstrate that this is actually working as suggested.
df.fruit.value_counts().reset_index().sort([0,'index'],ascending=[False,False])
index 0
1 peaches 5
0 apples 5
3 carrots 3
2 bananas 3
4 apricots 1
I'm actually a bit confused about exactly what desired output here in terms of ascending vs descending, but regardless, there are 4 possible combos here and you can get it however you like by altering the ascending keyword argument.