Remove groups from pandas where {condition}

Remove groups from pandas where {condition} - python-3.x

I have dataframe like this:
+---+--------------------------------------+-----------+
| | envelopeid | message |
+---+--------------------------------------+-----------+
| 1 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.00002 |
| 2 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.00004 |
| 3 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.11001 |
| 4 | 5cb72b9c-adb8-4e1c-9296-db2080cb3b6d | CMN.00002 |
| 5 | 5cb72b9c-adb8-4e1c-9296-db2080cb3b6d | CMN.00001 |
| 6 | f4260b99-6579-4607-bfae-f601cc13ff0c | CMN.00202 |
| 7 | 8f673ae3-0293-4aca-ad6b-572f138515e6 | CMN.00002 |
| 8 | fee98470-aa8f-4ec5-8bcd-1683f85727c2 | TKP.00001 |
| 9 | 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00002 |
| 10| 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00004 |
+---+--------------------------------------+-----------+
I've grouped it with grouped = df.groupby('envelopeid')
And I need to remove all groups from the dataframe and stay only that groups that have messages (CMN.00002) or (CMN.00002 and CMN.00004) only.
Desired dataframe:
+---+--------------------------------------+-----------+
| | envelopeid | message |
+---+--------------------------------------+-----------+
| 7 | 8f673ae3-0293-4aca-ad6b-572f138515e6 | CMN.00002 |
| 9 | 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00002 |
| 10| 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00004 |
+---+--------------------------------------+-----------+
tried
(grouped.message.transform(lambda x: x.eq('CMN.00001').any() or (x.eq('CMN.00002').any() and x.ne('CMN.00002' or 'CMN.00004').any()) or x.ne('CMN.00002').all()))
but it is not working properly

Try:
grouped = df.loc[df['message'].isin(['CMN.00002', 'CMN.00002', 'CMN.00004'])].groupby('envelopeid')

Try this: df[df.message== 'CMN.00002']

outdf = df.groupby('envelopeid').filter(lambda x: tuple(x.message)== ('CMN.00002',) or tuple(x.message)== ('CMN.00002','CMN.00004'))
So i figured it up.
resulting dataframe will got only groups that have only CMN.00002 message or CMN.00002 and CMN.00004. This is what I need.
I used filter instead of transform.

Related

Calculating the values and appending them as a column in the dataframe [duplicate]

This question already has answers here:
Groupby value counts on the dataframe pandas
(6 answers)
Closed 2 years ago.
I'm working on an airline dataset. I've to calculate the number of adults, children's and infants per airline_pnr number and then append those values as a column in a data frame.
Pax Type: Passenger type(Adult(ADT), Children(CHD), Infant(INF))
+-------------+----------+
| airline_pnr |Pax_Type |
+-------------+----------+
| EIPBGB | ADT |
| EIPBGB | ADT |
| EIPBGB | CHD |
| EIPBGB | INF |
| UH7EQV | ADT |
| UH7EQV | ADT |
| YVEEW | ADT |
| YVEEW | ADT |
| DR6YWR | ADT |
| DR6YWR | ADT |
| DR6YWR | ADT |
| DR6YWR | CHD |
| DR6YWR | INF |
| QJ2ESP | ADT |
| QJ2ESP | CHD |
| JL6E9T | ADT |
| VGYD5V | ADT |
| YVEG1 | ADT |
| YVEG1 | ADT |
+-------------+----------+
Expected output:
+--------+----------+--------------+-----------------+---------------+
|air_pnr | Pax Type | no_of_adults | no_of_childrens | no_of_infants |
+--------+----------+--------------+-----------------+---------------+
| EIPBGB | ADT | 2 | 1 | 1 |
| UH7EQV | ADT | 2 | 0 | 0 |
| YVEEW | ADT | 2 | 0 | 0 |
| DR6YWR | ADT | 3 | 1 | 1 |
| QJ2ESP | ADT | 1 | 1 | 0 |
| JL6E9T | ADT | 1 | 0 | 0 |
| VGYD5V | ADT | 1 | 0 | 0 |
| YVEG1 | ADT | 2 | 0 | 0 |
+--------+----------+--------------+-----------------+---------------+
My Efforts:
df= df.value_counts(['airline_pnr', 'Pax Type'])
df = df.to_frame()
df= df.rename(columns = {0: "freq"})
But not getting the desired results

you can use groupby on the 'air_pnr' variable, and them use the size()
which counts the number of occurrences of each value.
df.groupby(['air_pnr','Pax_Type']).size()

Pandas: How to merge cells in the dataframe from a specific column using pandas?

I want to remove the duplicated names from the cells and merge them. This dataframe is generated after concatenating multiple dataframes.
My dataframe as under:
| | Customer ID | Category | VALUE |
| -:|:----------- |:------------- | -------:|
| 0 | GETO90 | Baby Sets | 1090.0 |
| 1 | GETO90 | Girls Dresses | 5357.0 |
| 2 | GETO90 | Girls Jumpers | 2823.0 |
| 3 | SETO90 | Girls Top | 3398.0 |
| 4 | SETO90 | Shorts | 7590.0 |
| 5 | SETO90 | Shorts | 7590.0 |
| 6 | RETO90 | Pants | 6590.0 |
| 7 | RETO90 | Pants | 6590.0 |
| 8 | RETO90 | Jeans | 8590.0 |
| 9 | YETO90 | Jeans | 9590.0 |
| 10| YETO90 | Jeans | 2590.0 |
I want to merge the first column and the expected dataframe is mentioned below:
| | Customer ID | Category | VALUE |
| -:|:----------- |:------------- | -------:|
| 0 | GETO90 | Baby Sets | 1090.0 |
| 1 | | Girls Dresses | 5357.0 |
| 2 | | Girls Jumpers | 2823.0 |
| 3 | SETO90 | Girls Top | 3398.0 |
| 4 | | Shorts | 7590.0 |
| 5 | | Shorts | 7590.0 |
| 6 | RETO90 | Pants | 6590.0 |
| 7 | | Pants | 6590.0 |
| 8 | | Jeans | 8590.0 |
| 9 | YETO90 | Jeans | 9590.0 |
| 10| | Jeans | 2590.0 |

Use duplicated with loc:
df.loc[df.duplicated('Customer ID'), 'Customer ID'] = ''

How to create a calculated column in access 2013 to detect duplicates

I'm recreating a tool I made in Excel as it's getting bigger and performance is getting out of hand.
The issue is that I only have MS Access 2013 on my work laptop and I'm fairly new to the Expression Builder in Access 2013, which has a very limited function base to be honest.
My data has duplicates on the [Location] column, meaning that, I have multiple SKUs on that warehouse location. However, some of my calculations need to be done only once per [Location]. My solution to that, in Excel, was to make a formula (see below) putting 1 only on the first appearance of that location, putting 0 on next appearances. Doing that works like a charm because summing over that [Duplicate] column while imposing multiple criteria returns the number of occurrences of the multiple criteria counting locations only once.
Now, MS Access 2013 Expression Builder has no SUM nor COUNT functions to create a calculated column emulating my [Duplicate] column from Excel. Preferably, I would just input the raw data and let Access populate the calculated fields vs also inputting the calculated fields as well, since that would defeat my original purpose of reducing the computational cost of creating my dashboard.
The question is, how would you create a calculated column, in MS Access 2013 Expression Builder to recreate the below Excel function:
= IF($D$2:$D3=$D4,0,1)
In the sake of reducing the file size (over 100K rows) I even replace the 0 by a blank character "".
Thanks in advance for your help
Y

First and foremost, understand MS Access' Expression Builder is a convenience tool to build an SQL expression. Everything in Query Design ultimately is to build an SQL query. For this reason, you have to use a set-based mentality to see data in whole sets of related tables and not cell-by-cell mindset.
Specifically, to achieve:
putting 1 only on the first appearance of that location, putting 0 on next appearances
Consider a whole set-based approach by joining on a separate, aggregate query to identify the first value of your needed grouping, then calculate needed IIF expression. Below assumes you have an autonumber or primary key field in table (a standard in relational databases):
Aggregate Query (save as a separate query, adjust columns as needed)
SELECT ColumnD, MIN(AutoNumberID) As MinID
FROM myTable
GROUP BY ColumnD
Final Query (join to original table and build final IIF expression)
SELECT m.*, IIF(agg.MinID = AutoNumberID, 1, 0) As Dup_Indicator
FROM myTable m
INNER JOIN myAggregateQuery agg
ON m.[ColumnD] = agg.ColumnD
To demonstrate with random data:
Original
| ID | GROUP | INT | NUM | CHAR | BOOL | DATE |
|----|--------|-----|--------------|------|-------|------------|
| 1 | r | 9 | 1.424490258 | B6z | TRUE | 7/4/1994 |
| 2 | stata | 10 | 2.591235683 | h7J | FALSE | 10/5/1971 |
| 3 | spss | 6 | 0.560461966 | Hrn | TRUE | 11/27/1990 |
| 4 | stata | 10 | -1.499272175 | eXL | FALSE | 4/17/2010 |
| 5 | stata | 15 | 1.470269177 | Vas | TRUE | 6/13/2010 |
| 6 | r | 14 | -0.072238898 | puP | TRUE | 4/1/1994 |
| 7 | julia | 2 | -1.370405263 | S2l | FALSE | 12/11/1999 |
| 8 | spss | 6 | -0.153684675 | mAw | FALSE | 7/28/1977 |
| 9 | spss | 10 | -0.861482674 | cxC | FALSE | 7/17/1994 |
| 10 | spss | 2 | -0.817222582 | GRn | FALSE | 10/19/2012 |
| 11 | stata | 2 | 0.949287754 | xgc | TRUE | 1/18/2003 |
| 12 | stata | 5 | -1.580841322 | Y1D | TRUE | 6/3/2011 |
| 13 | r | 14 | -1.671303816 | JCP | FALSE | 5/15/1981 |
| 14 | r | 7 | 0.904181025 | Rct | TRUE | 7/24/1977 |
| 15 | stata | 10 | -1.198211174 | qJY | FALSE | 5/6/1982 |
| 16 | julia | 10 | -0.265808162 | 10s | FALSE | 3/18/1975 |
| 17 | r | 13 | -0.264955027 | 8Md | TRUE | 6/11/1974 |
| 18 | r | 4 | 0.518302149 | 4KW | FALSE | 9/12/1980 |
| 19 | r | 5 | -0.053620183 | 8An | FALSE | 4/17/2004 |
| 20 | r | 14 | -0.359197116 | F8Q | TRUE | 6/14/2005 |
| 21 | spss | 11 | -2.211875193 | AgS | TRUE | 4/11/1973 |
| 22 | stata | 4 | -1.718749471 | Zqr | FALSE | 2/20/1999 |
| 23 | python | 10 | 1.207878576 | tcC | FALSE | 4/18/2008 |
| 24 | stata | 11 | 0.548902226 | PFJ | TRUE | 9/20/1994 |
| 25 | stata | 6 | 1.479125922 | 7a7 | FALSE | 3/2/1989 |
| 26 | python | 10 | -0.437245299 | r32 | TRUE | 6/7/1997 |
| 27 | sas | 14 | 0.404746106 | 6NJ | TRUE | 9/23/2013 |
| 28 | stata | 8 | 2.206741458 | Ive | TRUE | 5/26/2008 |
| 29 | spss | 12 | -0.470694096 | dPS | TRUE | 5/4/1983 |
| 30 | sas | 15 | -0.57169507 | yle | TRUE | 6/20/1979 |
SQL (uses aggregate in subquery but can be a stored query)
SELECT r.*, IIF(sub.MinID = r.ID,1, 0) AS Dup
FROM Random_Data r
LEFT JOIN
(
SELECT r.GROUP, MIN(r.ID) As MinID
FROM Random_Data r
GROUP BY r.Group
) sub
ON r.[Group] = sub.[GROUP]
Output (notice the first GROUP value is tagged 1, all else 0)
| ID | GROUP | INT | NUM | CHAR | BOOL | DATE | Dup |
|----|--------|-----|--------------|------|-------|------------|-----|
| 1 | r | 9 | 1.424490258 | B6z | TRUE | 7/4/1994 | 1 |
| 2 | stata | 10 | 2.591235683 | h7J | FALSE | 10/5/1971 | 1 |
| 3 | spss | 6 | 0.560461966 | Hrn | TRUE | 11/27/1990 | 1 |
| 4 | stata | 10 | -1.499272175 | eXL | FALSE | 4/17/2010 | 0 |
| 5 | stata | 15 | 1.470269177 | Vas | TRUE | 6/13/2010 | 0 |
| 6 | r | 14 | -0.072238898 | puP | TRUE | 4/1/1994 | 0 |
| 7 | julia | 2 | -1.370405263 | S2l | FALSE | 12/11/1999 | 1 |
| 8 | spss | 6 | -0.153684675 | mAw | FALSE | 7/28/1977 | 0 |
| 9 | spss | 10 | -0.861482674 | cxC | FALSE | 7/17/1994 | 0 |
| 10 | spss | 2 | -0.817222582 | GRn | FALSE | 10/19/2012 | 0 |
| 11 | stata | 2 | 0.949287754 | xgc | TRUE | 1/18/2003 | 0 |
| 12 | stata | 5 | -1.580841322 | Y1D | TRUE | 6/3/2011 | 0 |
| 13 | r | 14 | -1.671303816 | JCP | FALSE | 5/15/1981 | 0 |
| 14 | r | 7 | 0.904181025 | Rct | TRUE | 7/24/1977 | 0 |
| 15 | stata | 10 | -1.198211174 | qJY | FALSE | 5/6/1982 | 0 |
| 16 | julia | 10 | -0.265808162 | 10s | FALSE | 3/18/1975 | 0 |
| 17 | r | 13 | -0.264955027 | 8Md | TRUE | 6/11/1974 | 0 |
| 18 | r | 4 | 0.518302149 | 4KW | FALSE | 9/12/1980 | 0 |
| 19 | r | 5 | -0.053620183 | 8An | FALSE | 4/17/2004 | 0 |
| 20 | r | 14 | -0.359197116 | F8Q | TRUE | 6/14/2005 | 0 |
| 21 | spss | 11 | -2.211875193 | AgS | TRUE | 4/11/1973 | 0 |
| 22 | stata | 4 | -1.718749471 | Zqr | FALSE | 2/20/1999 | 0 |
| 23 | python | 10 | 1.207878576 | tcC | FALSE | 4/18/2008 | 1 |
| 24 | stata | 11 | 0.548902226 | PFJ | TRUE | 9/20/1994 | 0 |
| 25 | stata | 6 | 1.479125922 | 7a7 | FALSE | 3/2/1989 | 0 |
| 26 | python | 10 | -0.437245299 | r32 | TRUE | 6/7/1997 | 0 |
| 27 | sas | 14 | 0.404746106 | 6NJ | TRUE | 9/23/2013 | 1 |
| 28 | stata | 8 | 2.206741458 | Ive | TRUE | 5/26/2008 | 0 |
| 29 | spss | 12 | -0.470694096 | dPS | TRUE | 5/4/1983 | 0 |
| 30 | sas | 15 | -0.57169507 | yle | TRUE | 6/20/1979 | 0 |

Pandas sort not sorting data properly

I am trying to sort the results of sklearn.ensemble.RandomForestRegressor's feature_importances_
I have the following function:
def get_feature_importances(cols, importances):
feats = {}
for feature, importance in zip(cols, importances):
feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance')
return importances
I use it like so:
importances = get_feature_importances(X_test.columns, rf.feature_importances_)
print()
print(importances)
And I get the following results:
| PART | 0.035034 |
| MONTH1 | 0.02507 |
| YEAR1 | 0.020075 |
| MONTH2 | 0.02321 |
| YEAR2 | 0.017861 |
| MONTH3 | 0.042606 |
| YEAR3 | 0.028508 |
| DAYS | 0.047603 |
| MEDIANDIFF | 0.037696 |
| F2 | 0.008783 |
| F1 | 0.015764 |
| F6 | 0.017933 |
| F4 | 0.017511 |
| F5 | 0.017799 |
| SS22 | 0.010521 |
| SS21 | 0.003896 |
| SS19 | 0.003894 |
| SS23 | 0.005249 |
| SS20 | 0.005127 |
| RR | 0.021626 |
| HI_HOURS | 0.067584 |
| OI_HOURS | 0.054369 |
| MI_HOURS | 0.062121 |
| PERFORMANCE_FACTOR | 0.033572 |
| PERFORMANCE_INDEX | 0.073884 |
| NUMPA | 0.022445 |
| BUMPA | 0.024192 |
| ELOH | 0.04386 |
| FFX1 | 0.128367 |
| FFX2 | 0.083839 |
I thought the line importances.sort_values(by='Gini-importance') would sort them. But it is not. Why is this not performing correctly?

importances.sort_values(by='Gini-importance') returns the sorted dataframe, which is overlooked by your function.
You want return importances.sort_values(by='Gini-importance').
Or you could make sort_values inplace:
importances.sort_values(by='Gini-importance', inplace=True)
return importances

Calculating frequecy (Min/Max/Average) of order placement in Excel

I need to analyze Weekly order frequencies over last 1 year period to find out what is the min/max/average frequencies of orders for each product.
whether it is new or old,system should calculate the first occurrence of the order in the year as the starting week of the order. Min order frequency is difference between successive ordering weeks. If the first order is in wk 3 and the second order is in wk6, implies the order frequency is 3 weeks (=>6-3). Orders can be at any week in the past 52 weeks. Average order frequency = (52 - First order week) / no of weeks that have orders.
Attaching the excel for better understanding the issue.
Original image
+---------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+----------------+-------------------------+-----+-----------------------------------+--+
| Product | wk1 | wk2 | wk3 | wk4 | wk5 | wk6 | wk7 | wk8 | wk9 | wk10 | wk11 | wk12 | wk13 | wk14 | wk15 | wk16 | wk17 | wk18 | wk19 | wk20 | wk21 | wk22 | wk23 | wk24 | wk25 | wk26 | wk27 | wk28 | wk29 | wk30 | wk31 | wk32 | wk33 | wk34 | wk35 | wk36 | wk37 | wk38 | wk39 | wk40 | wk41 | wk42 | wk43 | wk44 | wk45 | wk46 | wk47 | wk48 | wk49 | wk50 | wk51 | wk52 | Order start wk | Order frequency (Weeks) | | | |
+---------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+----------------+-------------------------+-----+-----------------------------------+--+
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Min | Max | Average | |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (End wk - Start week)/No of times | |
| SKU 1 | | | | | | | | | y | | y | | y | | y | | y | | y | | y | | y | y | | | y | | y | | y | | y | | | | | | y | | y | | y | | y | | y | | y | | y | | 9 | 1 | 6 | 2.15 | |
| SKU 2 | | | | | | | y | | | | | | y | | | | | | y | | | | | | y | | | | | | y | | | | | | y | | | | | | y | | | | | | y | | | | 1 | 0 | 0 | 7.29 | |
| SKU 3 | | | | | | | | | | | | | | | y | | | | | | | | | | | | | | | | y | | | | | | | | y | | | | | | | | y | | | | | | 15 | 8 | 15 | 9.25 | |
+---------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+----------------+-------------------------+-----+-----------------------------------+--+

So as mentioned #Barry Houdini solves the problem of finding the longest sequence of zeroes separated by ones elegantly here
You only have to change it slightly to check for repeated blank cells separated by 'y'. The only thing is that you don't want to include cells before the first 'y', and (although this isn't clear) may not want to include blank cells after the last 'y'.
The formula for MIN becomes
=MIN(IF((ROW(A$1:INDEX(A:A,COUNTA(B4:BA4)+1))>1)*(ROW(A$1:INDEX(A:A,COUNTA(B4:BA4)+1))<COUNTA(B4:BA4)+1),FREQUENCY(IF(B4:BA4="",COLUMN(B4:BA4)),IF(B4:BA4="y",COLUMN(B4:BA4)))))+1
and the formula for MAX becomes (the same)
=MAX(IF((ROW(A$1:INDEX(A:A,COUNTA(B4:BA4)+1))>1)*(ROW(A$1:INDEX(A:A,COUNTA(B4:BA4)+1))<COUNTA(B4:BA4)+1),FREQUENCY(IF(B4:BA4="",COLUMN(B4:BA4)),IF(B4:BA4="y",COLUMN(B4:BA4)))))+1
where you need to add 1 to make the results agree with the question because #Barry's formula counts numbers of blanks but OP wants interval between two successive y's. An array of ny+1 elements is generated where ny is the number of y's. This is because the FREQUENCY function returns an array with n+1 elements where n is the number of cut points (bins_array in documentation and because the column numbers of cells containing y are used as cut points so there are ny of them.
These are both array formulas and need to be entered with CtrlShiftEnter
The formula for the average is just
=(COLUMNS(B4:BA4)-MATCH("y",B4:BA4,0))/COUNTA(B4:BA4)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Remove groups from pandas where {condition} - python-3.x

Try: grouped = df.loc[df['message'].isin(['CMN.00002', 'CMN.00002', 'CMN.00004'])].groupby('envelopeid')

Try this: df[df.message== 'CMN.00002']

Related

Calculating the values and appending them as a column in the dataframe [duplicate]

Pandas: How to merge cells in the dataframe from a specific column using pandas?

How to create a calculated column in access 2013 to detect duplicates

Pandas sort not sorting data properly

Calculating frequecy (Min/Max/Average) of order placement in Excel

Categories

Resources