Delete entire row if Consequent Data is Matching in Excel 2007 - excel

I have a problem is if my columns Consequent Data is Matching then it should Delete row.
for e.g.
Before
column A column B
aaa 10
aaa 10
aaa 5
bbb 6
aaa 10
bbb 5
After
column A column B
aaa 10
aaa 5
bbb 6
bbb 5

Select all the data in column a and b and then on the data ribbon select remove duplicates.

Related

Excel formula to find specific data based on max row result

Good day beautiful people of Scotl... Stackoverflow.
I have faced issue in Excel which I have no idea how to solve. I tried many formulas but I believe that the problem is in my mind, since I have troubles to imagine the logical way it should follow.
I have attached a screenshot to clarify my problem:
Excel screenshot
Description of a screenshot
Column B - data name,
Rows C3:H3 - product name,
Table C4:H15 - some data (description, dates, etc.).
Column I is my extra and it is not mandatory to be there.
Desired result
I want to get data from table above to the table below but if there is one or more "DataX", I want Excel to pick the "DataX" where the biggest amount of rows are filled up (I have marked them blue for each DataX).
For example, for:
Data 1 - row 4,
Data 2 - row 7,
Data 3 - (obviously) row 9,
Data 4 - rows 11,
Data 5 - row 13.
If one or more records will match (all rows are empty / filled up), I don't care which row will be presented as a result.
What I have tried
I have added calculation (column I) which shows how many rows were updated and I was trying to find combination of v,hlookup + max but it wasn't working correctly.
I also created VBA code for it, which was working... almost good but then I received information that macros are no-go zone for this project.
Logic
I strongly believe that the logic should be as following:
Find matching DataX,
Find max value in row I (or include it in formula),
Find corresponding rows / columns for this record.
A
B
C
D
E
F
G
H
I
2
CAT 1
CAT 2
CAT 3
CAT 4
CAT 5
CAT 6
Count not blank
3
1
2
3
4
5
6
4
Data 1
AAA
BBB
CCC
EEE
FFF
=$H$3-COUNTBLANK(C4:H4)
5
Data 1
BBB
CCC
DDD
=$H$3-COUNTBLANK(C5:H5)
6
Data 1
AAA
BBB
EEE
FFF
=$H$3-COUNTBLANK(C6:H6)
7
Data 2
AAA
BBB
CCC
DDD
EEE
FFF
=$H$3-COUNTBLANK(C7:H7)
8
Data 2
AAA
BBB
CCC
DDD
FFF
=$H$3-COUNTBLANK(C8:H8)
9
Data 3
AAA
BBB
CCC
EEE
FFF
=$H$3-COUNTBLANK(C9:H9)
10
Data 4
CCC
DDD
EEE
FFF
=$H$3-COUNTBLANK(C10:H10)
11
Data 4
AAA
BBB
CCC
DDD
FFF
=$H$3-COUNTBLANK(C11:H11)
12
Data 4
AAA
BBB
CCC
EEE
FFF
=$H$3-COUNTBLANK(C12:H12)
13
Data 5
AAA
BBB
CCC
DDD
EEE
FFF
=$H$3-COUNTBLANK(C13:H13)
14
Data 5
BBB
CCC
DDD
EEE
FFF
=$H$3-COUNTBLANK(C14:H14)
15
Data 5
AAA
BBB
DDD
EEE
FFF
=$H$3-COUNTBLANK(C15:H15)
Hello dear son of Scotl.. overflow!
Please add to column J (range J4:J15) this additional formula
=CONCATENATE(B4,I4)
and then paste this to C19:
=INDEX(C$4:C$15,MATCH(CONCATENATE($B19,MAX(IF($B$4:$B$15=$B19,$I$4:$I$15,0))), $J$4:$J$15,0))
paste it as an array formula, i.e. press Ctrl+Shift+Enter simultaneously. Then populate it to the rest of the desired range.
The numbers in my example table do not mean anything, it's the number in I that matters.
Regards!!

Pandas - Groupby Company and drop rows according to criteria based off the Dates of values being out of order

I have a history data log and want to calculate the number of days between the progress by Company (Timestamp of the early stage must be smaller than the later stage).
Company Progress Time
AAA 3. Contract 07/10/2020
AAA 2. Discuss 03/09/2020
AAA 1. Start 02/02/2020
BBB 3. Contract 11/13/2019
BBB 3. Contract 07/01/2019
BBB 1. Start 06/22/2019
BBB 2. Discuss 04/15/2019
CCC 3. Contract 05/19/2020
CCC 2. Discuss 04/08/2020
CCC 2. Discuss 03/12/2020
CCC 1. Start 01/01/2020
Expected outputs:
Progress (1. Start --> 2. Discuss)
Company Progress Time
AAA 1. Start 02/02/2020
AAA 2. Discuss 03/09/2020
CCC 1. Start 01/01/2020
CCC 2. Discuss 03/12/2020
Progress (2. Discuss --> 3. Contract)
Company Progress Time
AAA 2. Discuss 03/09/2020
AAA 3. Contract 07/10/2020
CCC 2. Discuss 03/12/2020
CCC 3. Contract 05/19/2020
I did try some stupid ways to do the work but still need manualyl filter in excel, below is my coding:
df_stage1_stage2 = df[(df['Progress']=='1. Start')|(df['Progress']=='2. Discuss ')]
pd.pivot_table(df_stage1_stage2 ,index=['Company','Progress'],aggfunc={'Time':min})
Can anyone help with the problem? thanks
Create some masks to filter out the relevant rows. m1 and m2 filter out groups where 1. Start is not the "first" datetime if looking at in reverse order )since your dates are sorted by Company ascending and date descending). You can create more masks if you need to also check if 2. Discuss and 3. Contract are in order, instead of the current logic which is only checking to make sure that 1. is in order. But, with the data you provided that returns the correct output:
m1 = df.groupby('Company')['Progress'].transform('last')
m2 = np.where((m1 == '1. Start'), 'drop', 'keep')
df = df[m2=='drop']
df
intermediate output:
Company Progress Time
0 AAA 3. Contract 07/10/2020
1 AAA 2. Discuss 03/09/2020
2 AAA 1. Start 02/02/2020
7 CCC 3. Contract 05/19/2020
8 CCC 2. Discuss 04/08/2020
9 CCC 2. Discuss 03/12/2020
10 CCC 1. Start 01/01/2020
From there, filter as you have indicated by sorting and dropping duplicates based off a subset of the first two columns and keep the 'first' duplicate:
final df1 and df2 output:
df1
df1 = df[df['Progress'] != '3. Contract'] \
.sort_values(['Company', 'Time'], ascending=[True,True]) \
.drop_duplicates(subset=['Company', 'Progress'], keep='first')
df1 output:
Company Progress Time
2 AAA 1. Start 02/02/2020
1 AAA 2. Discuss 03/09/2020
10 CCC 1. Start 01/01/2020
9 CCC 2. Discuss 03/12/2020
df2
df2 = df[df['Progress'] != '1. Start'] \
.sort_values(['Company', 'Time'], ascending=[True,True]) \
.drop_duplicates(subset=['Company', 'Progress'], keep='first')
df2 output:
Company Progress Time
1 AAA 2. Discuss 03/09/2020
0 AAA 3. Contract 07/10/2020
9 CCC 2. Discuss 03/12/2020
7 CCC 3. Contract 05/19/2020
Something like this could work, assuming an already sorted df:
(full example)
data = {
'Company':['AAA', 'AAA', 'AAA', 'BBB','BBB','BBB','BBB','CCC','CCC','CCC','CCC',],
'Progress':['3. Contract', '2. Discuss', '1. Start', '3. Contract', '3. Contract', '2. Discuss', '1. Start', '3. Contract', '2. Discuss', '2. Discuss', '1. Start', ],
'Time':['07-10-2020','03-09-2020','02-02-2020','11-13-2019','07-01-2019','06-22-2019','04-15-2019','05-19-2020','04-08-2020','03-12-2020','01-01-2020',],
}
df = pd.DataFrame(data)
df['Time'] = pd.to_datetime(df['Time'])
# We want to measure from the first occurrence (last date) if duplicated:
df.drop_duplicates(subset=['Company', 'Progress'], keep='first', inplace=True)
# Except for the rows of 'start', calculate the difference in days
df['days_delta'] = np.where((df['Progress'] != '1. Start'), df.Time.diff(-1), 0)
Output:
Company Progress Time days_delta
0 AAA 3. Contract 2020-07-10 123 days
1 AAA 2. Discuss 2020-03-09 36 days
2 AAA 1. Start 2020-02-02 0 days
3 BBB 3. Contract 2019-11-13 144 days
5 BBB 2. Discuss 2019-06-22 68 days
6 BBB 1. Start 2019-04-15 0 days
7 CCC 3. Contract 2020-05-19 41 days
8 CCC 2. Discuss 2020-04-08 98 days
10 CCC 1. Start 2020-01-01 0 days
If you do not want the 'days' word in output use:
df['days_delta'] = df['days_delta'].dt.days
First Problem
#Coerce Time to Datetime
df['Time']=pd.to_datetime(df['Time'])
#`groupby().nth[]` `to slice the consecutive order`
df2=(df.merge(df.groupby(['Company'])['Time'].nth([-2,-1]))).sort_values(by=['Company','Time'], ascending=[True, True])
#Apply the universal rule for this problem which is, after groupby nth, drop any agroup with duplicates
df2[~df2.Company.isin(df2[df2.groupby('Company').Progress.transform('nunique')==1].Company.values)]
#Calculate the diff() in Time in each group
df2['diff'] = df2.sort_values(by='Progress').groupby('Company')['Time'].diff().dt.days.fillna(0)#.groupby('Company')['Time'].diff() / np.timedelta64(1, 'D')
#Filter out the groups where start and Discuss Time are in conflict
df2[~df2.Company.isin(df2.loc[df2['diff']<0, 'Company'].unique())]
Company Progress Time diff
1 AAA 1.Start 2020-02-02 0.0
0 AAA 2.Discuss 2020-03-09 36.0
5 CCC 1.Start 2020-01-01 0.0
4 CCC 2.Discuss 2020-03-12 71.0
Second Problem
#Groupbynth to slice right consecutive groups
df2=(df.merge(df.groupby(['Company'])['Time'].nth([0,1]))).sort_values(by=['Company','Time'], ascending=[True, True])
#Drop any groups after grouping that have duplicates
df2[~df2.Company.isin(df2[df2.groupby('Company').Progress.transform('nunique')==1].Company.values)]
Company Progress Time
1 AAA 2.Discuss 2020-03-09
0 AAA 3.Contract 2020-07-10
5 CCC 2.Discuss 2020-04-08
4 CCC 3.Contract 2020-05-19

pandas selecting rows whose sum equals to a value in another column

Hi Guys i have a dataFrame where i want to frist group rows by a column, then i find any rows that sum up to a given value in another column.
**A** **B** **c**
XCD 1 5
FFF 12 2
VB 3 6
XCD 8 5
AAA 2 7
AAA 5 7
XCD 4 5
VB 6 6
VB 3 6
FFF 2 2
For each unique entry in column A say XCD, the value of column C is always the same to represent the total sum needed per unique entry. To illustrate what i need, see the below final data Frame.
**A** **B** **c**
XCD 1 5
XCD 4 5
FFF 2 2
VB 6 6
AAA 2 7
AAA 5 7
The algorithm should select the rows that sum up to the column c. The algorithm can select a single row as long as its total sums up to the number in column c but we only take the first occurance that sum up to column c and leave out the rest, then have a new data Frame

Split a column's values by a special character and group by pandas

I have a df like this,
Owner Messages
AAA (YY) Duplicates
AAA Missing Number; (VV) Corrected Value; (YY) Duplicates
AAA (YY) Duplicates
BBB (YY) Duplicates
BBB Missing Measure; Missing Number
When I do a normal groupby like this,
df_grouped = df.groupby([' Owner', 'Messages']).size().reset_index(name='count')
df_grouped
I get this as expected,
Owner Messages count
0 AAA (YY) Duplicates 2
1 AAA Missing Number; (VV) Corrected Value; (YY) Duplicates 1
2 BBB (YY) Duplicates 1
3 BBB Missing Measure; Missing Number 1
However, I need something (desired output) like this splitting by ; inside Messages column.
Owner Messages count
0 AAA (YY) Duplicates 3
1 AAA Missing Number 1
2 AAA (VV) Corrected Value 1
3 BBB (YY) Duplicates 1
4 BBB Missing Measure 1
5 BBB Missing Number 1
So far, based on this post, #LeoRochael's answer, it splits Messages column's values by ; and puts into a list. Anyhow, I can not get the individual count after splitting.
Any ideas how to get my desired output?
You need to unnest your original dataframe , then we just do group size
s=df.set_index('Owner').Messages.str.split('; ',expand=True).stack().to_frame('Messages').reset_index()
s.groupby(['Owner','Messages']).size()
Out[1213]:
Owner Messages
AAA (VV) Corrected Value 1
(YY) Duplicates 3
Missing Number 1
BBB (YY) Duplicates 1
Missing Measure 1
Missing Number 1
dtype: int64
from collections import Counter
import pandas as pd
pd.Series(
Counter([(o, m) for o, M in df.values for m in M.split('; ')])
).rename_axis(['Owner', 'Message']).reset_index(name='Count')
Owner Message Count
0 AAA (VV) Corrected Value 1
1 AAA (YY) Duplicates 3
2 AAA Missing Number 1
3 BBB (YY) Duplicates 1
4 BBB Missing Measure 1
5 BBB Missing Number 1

Column indicating count in Excel

Let's say I have an Excel sheet with the following columns
Year ProjectName
2001 AAA
2001 MMM
2001 XXX
2002 CCC
2003 KKK
2003 NNN
I want to generate a new column indicating the total number of projects in that year:
Year ProjectName NumberOfProjects
2001 AAA 3
2001 MMM 3
2001 XXX 3
2002 CCC 1
2003 KKK 2
2003 NNN 2
How do I do this?
Assuming your data is in columns C and B, in column C add the following:
=COUNTIF(A$2:A$6,A2)
(example for row 2, for other rows, change A2 as needed)

Resources