Create a column by Groupy and filter in python - python-3.x

I have a data frame with vendor, bill amount, and payment type.
I want to add a column in which I will get sum of late payment by Vendor.
Is it be possible to write one line code to get this output?
df['Paid Late by Vendor']=

You can use a combination of groupby.transform and bfill(), and assign back to a new column using assign:
df = df.assign(late_payments=df[df['Payment'].eq('Delay')].groupby('Vendor')['Amount'].transform('sum')).bfill()
Prints:
Vendor Payment Amount late_payments
0 A Ontime 91 78.0
1 A Ontime 90 78.0
2 A Delay 78 78.0
3 B Ontime 58 166.0
4 B Delay 77 166.0
5 B Ontime 96 166.0
6 B Delay 89 166.0

Let's define the dataframe:
data = pd.DataFrame({'Vendor':['A', 'A', 'B', 'B'],
'Payment':['Ontime', 'Delay', 'Ontime', 'Delay'],
'Paid Late by Vendor':[20, 21, 19, 18]})
to get the results you want you need to create a separate dataframe with grouped values and then combine it with the original.
Since you want a value for only late payments then you need to filter the data to-be-grouped to have only the wanted records, and group on it.
reset_index() is used to make the index a column(in this case it's the column that we grouped on; Vendor)
groupedLateData = data[data['Payment']=='Delay'].groupby('Vendor')["Paid Late by Vendor"].sum().reset_index()
Then we merge the resulting dataframe with the original on the Vendor column
pd.merge(data, groupedLateData, on='Vendor')
and this would be the result:

Related

extract values based on two columns

I would like to extract a price value based on two other columns. In table 1, I am given the raw data where I want to draw from. In Table 2, I am given only the contract number, and I would like to find the type being "Mater" and have the price listed out for it.
I've tried to use this formula but I don't think I am calling the columns correctly.
=IF(AND(Table2!A1=Table1!$A$1:$A$6,Table1$C$1:$C$6="Mater"),Table1!$D$2:$D$6,"")
Is there a formula using index match, if(and), or another one that could work in this case?
Thank you!
Table 1.
Contract
Work
Type
Cost
5321a
aaa
Labor
52
5321a
ab
Mater
57
5641a
aba
Mater
10
536451a
aae
Labor
75
2441a
aan
Labor
42
53421
aar
Mater
14
Table 2
Contract
Mater Cost
5321a
57
5641a
57
53421
14
The following should work:
=SUMIFS(Sheet1!D2:D7,Sheet1!A2:A7,A2,Sheet1!C2:C7,"Mater")
(I'm assuming the first table in on Sheet1)

How to get multiple aggregation in a dataframe? cumsum and count columns

I need a column which aggregates using the count() function and another field using the cumsum() function in a dataframe
I would like to group it only once and the cumsum should be grouped with Site almost just like the count. How can I do this?
#I get the count by grouping site and arrived
df_arrived_gby = df.groupby(['Site','Arrived']).size().reset_index(name='Count_X')
#I do the cumsum but it should be groupby Site and Arrived same as above
#How can I do this?
df_arrived_gby['Cumsum_X'] = df_arrived_gby['Count_X'].cumsum()
print(df_arrived_gby)
Data example (it is not grouped by Site, so it continues adding the others):
Site Arrived Count Cumsum
198 T 30/06/2020 146 22368
199 T 31/05/2020 76 22444
200 V 05/01/2020 77 22521
201 V 05/02/2020 57 22578
First you need to get the values from the Count_X column, then you can cumsum():
df_arrived_gby['Cumsum_X'] = df_arrived_gby.Count_X.values.cumsum()
Let me know if that helps
I was able to do it using groupby on a new dataframe column as shown below:
df_arrived_gby['Cumsum'] = df_arrived_gby.groupby(['Site'])['Count X'].apply(lambda x: x.cumsum())

Panda remove duplicates but keep relationship [duplicate]

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.
So this:
A B
1 10
1 20
2 30
2 40
3 10
Should turn into this:
A B
1 20
2 40
3 10
I'm guessing there's probably an easy way to do this—maybe as easy as sorting the DataFrame before dropping duplicates—but I don't know groupby's internal logic well enough to figure it out. Any suggestions?
This takes the last. Not the maximum though:
In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]:
A B
1 1 20
3 2 40
4 3 10
You can do also something like:
In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]:
A B
A
1 1 20
2 2 40
3 3 10
The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.
df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()
A B
1 1 20
3 2 40
4 3 10
Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()
Simplest solution:
To drop duplicates based on one column:
df = df.drop_duplicates('column_name', keep='last')
To drop duplicates based on multiple columns:
df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')
I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first
df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")
without any groupby
Try this:
df.groupby(['A']).max()
I was brought here by a link from a duplicate question.
For just two columns, wouldn't it be simpler to do:
df.groupby('A')['B'].max().reset_index()
And to retain a full row (when there are more columns, which is what the "duplicate question" that brought me here was asking):
df.loc[df.groupby(...)[column].idxmax()]
For example, to retain the full row where 'C' takes its max, for each group of ['A', 'B'], we would do:
out = df.loc[df.groupby(['A', 'B')['C'].idxmax()]
When there are relatively few groups (i.e., lots of duplicates), this is faster than the drop_duplicates() solution (less sorting):
Setup:
n = 1_000_000
df = pd.DataFrame({
'A': np.random.randint(0, 20, n),
'B': np.random.randint(0, 20, n),
'C': np.random.uniform(size=n),
'D': np.random.choice(list('abcdefghijklmnopqrstuvwxyz'), size=n),
})
(Adding sort_index() to ensure equal solution):
%timeit df.loc[df.groupby(['A', 'B'])['C'].idxmax()].sort_index()
# 101 ms ± 98.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.sort_values(['C', 'A', 'B'], ascending=False).drop_duplicates(['A', 'B']).sort_index()
# 667 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
I think in your case you don't really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and
clean index like that:
df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)
Easiest way to do this:
# First you need to sort this DF as Column A as ascending and column B as descending
# Then you can drop the duplicate values in A column
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step.
d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df
A B
0 1 30
1 1 40
2 2 50
3 3 42
4 1 38
5 2 30
6 3 25
7 1 32
df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)
df
A B
0 1 40
1 2 50
2 3 42
You can try this as well
df.drop_duplicates(subset='A', keep='last')
I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
Here's a variation I had to solve that's worth sharing: for each unique string in columnA I wanted to find the most common associated string in columnB.
df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()
The .any() picks one if there's a tie for the mode. (Note that using .any() on a Series of ints returns a boolean rather than picking one of them.)
For the original question, the corresponding approach simplifies to
df.groupby('columnA').columnB.agg('max').reset_index().
When already given posts answer the question, I made a small change by adding the column name on which the max() function is applied for better code readability.
df.groupby('A', as_index=False)['B'].max()
Very similar method to the selected answer, but sorting data frame by multiple columns might be an easier way to code.
Firstly, sort the date frame by both "A" and "B" columns, the ascending=False ensure it is ranked from highest value to lowest:
df.sort_values(["A", "B"], ascending=False, inplace=True)
Then, drop duplication and keep only the first item, which is already the one with the highest value:
df.drop_duplicates(inplace=True)
this also works:
a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A') ['B'].max().values})
I am not going to give you the whole answer (I don't think you're looking for the parsing and writing to file part anyway), but a pivotal hint should suffice: use python's set() function, and then sorted() or .sort() coupled with .reverse():
>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]

Slicing specific rows of a column in pandas Dataframe

In the flowing data frame in Pandas, I want to extract columns corresponding dates between '03/01' and '06/01'. I don't want to use the index at all, as my input would be a start and end dates. How could I do so ?
A B
0 01/01 56
1 02/01 54
2 03/01 66
3 04/01 77
4 05/01 66
5 06/01 72
6 07/01 132
7 08/01 127
First create a list of the dates you need using daterange. I'm adding the year 2000 since you need to supply a year for this to work, im then cutting it off to get the desired strings. In real life you might want to pay attention to the actual year due to things like leap-days.
date_start = '03/01'
date_end = '06/01'
dates = [x.strftime('%m/%d') for x in pd.date_range('2000/{}'.format(date_start),
'2000/{}'.format(date_end), freq='D')]
dates is now equal to:
['03/01',
'03/02',
'03/03',
'03/04',
.....
'05/29',
'05/30',
'05/31',
'06/01']
Then simply use the isin argument and you are done
df = df.loc[df.A.isin(dates)]
df
If your columns is a datetime column I guess you can skip the strftime part in th list comprehension to get the right result.
You are welcome to use boolean masking, i.e.:
df[(df.A >= start_date) && (df.A <= end_date)]
Inside the bracket is a boolean array of True and False. Only rows that fulfill your given condition (evaluates to True) will be returned. This is a great tool to have and it works well with pandas and numpy.

Finding the right function in Excel to calculate

I have data like this:
A: B: C:
1: 4-jan-16 117 85
2: 11-jan-16 58 11
3: 18-jan-16 101 98
...
and so on up to 2-jan-17.
I need to calculate the difference between b1-c1, b2-c1 for each month (c1=85=Jan, c2=111=Feb, c3=98=March).
Then I need to take the difference of every week and ^2
I have no clue on how to go about it, I have looked up many function just not sure which would do the trick. Please feel free to ask for additional details...
UPDATE
outcome should be (add a column D)
A: B: C: D:
1: 4-jan-16 117 85 =b1-c1
2: 11-jan-16 58 11 =b2-c1
3: 18-jan-16 101 98 =b3-c1
4: 25-jan-16 110 10 =b4-c1
5: 1-Feb-16 52 =b5-c2
For the column B, I used the following (have 2 sheets one called BE, the other sheet1)
=AVERAGEIFS('BE'!$H$157:$H$208,'BE'!$B$157:$B$208,">="&A7,'BE'!$B$157:$B$208,"<="&EOMONTH(Sheet1!A7,0))
Was wondering if I could use something like that to calculate the difference for every week in column D
In D1 use the following and copy down.
=B1-INDEX($C$1:$C$4,MONTH(A1))
C1:C4 is the range for you month results This method will only word for the 1 year and needs your data start in January as that is month 1. The formula above also assumes your data start in row 1. If you Data does not start in row 1 then you would want to modify the formula as follows:
=B5-INDEX($C$5:$C$16,MONTH(A5)+row(A5)-(row($A$5)-1))
'This example assumes your data stated in row 5 and you had 12 month of monthly data in column C.

Resources