How to get max value group by another column from Pandas dataframe - python-3.x

I have the following dataframe. I would like to get the rows where the date is max for each pipeline_name
Here is the dataframe:
+----+-----------------+--------------------------------------+----------------------------------+
| | pipeline_name | runid | run_end_dt |
|----+-----------------+--------------------------------------+----------------------------------|
| 0 | test_pipeline | test_pipeline_run_101 | 2021-03-10 20:01:26.704265+00:00 |
| 1 | test_pipeline | test_pipeline_run_102 | 2021-03-13 20:08:31.929038+00:00 |
| 2 | test_pipeline2 | test_pipeline2_run_101 | 2021-03-10 20:13:53.083525+00:00 |
| 3 | test_pipeline2 | test_pipeline2_run_102 | 2021-03-12 20:14:51.757058+00:00 |
| 4 | test_pipeline2 | test_pipeline2_run_103 | 2021-03-13 20:17:00.285573+00:00 |
Here is the result I want to achieve:
+----+-----------------+--------------------------------------+----------------------------------+
| | pipeline_name | runid | run_end_dt |
|----+-----------------+--------------------------------------+----------------------------------|
| 0 | test_pipeline | test_pipeline_run_102 | 2021-03-13 20:08:31.929038+00:00 |
| 1 | test_pipeline2 | test_pipeline2_run_103 | 2021-03-13 20:17:00.285573+00:00 |
In the expected result, we have only the runid against each pipeline_name with the max run_end_dt
Thanks

Suppose your dataframe stored in a variable named df
Just use groupby() method:-
df.groupby('pipeline_name',as_index=False)[['runid','run_end_dt']].max()

Use groupby followed by a transform. Get the indices of the rows which have the max value in each group.
idx = (df.groupby(['pipeline_name'], sort=False)['run_end_dt'].transform('max') == df['run_end_dt'])
df = df.loc[idx]

Related

How to use order_by when using group_by in Dajngo-orm, and take out all fields

I used Django-orm,postgresql, Is it possible to query by group_by and order_by?
this table
| id | b_id | others |
| 1 | 2 | hh |
| 2 | 2 | hhh |
| 3 | 6 | h |
| 4 | 7 | hi |
| 5 | 7 | i |
I want the query result to be like this
| id | b_id | others |
| 1 | 2 | hh |
| 3 | 6 | h |
| 4 | 7 | hi |
or
| id | b_id | others |
| 4 | 7 | hi |
| 3 | 6 | h |
| 1 | 2 | hh |
I tried
Table.objects.annotate(count=Count('b_id')).values('b_id', 'id', 'others')
Table.objects.values('b_id', 'id', 'others').annotate(count=Count('b_id'))
Table.objects.extra(order_by=['id']).values('b_id','id', 'others')
Try window function and subquery in the following way:
from django.db.models import Window, F, Subquery, Count
from django.db.models.functions import FirstValue
queryset = A.objects.annotate(count=Count('b_id')).filter(pk__in=Subquery(
A.objects.annotate(
first_id=Window(expression=FirstValue('id'), partition_by=[F('b_id')]), order_by=F('id'))
.values('first_id')))
You can try this:
from django.db.models import Count
result = Table.objects
.values('b_id')
.annotate(count=Count('b_id'))

Unique count of values in column per month

Excel-Table:
| A | B | C | D | E | F | G |
-----|----------------|-----------------|------------------|--------|---------|---------|---------|-----
1 | month&year | date | customer | | 2020-01 | 2020-03 | 2020-04 |
-----|----------------|-----------------|------------------|--------|---------|---------|---------|-----
2 | 2020-01 | 2020-01-10 | Customer A | | 3 | 2 | 4 |
3 | 2020-01 | 2020-01-14 | Customer A | | | | |
4 | 2020-01 | 2020-01-17 | Customer B | | | | |
5 | 2020-01 | 2020-01-19 | Customer B | | | | |
6 | 2020-01 | 2020-01-23 | Customer C | | | | |
7 | 2020-01 | 2020-01-23 | Customer B | | | | |
-----|----------------|-----------------|---------------- -|--------|---------|---------|---------|-----
8 | 2020-03 | 2020-03-18 | Customer E | | | | |
9 | 2020-03 | 2020-03-19 | Customer A | | | | |
-----|----------------|-----------------|------------------|--------|---------|---------|---------|-----
10 | 2020-04 | 2020-04-04 | Customer B | | | | |
11 | 2020-04 | 2020-04-07 | Customer C | | | | |
12 | 2020-04 | 2020-04-07 | Customer A | | | | |
13 | 2020-04 | 2020-04-07 | Customer E | | | | |
14 | 2020-04 | 2020-04-08 | Customer A | | | | |
15 | 2020-04 | 2020-04-12 | Customer A | | | | |
16 | 2020-04 | 2020-04-15 | Customer B | | | | |
17 | |
In my Excel file I want to calculate the unique count of cutomers per month as you can see in Cell E2:G2.
I already inserted Column A as a helper column which extracts only the month and the year from the date in Column B.
Therefore, the date-formatting is the same as in the timline in Cell E1:G2.
I guess the formula to get the unique count per month is somehow related to =COUNTIFS($A:$A,E$1) but I have no clue how to modify this formula to get the expected values.
Do you have any idea?
Here's one approach which would work for Office 365 and if you have access to UNIQUE:
=COUNTA(UNIQUE(IF($A$2:$A$16=G$1,$C$2:$C$16,""),,FALSE))-1
For older versions, following will work with CTRL+SHIFT+ENTER (array entry)
=SUM(--(FREQUENCY(IFERROR(MATCH($A$2:$A$16&$C$2:$C$16,E$1&$C$2:$C$16,0),"a"),MATCH($A$2:$A$16&$C$2:$C$16,E$1&$C$2:$C$16,0))>0))
You can do it without any helping column.
=SUM(--(UNIQUE(FILTER($C$2:$C$16,TEXT($B$2:$B$16,"yyyy-mm")=E$1))<>""))
For older version of excel use below formula with your helper column.
=SUMPRODUCT(--($A$2:$A$16=D$1)*(1/COUNTIFS($A$2:$A$16,$A$2:$A$16,$C$2:$C$16,$C$2:$C$16)))

Remove groups from pandas where {condition}

I have dataframe like this:
+---+--------------------------------------+-----------+
| | envelopeid | message |
+---+--------------------------------------+-----------+
| 1 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.00002 |
| 2 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.00004 |
| 3 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.11001 |
| 4 | 5cb72b9c-adb8-4e1c-9296-db2080cb3b6d | CMN.00002 |
| 5 | 5cb72b9c-adb8-4e1c-9296-db2080cb3b6d | CMN.00001 |
| 6 | f4260b99-6579-4607-bfae-f601cc13ff0c | CMN.00202 |
| 7 | 8f673ae3-0293-4aca-ad6b-572f138515e6 | CMN.00002 |
| 8 | fee98470-aa8f-4ec5-8bcd-1683f85727c2 | TKP.00001 |
| 9 | 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00002 |
| 10| 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00004 |
+---+--------------------------------------+-----------+
I've grouped it with grouped = df.groupby('envelopeid')
And I need to remove all groups from the dataframe and stay only that groups that have messages (CMN.00002) or (CMN.00002 and CMN.00004) only.
Desired dataframe:
+---+--------------------------------------+-----------+
| | envelopeid | message |
+---+--------------------------------------+-----------+
| 7 | 8f673ae3-0293-4aca-ad6b-572f138515e6 | CMN.00002 |
| 9 | 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00002 |
| 10| 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00004 |
+---+--------------------------------------+-----------+
tried
(grouped.message.transform(lambda x: x.eq('CMN.00001').any() or (x.eq('CMN.00002').any() and x.ne('CMN.00002' or 'CMN.00004').any()) or x.ne('CMN.00002').all()))
but it is not working properly
Try:
grouped = df.loc[df['message'].isin(['CMN.00002', 'CMN.00002', 'CMN.00004'])].groupby('envelopeid')
Try this: df[df.message== 'CMN.00002']
outdf = df.groupby('envelopeid').filter(lambda x: tuple(x.message)== ('CMN.00002',) or tuple(x.message)== ('CMN.00002','CMN.00004'))
So i figured it up.
resulting dataframe will got only groups that have only CMN.00002 message or CMN.00002 and CMN.00004. This is what I need.
I used filter instead of transform.

Multiplying and adding values from csv file in python

I have a csv file with following data, I want to know how to multiply values in Qty column with Avg cost column and then sum the values together.
| Instrument | Qty | Avg cost |
|------------|------|-----------|
| APLAPOLLO | 1 | 878.2 |
| AVANTIFEED | 2 | 488.95 |
| BALAMINES | 3 | 308.95 |
| BANCOINDIA | 5 | 195.2 |
| DCMSHRIRAM | 4 | 212.95 |
| GHCL | 4 | 241.75 |
| GIPCL | 9 | 102 |
| JAMNAAUTO | 5 | 178.8 |
| JBCHEPHARM | 3 | 348.65 |
| KEI | 8 | 121 |
| KPRMILL | 2 | 592.65 |
| KRBL | 3 | 274.45 |
| MPHASIS | 2 | 519.75 |
| SHEMAROO | 2 | 400 |
| VOLTAMP | 1 | 924 |
Try this:
f=open('yourfile.csv','r')
temp_sum=0
for line in f:
word=line.split(',')
temp_sum=temp_sum+float(word[1])*float(word[2])
print(temp_sum)
import pandas
colnames = ['Qty', 'Avg_cost']
data = pandas.read_csv('test.csv', names=colnames)
qty = data.Qty.tolist()
avg = data.Avg_cost.tolist()
mult = []
for i in range(0,len(qty)):
temp = qty[i]*avg[i]
mult.append(temp)
sum_all = sum(mul)
print sum_all
print mult
I saved the file as test.csv and did the following
import csv
with open('/tmp/test.csv', 'r') as f:
next(f) #skip first row
total = sum(int(row[1]) * float(row[2]) for row in csv.reader(f))
print('The total is {}'.format(total))

Create columns from column values in Excel

I have a data in Excel:
+-----------------------------+--------------------+----------+
| Name | Category | Number |
+-----------------------------+--------------------+----------+
| Alex | Portret | 3 |
| Alex | Other | 2 |
| Serge | Animals | 1 |
| Serge | Portret | 4 |
+-----------------------------+--------------------+----------+
And I want to transform it to:
+-----------+-----------+-------+---------+
| Name | Portret | Other | Animals |
+-----------+-----------+-------+---------+
| Alex | 3 | 2 | 0 |
| Serge | 4 | 0 | 1 |
+-----------+-----------+-------+---------+
How can I do it in MS Excel ?
You can use a pivot table for that
Take a look at http://office.microsoft.com/en-gb/excel-help/pivottable-reports-101-HA001034632.aspx

Resources