How to calculate data changes over time using Python - python-3.x

For the following dataframe, I need calculate the change in 'count', for each set of date, location_id, uid and include the set in the results.
# Sample DataFrame
df = pd.DataFrame({'date': ['2021-01-01', '2021-01-01','2021-01-01','2021-01-02', '2021-01-02','2021-01-02'],
'location_id':[1001,2001,3001, 1001,2001,3001],
'uid': ['001', '003', '002','001', '004','002'],
'uid_count':[1, 2,3 ,2, 2, 4]})
date location_id uid count
0 2021-01-01 1001 001 1
1 2021-01-01 2001 003 2
2 2021-01-01 3001 002 3
3 2021-01-02 1001 001 2
4 2021-01-02 2001 004 2
5 2021-01-02 3001 002 4
My desired results would look like:
# Desired Results
date location_id uid
2021-01-01 1001 001 0
2001 003 0
3001 002 0
2021-01-02 1001 001 1
2001 004 0
3001 002 1
I thought I could do this via groupby by using the following, but the desired calculation isn't made:
# Current code:
df.groupby(['date','location_id','uid'],sort=False).apply(lambda x: (x['count'].values[-1] - x['count'].values[0]))
# Current results:
date location_id uid
2021-01-01 1001 001 0
2001 003 0
3001 002 0
2021-01-02 1001 001 0
2001 004 0
3001 002 0
How can I get the desired results?

The following code works with the test dataframe, I'm not certain about a larger dataframe
.transform() is used to calculate the differences for consecutive occurrences of 'uid_count', for each uid, with the same index as df.
The issue with .groupby(['date','location_id','uid'], is that each group only contains a single value.
Remove 'uid_count' at the end, with .drop(columns='uid_count'), if desired.
import pandas as pd
# sort the dataframe
df = df.sort_values(['date', 'location_id', 'uid'])
# groupby and transform based on the difference in uid_count
uid_count_diff = df.groupby(['location_id', 'uid']).uid_count.transform(lambda x: x.diff()).fillna(0).astype(int)
# create a column in df
df['uid_count_diff'] = uid_count_diff
# set the index
df = df.set_index(['date', 'location_id', 'uid'])
# result
uid_count uid_count_diff
date location_id uid
2021-01-01 1001 001 1 0
2001 003 2 0
3001 002 3 0
2021-01-02 1001 001 2 1
2001 004 2 0
3001 002 4 1

Related

Pivot a column so repeated values/records are placed in 1 cell

I have the following
Input:
samples = [('001', 'RENAL', 'CHROMOPHOBE', 'KICH'),
('002', 'OVARIAN', 'HIGH_GRADE_SEROUS_CARCINOMA', 'LGSOC'),
('003', 'OVARIAN', 'OTHER', 'NaN'),
('001', 'COLORECTAL', 'ADENOCARCINOMA', 'KICH')]
labels = ['id', 'disease_type', 'disease_sub_type', 'study_abbreviation']
df = pd.DataFrame.from_records(samples, columns=labels)
df
id disease_type disease_sub_type study_abbreviation
0 001 RENAL CHROMOPHOBE KICH
1 002 OVARIAN HIGH_GRADE_SEROUS_CARCINOMA LGSOC
2 003 OVARIAN OTHER NaN
3 001 COLORECTAL ADENOCARCINOMA KICH
I want to be able to compress the repeated id, say 001 in this case so that I can have the disease_type and disease_sub_type, study_abbreviation merged into 1 cell each (nested).
Output
id disease_type disease_sub_type study_abbreviation
0 001 RENAL,COLORECTAL CHROMOPHOBE,ADENOCARCINOMA KICH, KICH
1 002 OVARIAN HIGH_GRADE_SEROUS_CARCINOMA LGSOC
2 003 OVARIAN OTHER NaN
This is not for anything but admin work hence the stupid ask but would help greatly when I need to merge on other datasets, thanks again.
You could group by your 'id' column and use list as an aggregation:
df.groupby('id',as_index=False).agg(','.join)
id disease_type disease_sub_type study_abbreviation
0 001 RENAL,COLORECTAL CHROMOPHOBE,ADENOCARCINOMA KICH,KICH
1 002 OVARIAN HIGH_GRADE_SEROUS_CARCINOMA LGSOC
2 003 OVARIAN OTHER NaN

how to compare two data frames based in difference in date

I have two data frames, each has #id column and date column,
I want to find rows in both Data frames that have same id with a date difference more than > 2 days
Normally it's helpful to include a datafrme so that the responder doesn't need to create it. :)
import pandas as pd
from datetime import timedelta
Create two dataframes:
df1 = pd.DataFrame(data={"id":[0,1,2,3,4], "date":["2019-01-01","2019-01-03","2019-01-05","2019-01-07","2019-01-09"]})
df1["date"] = pd.to_datetime(df1["date"])
df2 = pd.DataFrame(data={"id":[0,1,2,8,4], "date":["2019-01-02","2019-01-06","2019-01-09","2019-01-07","2019-01-10"]})
df2["date"] = pd.to_datetime(df2["date"])
They will look like this:
DF1
id date
0 0 2019-01-01
1 1 2019-01-03
2 2 2019-01-05
3 3 2019-01-07
4 4 2019-01-09
DF2
id date
0 0 2019-01-02
1 1 2019-01-06
2 2 2019-01-09
3 8 2019-01-07
4 4 2019-01-10
Merge the two dataframes on 'id' columns:
df_result = df1.merge(df2, on="id")
Resulting in:
id date_x date_y
0 0 2019-01-01 2019-01-02
1 1 2019-01-03 2019-01-06
2 2 2019-01-05 2019-01-09
3 4 2019-01-09 2019-01-10
Then subtract the two day columns and filter for greater than two.
df_result[(df_result["date_y"] - df_result["date_x"]) > timedelta(days=2)]
id date_x date_y
1 1 2019-01-03 2019-01-06
2 2 2019-01-05 2019-01-09

manipulating pandas dataframe - conditional

I have a pandas dataframe that looks like this:
ID Date Event_Type
1 01/01/2019 A
1 01/01/2019 B
2 02/01/2019 A
3 02/01/2019 A
I want to be left with:
ID Date
1 01/01/2019
2 02/01/2019
3 02/01/2019
Where my condition is:
If the ID is the same AND the dates are within 2 days of each other then drop one of the rows.
If however the dates are more than 2 days apart then keep both rows.
How do I do this?
I believe you need first convert values to datetimes by to_datetime, then get diff and get first values per groups by isnull() chained with comparing if next values are higher like timedelta treshold:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
s = df.groupby('ID')['Date'].diff()
df = df[(s.isnull() | (s > pd.Timedelta(2, 'd')))]
print (df)
ID Date Event_Type
0 1 2019-01-01 A
2 2 2019-02-01 A
3 3 2019-02-01 A
Check solution with another data:
print (df)
ID Date Event_Type
0 1 01/01/2019 A
1 1 04/01/2019 B <-difference 3 days
2 2 02/01/2019 A
3 3 02/01/2019 A
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
s = df.groupby('ID')['Date'].diff()
df = df[(s.isnull() | (s > pd.Timedelta(2, 'd')))]
print (df)
ID Date Event_Type
0 1 2019-01-01 A
1 1 2019-01-04 B
2 2 2019-01-02 A
3 3 2019-01-02 A

Cannot convert object to date after groupby

I have been successful with converting while working with a different dataset a couple days ago. However, I cannot apply the same technique to my current dataset. The set looks as:
totalHist.columns.values[[0, 1]] = ['Datez', 'Volumez']
totalHist.head()
Datez Volumez
0 2016-09-19 6.300000e+07
1 2016-09-20 3.382694e+07
2 2016-09-26 4.000000e+05
3 2016-09-27 4.900000e+09
4 2016-09-28 5.324995e+08
totalHist.dtypes
Datez object
Volumez float64
dtype: object
This used to do the trick:
totalHist['Datez'] = pd.to_datetime(totalHist['Datez'], format='%d-%m-%Y')
totalHist.dtypes
which now is giving me:
KeyError: 'Datez'
During handling of the above exception, another exception occurred:
How can I fix this? I am doing this groupby before trying:
totalHist = df.groupby('Date', as_index = False).agg({"Trading_Value": "sum"})
totalHist.head()
totalHist.columns.values[[0, 1]] = ['Datez', 'Volumez']
totalHist.head()
You can just use .rename() to rename your columns
Generate some data (in same format as OP)
d = ['1/1/2018','1/2/2018','1/3/2018',
'1/3/2018','1/4/2018','1/2/2018','1/1/2018','1/5/2018']
df = pd.DataFrame(d, columns=['Date'])
df['Trading_Value'] = [1000,1005,1001,1001,1002,1009,1010,1002]
print(df)
Date Trading_Value
0 1/1/2018 1000
1 1/2/2018 1005
2 1/3/2018 1001
3 1/3/2018 1001
4 1/4/2018 1002
5 1/2/2018 1009
6 1/1/2018 1010
7 1/5/2018 1002
GROUP BY
totalHist = df.groupby('Date', as_index = False).agg({"Trading_Value": "sum"})
print(totalHist.head())
Date Trading_Value
0 1/1/2018 2010
1 1/2/2018 2014
2 1/3/2018 2002
3 1/4/2018 1002
4 1/5/2018 1002
Rename columns
totalHist.rename(columns={'Date':'Datez','totalHist':'Volumez'}, inplace=True)
print(totalHist)
Datez Trading_Value
0 1/1/2018 2010
1 1/2/2018 2014
2 1/3/2018 2002
3 1/4/2018 1002
4 1/5/2018 1002
Finally, convert to datetime
totalHist['Datez'] = pd.to_datetime(totalHist['Datez'])
print(totalHist.dtypes)
Datez datetime64[ns]
Trading_Value int64
dtype: object
This was done with python --version = 3.6.7 and pandas (0.23.4).

Restructure dataframe based on given keys

I'm working on a dataset and after all the cleaning and restructuring I have arrived at a situation where the dataset looks like below.
import pandas as pd
df = pd.read_csv('data.csv', dtype={'freq_no': object, 'sequence': object, 'field': object})
print(df)
CSV URL: https://pastebin.com/raw/nkDHEXQC
id year period freq_no sequence file_date data_date field \
0 abcdefghi 2018 A 001 001 20180605 20180331 05210
1 abcdefghi 2018 A 001 001 20180605 20180331 05210
2 abcdefghi 2018 A 001 001 20180605 20180331 05210
3 abcdefghi 2018 A 001 001 20180605 20180330 05220
4 abcdefghi 2018 A 001 001 20180605 20180330 05220
5 abcdefghi 2018 A 001 001 20180605 20180330 05230
6 abcdefghi 2018 A 001 001 20180605 20180330 05230
value note_type note transaction_type
0 200.0 NaN NaN A
1 NaN B {05210_B:ABC} A
2 NaN U {05210_U:DEFF} D
3 200.0 NaN NaN U
4 NaN U {05220_U:xyz} D
5 100.0 NaN NaN D
6 NaN U {05230_U:lmn} A
I want to restructure above so that it looks like below.
Logic:
Use id, year, period, freq_no, sequence, data_date as key (groupby?)
Transpose such that field becomes column and this column has value as its values
Create a combined_note column by concatenating note (for same key)
Create a deleted column which will show which note or value was deleted based on transaction_type D.
Output:
id year period freq_no sequence file_date data_date 05210 \
0 abcdefghi 2018 A 001 001 20180605 20180331 200.0
1 abcdefghi 2018 A 001 001 20180605 20180330 NaN
05220 05230 combined_note deleted
0 NaN NaN {05210_B:ABC}{05210_U:DEFF} note{05210_U:DEFF} #because for note 05210_U:DEFF the trans_type was D
1 200.0 100.0 {05220_U:xyz}{05230_U:lmn} note{05220_U:xyz}|05230 #because for note {05220_U:xyz} trans_type is D, we also show field (05230) here separated by pipe because for that row the trans_type is D
I think this can be done by using set_index on key and then restructruing other columns but I wasn't able to get the desired output.
So I ended having to do this with a merge.
Logical Steps:
Group DataFrame by all fields except note and value. This is to preserve the field and transaction columns to not be affected by the aggregation.
Add a deleted column.
First DataFrame that contains the aggregation of the notes(deleted as well).
Second DataFrame to transform field and value to multiple columns.
Merge first and second data frame on the index.
Code:
import pandas as pd
import io
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
# url = "https://pastebin.com/raw/nkDHEXQC"
csv_string = b"""id,year,period,freq_no,sequence,file_date,data_date,field,value,note_type,note,transaction_type
abcdefghi,2018,A,001,001,20180605,20180331,05210,200,,,A
abcdefghi,2018,A,001,001,20180605,20180331,05210,,B,{05210_B:ABC},A
abcdefghi,2018,A,001,001,20180605,20180331,05210,,U,{05210_U:DEFF},D
abcdefghi,2018,A,001,001,20180605,20180330,05220,200,,,U
abcdefghi,2018,A,001,001,20180605,20180330,05220,,U,{05220_U:xyz},D
abcdefghi,2018,A,001,001,20180605,20180330,05230,100,,,D
abcdefghi,2018,A,001,001,20180605,20180330,05230,,U,{05230_U:lmn},A
"""
data = io.BytesIO(csv_string)
df = pd.read_csv(data, dtype={'freq_no': object, 'sequence': object, 'field': object})
# so the aggregation function will work
df['note'] = df['note'].fillna('')
grouped = df.groupby(
['id', 'year', 'period', 'freq_no', 'sequence', 'data_date', 'file_date', 'field', 'transaction_type']).agg(['sum'])
grouped.columns = grouped.columns.droplevel(1)
grouped.reset_index(['field', 'transaction_type'], inplace=True)
gcolumns = ['id', 'year', 'period', 'freq_no', 'sequence', 'data_date', 'file_date']
def is_deleted(note, trans_type, field):
"""Determines if a note is deleted"""
deleted = []
for val, val2 in zip(note, trans_type):
if val != "":
if val2 == 'D':
deleted.append(val)
else:
deleted.append('')
else:
deleted.append('')
return pd.Series(deleted, index=note.index)
# This function will add the deleted notes
# I am not sure of the pipe operator, i will leave that to you
grouped['deleted'] = is_deleted(grouped['note'], grouped['transaction_type'], grouped['field'])
# This will obtain all agg of all the notes and deleted
notes = grouped.drop(['field', 'transaction_type', 'value'], axis=1).reset_index().groupby(gcolumns).agg(sum)
# converts two columns into new columns using specified table
# using pivot table to take advantage of the multi index
stacked_values = grouped.pivot_table(index=gcolumns, columns='field', values='value')
# finally merge the notes and stacked_value on their index
final = stacked_values.merge(notes, left_index=True, right_index=True).rename(columns={'note': 'combined_note'}).reset_index()
Output:
final
id year period freq_no sequence data_date file_date 05210 05220 05230 combined_note deleted
0 abcdefghi 2018 A 001 001 20180330 20180605 NaN 200.0 100.0 {05220_U:xyz}{05230_U:lmn} {05220_U:xyz}
1 abcdefghi 2018 A 001 001 20180331 20180605 200.0 NaN NaN {05210_B:ABC}{05210_U:DEFF} {05210_U:DEFF}

Resources