Pivot a column so repeated values/records are placed in 1 cell - python-3.x

I have the following
Input:
samples = [('001', 'RENAL', 'CHROMOPHOBE', 'KICH'),
('002', 'OVARIAN', 'HIGH_GRADE_SEROUS_CARCINOMA', 'LGSOC'),
('003', 'OVARIAN', 'OTHER', 'NaN'),
('001', 'COLORECTAL', 'ADENOCARCINOMA', 'KICH')]
labels = ['id', 'disease_type', 'disease_sub_type', 'study_abbreviation']
df = pd.DataFrame.from_records(samples, columns=labels)
df
id disease_type disease_sub_type study_abbreviation
0 001 RENAL CHROMOPHOBE KICH
1 002 OVARIAN HIGH_GRADE_SEROUS_CARCINOMA LGSOC
2 003 OVARIAN OTHER NaN
3 001 COLORECTAL ADENOCARCINOMA KICH
I want to be able to compress the repeated id, say 001 in this case so that I can have the disease_type and disease_sub_type, study_abbreviation merged into 1 cell each (nested).
Output
id disease_type disease_sub_type study_abbreviation
0 001 RENAL,COLORECTAL CHROMOPHOBE,ADENOCARCINOMA KICH, KICH
1 002 OVARIAN HIGH_GRADE_SEROUS_CARCINOMA LGSOC
2 003 OVARIAN OTHER NaN
This is not for anything but admin work hence the stupid ask but would help greatly when I need to merge on other datasets, thanks again.

You could group by your 'id' column and use list as an aggregation:
df.groupby('id',as_index=False).agg(','.join)
id disease_type disease_sub_type study_abbreviation
0 001 RENAL,COLORECTAL CHROMOPHOBE,ADENOCARCINOMA KICH,KICH
1 002 OVARIAN HIGH_GRADE_SEROUS_CARCINOMA LGSOC
2 003 OVARIAN OTHER NaN

Related

Calculate quantity using an unrelated field in dax

ID
001
002
REQ
ID ITEM QUANT
001 chips 20
002 chips 100
SCHEDULE
1 001 cleaning
1 002 normal
2 001 normal
2 002 remodel
3 001 normal
3 002 remodel
4 001 remodel
4 002 cleaning
item = corn chips
id_store
1
2
3
4
001
phase
cleaning
normal
normal
remodel
quant
0
20
20
5
002
phase
normal
remodel
remodel
cleaning
quant
100
5
5
0
I want to calculate a quant given a store phase. if the store is cleaning then its 0 quant, if remodeling then 5 quant else its the quant from requirements.
normally I would do this with a switch statement in dax but the phase data is not in my table. Please assist.
it turns out a simple switch statement looking at different tables works just fine.
Num Items :=
VAR T = sum(REQ[Quant])
RETURN
SWITCH(
TRUE(),
VALUES(SCHEDULE[PHASE]) = "cleaning", 0,
VALUES(SCHEDULE[PHASE]) = "remodeling", 5,
T
)

Can we separate data using Unique ID in to the following format?

Current Format:
UNIQUE ID
NAME
AGE
DEP
RANK
001
John
10
4th
1
002
Priya
11
4th
2
003
Jack
15
5th
2
004
Jill
14
5th
1
Expected Format:
UNIQUE ID
NAME
COLUMN_NO
001
John
1
001
10
2
001
4th
3
001
1
4
002
Priya
1
002
11
2
002
4th
3
002
2
4
My starting point:
>>> df
UNIQUE ID NAME AGE DEP RANK
0 1 John 10 4th 1
1 2 Priya 11 4th 2
2 3 Jack 15 5th 2
3 4 Jill 14 5th 1
The basic transformation you need is provided by df.stack, which results in:
0 UNIQUE ID 1
NAME John
AGE 10
DEP 4th
RANK 1
1 UNIQUE ID 2
NAME Priya
[...]
However, you want column UNIQUE ID to be treated separately. This can be accomplished by making it the index:
>>> df.set_index('UNIQUE ID').stack()
UNIQUE ID
1 NAME John
AGE 10
DEP 4th
RANK 1
2 NAME Priya
AGE 11
DEP 4th
RANK 2
The last missing bit are the column names: you want them renamed to numbers. This could be accomplished two different ways: a) by re-assigning df.columns (after having moved column UNIQUE ID to the index first):
df = df.set_index('UNIQUE_ID')
df.columns = range(1, 5)
or b) by df.renaming the columns:
df = df.set_index('UNIQUE_ID')
df = df.rename(columns={'NAME': 1, 'AGE': 2, 'DEP': 3, 'RANK': 4})
And finally you can convert the resulting Series back to a DataFrame. The most elegant way to get COLUMN NO at the right place is using df.rename_axis before stacking. All together as one expression (possibly better to split it up):
>>> (df.set_index('UNIQUE ID')
.rename(columns={'NAME': 1, 'AGE': 2, 'DEP': 3, 'RANK': 4})
.rename_axis('COLUMN NO', axis=1)
.stack()
.to_frame('NAME')
.reset_index())
UNIQUE ID COLUMN NO NAME
0 1 1 John
1 1 2 10
2 1 3 4th
3 1 4 1
4 2 1 Priya
5 2 2 11
6 2 3 4th
7 2 4 2
8 3 1 Jack
9 3 2 15
10 3 3 5th
11 3 4 2
12 4 1 Jill
13 4 2 14
14 4 3 5th
15 4 4 1
Things left out: reading the data; preserving the correct type: UNIQUE ID only looks numeric, but has leading zeros that probably want to be preserved; so parsing them as a string would be better.

How to calculate data changes over time using Python

For the following dataframe, I need calculate the change in 'count', for each set of date, location_id, uid and include the set in the results.
# Sample DataFrame
df = pd.DataFrame({'date': ['2021-01-01', '2021-01-01','2021-01-01','2021-01-02', '2021-01-02','2021-01-02'],
'location_id':[1001,2001,3001, 1001,2001,3001],
'uid': ['001', '003', '002','001', '004','002'],
'uid_count':[1, 2,3 ,2, 2, 4]})
date location_id uid count
0 2021-01-01 1001 001 1
1 2021-01-01 2001 003 2
2 2021-01-01 3001 002 3
3 2021-01-02 1001 001 2
4 2021-01-02 2001 004 2
5 2021-01-02 3001 002 4
My desired results would look like:
# Desired Results
date location_id uid
2021-01-01 1001 001 0
2001 003 0
3001 002 0
2021-01-02 1001 001 1
2001 004 0
3001 002 1
I thought I could do this via groupby by using the following, but the desired calculation isn't made:
# Current code:
df.groupby(['date','location_id','uid'],sort=False).apply(lambda x: (x['count'].values[-1] - x['count'].values[0]))
# Current results:
date location_id uid
2021-01-01 1001 001 0
2001 003 0
3001 002 0
2021-01-02 1001 001 0
2001 004 0
3001 002 0
How can I get the desired results?
The following code works with the test dataframe, I'm not certain about a larger dataframe
.transform() is used to calculate the differences for consecutive occurrences of 'uid_count', for each uid, with the same index as df.
The issue with .groupby(['date','location_id','uid'], is that each group only contains a single value.
Remove 'uid_count' at the end, with .drop(columns='uid_count'), if desired.
import pandas as pd
# sort the dataframe
df = df.sort_values(['date', 'location_id', 'uid'])
# groupby and transform based on the difference in uid_count
uid_count_diff = df.groupby(['location_id', 'uid']).uid_count.transform(lambda x: x.diff()).fillna(0).astype(int)
# create a column in df
df['uid_count_diff'] = uid_count_diff
# set the index
df = df.set_index(['date', 'location_id', 'uid'])
# result
uid_count uid_count_diff
date location_id uid
2021-01-01 1001 001 1 0
2001 003 2 0
3001 002 3 0
2021-01-02 1001 001 2 1
2001 004 2 0
3001 002 4 1

how to compare two data frames based in difference in date

I have two data frames, each has #id column and date column,
I want to find rows in both Data frames that have same id with a date difference more than > 2 days
Normally it's helpful to include a datafrme so that the responder doesn't need to create it. :)
import pandas as pd
from datetime import timedelta
Create two dataframes:
df1 = pd.DataFrame(data={"id":[0,1,2,3,4], "date":["2019-01-01","2019-01-03","2019-01-05","2019-01-07","2019-01-09"]})
df1["date"] = pd.to_datetime(df1["date"])
df2 = pd.DataFrame(data={"id":[0,1,2,8,4], "date":["2019-01-02","2019-01-06","2019-01-09","2019-01-07","2019-01-10"]})
df2["date"] = pd.to_datetime(df2["date"])
They will look like this:
DF1
id date
0 0 2019-01-01
1 1 2019-01-03
2 2 2019-01-05
3 3 2019-01-07
4 4 2019-01-09
DF2
id date
0 0 2019-01-02
1 1 2019-01-06
2 2 2019-01-09
3 8 2019-01-07
4 4 2019-01-10
Merge the two dataframes on 'id' columns:
df_result = df1.merge(df2, on="id")
Resulting in:
id date_x date_y
0 0 2019-01-01 2019-01-02
1 1 2019-01-03 2019-01-06
2 2 2019-01-05 2019-01-09
3 4 2019-01-09 2019-01-10
Then subtract the two day columns and filter for greater than two.
df_result[(df_result["date_y"] - df_result["date_x"]) > timedelta(days=2)]
id date_x date_y
1 1 2019-01-03 2019-01-06
2 2 2019-01-05 2019-01-09

Restructure dataframe based on given keys

I'm working on a dataset and after all the cleaning and restructuring I have arrived at a situation where the dataset looks like below.
import pandas as pd
df = pd.read_csv('data.csv', dtype={'freq_no': object, 'sequence': object, 'field': object})
print(df)
CSV URL: https://pastebin.com/raw/nkDHEXQC
id year period freq_no sequence file_date data_date field \
0 abcdefghi 2018 A 001 001 20180605 20180331 05210
1 abcdefghi 2018 A 001 001 20180605 20180331 05210
2 abcdefghi 2018 A 001 001 20180605 20180331 05210
3 abcdefghi 2018 A 001 001 20180605 20180330 05220
4 abcdefghi 2018 A 001 001 20180605 20180330 05220
5 abcdefghi 2018 A 001 001 20180605 20180330 05230
6 abcdefghi 2018 A 001 001 20180605 20180330 05230
value note_type note transaction_type
0 200.0 NaN NaN A
1 NaN B {05210_B:ABC} A
2 NaN U {05210_U:DEFF} D
3 200.0 NaN NaN U
4 NaN U {05220_U:xyz} D
5 100.0 NaN NaN D
6 NaN U {05230_U:lmn} A
I want to restructure above so that it looks like below.
Logic:
Use id, year, period, freq_no, sequence, data_date as key (groupby?)
Transpose such that field becomes column and this column has value as its values
Create a combined_note column by concatenating note (for same key)
Create a deleted column which will show which note or value was deleted based on transaction_type D.
Output:
id year period freq_no sequence file_date data_date 05210 \
0 abcdefghi 2018 A 001 001 20180605 20180331 200.0
1 abcdefghi 2018 A 001 001 20180605 20180330 NaN
05220 05230 combined_note deleted
0 NaN NaN {05210_B:ABC}{05210_U:DEFF} note{05210_U:DEFF} #because for note 05210_U:DEFF the trans_type was D
1 200.0 100.0 {05220_U:xyz}{05230_U:lmn} note{05220_U:xyz}|05230 #because for note {05220_U:xyz} trans_type is D, we also show field (05230) here separated by pipe because for that row the trans_type is D
I think this can be done by using set_index on key and then restructruing other columns but I wasn't able to get the desired output.
So I ended having to do this with a merge.
Logical Steps:
Group DataFrame by all fields except note and value. This is to preserve the field and transaction columns to not be affected by the aggregation.
Add a deleted column.
First DataFrame that contains the aggregation of the notes(deleted as well).
Second DataFrame to transform field and value to multiple columns.
Merge first and second data frame on the index.
Code:
import pandas as pd
import io
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
# url = "https://pastebin.com/raw/nkDHEXQC"
csv_string = b"""id,year,period,freq_no,sequence,file_date,data_date,field,value,note_type,note,transaction_type
abcdefghi,2018,A,001,001,20180605,20180331,05210,200,,,A
abcdefghi,2018,A,001,001,20180605,20180331,05210,,B,{05210_B:ABC},A
abcdefghi,2018,A,001,001,20180605,20180331,05210,,U,{05210_U:DEFF},D
abcdefghi,2018,A,001,001,20180605,20180330,05220,200,,,U
abcdefghi,2018,A,001,001,20180605,20180330,05220,,U,{05220_U:xyz},D
abcdefghi,2018,A,001,001,20180605,20180330,05230,100,,,D
abcdefghi,2018,A,001,001,20180605,20180330,05230,,U,{05230_U:lmn},A
"""
data = io.BytesIO(csv_string)
df = pd.read_csv(data, dtype={'freq_no': object, 'sequence': object, 'field': object})
# so the aggregation function will work
df['note'] = df['note'].fillna('')
grouped = df.groupby(
['id', 'year', 'period', 'freq_no', 'sequence', 'data_date', 'file_date', 'field', 'transaction_type']).agg(['sum'])
grouped.columns = grouped.columns.droplevel(1)
grouped.reset_index(['field', 'transaction_type'], inplace=True)
gcolumns = ['id', 'year', 'period', 'freq_no', 'sequence', 'data_date', 'file_date']
def is_deleted(note, trans_type, field):
"""Determines if a note is deleted"""
deleted = []
for val, val2 in zip(note, trans_type):
if val != "":
if val2 == 'D':
deleted.append(val)
else:
deleted.append('')
else:
deleted.append('')
return pd.Series(deleted, index=note.index)
# This function will add the deleted notes
# I am not sure of the pipe operator, i will leave that to you
grouped['deleted'] = is_deleted(grouped['note'], grouped['transaction_type'], grouped['field'])
# This will obtain all agg of all the notes and deleted
notes = grouped.drop(['field', 'transaction_type', 'value'], axis=1).reset_index().groupby(gcolumns).agg(sum)
# converts two columns into new columns using specified table
# using pivot table to take advantage of the multi index
stacked_values = grouped.pivot_table(index=gcolumns, columns='field', values='value')
# finally merge the notes and stacked_value on their index
final = stacked_values.merge(notes, left_index=True, right_index=True).rename(columns={'note': 'combined_note'}).reset_index()
Output:
final
id year period freq_no sequence data_date file_date 05210 05220 05230 combined_note deleted
0 abcdefghi 2018 A 001 001 20180330 20180605 NaN 200.0 100.0 {05220_U:xyz}{05230_U:lmn} {05220_U:xyz}
1 abcdefghi 2018 A 001 001 20180331 20180605 200.0 NaN NaN {05210_B:ABC}{05210_U:DEFF} {05210_U:DEFF}

Resources