Checking if the value is there in any specified column of the same table - python-3.x

I wanted to check if the value of a particular row of a column is present in the other column.
df:
sno id1 id2 id3
1 1,2 7 1,2,7,22
2 2 8,9 2,8,9,15,17
3 1,5 6 1,5,6,17,33
4 4 4,12,18
5 9 9,14
output:
for a particular given row,
for i in sno:
if id1 in id3 :
score = 50
elif id2 in id3:
score = 50
if id1 in id3 and id2 in id3:
score = 75
I finally want my score out of that logic.

You can convert all values to sets with split and then compare by issubset, also and bool(a) is used for omit empty sets (created from missing values):
print (df)
sno id1 id2 id3
0 1 1,2 7 1,20,70,22
1 2 2 8,9 2,8,9,15,17
2 3 1,5 6 1,5,6,17,33
3 4 4 NaN 4,12,18
4 5 NaN 9 9,14
def convert(x):
return set(x.split(',')) if isinstance(x, str) else set([])
cols = ['id1', 'id2', 'id3']
df1 = df[cols].applymap(convert)
m1 = np.array([a.issubset(b) and bool(a) for a, b in zip(df1['id1'], df1['id3'])])
m2 = np.array([a.issubset(b) and bool(a) for a, b in zip(df1['id2'], df1['id3'])])
df['new'] = np.select([m1 & m2, m1 | m2], [75, 50], np.nan)
print (df)
sno id1 id2 id3 new
0 1 1,2 7 1,20,70,22 NaN
1 2 2 8,9 2,8,9,15,17 75.0
2 3 1,5 6 1,5,6,17,33 75.0
3 4 4 NaN 4,12,18 50.0
4 5 NaN 9 9,14 50.0

Related

Join with column having the max sequence number

I have a margin table
item margin
0 a 3
1 b 4
2 c 5
and an item table
item sequence
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 c 1
6 c 2
7 c 3
I want to join the two table so that the margin will only be joined to the product with maximum sequence number, the desired outcome is
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
How to achieve this?
Below is the code for margin and item table
import pandas as pd
df_margin=pd.DataFrame({"item":["a","b","c"],"margin":[3,4,5]})
df_item=pd.DataFrame({"item":["a","a","a","b","b","c","c","c"],"sequence":[1,2,3,1,2,1,2,3]})
One option would be to merge then replace extra values with NaN via Series.where:
new_df = df_item.merge(df_margin)
new_df['margin'] = new_df['margin'].where(
new_df.groupby('item')['sequence'].transform('max').eq(new_df['sequence'])
)
Or with loc:
new_df = df_item.merge(df_margin)
new_df.loc[new_df.groupby('item')['sequence']
.transform('max').ne(new_df['sequence']), 'margin'] = np.NAN
Another option would be to assign a temp column to both frames df_item with True where the value is maximal, and df_margin is True everywhere then merge outer and drop the temp column:
new_df = (
df_item.assign(
t=df_item
.groupby('item')['sequence']
.transform('max')
.eq(df_item['sequence'])
).merge(df_margin.assign(t=True), how='outer').drop('t', 1)
)
Both produce new_df:
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
You could do:
df_item.merge(df_item.groupby('item')['sequence'].max().\
reset_index().merge(df_margin), 'left')
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
Breakdown:
df_new = df_item.groupby('item')['sequence'].max().reset_index().merge(df_margin)
df_item.merge(df_new, 'left')

calculate different between consecutive date records at an ID level

I have a dataframe as
col 1 col 2
A 2020-07-13
A 2020-07-15
A 2020-07-18
A 2020-07-19
B 2020-07-13
B 2020-07-19
C 2020-07-13
C 2020-07-18
I want it to become the following in a new dataframe
col_3 diff_btw_1st_2nd_date diff_btw_2nd_3rd_date diff_btw_3rd_4th_date
A 2 3 1
B 6 NaN NaN
C 5 NaN NaN
I tried getting the groupby at Col 1 level , but not getting the intended result. Can anyone help?
Use GroupBy.cumcount for counter pre column col 1 and reshape by DataFrame.set_index with Series.unstack, then use DataFrame.diff, remove first only NaNs columns by DataFrame.iloc, convert timedeltas to days by Series.dt.days per all columns and change columns names by DataFrame.add_prefix:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.set_index(['col 1',df.groupby('col 1').cumcount()])['col 2']
.unstack()
.diff(axis=1)
.iloc[:, 1:]
.apply(lambda x: x.dt.days)
.add_prefix('diff_')
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2 3.0 1.0
1 B 6 NaN NaN
2 C 5 NaN NaN
Or use DataFrameGroupBy.diff with counter for new columns by DataFrame.assign, reshape by DataFrame.pivot and remove NaNs by c2 with DataFrame.dropna:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.assign(g = df.groupby('col 1').cumcount(),
c1 = df.groupby('col 1')['col 2'].diff().dt.days)
.dropna(subset=['c1'])
.pivot('col 1','g','c1')
.add_prefix('diff_')
.rename_axis(None, axis=1)
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2.0 3.0 1.0
1 B 6.0 NaN NaN
2 C 5.0 NaN NaN
You can assign a cumcount number grouped by col 1, and pivot the table using that cumcount number.
Solution
df["col 2"] = pd.to_datetime(df["col 2"])
# 1. compute date difference in days using diff() and dt accessor
df["diff"] = df.groupby(["col 1"])["col 2"].diff().dt.days
# 2. assign cumcount for pivoting
df["cumcount"] = df.groupby("col 1").cumcount()
# 3. partial transpose, discarding the first difference in nan
df2 = df[["col 1", "diff", "cumcount"]]\
.pivot(index="col 1", columns="cumcount")\
.drop(columns=[("diff", 0)])
Result
# replace column names for readability
df2.columns = [f"d{i+2}-d{i+1}" for i in range(len(df2.columns))]
print(df2)
d2-d1 d3-d2 d4-d3
col 1
A 2.0 3.0 1.0
B 6.0 NaN NaN
C 5.0 NaN NaN
df after assing cumcount is like this
print(df)
col 1 col 2 diff cumcount
0 A 2020-07-13 NaN 0
1 A 2020-07-15 2.0 1
2 A 2020-07-18 3.0 2
3 A 2020-07-19 1.0 3
4 B 2020-07-13 NaN 0
5 B 2020-07-19 6.0 1
6 C 2020-07-13 NaN 0
7 C 2020-07-18 5.0 1

pandas groupby and widen dataframe with ordered columns

I have a long form dataframe that contains multiple samples and time points for each subject. The number of samples and timepoint can vary, and the days between time points can also vary:
test_df = pd.DataFrame({"subject_id":[1,1,1,2,2,3],
"sample":["A", "B", "C", "D", "E", "F"],
"timepoint":[19,11,8,6,2,12],
"time_order":[3,2,1,2,1,1]
})
subject_id sample timepoint time_order
0 1 A 19 3
1 1 B 11 2
2 1 C 8 1
3 2 D 6 2
4 2 E 2 1
5 3 F 12 1
I need to figure out a way to generalize grouping this dataframe by subject_id and putting all samples and time points on the same row, in time order.
DESIRED OUTPUT:
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8 B 11 A 19
1 2 E 2 D 6 null null
5 3 F 12 null null null null
Pivot gets me close, but I'm stuck on how to proceed from there:
test_df = test_df.pivot(index=['subject_id', 'sample'],
columns='time_order', values='timepoint')
Use DataFrame.set_index with DataFrame.unstack for pivoting, sorting MultiIndex in columns, flatten it and last convert subject_id to column:
df = (test_df.set_index(['subject_id', 'time_order'])
.unstack()
.sort_index(level=[1,0], axis=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8.0 B 11.0 A 19.0
1 2 E 2.0 D 6.0 NaN NaN
2 3 F 12.0 NaN NaN NaN NaN
a=test_df.iloc[:,:3].groupby('subject_id').last().add_suffix('1')
b=test_df.iloc[:,:3].groupby('subject_id').nth(-2).add_suffix('2')
c=test_df.iloc[:,:3].groupby('subject_id').nth(-3).add_suffix('3')
pd.concat([a, b,c], axis=1)
sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
subject_id
1 C 8 B 11.0 A 19.0
2 E 2 D 6.0 NaN NaN
3 F 12 NaN NaN NaN NaN

Create Python function to look for ANY NULL value in a group

I am trying to write a function that will check a specified column for nulls, within a group in a dataframe. The example dataframe has two columns, ID and VALUE. Multiple rows exist per ID. I want to know if ANY of the rows for a particular ID have a NULL value in VALUE.
I have tried building the function with iterrows().
df = pd.DataFrame({'ID':[1,2,2,3,3,3],
'VALUE':[50,None,30,20,10,None]})
def nullValue(col):
for i, row in col.iterrows():
if ['VALUE'] is None:
return 1
else:
return 0
df2 = df.groupby('ID').apply(nullVALUE)
df2.columns = ['ID','VALUE','isNULL']
df2
I am expecting to retrieve a dataframe with three columns, ID, VALUE, and isNULL. If any row in a grouped ID has a null, all of the rows for that ID should have a 1 under isNull.
Example:
ID VALUE isNULL
1 50.0 0
2 NaN 1
2 30.0 1
3 20.0 1
3 10.0 1
3 NaN 1
A quick solution, borrowed partially from this answer is to use groupby with transform:
df = pd.DataFrame({'ID':[1,2,2,3,3,3,3],
'VALUE':[50,None,None,30,20,10,None]})
df['isNULL'] = (df.VALUE.isnull().groupby([df['ID']]).transform('sum') > 0).astype(int)
Out[51]:
ID VALUE isNULL
0 1 50.0 0
1 2 NaN 1
2 2 NaN 1
3 3 30.0 1
4 3 20.0 1
5 3 10.0 1
6 3 NaN 1

How to combine a data frame with another that contains comma separated values?

I am working with 2 data frames that I created based from an Excel file. One data frame contains values that are separated with commas, that is,
df1 df2
----------- ------------
0 LFTEG42 X,Y,Z
1 JOCOROW 1,2
2 TLR_U01 I
3 PR_UDG5 O,M
df1 and df2 are my column names. My intention is to merge the two data frames and generate the following output:
desired result
----------
0 LFTEG42X
1 LFTEG42Y
2 LFTEG42Z
3 JOCOROW1
4 JOCOROW2
5 TLR_U01I
6 .....
n PR_UDG5M
This is the code that I used but I ended up with the following result:
input_file = pd.ExcelFile \
('C:\\Users\\devel\\Desktop_12\\Testing\\latest_Calculation' + str(datetime.now()).split(' ')[0] + '.xlsx')
# convert the worksheets to dataframes
df1 = pd.read_excel(input_file, index_col=None, na_values=['NA'], parse_cols="H",
sheetname="Analysis")
df2 = pd.read_excel(input_file, index_col=None, na_values=['NA'], parse_cols="I",
sheetname="Analysis")
data_frames_merged = df1.append(df2, ignore_index=True)
current result
--------------
NaN XYZ
NaN 1,2
NaN I
... ...
PR_UDG5 NaN
Questions
Why did I end up receiving a NaN (not a number) value?
How can I achieve my desired result of merging these two data frames with the comma values?
I break down the steps
df=pd.concat([df1,df2],axis=1)
df.df2=df.df2.str.split(',')
df=df.set_index('df1').df2.apply(pd.Series).stack().reset_index().drop('level_1',1).rename(columns={0:'df2'})
df['New']=df.df1+df.df2
df
Out[34]:
df1 df2 New
0 LFTEG42 X LFTEG42X
1 LFTEG42 Y LFTEG42Y
2 LFTEG42 Z LFTEG42Z
3 JOCOROW 1 JOCOROW1
4 JOCOROW 2 JOCOROW2
5 TLR_U01 I TLR_U01I
6 PR_UDG5 O PR_UDG5O
7 PR_UDG5 M PR_UDG5M
Data Input :
df1
Out[36]:
df1
0 LFTEG42
1 JOCOROW
2 TLR_U01
3 PR_UDG5
df2
Out[37]:
df2
0 X,Y,Z
1 1,2
2 I
3 O,M
Dirty one-liner
new_df = pd.concat([df1['df1'], df2['df2'].str.split(',', expand = True).stack()\
.reset_index(1,drop = True)], axis = 1).sum(1)
0 LFTEG42X
0 LFTEG42Y
0 LFTEG42Z
1 JOCOROW1
1 JOCOROW2
2 TLR_U01I
3 PR_UDG5O
3 PR_UDG5M
Also, similar to #vaishali except using melt
df = pd.concat([df1,df2['df2'].str.split(',',expand=True)],axis=1).melt(id_vars='df1').dropna().drop('variable',axis=1).sum(axis=1)
0 LFTEG42X
1 JOCOROW1
2 TLR_U01I
3 PR_UDG5O
4 LFTEG42Y
5 JOCOROW2
7 PR_UDG5M
8 LFTEG42Z
Setup
df1 = pd.DataFrame(dict(A='LFTEG42 JOCOROW TLR_U01 PR_UDG5'.split()))
df2 = pd.DataFrame(dict(A='X,Y,Z 1,2 I O,M'.split()))
Getting creative
df1.A.repeat(df2.A.str.count(',') + 1) + ','.join(df2.A).split(',')
0 LFTEG42X
0 LFTEG42Y
0 LFTEG42Z
1 JOCOROW1
1 JOCOROW2
2 TLR_U01I
3 PR_UDG5O
3 PR_UDG5M
dtype: object

Resources