Identify the latest series of Continuous same value in Python Pandas DataFrame - python-3.x

I have the following DataFrame (Date in dd-mm-yyyy format):
import pandas as pd
data={'Id':['A', 'B', 'C', 'A', 'B', 'C', 'B', 'C', 'A', 'C', 'B', 'C', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
'Date':['20-10-2022', '20-10-2022', '20-10-2022', '21-10-2022', '21-10-2022', '21-10-2022',
'22-10-2022', '22-10-2022', '23-10-2022', '23-10-2022', '24-10-2022', '24-10-2022',
'25-10-2022', '25-10-2022', '26-10-2022', '26-10-2022', '26-10-2022', '27-10-2022',
'27-10-2022', '27-10-2022']}
df=pd.DataFrame.from_dict(data)
df
Id Date
0 A 20-10-2022
1 B 20-10-2022
2 C 20-10-2022
3 A 21-10-2022
4 B 21-10-2022
5 C 21-10-2022
6 B 22-10-2022
7 C 22-10-2022
8 A 23-10-2022
9 C 23-10-2022
10 B 24-10-2022
11 C 24-10-2022
12 B 25-10-2022
13 C 25-10-2022
14 A 26-10-2022
15 B 26-10-2022
16 C 26-10-2022
17 A 27-10-2022
18 B 27-10-2022
19 C 27-10-2022
This is the Final DataFrame that I want:
I have tried the following code:
# Find first occurance and last occurance of any given Id.
df_first_duplicate = df.drop_duplicates(subset=['Id'], keep='first')
df_first_duplicate.rename(columns = {'Date':'DateOfFirstOccurance'}, inplace = True)
df_first_duplicate.reset_index(inplace = True, drop = True)
df_last_duplicate = df.drop_duplicates(subset=['Id'], keep='last')
df_last_duplicate.rename(columns = {'Date':'DateOfLastOccurance'}, inplace = True)
df_last_duplicate.reset_index(inplace = True, drop = True)
# Merge the above two df's on key
df_merged = pd.merge(df_first_duplicate, df_last_duplicate, on='Id')
df_merged
But this is the output that I get:
Id DateOfFirstOccurance DateOfLastOccurance
0 A 20-10-2022 27-10-2022
1 B 20-10-2022 27-10-2022
2 C 20-10-2022 27-10-2022
What should I do to get the desired output?

df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')
records = []
for key, group in df.groupby(by='Id'):
filt = group['Date'].diff(-1).dt.days >= -1
filt.iloc[filt.shape[0]-1] = True
max_false_index = filt[~filt].index.max()
min_date = group['Date'].min() if type(max_false_index) == float else group.loc[max_false_index+1:, 'Date'].min()
records.append([key, min_date, group['Date'].max()])
pd.DataFrame(records, columns=['Id', 'DateOfFirstOccurance', 'DateOfLastOccurance'])

Here is one way to do it.
Sort your data by Id and Date. Use pandas.Series.diff to get the difference of each row compared to the last one, change it with dt.days to a floating number and create a boolean Series by comparing each value if it is greater/equal to 1. Convert the boolean Series from True/False to 1/0 with astype(int) and build the cumulative sum. The idx with the biggest value is the first/last occurence of your data.
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
df = df.sort_values(['Id', 'Date'])
out = (
df
.groupby('Id')['Date']
.agg(
first_occurence = lambda x: x[
(x.diff().dt.days>1)
.astype(int)
.cumsum()
.idxmax()
],
last_occurence = lambda x: x[
(x.diff().dt.days==1)
.astype(int)
.cumsum()
.idxmax()
],
)
)
print(out)

Related

Python Passing Dynamic Table Name in For Loop

table_name = []
counter=0
for year in ['2017', '2018', '2019']:
table_name.append(f'temp_df_{year}')
print(table_name[counter])
table_name[counter] = pd.merge(table1, table2.loc[table2.loc[:, 'year'] == year, :], left_on='col1', right_on='col1', how='left')
counter += 1
temp_df_2017
The print statement outputs are correct:
temp_df_2017,
temp_df_2018,
temp_df_2019
However, when I try to see what's in temp_df_2017, I get an error: name 'temp_df_2017' is not defined
I would like to create those three tables. How can I make this work?
PS: ['2017', '2018', '2019'] list will vary. It can be a list of quarters. That's why I want to do this in a loop, instead of using the merge statement 3x.
I think the easiest/most practical approach would be to create a dictionary to store names/df.
import pandas as pd
import numpy as np
# Create dummy data
data = np.arange(9).reshape(3,3)
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
df
Out:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
df_year_names = ['2017', '2018', '2019']
dict_of_dfs = {}
for year in df_year_names:
df_name = f'some_name_year_{year}'
dict_of_dfs[df_name] = df
dict_of_dfs.keys()
Out:
dict_keys(['some_name_year_2017', 'some_name_year_2018', 'some_name_year_2019'])
Then to access a particular year:
dict_of_dfs['some_name_year_2018']
Out:
a b c
0 0 1 2
1 3 4 5
2 6 7 8

Matching subset of two columns of two different dataframes

Comparing specific columns from two different dataframes. Counting if subset of both dataframe is matching or not matching.
Condition:
If any element of file small['genes of cluster'] is matching with the big['genes of cluster'], output should be: match: 1.
For below example only OR4F16 is matching to both dataframes.
So Output: match: 1; unmatch: 3.
file1: big <tab separated>
cl nP genes of cluster
1 11 DDX11L1, MIR6859-3, WASH7P, MIR1302-2, FAM138C, FAM138F, FAM138A, OR4F5, LOC729737, LOC102725121, FAM138D
2 4 OR4F16, OR4F3, OR4F29, LOC100132287
3 64 LOC100133331, LOC100288069, FAM87B, LINC00115, LINC01128, FAM41C, LINC02593, SAMD11
4 7 GNB1, CALML6, TMEM52, CFAP74, GABRD, LOC105378591, PRKCZ
file2: small <tab separated>
cl nP genes of cluster
1 11 A, B, C, D
2 4 OR4F16, X, Y, Z
My Code: Python3
def genes_coordinates(big, small):
b = pd.read_csv(big, header=0, sep="\t")
s = pd.read_csv(small, header=0, sep="\t")
match = 0
unmatch = 0
for index, row in b.iterrows():
if row[row['genes of cluster'].isin(s['genes of cluster'])]:
match+1
else:
unmatch+1
print("match: ", match, "\nunmatch: ", unmatch)
genes_coordinates('big','small')
I would go with a pandas.merge() followed by counting by list comprehension.
import pandas as pd
df1 = pd.DataFrame({'cl':[1,2], 'nP':[11,4], 'gene of cluster':[['A', 'B', 'C', 'D'], ['OR4F16', 'X', 'Y', 'Z']]})
df2 = pd.DataFrame({'cl':[1,2,3,4], 'nP':[11,4,64,7], 'gene of cluster':[['DDX11L1', 'MIR6859-3', 'WASH7P', 'MIR1302-2', 'FAM138C', 'FAM138F', 'FAM138A', 'OR4F5', 'LOC729737', 'LOC102725121', 'FAM138D'], ['OR4F16', 'OR4F3', 'OR4F29', 'LOC100132287'], ['LOC100133331', 'LOC100288069', 'FAM87B', 'LINC00115', 'LINC01128', 'FAM41C', 'LINC02593', 'SAMD11'], ['GNB1', 'CALML6', 'TMEM52', 'CFAP74', 'GABRD', 'LOC105378591', 'PRKCZ']]})
df_m = df1.merge(df2, on=['cl', 'nP'], how='outer')
>>>df_m
cl nP gene of cluster_x gene of cluster_y
0 1 11 [A, B, C, D] [DDX11L1, MIR6859-3, WASH7P, MIR1302-2, FAM138...
1 2 4 [OR4F16, X, Y, Z] [OR4F16, OR4F3, OR4F29, LOC100132287]
2 3 64 NaN [LOC100133331, LOC100288069, FAM87B, LINC00115...
3 4 7 NaN [GNB1, CALML6, TMEM52, CFAP74, GABRD, LOC10537...
# An np.nan value is an outright 'unmatch'
found = []
for x in df_m.index:
if isinstance(df_m.iloc[x]['gene of cluster_x'], float):
found.append(0)
else:
if isinstance(df_m.iloc[x]['gene of cluster_y'], float):
found.append(0)
elif any([y in df_m.iloc[x]['gene of cluster_y'] for y in df_m.iloc[x]['gene of cluster_x']]):
found.append(1)
else:
found.append(0)
# The counts
match = sum(found)
unmatch = len(found) - match

Pandas - Fastest way indexing with 2 dataframes

I am developing a software in Python 3 with Pandas library.
Time is very important but memory not so much.
For better visualization I am using the names a and b with few values, although there are many more:
a -> 50000 rows
b -> 5000 rows
I need to select from dataframe a and b (using multiples conditions)
a = pd.DataFrame({
'a1': ['x', 'y', 'z'] ,
'a2': [1, 2, 3],
'a3': [3.14, 2.73, -23.00],
'a4': [pd.np.nan, pd.np.nan, pd.np.nan]
})
a
a1 a2 a3 a4
0 x 1 3.14 NaN
1 y 2 2.73 NaN
2 z 3 -23.00 NaN
b = pd.DataFrame({
'b1': ['x', 'y', 'z', 'k', 'l'],
'b2': [2018, 2019, 2020, 2015, 2012]
})
b
b1 b2
0 x 2018
1 y 2019
2 z 2020
3 k 2015
4 l 2012
So far my code is like this:
for index, row in a.iterrows():
try:
# create a key
a1 = row["a1"]
mask = b.loc[(b['b1'] == a1) & (b['b2'] != 2019)]
# check if exists
if (len(mask.index) != 0): #not empty
a.loc[[index], ['a4']] = mask.iloc[0]['b2']
except KeyError: #not found
pass
But as you can see, I'm using for iterrows that is very slow compared to other methods and I'm changing the value of the DataFrame I'm iterating, that is not recommended.
Could you help me find a better way? The results should be like this:
a
a1 a2 a3 a4
0 x 1 3.14 2018
1 y 2 2.73 NaN
2 z 3 -23.00 2020
I tried things like this below, but I didnt made it work.
a.loc[ (a['a1'] == b['b1']) , 'a4'] = b.loc[b['b2'] != 2019]
*the real code has more conditions
Thanks!
EDIT
I benchmark using: iterrows, merge, set_index/loc. Here is the code:
import timeit
import pandas as pd
def f_iterrows():
for index, row in a.iterrows():
try:
# create a key
a1 = row["a1"]
a3 = row["a3"]
mask = b.loc[(b['b1'] == a1) & (b['b2'] != 2019)]
# check if exists
if len(mask.index) != 0: # not empty
a.loc[[index], ['a4']] = mask.iloc[0]['b2']
except: # not found
pass
def f_merge():
a.merge(b[b.b2 != 2019], left_on='a1', right_on='b1', how='left').drop(['a4', 'b1'], 1).rename(columns={'b2': 'a4'})
def f_lock():
df1 = a.set_index('a1')
df2 = b.set_index('b1')
df1.loc[:, 'a4'] = df2.b2[df2.b2 != 2019]
#variables for testing
number_rows = 100
number_iter = 100
a = pd.DataFrame({
'a1': ['x', 'y', 'z'] * number_rows,
'a2': [1, 2, 3] * number_rows,
'a3': [3.14, 2.73, -23.00] * number_rows,
'a4': [pd.np.nan, pd.np.nan, pd.np.nan] * number_rows
})
b = pd.DataFrame({
'b1': ['x', 'y', 'z', 'k', 'l'] * number_rows,
'b2': [2018, 2019, 2020, 2015, 2012] * number_rows
})
print('For: %s s' % str(timeit.timeit(f_iterrows, number=number_iter)))
print('Merge: %s s' % str(timeit.timeit(f_merge, number=number_iter)))
print('Loc: %s s' % str(timeit.timeit(f_iterrows, number=number_iter)))
They all worked :) and the time to run is:
For: 277.9994369489998 s
Loc: 274.04929955067564 s
Merge: 2.195712725706926 s
So far Merge is the fastest.
If another option appears I will update here, thanks again.
IIUC
a.merge(b[b.b2!=2019],left_on='a1',right_on='b1',how='left').drop(['a4','b1'],1).rename(columns={'b2':'a4'})
Out[263]:
a1 a2 a3 a4
0 x 1 3.14 2018.0
1 y 2 2.73 NaN
2 z 3 -23.00 2020.0

Multi-index pandas dataframes: find an index related to the number of unique values a column has

# import Pandas library
import pandas as pd
idx = pd.MultiIndex.from_product([['A001', 'B001','C001'],
['0', '1', '2']],
names=['ID', 'Entries'])
col = ['A', 'B']
df = pd.DataFrame('-', idx, col)
df.loc['A001', 'A'] = [10,10,10]
df.loc['A001', 'B'] = [90,84,70]
df.loc['B001', 'A'] = [10,20,10]
df.loc['B001', 'B'] = [70,86,67]
df.loc['C001', 'A'] = [20,20,20]
df.loc['C001', 'B'] = [98,81,72]
#df is a dataframe
df
Following is the problem: How to return the ID which has more than one unique values for column 'A'? In the above dataset, ideally it should return B001.
I would appreciate if anyone could help me out with performing operations in multi-index pandas dataframes.
Use GroupBy.transform with nunique and filter by boolean indexing and for values of first levl of MultiIndex add get_level_values with unique:
a = df[df.groupby(level=0)['A'].transform('nunique') > 1].index.get_level_values(0).unique()
print(a)
Index(['B001'], dtype='object', name='ID')
Or use duplicated, but first need columns from MultiIndex by reset_index:
m = df.reset_index().duplicated(subset=['ID','A'], keep=False).values
a = df[~m].index.get_level_values(0).unique()
print(a)
Index(['B001'], dtype='object', name='ID')

Getting all str type elements in a pd.DataFrame

Based on my little knowledge on pandas,pandas.Series.str.contains can search a specific str in pd.Series. But what if the dataframe is large and I just want to glance all kinds of str element in it before I do anything?
Example like this:
pd.DataFrame({'x1':[1,2,3,'+'],'x2':[2,'a','c','this is']})
x1 x2
0 1 2
1 2 a
2 3 c
3 + this is
I need a function to return ['+','a','c','this is']
If you are looking strictly at what are string values and performance is not a concern, then this is a very simple answer.
df.where(df.applymap(type).eq(str)).stack().tolist()
['a', 'c', '+', 'this is']
There are 2 possible ways - check numeric values saved as strings or not.
Check difference:
df = pd.DataFrame({'x1':[1,'2.78','3','+'],'x2':[2.8,'a','c','this is'], 'x3':[1,4,5,4]})
print (df)
x1 x2 x3
0 1 2.8 1
1 2.78 a 4 <-2.78 is float saved as string
2 3 c 5 <-3 is int saved as string
3 + this is 4
#flatten all values
ar = df.values.ravel()
#errors='coerce' parameter in pd.to_numeric return NaNs for non numeric
L = np.unique(ar[np.isnan(pd.to_numeric(ar, errors='coerce'))]).tolist()
print (L)
['+', 'a', 'c', 'this is']
Another solution is use custom function for check if possible convert to floats:
def is_not_float_try(str):
try:
float(str)
return False
except ValueError:
return True
s = df.stack()
L = s[s.apply(is_not_float_try)].unique().tolist()
print (L)
['a', 'c', '+', 'this is']
If need all values saved as strings use isinstance:
s = df.stack()
L = s[s.apply(lambda x: isinstance(x, str))].unique().tolist()
print (L)
['2.78', 'a', '3', 'c', '+', 'this is']
You can using str.isdigit with unstack
df[df.apply(lambda x : x.str.isdigit()).eq(0)].unstack().dropna().tolist()
Out[242]: ['+', 'a', 'c', 'this is']
Using regular expressions and set union, could try something like
>>> set.union(*[set(df[c][~df[c].str.findall('[^\d]+').isnull()].unique()) for c in df.columns])
{'+', 'a', 'c', 'this is'}
If you use a regular expression for a number in general, you could omit floating point numbers as well.

Resources