Get unique values of a column in between a timeperiod in pandas after groupby - python-3.x

I have a requirement where I need to find all the unique values of a merchant_store_id of the user on the same stampcard in between a specific time period. I had group by stampcard id and userid to get the data frame based on the condition. Now I need to find the unique merchant_store_id of the this dataframe in interval of 10mins from that entry.
My approach is I would loop in that groupby dataframe and then I would find the all indexes in that dataframe of that group and then I would create a new dataframe from time of index to index + 60mins and then find the unique merchant_store_id's in it. If the unique merchant_store_id is >1 , I would append that dataframe from that time to a final dataframe. Problem with the approach is it works fine for small data, but for data of size 20,000 rows it shows memory error on linux and keeps on running on windows. Below is my code
fi_df = pd.DataFrame()
for i in df.groupby(["stamp_card_id", "merchant_id", "user_id"]):
user_df = i[1]
if len(user_df)>1:
# get list of unique indexes in that groupby df
index = user_df.index.values
for ind in index:
fdf = user_df[ind:ind+np.timedelta64(1, 'h')]
if len(fdf.merchant_store_id.unique())>1:
fi_df=fi_df.append(fdf)
fi_df.drop_duplicates(keep="first").to_csv(csv_export_path)
Sample Data after group by is:
((117, 209, 'oZOfOgAgnO'), stamp_card_id stamp_time stamps_record_id user_id \
0 117 2018-10-14 16:48:03 1756 oZOfOgAgnO
1 117 2018-10-14 16:54:03 1759 oZOfOgAgnO
2 117 2018-10-14 16:58:03 1760 oZOfOgAgnO
3 117 2018-10-14 17:48:03 1763 oZOfOgAgnO
4 117 2018-10-14 18:48:03 1765 oZOfOgAgnO
5 117 2018-10-14 19:48:03 1767 oZOfOgAgnO
6 117 2018-10-14 20:48:03 1769 oZOfOgAgnO
7 117 2018-10-14 21:48:03 1771 oZOfOgAgnO
8 117 2018-10-15 22:48:03 1773 oZOfOgAgnO
9 117 2018-10-15 23:08:03 1774 oZOfOgAgnO
10 117 2018-10-15 23:34:03 1777 oZOfOgAgnO
merchant_id merchant_store_id
0 209 662
1 209 662
2 209 662
3 209 662
4 209 662
5 209 662
6 209 663
7 209 664
8 209 662
9 209 664
10 209 663 )
I have tried the resampling method also, but then i get the data in respective of the time, where the condition of user hitting multiple merchant_store_id is neglected at end time of the hours.
Any help would be appreciated. Thanks

if those are datetimes you can filter with the following:
filtered_set = set(df[df["stamp_time"]>=x][df["stamp_time"]<=y]["col of interest"])
df[df["stamp_time"]>=x] filters the df
adding [df["stamp_time"]<=y] filters the filtered df
["merchant_store_id"] captures just the specified column (series)
and finally set() returns the unique list (set)
Specific to your code:
x = datetime(lowerbound) #pseudo-code
y = datetime(upperbound) #pseudo-code
filtered_set = set(fi_df[fi_df["stamp_time"]>=x][fi_df["stamp_time"]<=y]["col of interest"])

Related

test/train splits in pycaret using a column for grouping rows that should be in the same split

My dataset contains a column with some data I need to use for splitting by groups in a way that rows belonging to same group should not be divided into train/test but sent as a whole to one of the splits using PYCARET
10 row sample for clarification:
group_id measure1 measure2 measure3
1 3455 3425 345
1 6455 825 945
1 6444 225 145
2 23 34 233
2 623 22 888
3 3455 3425 345
3 6155 525 645
3 6434 325 845
4 93 345 233
4 693 222 808
every unique group_id should be sent to any split in full this way (using 80/20):
TRAIN SET:
group_id measure1 measure2 measure3
1 3455 3425 345
1 6455 825 945
1 6444 225 145
3 3455 3425 345
3 6155 525 645
3 6434 325 845
4 93 345 233
4 693 222 808
TEST SET:
group_id measure1 measure2 measure3
2 23 34 233
2 623 22 888
You can try the following per the documentation
https://pycaret.readthedocs.io/en/latest/api/classification.html
fold_strategy = "groupkfold"
One solution could look like this:
import numpy as np
import pandas as pd
from itertools import combinations
def is_possible_sum(numbers, n):
for r in range(len(numbers)):
for combo in combinations(numbers, r + 1):
if sum(combo) == n:
return combo
print(f'Desired split not possible')
raise ArithmeticError
def train_test_split(table: pd.DataFrame, train_fraction: float, col_identifier: str):
train_ids = []
occurrences = table[col_identifier].value_counts().to_dict()
required = sum(occurrences.values()) * train_fraction
lengths = is_possible_sum(occurrences.values(), required)
for i in lengths:
for key, value in occurrences.items():
if value == i:
train_ids.append(key)
del occurrences[key] # prevents the same ID from being selected twice
break
train = table[table[col_identifier].isin(train_ids)]
test = table[~table[col_identifier].isin(train_ids)]
return train, test
if __name__ == '__main__':
df = pd.DataFrame()
df['Group_ID'] = np.array([1, 1, 1, 2, 2, 3, 3, 3, 4, 4])
df['Measurement'] = np.random.random(10)
train_part, test_part = train_test_split(df, 0.8, 'Group_ID')
Some remarks:
This is probably the least elegant way to do it...
It uses an ungodly amount of for loops and is probably slow for larger dataframes. It also doesn't randomize the split.
Lots of this is because the dictionary of group_id and the count of the samples with a certain group_id can't be reversed as some entries might be ambiguous. You could probably do this with numpy arrays as well, but I doubt that the overall structure would be much different.
First function taken from here: How to check if a sum is possible in array?

Count the number of labels on IOB corpus with Pandas

From my IOB corpus such as:
mention Tag
170
171 467 O
172
173 Vincennes B-LOCATION
174 . O
175
176 Confirmation O
177 des O
178 privilèges O
179 de O
180 la O
181 ville B-ORGANISATION
182 de I-ORGANISATION
183 Tournai I-ORGANISATION
184 1 O
185 ( O
186 cf O
187 . O
188 infra O
189 , O
I try to make simple statistics like total number of annotated mentions, total by labels etc.
After loading my dataset with pandas I got this:
df = pd.Series(data['Tag'].value_counts(), name="Total").to_frame().reset_index()
df.columns = ['Label', 'Total']
df
Output :
Label Total
0 O 438528
1 36235
2 B-LOCATION 378
3 I-LOCATION 259
4 I-PERSON 234
5 I-INSTALLATION 156
6 I-ORGANISATION 150
7 B-PERSON 144
8 B-TITLE 94
9 I-TITLE 89
10 B-ORGANISATION 68
11 B-INSTALLATION 62
12 I-EVENT 8
13 B-EVENT 2
First of all, How I could get a similar representation above but by regrouping the IOB prefixes such as (example):
Label, Total
PERSON, 300
LOCATION, 154
ORGANISATION, 67
etc.
and secondly how to exclude the "O" and empty strings labels from my output, I tested with .mask() and .where() on my Series but it fails.
Thank you for your leads.
remove B-, I- parts, groupby, sum
df['label'] = df.label.str[2:]
df.groupby(['label']).sum()
For the second part, just return data in which the length of the label column string is greater than 2
df.loc[df.label.str.len()>2]

How to iterate grouped objects in Pandas

I have DataFrame called df_grouped grouped by the "Chr" column as shown below.
Index Trait Chr p_adj ind
422 94 C10.1 1 21.660747 0
470 140 C10.1 1 10.859806 1
471 141 C10.1 1 24.434861 2
472 142 C10.1 1 10.962972 3
473 143 C10.1 1 32.396856 4
... ... ... ... ... ...
1710 15 Pro 22 47.523313 5458
1711 16 Pro 22 48.683401 5459
1713 18 Pro 22 49.804377 5460
1715 20 Pro 22 7.311224 5461
1704 9 Pro 22 15.566230 5462
*Now I want to loop through the grouped object and return a DataFrame called 'group' which contains all the datasets from Chr 1 to Chr 22. unfortunately, "group" is only returning only Chr 22. How can I solve this problem? *
x_labels = []
x_labels_pos = []
groupin = []
for num, (name, group) in enumerate(df_grouped):
#fig = px.scatter(group, x='ind', y='p_adj', color="Chr")
x_labels.append(name)
x_labels_pos.append((group['ind'].iloc[-1] - (group['ind'].iloc[-1] - group['ind'].iloc[0])/2))
groupin.append(group)

filtering and transposing the dataframe in python3

I made a csv file using pandas and trying to use it as input for the next step. when I open the file using pandas it will look like this example:
example:
Unnamed: 0 Class_Name Probe_Name small_example1.csv small_example2.csv small_example3.csv
0 0 Endogenous CCNO 196 32 18
1 1 Endogenous MYC 962 974 1114
2 2 Endogenous CD79A 390 115 178
3 3 Endogenous FSTL3 67 101 529
4 4 Endogenous VCAN 943 735 9226
I want to make a plot, to do so, I have to change the data structure.
1- I want to remove Unnamed column
2- then I want to make a data frame for a heatmap. to do so I want to use these columns "probe_name", "small_example1.csv", "small_example2.csv" and "small_example3.csv"
3- I also want to transpose the data frame.
here is the expected output:
Probe_Name CCNO MYC CD79A FSTL3 VCAN
small_example1.csv 196 962 390 67 943
small_example1.csv 32 974 115 101 735
small_example1.csv 18 1114 178 529 9226
I tied to do that using the following code:
df = pd.read_csv('myfile.csv')
result = df.transpose()
but it does not return what I want to get. do you know how to fix it?
df.drop(['Unnamed: 0','Class_Name'],axis=1).set_index('Probe_Name').T
Result:
Probe_Name CCNO MYC CD79A FSTL3 VCAN
small_example1.csv 196 962 390 67 943
small_example2.csv 32 974 115 101 735
small_example3.csv 18 1114 178 529 9226
Here's a suggestion:
Changes 1 & 2 can be tackled in one go:
df = df.loc[:, ["Probe_Name", "small_example1.csv", "small_example2.csv", "small_example3.csv"]] # This only retains the specified columns
In order for change 3 (transposing) to work as desired, the column Probe_Name needs to be set as your index:
df = df.set_index("Probe_Name", drop=True)
df = df.transpose()

pandas df merge avoid duplicate column names

The question is when merge two dfs, and they all have a column called A, then the result will be a df having A_x and A_y, I am wondering how to keep A from one df and discard another one, so that I don't have to rename A_x to A later on after the merge.
Just filter your dataframe columns before merging.
df1 = pd.DataFrame({'Key':np.arange(12),'A':np.random.randint(0,100,12),'C':list('ABCD')*3})
df2 = pd.DataFrame({'Key':np.arange(12),'A':np.random.randint(100,1000,12),'C':list('ABCD')*3})
df1.merge(df2[['Key','A']], on='Key')
Output: (Note: C is not duplicated)
A_x C Key A_y
0 60 A 0 440
1 65 B 1 731
2 76 C 2 596
3 67 D 3 580
4 44 A 4 477
5 51 B 5 524
6 7 C 6 572
7 88 D 7 984
8 70 A 8 862
9 13 B 9 158
10 28 C 10 593
11 63 D 11 177
It depends if need append columns with duplicated columns names to final merged DataFrame:
...then add suffixes parameter to merge:
print (df1.merge(df2, on='Key', suffixes=('', '_')))
--
... if not use #Scott Boston solution.

Resources