test/train splits in pycaret using a column for grouping rows that should be in the same split

test/train splits in pycaret using a column for grouping rows that should be in the same split - scikit-learn

My dataset contains a column with some data I need to use for splitting by groups in a way that rows belonging to same group should not be divided into train/test but sent as a whole to one of the splits using PYCARET
10 row sample for clarification:
group_id measure1 measure2 measure3
1 3455 3425 345
1 6455 825 945
1 6444 225 145
2 23 34 233
2 623 22 888
3 3455 3425 345
3 6155 525 645
3 6434 325 845
4 93 345 233
4 693 222 808
every unique group_id should be sent to any split in full this way (using 80/20):
TRAIN SET:
group_id measure1 measure2 measure3
1 3455 3425 345
1 6455 825 945
1 6444 225 145
3 3455 3425 345
3 6155 525 645
3 6434 325 845
4 93 345 233
4 693 222 808
TEST SET:
group_id measure1 measure2 measure3
2 23 34 233
2 623 22 888

You can try the following per the documentation
https://pycaret.readthedocs.io/en/latest/api/classification.html
fold_strategy = "groupkfold"

One solution could look like this:
import numpy as np
import pandas as pd
from itertools import combinations
def is_possible_sum(numbers, n):
for r in range(len(numbers)):
for combo in combinations(numbers, r + 1):
if sum(combo) == n:
return combo
print(f'Desired split not possible')
raise ArithmeticError
def train_test_split(table: pd.DataFrame, train_fraction: float, col_identifier: str):
train_ids = []
occurrences = table[col_identifier].value_counts().to_dict()
required = sum(occurrences.values()) * train_fraction
lengths = is_possible_sum(occurrences.values(), required)
for i in lengths:
for key, value in occurrences.items():
if value == i:
train_ids.append(key)
del occurrences[key] # prevents the same ID from being selected twice
break
train = table[table[col_identifier].isin(train_ids)]
test = table[~table[col_identifier].isin(train_ids)]
return train, test
if __name__ == '__main__':
df = pd.DataFrame()
df['Group_ID'] = np.array([1, 1, 1, 2, 2, 3, 3, 3, 4, 4])
df['Measurement'] = np.random.random(10)
train_part, test_part = train_test_split(df, 0.8, 'Group_ID')
Some remarks:
This is probably the least elegant way to do it...
It uses an ungodly amount of for loops and is probably slow for larger dataframes. It also doesn't randomize the split.
Lots of this is because the dictionary of group_id and the count of the samples with a certain group_id can't be reversed as some entries might be ambiguous. You could probably do this with numpy arrays as well, but I doubt that the overall structure would be much different.
First function taken from here: How to check if a sum is possible in array?

Related

Filter rows of a dataframe based on certain conditions in pandas

While I was handling the dataframe in pandas, got some unexpected cells which consists values like-
E_no
E_name
6654-0984
Elvin-Johnson
430
Fred
663/547/900
Banty/Shon/Crio
87
Arif
546
Zerin
322,76
Chris,Deory
In some rows, more than one E_name and E_no has been assigned which is supposed to be a single employee in each and every cell
My data consists of E_no and E_name both these column needs to be separated in different rows.
What I want is
E_no
E_name
6654
ELvin
0984
Johnson
430
Fred
663
Banty
547
Shon
900
Crio
87
Arif
546
Zerin
322
Chris
76
Deory
Seperate those values and put in different rows.
Please help me in doing this so that I can proceed further, and it will be really helpful if someone can mention the logic , how to think for this prblm.
Thanks in advance.
Let me know if ur facing any kind of difficulty in understanding the prblm

Similar to Divyaansh's solution. Just use split, explode and merge.
import pandas as pd
df = pd.DataFrame({'E_no':['6654-0984','430','663/547/900','87','546', '322,76'],
'E_name':['Elvin-Johnson','Fred','Banty/Shon/Crio','Arif','Zerin','Chris,Deory']})
#explode each column
x = df['E_no'].str.split('[,-/]').explode().reset_index(drop=True)
y = df['E_name'].str.split('[,-/]').explode().reset_index(drop=True)
#Merge both the columns together
df2 = pd.merge(x,y,left_index=True,right_index=True)
#print the modified dataframe
print (df2)
Output of this will be:
Original Dataframe:
E_no E_name
0 6654-0984 Elvin-Johnson
1 430 Fred
2 663/547/900 Banty/Shon/Crio
3 87 Arif
4 546 Zerin
5 322,76 Chris,Deory
Modified Dataframe:
E_no E_name
0 6654 Elvin
1 0984 Johnson
2 430 Fred
3 663 Banty
4 547 Shon
5 900 Crio
6 87 Arif
7 546 Zerin
8 322 Chris
9 76 Deory
Alternate, you can also create a new dataframe with the values from x and y.
x = df['E_no'].str.split('[,-/]').explode().reset_index(drop=True)
y = df['E_name'].str.split('[,-/]').explode().reset_index(drop=True)
#Create a new dataframe with the new values from x and y
df3 = pd.DataFrame({'E_no':x,'E_name':y})
print (df3)
Same result as before.
Or this:
#explode each column
x = df['E_no'].str.split('[,-/]').explode().reset_index()
y = df['E_name'].str.split('[,-/]').explode().reset_index()
#Create a new dataframe with the new values from x and y
df3 = pd.DataFrame({'E_no':x['E_no'],'E_name':y['E_name']})
print (df3)
Or you can do:
#explode each column
x = df['E_no'].str.split('[,-/]').explode().reset_index(drop=True)
y = df['E_name'].str.split('[,-/]').explode().reset_index(drop=True)
df4 = pd.DataFrame([x,y]).T
print (df4)

Split, flatten, recombine, rename:
a = [item for sublist in df.E_no.str.split('\W').tolist() for item in sublist]
b = [item for sublist in df.E_name.str.split('\W').tolist() for item in sublist]
df2 = pd.DataFrame(list(zip(a, b)), columns=df.columns)
Output:
E_no E_name
0 6654 Elvin
1 0984 Johnson
2 430 Fred
3 663 Banty
4 547 Shon
5 900 Crio
6 87 Arif
7 546 Zerin
8 322 Chris
9 76 Deory

You should provide multiple delimiter in the read_csv arguments.
pd.read_csv("filepath",sep='-|/|,| ')
This is the best I can help right now without the data tables.

I think this is actually rather tricky. Here's a solution in which we use the E_no column to build a column of regexes that we will then use to split the two original columns into parts. Finally we construct a new DataFrame from those parts. This method ensures that the second column's format matches the first's.
df = pd.DataFrame.from_records(
[
{"E_no": "6654-0984", "E_name": "Elvin-Johnson"},
{"E_no": "430", "E_name": "Fred"},
{"E_no": "663/547/900", "E_name": "Banty/Shon/Crio"},
{"E_no": "87", "E_name": "Arif"},
{"E_no": "546", "E_name": "Zerin"},
{"E_no": "322,76", "E_name": "Chris,Deory"},
{"E_no": "888+88", "E_name": "FIRST+SEC|OND"},
{"E_no": "999|99", "E_name": "TH,IRD|FOURTH"},
]
)
def get_pattern(e_no, delimiters=None):
if delimiters is None:
delimiters = "-/,|+"
delimiters = "|".join(re.escape(d) for d in delimiters)
non_match_delims = f"(?:(?!{delimiters}).)*"
delim_parts = re.findall(f"{non_match_delims}({delimiters})", e_no)
pattern_parts = []
for delim_part in delim_parts:
delim = re.escape(delim_part)
pattern_parts.append(f"((?:(?!{delim}).)*)")
pattern_parts.append(delim)
pattern_parts.append("(.*)")
return "".join(pattern_parts)
def extract_items(row, delimiters=None):
pattern = get_pattern(row["E_no"], delimiters)
nos = re.search(pattern, row["E_no"]).groups()
names = re.search(pattern, row["E_name"]).groups()
return (nos, names)
nos, names = map(
lambda L: [e for tup in L for e in tup],
zip(*df.apply(extract_items, axis=1))
)
print(pd.DataFrame({"E_no": nos, "E_names": names}))
E_no E_names
0 6654 Elvin
1 0984 Johnson
2 430 Fred
3 663 Banty
4 547 Shon
5 900 Crio
6 87 Arif
7 546 Zerin
8 322 Chris
9 76 Deory
10 888 FIRST
11 88 SEC|OND
12 999 TH,IRD
13 99 FOURTH

Here is my approach to this:
1)Replace -&/ with comma(,) for both columns
2)Split each field on comma(,) then expand it and then stack it.
3)Resetting the index of each data frame which gives below DF
4) Finally merging two DF's into one finalDF
a={'E_no':['6654-0984','430','663/547/900','87','546','322,76'],
'E_name':['Elvin-Johnson','Fred','Banty/Shon/Crio','Arif','Zerin','Chris,Deory']}
df = pd.DataFrame(a)
df1=df['E_no'].str.replace('-|/',',').str.split(',',expand=True).stack().reset_index()
df2=df['E_name'].str.replace('-|/',',').str.split(',',expand=True).stack().reset_index()
df1.drop(['level_0','level_1'],axis=1,inplace=True)
df1.rename(columns={0:'E_no'},inplace=True)
df2.drop(['level_0','level_1'],axis=1,inplace=True)
df2.rename(columns={0:'E_name'},inplace=True)
finalDF=pd.merge(df1,df2,left_index=True,right_index=True)
Output:

How to replace rows with character value by integers in a column in pandas dataframe?

I am working on one large dataset, the problem am facing is that there are columns that have all integer values, however, as the dataset is uncleaned there are a few rows where there are 'characters' along with integers. Here am trying to illustrate the problem with a small pandas dataframe example,
I have the following dataframe:
Index
l1
l2
l3
0
1
123
23
1
2
Z3V
343
2
3
321
21
3
4
AZ34
345
4
5
432
3
With dataframe code :
l1,l2,l3 = [1,2,3,4,5], [123, 'Z3V', 321, 'AZ34', 432], [23,343,21,345,3]
data = pd.DataFrame(zip(l1,l2,l3), columns=['l1', 'l2', 'l3'])
print(data)
Here as you can see, column 'l2' at rows index 1 and 3 have 'characters' along with integers. I want to find such rows in this particular column and print them. Later I want to replace them with integer values like 100 or something similar integer. i.e. those numbers that I am replacing with will be different for example, am replacing instances of 'Z3V' with 100 and instances of 'AZ34' with 101. My point is to replace characters containing values with integers. Now, if in 'l2' column, 'Z3V' occurs again, there too, I will replace it with 100.
Expected output :
Index
l1
l2
l3
0
1
123
23
1
2
100
343
2
3
321
21
3
4
101
345
4
5
432
3
As you can see, the two instances where there were characters have been replaced with 100 and 101 respectively
How to get this expected output ?

You could do:
import pandas as pd
import numpy as np
# setup
l1, l2, l3 = [1, 2, 3, 4, 5, 6], [123, 'Z3V', 321, 'AZ34', 432, 'Z3V'], [23, 343, 21, 345, 3, 3]
data = pd.DataFrame(zip(l1, l2, l3), columns=['l1', 'l2', 'l3'])
# find all non numeric values across the whole DataFrame
mask = data.applymap(np.isreal)
rows, cols = np.where(~mask)
# create the replacement dictionary
replacements = {k: i for i, k in enumerate(np.unique(data.values[rows, cols]), 100)}
# apply the replacements
res = data.replace(replacements)
print(res)
Output
l1 l2 l3
0 1 123 23
1 2 101 343
2 3 321 21
3 4 100 345
4 5 432 3
5 6 101 3
Note that I added an extra row to verify the desire behaviour, now the data DataFrame looks like:
l1 l2 l3
0 1 123 23
1 2 Z3V 343
2 3 321 21
3 4 AZ34 345
4 5 432 3
5 6 Z3V 3
By changing this line:
# create the replacement dictionary
replacements = {k: i for i, k in enumerate(np.unique(data.values[rows, cols]), 100)}
you can change the replacement values as you see fit.

filtering and transposing the dataframe in python3

I made a csv file using pandas and trying to use it as input for the next step. when I open the file using pandas it will look like this example:
example:
Unnamed: 0 Class_Name Probe_Name small_example1.csv small_example2.csv small_example3.csv
0 0 Endogenous CCNO 196 32 18
1 1 Endogenous MYC 962 974 1114
2 2 Endogenous CD79A 390 115 178
3 3 Endogenous FSTL3 67 101 529
4 4 Endogenous VCAN 943 735 9226
I want to make a plot, to do so, I have to change the data structure.
1- I want to remove Unnamed column
2- then I want to make a data frame for a heatmap. to do so I want to use these columns "probe_name", "small_example1.csv", "small_example2.csv" and "small_example3.csv"
3- I also want to transpose the data frame.
here is the expected output:
Probe_Name CCNO MYC CD79A FSTL3 VCAN
small_example1.csv 196 962 390 67 943
small_example1.csv 32 974 115 101 735
small_example1.csv 18 1114 178 529 9226
I tied to do that using the following code:
df = pd.read_csv('myfile.csv')
result = df.transpose()
but it does not return what I want to get. do you know how to fix it?

df.drop(['Unnamed: 0','Class_Name'],axis=1).set_index('Probe_Name').T
Result:
Probe_Name CCNO MYC CD79A FSTL3 VCAN
small_example1.csv 196 962 390 67 943
small_example2.csv 32 974 115 101 735
small_example3.csv 18 1114 178 529 9226

Here's a suggestion:
Changes 1 & 2 can be tackled in one go:
df = df.loc[:, ["Probe_Name", "small_example1.csv", "small_example2.csv", "small_example3.csv"]] # This only retains the specified columns
In order for change 3 (transposing) to work as desired, the column Probe_Name needs to be set as your index:
df = df.set_index("Probe_Name", drop=True)
df = df.transpose()

Efficient way to perform iterative subtraction and division operations on pandas columns

I have a following dataframe-
A B C Result
0 232 120 9 91
1 243 546 1 12
2 12 120 5 53
I want to perform the operation of the following kind-
A B C Result A-B/A+B A-C/A+C B-C/B+C
0 232 120 9 91 0.318182 0.925311 0.860465
1 243 546 1 12 -0.384030 0.991803 0.996344
2 12 120 5 53 -0.818182 0.411765 0.920000
which I am doing using
df['A-B/A+B']=(df['A']-df['B'])/(df['A']+df['B'])
df['A-C/A+C']=(df['A']-df['C'])/(df['A']+df['C'])
df['B-C/B+C']=(df['B']-df['C'])/(df['B']+df['C'])
which I believe is a very crude and ugly way to do.
How to do it in a more correct way?

You can do the following:
# take columns in a list except the last column
colnames = df.columns.tolist()[:-1]
# compute
for i, c in enumerate(colnames):
if i != len(colnames):
for k in range(i+1, len(colnames)):
df[c + '_' + colnames[k]] = (df[c] - df[colnames[k]]) / (df[c] + df[colnames[k]])
# check result
print(df)
A B C Result A_B A_C B_C
0 232 120 9 91 0.318182 0.925311 0.860465
1 243 546 1 12 -0.384030 0.991803 0.996344
2 12 120 5 53 -0.818182 0.411765 0.920000

This is a perfect case to use DataFrame.eval:
cols = ['A-B/A+B','A-C/A+C','B-C/B+C']
x = pd.DataFrame([df.eval(col).values for col in cols], columns=cols)
df.assign(**x)
A B C Result A-B/A+B A-C/A+C B-C/B+C
0 232 120 9 91 351.482759 786.753086 122.000000
1 243 546 1 12 240.961207 243.995885 16.583333
2 12 120 5 53 128.925000 546.998168 124.958333
The advantage of this method respect to the other solution, is that it does not depend on the order of the operation sings that appear as column names, but rather as mentioned in the documentation it is used to:
Evaluate a string describing operations on DataFrame columns.

Get unique values of a column in between a timeperiod in pandas after groupby

I have a requirement where I need to find all the unique values of a merchant_store_id of the user on the same stampcard in between a specific time period. I had group by stampcard id and userid to get the data frame based on the condition. Now I need to find the unique merchant_store_id of the this dataframe in interval of 10mins from that entry.
My approach is I would loop in that groupby dataframe and then I would find the all indexes in that dataframe of that group and then I would create a new dataframe from time of index to index + 60mins and then find the unique merchant_store_id's in it. If the unique merchant_store_id is >1 , I would append that dataframe from that time to a final dataframe. Problem with the approach is it works fine for small data, but for data of size 20,000 rows it shows memory error on linux and keeps on running on windows. Below is my code
fi_df = pd.DataFrame()
for i in df.groupby(["stamp_card_id", "merchant_id", "user_id"]):
user_df = i[1]
if len(user_df)>1:
# get list of unique indexes in that groupby df
index = user_df.index.values
for ind in index:
fdf = user_df[ind:ind+np.timedelta64(1, 'h')]
if len(fdf.merchant_store_id.unique())>1:
fi_df=fi_df.append(fdf)
fi_df.drop_duplicates(keep="first").to_csv(csv_export_path)
Sample Data after group by is:
((117, 209, 'oZOfOgAgnO'), stamp_card_id stamp_time stamps_record_id user_id \
0 117 2018-10-14 16:48:03 1756 oZOfOgAgnO
1 117 2018-10-14 16:54:03 1759 oZOfOgAgnO
2 117 2018-10-14 16:58:03 1760 oZOfOgAgnO
3 117 2018-10-14 17:48:03 1763 oZOfOgAgnO
4 117 2018-10-14 18:48:03 1765 oZOfOgAgnO
5 117 2018-10-14 19:48:03 1767 oZOfOgAgnO
6 117 2018-10-14 20:48:03 1769 oZOfOgAgnO
7 117 2018-10-14 21:48:03 1771 oZOfOgAgnO
8 117 2018-10-15 22:48:03 1773 oZOfOgAgnO
9 117 2018-10-15 23:08:03 1774 oZOfOgAgnO
10 117 2018-10-15 23:34:03 1777 oZOfOgAgnO
merchant_id merchant_store_id
0 209 662
1 209 662
2 209 662
3 209 662
4 209 662
5 209 662
6 209 663
7 209 664
8 209 662
9 209 664
10 209 663 )
I have tried the resampling method also, but then i get the data in respective of the time, where the condition of user hitting multiple merchant_store_id is neglected at end time of the hours.
Any help would be appreciated. Thanks

if those are datetimes you can filter with the following:
filtered_set = set(df[df["stamp_time"]>=x][df["stamp_time"]<=y]["col of interest"])
df[df["stamp_time"]>=x] filters the df
adding [df["stamp_time"]<=y] filters the filtered df
["merchant_store_id"] captures just the specified column (series)
and finally set() returns the unique list (set)
Specific to your code:
x = datetime(lowerbound) #pseudo-code
y = datetime(upperbound) #pseudo-code
filtered_set = set(fi_df[fi_df["stamp_time"]>=x][fi_df["stamp_time"]<=y]["col of interest"])

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

test/train splits in pycaret using a column for grouping rows that should be in the same split - scikit-learn

You can try the following per the documentation https://pycaret.readthedocs.io/en/latest/api/classification.html fold_strategy = "groupkfold"

Related

Filter rows of a dataframe based on certain conditions in pandas

How to replace rows with character value by integers in a column in pandas dataframe?

filtering and transposing the dataframe in python3

Efficient way to perform iterative subtraction and division operations on pandas columns

Get unique values of a column in between a timeperiod in pandas after groupby

Categories

Resources