Select k rows with the highest value of a given column - python-3.x

Let's suppose you have a pandas dataframe with col1 and you want to keep only the k samples with the highest value of col1. How can you do that?
Notice I'm not saying maximum value. But rather like sorting by col1, keeping the best k samples, and removing the rest.

k=10 # some number
df.sort_values('col', ascending=False).head(k)

Related

Sampling a dataframe according to some rules: balancing a multilabel dataset

I have a dataframe like this:
df = pd.DataFrame({'id':[10,20,30,40],'text':['some text','another text','random stuff', 'my cat is a god'],
'A':[0,0,1,1],
'B':[1,1,0,0],
'C':[0,0,0,1],
'D':[1,0,1,0]})
Here I have columns from Ato D but my real dataframe has 100 columns with values of 0and 1. This real dataframe has 100k reacords.
For example, the column A is related to the 3rd and 4rd row of text, because it is labeled as 1. The Same way, A is not related to the 1st and 2nd rows of text because it is labeled as 0.
What I need to do is to sample this dataframe in a way that I have the same or about the same number of features.
In this case, the feature C has only one occurrece, so I need to filter all others columns in a way that I have one text with A, one text with B, one text with Cetc..
The best would be: I can set using for example n=100 that means I want to sample in a way that I have 100 records with all the features.
This dataset is a multilabel dataset training and is higly unbalanced, I am looking for the best way to balance it for a machine learning task.
Important: I don't want to exclude the 0 features. I just want to have ABOUT the same number of columns with 1 and 0
For example. with a final data set with 1k records, I would like to have all columns from A to the final_column and all these columns with the same numbers of 1 and 0. To accomplish this I will need to random discard text rows and id only.
The approach I was trying was to look to the feature with the lowest 1 and 0 counts and then use this value as threshold.
Edit 1: One possible way I thought is to use:
df.sum(axis=0, skipna=True)
Then I can use the column with the lowest sum value as threshold to filter the text column. I dont know how to do this filtering step
Thanks
The exact output you expect is unclear, but assuming you want to get 1 random row per letter with 1 you could reshape (while dropping the 0s) and use GroupBy.sample:
(df
.set_index(['id', 'text'])
.replace(0, float('nan'))
.stack()
.groupby(level=-1).sample(n=1)
.reset_index()
)
NB. you can rename the columns if needed
output:
id text level_2 0
0 30 random stuff A 1.0
1 20 another text B 1.0
2 40 my cat is a god C 1.0
3 30 random stuff D 1.0

How to get number of columns in a DataFrame row that are above threshold

I have a simple python 3.8 DataFrame with 8 columns (simply labeled 0, 1, 2, etc.) with approx. 3500 rows. I want a subset of this DataFrame where there are at least 2 columns in each row that are above 1. I would prefer not to have to check each column individually, but be able to check all columns. I know I can use the .any(1) to check all the columns, but I need there to be at least 2 columns that meet the threshold, not just one. Any help would be appreciated. Sample code below:
import pandas as pd
df = pd.DataFrame({0:[1,1,1,1,100],
1:[1,3,1,1,1],
2:[1,3,1,1,4],
3:[1,1,1,1,1],
4:[3,4,1,1,5],
5:[1,1,1,1,1]})
Easiest way I can think to sort/filter later would be to create another column at the end df[9] that houses the count:
df[9] = df.apply(lambda x: x.count() if x > 2, axis=1)
This code doesn't work, but I feel like it's close?
df[(df>1).sum(axis=1)>=2]
Explanation:
(df>1).sum(axis=1) gives the number of columns in that row that is greater than 1.
then with >=2 we filter those rows with at least 2 columns that meet the condition --which we counted as explained in the previous bullet
The value of x in the lambda is a Series, which can be indexed like this.
df[9] = df.apply(lambda x: x[x > 2].count(), axis=1)

calculate average of values in column based on target values

I have this sample of a data frame. I would like to find an average value for each assignment type value according to the target value. For example, for rows with pass result, I want to calculate their average C and T. There are about 5 Cs, and 3 Ts based on time. id_assigment differentiate between them you can see, but I want to find the average for each C and T for each class value. For example, average C with id 45 for Pass rows, for Fail rows,etc. I wonder how to calculate the averages?
Try pd.pivot_table
pd.pivot_table(df, columns=['id_assignemnt', 'assignemnt_type', 'result'], values='score')

How to select bunch of rows

I have dataframe with multiple columns , i want to select bunch of rows if column B have consecutive 1 and check in these rows if column A have any value equal to 0.04 then need this bunch of rows and extract start value and end value of column A for this bunch of rows
Here is my dataframe
Here is my desired output:
filtter Consecutive groups .diff().abs().cumsum().bfill() not following the specific considitons (x['B'].eq(1).any() and x['A'].eq(0.04).any()
agg first and last
followed by grouping consecutivity column to extract first and last rows with use of agg fun
df['temp'] = df.B.diff().abs().cumsum().bfill()
df.groupby('temp').filter(lambda x: (x['B'].eq(1).any() and x['A'].eq(0.04).any()))\
.groupby('temp').agg({'A':['first','last']})
Out:
A
first last
temp
3.0 344.0 39.9

Returning Multiple Columns from FuzzyWuzzy token_set_ratio

I am attempting to perform some fuzzy matching across two datasets containing lots of addresses.
I am iterating through a list of addresses in df, and finding the 'most matching' out of another:
for index,row in df.iterrows():
test_address = df.Full_Address[row]
first_comp = fuzz.token_set_ratio(df3.Full_Address,`test_address)
taking the row output returns me the full address from df, but I can't come up with a way to return the subsequently 'matched' address from df3.
Can anyone give a pointer please?
df ~ 18k rows
df3 ~ 2.5M rows
Which obviously presents limitations:
I have tried using np.meshgrid to create a list of values & get ratio for each value pair then select rows greater than the threshold.
Also tried this but with the dataset size it takes an age
matched_names =[]
for row1 in df.index:
name1 = df.get_value(row1,"Full_Address")
for row2 in df3.index:
name2= df3.get_value(row2,"Full_Address")
matched_token=fuzz.token_set_ratio(name1,name2)
if matched_token> 80:
matched_names.append([name1,name2,matched_token])
print(matched_names)

Resources