calculate percentage of occurrences in column pandas - python-3.x

I have a column with thousands of rows. I want to select the top significant one. Let's say I want to select all the rows that would represent 90% of my sample. How would I do that?
I have a dataframe with 2 columns, one for product_id one showing whether it was purchased or not (value is or 0 or 1)
product_id purchased
a 1
b 0
c 0
d 1
a 1
. .
. .
with df['product_id'].value_counts() I can have all my product-ids ranked by number of occurrences.
Let's say now I want to get the number of product_ids that I should consider in my future analysis that would represent 90% of the total of occurences.
Is there a way to do that?

If want all product_id with counts under 0.9 then use:
s = df['product_id'].value_counts(normalize=True).cumsum()
df1 = df[df['product_id'].isin(s.index[s < 0.9])]
Or if want all rows sorted by counts and get 90% of them:
s1 = df['product_id'].map(df['product_id'].value_counts()).sort_values(ascending=False)
df2 = df.loc[s1.index[:int(len(df) * 0.9)]]

Related

Pandas - Get the first n rows if a column has a specific value

I have a DataFrame that has 5 columns including User and MP.
I need to extract a sample of n rows for each User, n being a percentage based on User (if User has 1000 entries and n is 5, select the first 50 rows and and go to the next User. After that I have to add all the samples to a new DataFrame. Also if User has multiple values on the column MP, for example if the user has 2 values in the column MP, select 2.5% for 1 value and 2.5% for the other.
Somehow my logic isn't that good(started with the first step, without adding the logic for multiple MPs)
df = pd.read_excel("Results/fullData.xlsx")
dfSample = pd.DataFrame()
uniqueValues = df['User'].unique()
print(uniqueValues)
n = 5
for u in uniqueValues:
sm = df["User"].str.count(u).sum()
print(sm)
for u in df['User']:
sample = df.head(int(sm*(n/100)))
#print(sample)
dfSample.append(sample)
print(dfSample)
dfSample.to_excel('testFinal.xlsx')
Check Below example. It is intentionally verbose for understanding. The column that solve problem is "ROW_PERC". You can filter it based on the requirement (50% rows or 25% rows) that are required for each USR/MP.
import pandas as pd
df = pd.DataFrame({'USR':[1,1,1,1,2,2,2,2],'MP':['A','A','A','A','B','B','A','A'],"COL1":[1,2,3,4,5,6,7,8]})
df['USR_MP_RANK'] = df.groupby(['USR','MP']).rank()
df['USR_MP_RANK_MAX'] = df.groupby(['USR','MP'])['USR_MP_RANK'].transform('max')
df['ROW_PERC'] = df['USR_MP_RANK']/df['USR_MP_RANK_MAX']
df
Output:

Sampling a dataframe according to some rules: balancing a multilabel dataset

I have a dataframe like this:
df = pd.DataFrame({'id':[10,20,30,40],'text':['some text','another text','random stuff', 'my cat is a god'],
'A':[0,0,1,1],
'B':[1,1,0,0],
'C':[0,0,0,1],
'D':[1,0,1,0]})
Here I have columns from Ato D but my real dataframe has 100 columns with values of 0and 1. This real dataframe has 100k reacords.
For example, the column A is related to the 3rd and 4rd row of text, because it is labeled as 1. The Same way, A is not related to the 1st and 2nd rows of text because it is labeled as 0.
What I need to do is to sample this dataframe in a way that I have the same or about the same number of features.
In this case, the feature C has only one occurrece, so I need to filter all others columns in a way that I have one text with A, one text with B, one text with Cetc..
The best would be: I can set using for example n=100 that means I want to sample in a way that I have 100 records with all the features.
This dataset is a multilabel dataset training and is higly unbalanced, I am looking for the best way to balance it for a machine learning task.
Important: I don't want to exclude the 0 features. I just want to have ABOUT the same number of columns with 1 and 0
For example. with a final data set with 1k records, I would like to have all columns from A to the final_column and all these columns with the same numbers of 1 and 0. To accomplish this I will need to random discard text rows and id only.
The approach I was trying was to look to the feature with the lowest 1 and 0 counts and then use this value as threshold.
Edit 1: One possible way I thought is to use:
df.sum(axis=0, skipna=True)
Then I can use the column with the lowest sum value as threshold to filter the text column. I dont know how to do this filtering step
Thanks
The exact output you expect is unclear, but assuming you want to get 1 random row per letter with 1 you could reshape (while dropping the 0s) and use GroupBy.sample:
(df
.set_index(['id', 'text'])
.replace(0, float('nan'))
.stack()
.groupby(level=-1).sample(n=1)
.reset_index()
)
NB. you can rename the columns if needed
output:
id text level_2 0
0 30 random stuff A 1.0
1 20 another text B 1.0
2 40 my cat is a god C 1.0
3 30 random stuff D 1.0

Randomly select and assign values to given number of rows in python dataframe

How can I randomly select and assign values to given number of rows in python dataframe.
Col B contains only 1's and 0's.
Suppose I have a dataframe as below
Col A Col B
A 0
B 0
A 0
B 0
C 0
A 0
B 0
C 0
D 0
A 0
I aim to randomly chose 5% of the rows and change the value of Col B to 1. I saw df.sample() but that wont allow me to do inplace changes to the column data
You can try Random library. Random has it's own sample function.
import Random
randindx = Random.sample(arr.between(0, dataframe['Col B'].size), dataframe['Col B'].size//20)
Considering 5%, you need to divide by 20.
You can first use the sample method to get the random 5% of examples and get hold of their indices like so:
samples_indices = df.sample(frac=0.05, replace=False).index
With the knowledge of the indices, loc method can be used to update the values corresponding to the examples.
df.loc[samples_indices, 'Col B'] = 1

Calculations within groupby in pandas, python

My data set contains house price for 4 different house types (A,B,C,D) in 4 different countries (USA, Germany, Uk, sweden). House price can be only three types (Upward, Downward, and Not Changed). I want to calculate Difition index (ID) for different House types (A,B,C,D) for different countries (USA, Germany, Uk, sweden) based on house price.
The formula that I want to use to calculate Difition index (DI) is:
DI = (Total Number of Upward * 1 + Total Number of Downward * 0 + Total Number of Not Changed * 0.5) / (Total Number of Upward + Total Number of Downward + Total Number of Not Changed)
Here is my data:
and the expected result is:
I really need your help.
Thanks.
You can do this by using groupby and assuming your file is named as text.xlsx
df = pd.read_excel('test.xlsx')
df = df.replace({'Upward':1,'Downward':0,'Notchanged':0.5})
df.groupby('Country').mean().reset_index()

pandas - grouping values by pair of columns and pivoting

Been struggling to think what to do here, pivoting and melting and whatnot doesn't seem to be working out. I was trying to join the names of the to/from destinations together and then re-order the combined names but it was a total mess
My data concerns flows from one location to another, it's in the format:
pd.DataFrame(columns=['from_location','to_location','flow'],data =[['a','b',1],['b','a',3]])
from_location to_location flow
0 a b 1
1 b a 3
but my output needs to be the format:
pd.DataFrame(columns=['connection','flow','back flow','net'],data =[['a -> b',1,3,2]])
connection flow back flow net
0 a -> b 1 3 2
Any nice built in functions that can rearrange things like this? I'm not even sure what keywords to search by
Use:
#df = df.sort_values(['from_location','to_location'])
df1 = pd.DataFrame(np.sort(df[['from_location','to_location']], axis=1),
columns=list('ab'), index=df.index)
s = df1['a'] + ' -> ' + df1['b']
df2 = df.groupby(s)['flow'].agg(['first','last']).assign(net=lambda x: x['last'] - x['first'])
print (df2)
first last net
a -> b 1 3 2
Explanation:
If necessary first sort_values if possible some paired rows are swapped
Sort columns per rows by numpy.sort and join columns together with splitter
Then groupby by joined values and aggregate by agg with first and last
Last if need subtract columns add new column by assign

Resources