How can I randomly select and assign values to given number of rows in python dataframe.
Col B contains only 1's and 0's.
Suppose I have a dataframe as below
Col A Col B
A 0
B 0
A 0
B 0
C 0
A 0
B 0
C 0
D 0
A 0
I aim to randomly chose 5% of the rows and change the value of Col B to 1. I saw df.sample() but that wont allow me to do inplace changes to the column data
You can try Random library. Random has it's own sample function.
import Random
randindx = Random.sample(arr.between(0, dataframe['Col B'].size), dataframe['Col B'].size//20)
Considering 5%, you need to divide by 20.
You can first use the sample method to get the random 5% of examples and get hold of their indices like so:
samples_indices = df.sample(frac=0.05, replace=False).index
With the knowledge of the indices, loc method can be used to update the values corresponding to the examples.
df.loc[samples_indices, 'Col B'] = 1
I have a dataframe with two classes (A or B) and marks and I want to present the mark ranges per class.
Dataframe:
Class Mark Department
A 74.0 1
A 73.0 2
B 72.0 1
A 75.0 1
B 64.0 2
What I want to achieve:
Class Mark Range
A 73.0-75.0
B 64.0-72.0
and I was thinking of using the min max (creating a new field for the range). But as a start, I tried to just group it:
df['count'] = 1
result = df.pivot_table('count', index='Mark', columns='Class', aggfunc='sum').fillna(0)
which is complex and I abandoned this quickly.
I then I only kept two columns in my dataframe (Mark and Class) and used the following:
df[['Mark','Class']].values
And now I just have to create the Mark range column. I was thinking whether there was a simpler way without the steps to simply pivot the data and check the range (min max of columnA grouped by ColumnB).
We can use GroupBy.apply and get the max and min per group and represent them as string with f-strings:
df = (
df.groupby('Class')['Mark'].apply(lambda x: f'{x.min()}-{x.max()}')
.reset_index(name='Mark Range')
)
Class Mark Range
0 A 73.0-75.0
1 B 64.0-72.0
Simple but ugly:
temp = df.groupby('Class')['Mark'].agg({'min': min, 'max': max})
temp['range'] = temp['min'].map(str) + '-' + temp['max'].map(str)
Result of doing temp[['range']]:
range
Class
A 73.0-75.0
B 64.0-72.0
If you are interested in using pivot_table:
df_new = (df.pivot_table('Mark', 'Class', aggfunc=lambda x: f'{x.min()}-{x.max()}')
.add_suffix(' Range').reset_index())
Out[1543]:
Class Mark Range
0 A 73.0-75.0
1 B 64.0-72.0
As in your comment. To add Deparment, just use the list ['Class', 'Department'] for index as follows
df_new = (df.pivot_table('Mark', ['Class', 'Department'],
aggfunc=lambda x: f'{x.min()}-{x.max()}')
.add_suffix(' Range').reset_index())
Out[259]:
Class Department Mark Range
0 A 1 74.0-75.0
1 A 2 73.0-73.0
2 B 1 72.0-72.0
3 B 2 64.0-64.0
I've got a Dataframe like this:
df = pd.DataFrame(np.reshape(np.arange(0,9), (3,3)))
print(df)
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
I'd like to normalize two of the columns against a reference column. For example, if I chose df[0] as my reference column, then df[1] and df[2] would also have a mean of 3 and a standard deviation of 3.
What's the best way to do this?
You can shift and scale the values in each column by the mean and standard deviation of the reference column ref:
ref = 0
means = df.mean()
stds = df.std()
(df - means + means[ref]) / stds * stds[ref]
I have a dataframe which has 500K rows and 7 columns for days and include start and end day.
I search a value(like equal 0) in range(startDay, endDay)
Such as, for id_1, startDay=1, and endDay=7, so, I should seek a value D1 to D7 columns.
For id_2, startDay=4, and endDay=7, so, I should seek a value D4 to D7 columns.
However, I couldn't seek different column range successfully.
Above-mentioned,
if startDay > endDay, I should see "-999"
else, I need to find first zero (consider the day range) and such as for id_3's, first zero in D2 column(day 2). And starDay of id_3 is 1. And I want to see, 2-1=1 (D2 - StartDay)
if I cannot find 0, I want to see "8"
Here is my data;
data = {
'D1':[0,1,1,0,1,1,0,0,0,1],
'D2':[2,0,0,1,2,2,1,2,0,4],
'D3':[0,0,1,0,1,1,1,0,1,0],
'D4':[3,3,3,1,3,2,3,0,3,3],
'D5':[0,0,3,3,4,0,4,2,3,1],
'D6':[2,1,1,0,3,2,1,2,2,1],
'D7':[2,3,0,0,3,1,3,2,1,3],
'startDay':[1,4,1,1,3,3,2,2,5,2],
'endDay':[7,7,6,7,7,7,2,1,7,6]
}
data_idx = ['id_1','id_2','id_3','id_4','id_5',
'id_6','id_7','id_8','id_9','id_10']
df = pd.DataFrame(data, index=data_idx)
What I want to see;
df_need = pd.DataFrame([0,1,1,0,8,2,8,-999,8,1], index=data_idx)
You can create boolean array to check in each row which 'Dx' column(s) are above 'startDay' and below 'endDay' and the value is equal to 0. For the first two conditions, you can use np.ufunc.outer with the ufunc being np.less_equal and np.greater_equal such as:
import numpy as np
arr_bool = ( np.less_equal.outer(df.startDay, range(1,8)) # which columns Dx is above startDay
& np.greater_equal.outer(df.endDay, range(1,8)) # which columns Dx is under endDay
& (df.filter(regex='D[0-9]').values == 0)) #which value of the columns Dx are 0
Then you can use np.argmax to find the first True per row. By adding 1 and removing 'startDay', you get the values you are looking for. Then you need to look for the other conditions with np.select to replace values by -999 if df.startDay >= df.endDay or 8 if no True in the row of arr_bool such as:
df_need = pd.DataFrame( (np.argmax(arr_bool , axis=1) + 1 - df.startDay).values,
index=data_idx, columns=['need'])
df_need.need= np.select( condlist = [df.startDay >= df.endDay, ~arr_bool.any(axis=1)],
choicelist = [ -999, 8],
default = df_need.need)
print (df_need)
need
id_1 0
id_2 1
id_3 1
id_4 0
id_5 8
id_6 2
id_7 -999
id_8 -999
id_9 8
id_10 1
One note: to get -999 for id_7, I used the condition df.startDay >= df.endDay in np.select and not df.startDay > df.endDay like in your question, but you can cahnge to strict comparison, you get 8 instead of -999 in this case.
I have a DataFrame that looks like:
Where I have YEAR and RACEETHN as a multiindex. I want to to count the number of "1" values (note, the data are not only 0 and 1 so I cannot do a sum) for each YEAR and RACEETHN combination for each column variable.
I am able to count where value = 1 for each column by doing this:
(df_3.ACSUPPSV == 1).sum()
(df_3.PSEDSUPPSV == 1).sum()
I want to do this with groupby, but am unable to get it to work. I've tried the following code to test if I could do it on a single column 'ACSUPPSV' and it did no work:
df.groupby(['YEAR', 'RACEETHN']).loc[df.ACSUPPSV == 1, 'ACSUPPSV'].count()
I exported the data to excel and was able to calculate this with a quick "COUNTIF" formula, but I know there must be a way to do this in pandas - the results from excel look like:
Would appreciate if someone had a better way to do this than export to Excel! :)
I think you need agg with custom function for count 1 only:
df_3 = pd.DataFrame({'ACSUPPSV':[1,1,1,1,0,1],
'PSEDSUPPSV':[1,1,0,1,0,0],
'BUDGETSV':[1,0,1,1,1,0],
'YEAR':[2000,2000,2001,2000,2000,2000],
'RACEETHN':list('aaabbb')}).set_index(['YEAR','RACEETHN'])
print (df_3)
ACSUPPSV BUDGETSV PSEDSUPPSV
YEAR RACEETHN
2000 a 1 1 1
a 1 0 1
2001 a 1 1 0
2000 b 1 1 1
b 0 1 0
b 1 0 0
df2 = df_3.groupby(['YEAR', 'RACEETHN']).agg(lambda x: (x == 1).sum())
print (df2)
ACSUPPSV BUDGETSV PSEDSUPPSV
YEAR RACEETHN
2000 a 2 1 2
b 2 2 1
2001 a 1 1 0
Old answer:
df_3[((df_3.ACSUPPSV == 1) & (df_3.PSEDSUPPSV == 1))].groupby(['YEAR', 'RACEETHN']).size()
df_3.query('ACSUPPSV == 1 & PSEDSUPPSV == 1').groupby(['YEAR', 'RACEETHN']).size()
More general:
cols = ['ACSUPPSV','PSEDSUPPSV']
df_3[(df_3[cols] == 1).all(axis=1)].groupby(['YEAR', 'RACEETHN']).size()
For all columns:
df_3[(df_3 == 1).all(axis=1)].groupby(['YEAR', 'RACEETHN']).size()
EDIT:
Or maybe need:
df_3.groupby(['YEAR', 'RACEETHN']).agg(lambda x: (x == 1).sum())