Best way to count cells in pd DataFrame based on value - python-3.x

I have a pd DataFrame mat. From this DataFrame, I want to get the count of cells that contain a specific value (>0.5 in this case). To do so, I used:
mat[mat[:] > 0.5].count().sum()
This seems like a lot of code for a simple application. Is this the most efficient way to get the count?

Use sum only with mask for count Trues values:
(mat > 0.5).sum()
If need total sum:
np.sum(mat > 0.5).sum()
(mat > 0.5).sum().sum()
Or if convert values to 2d numpy array np.sum return scalar:
np.sum(mat.to_numpy() > 0.5)

Related

Elementwise comparison of numpy arrays to yield output arrays

We have several "in_arrays" like
in_1=np.array([0.4,0.7,0.8,0.3])
in_2=np.array([0.9,0.8,0.6,0.4])
I need to create two outputs like
out_1=np.array([0,0,1,0])
out_2=np.array([1,1,0,0])
So, the given element of the output array is 1 if the value in the corresponding input array is greater than 0.5 AND the value in this position of this input array is greater than the values of other arrays in this position. What is the efficient way to do this?
You can aggregate all the input arrays in a single matrix, where each row represents a particular input array. That way it is possible to calculate all the output arrays again as a single matrix.
The code could look something like that:
import numpy as np
# input matrix corresponding to the example input arrays given in the question
in_matrix = np.array([[0.4,0.7,0.8,0.3], [0.9,0.8,0.6,0.4]])
out_matrix = np.zeros(in_matrix.shape)
# each element in the array is the maximal value of the corresponding column in input_matrix
max_values = np.max(in_matrix, axis=0)
# compute the values in the output matrix row by row
for n, row in enumerate(in_matrix):
out_matrix[n] = np.logical_and(row > 0.5, row == max_values)

Get the indexes of the adjacent elements smaller or equal to a threshold in Python

I have a sorted array of distances, e.g.:
d = np.linspace(0.5, 50, 200)
and I want to iteratively get the indexes of adjacent elements that are at a distance smaller or equal than L for each element of d.
Is there any simple way of doing it?
You can use numpy.argwhere after checking that consecutive differences are less than or equal to L:
import numpy as np
np.argwhere((d[1:]-d[:-1])<=L)

create variable size array

I would like to create an numpy array like the one below, except I want to be able to create a variable shape array. So for the one below it would be n=3. Is there an a slick way to do this with numpy, or do I need a for loop.
output data:
import numpy as np
np.array([1,0,0],[0,1,0],[0,0,1],[1,1,1],[0,0,0])
Say you want to create array with name d having row number of rows and col number of columns.
It would also initialize all elements of the array with 0.
d = [[0 for x in range(col)] for y in range(row)]
You can access any element I,j by d[i][j].

Apply this function to a 2D numpy Matrix vector operations only

guys, I have this function
def averageRating(a,b):
avg = (float(a)+float(b))/2
return round(avg/25)*25
Currently, I am looping over my np array which is just a 2D array that has numerical values. What I want to be able to do is have "a" be the 1st array and "b" be the 2nd array and get the average per row and what I want for my return is just an array with the values. I have used mean but could not find a way to edit it and have the round() part or multiple (avg*25)/25.
My goal is to get rid of looping and replace it with a vectorized operations because of how slow looping is.
Sorry for the question new to python and numpy.
def averageRating(a,b):
avg = (np.average(a,axis=1) + np.average(b,axis=1))/2
return np.round(avg,0)
This should do what you are looking for if I understand the question correctly. Specifying axis = 1 in np.average will give the average of the rows (axis = 0 would be the average of the columns). And the 0 in np.round will round to 0 decimal places, changing it will change the number of decimal places you round to. Hope that helps!
def averageRating(a, b):
averages = []
for i in range( len(a) ):
averages.append( (a[i] + b[i]) / 2 )
return averages
Giving your arrays are of equal length this should be a simple resolution.
This doesn't eliminate the use of for loops, however, it will be computationally cheaper than the current approach.

Pandas: Filling random empty rows with data

I have a dataframe with several currently-empty columns. I want a fraction of these filled with data drawn from a normal distribution, while all the rest are left blank. So, for example, if 60% of the elements should be blank, then 60% would be, while the other 40% would be filled. I already have the normal distribution, via numpy, but I'm trying to figure out how to choose random rows to fill. Currently, the only way I can think of involves FOR loops, and I would rather avoid that.
Does anyone have any ideas for how I could fill empty elements of a dataframe at random? I have a bit of the code below, for the random numbers.
data.loc[data['ColumnA'] == 'B', 'ColumnC'] = np.random.normal(1000, 500, rowsB).astype('int64')
piRSquared's advice is good. We are left guessing what to solve.
Having just looked through some of the latest unanswered pandas questions there are worse.
import pandas as pd
import numpy as np
#some redundancy here as i make an empty dataframe -pretending i start like you with a Dataframe.
df = pd.DataFrame(index = range(11),columns=list('abcdefg'))
num_cells = np.product(df.shape)
# make a 2-dim array with number from 1 to number cells.
arr =np.arange(1,num_cells+1)
#inplace shuffle - this is the key randomization operation
np.random.shuffle(arr)
arr = arr.reshape(df.shape)
#place the shuffled values, normalized to the number of cells, into my dateframe.
df = pd.DataFrame(index = df.index,columns = df.columns,data=arr/np.float(num_cells))
#use applymap to set keep 40% of cells as ones, the other 60% as nan.
df = df.applymap(lambda x: 1 if x > 0.6 else np.nan)
# now sample a full set from normal distribution
# but when multiplying the nans will cause the sampled value to nullify, whilst the multiply by 1 will retain the sample value.
df * np.random.normal(1000,500,df.shape)
Thus you are left with a random 40% of the cells containing a draw from your normal distribution.
If your dataframe was large you could assume the stability of the uniform rand() function. Here i didn't do that and rather determined explicitly how many cells are above and below the threshold.

Resources