Pandas: Filling random empty rows with data - python-3.x

I have a dataframe with several currently-empty columns. I want a fraction of these filled with data drawn from a normal distribution, while all the rest are left blank. So, for example, if 60% of the elements should be blank, then 60% would be, while the other 40% would be filled. I already have the normal distribution, via numpy, but I'm trying to figure out how to choose random rows to fill. Currently, the only way I can think of involves FOR loops, and I would rather avoid that.
Does anyone have any ideas for how I could fill empty elements of a dataframe at random? I have a bit of the code below, for the random numbers.
data.loc[data['ColumnA'] == 'B', 'ColumnC'] = np.random.normal(1000, 500, rowsB).astype('int64')

piRSquared's advice is good. We are left guessing what to solve.
Having just looked through some of the latest unanswered pandas questions there are worse.
import pandas as pd
import numpy as np
#some redundancy here as i make an empty dataframe -pretending i start like you with a Dataframe.
df = pd.DataFrame(index = range(11),columns=list('abcdefg'))
num_cells = np.product(df.shape)
# make a 2-dim array with number from 1 to number cells.
arr =np.arange(1,num_cells+1)
#inplace shuffle - this is the key randomization operation
np.random.shuffle(arr)
arr = arr.reshape(df.shape)
#place the shuffled values, normalized to the number of cells, into my dateframe.
df = pd.DataFrame(index = df.index,columns = df.columns,data=arr/np.float(num_cells))
#use applymap to set keep 40% of cells as ones, the other 60% as nan.
df = df.applymap(lambda x: 1 if x > 0.6 else np.nan)
# now sample a full set from normal distribution
# but when multiplying the nans will cause the sampled value to nullify, whilst the multiply by 1 will retain the sample value.
df * np.random.normal(1000,500,df.shape)
Thus you are left with a random 40% of the cells containing a draw from your normal distribution.
If your dataframe was large you could assume the stability of the uniform rand() function. Here i didn't do that and rather determined explicitly how many cells are above and below the threshold.

Related

Z-score normalization in pandas DataFrame (python)

I am using python3 (spyder), and I have a table which is the type of object "pandas.core.frame.DataFrame". I want to z-score normalize the values in that table (to each value substract the mean of its row and divide by the sd of its row), so each row has mean=0 and sd=1. I have tried 2 approaches.
First approach
from scipy.stats import zscore
zetascore_table=zscore(table,axis=1)
Second approach
rows=table.index.values
columns=table.columns
import numpy as np
for i in range(len(rows)):
for j in range(len(columns)):
table.loc[rows[i],columns[j]]=(table.loc[rows[i],columns[j]] - np.mean(table.loc[rows[i],]))/np.std(table.loc[rows[i],])
table
Both approaches seem to work, but when I check the mean and sd of each row it is not 0 and 1 as it is suppose to be, but other float values. I donĀ“t know which can be the problem.
Thanks in advance for your help!
The code below calculates a z-score for each value in a column of a pandas df. It then saves the z-score in a new column (here, called 'num_1_zscore'). Very easy to do.
from scipy.stats import zscore
import pandas as pd
# Create a sample df
df = pd.DataFrame({'num_1': [1,2,3,4,5,6,7,8,9,3,4,6,5,7,3,2,9]})
# Calculate the zscores and drop zscores into new column
df['num_1_zscore'] = zscore(df['num_1'])
display(df)
Sorry, thinking about it I found myself another easier way to calculate z-score (substract the mean of each row and divide the result by the sd of the row) than the for loops:
table=table.T# need to transpose it since the functions work like that
sd=np.std(table)
mean=np.mean(table)
numerator=table-mean #numerator in the formula for z-score
z_score=numerator/sd
z_norm_table=z_score.T #we transpose again and we have the initial table but with all the
#values z-scored by row.
I checked and now mean in each row is 0 or very close to 0 and sd is 1 or very close to 1, so like that was working for me. Sorry, I have few experience with coding and sometimes easy things require a lot of trials until I figure out how to solve them.

{Python} - [Pandas] - How sum columns by condition less than in columns name

First explaining the dataframe, the values of columns '0-156', '156-234', '234-546' .... '> 76830' is the percentage distribution for each range of distances in meters, totaling 100%.
Column 'Cell Name' refers to the data element of the other columns and the column 'Distance' is the column that will trigger the desired sum.
I need to sum the values of the columns '0-156', '156-234', '234-546' .... '> 76830' which are less than the value of the 'Distance' (Meters) column.
Below creation code for testing.
import pandas as pd
# initialize list of lists
data = [['Test1',0.36516562,19.065996,49.15094,24.344206,0.49186087,1.24217,5.2812457,0.05841639,0,0,0,0,158.4122868],
['Test2',0.20406325,10.664485,48.70978,14.885571,0.46103176,8.75815,14.200708,2.1162114,0,0,0,0,192.553074],
['Test3',0.13483211,0.6521175,6.124511,41.61725,45.0036,5.405257,1.0494527,0.012979688,0,0,0,0,1759.480042]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Cell Name','0-156','156-234','234-546','546-1014','1014-1950','1950-3510','3510-6630','6630-14430','14430-30030','30030-53430','53430-76830','>76830','Distance'])
Example of what should be done:
The value of column 'Distance' = 158.412286772863 therefore would have to sum the values <= of the following columns, 0-156, '156-234' totalizing 19.43116162 %.
Thanks so much!
As I understand it, you want to sum up all the percentage values in a row, where the lower value of the column-description (in case of '0-156' it would be 0, in case of '156-234' it would be 156, and so on...) is smaller than the value in the distance column.
First I would suggest, that you transform your string-like column-names into values, as an example:
lowerlimit=df.columns[2]
>>'156-234'
Then read the string only till the '-' and make it a number
int(lowerlimit[:lowerlimit.find('-')])
>> 156
You can loop this through all your columns and make a new row for the lower limits.
For a bit more simplicity I left out the first column for your example, and added another first row with the lower limits of each column, that you could generate as described above. Then this code works:
data = [[0,156,234,546,1014,1950,3510,6630,11430,30030,53430,76830,1e-23],[0.36516562,19.065996,49.15094,24.344206,0.49186087,1.24217,5.2812457,0.05841639,0,0,0,0,158.4122868],
[0.20406325,10.664485,48.70978,14.885571,0.46103176,8.75815,14.200708,2.1162114,0,0,0,0,192.553074],
[0.13483211,0.6521175,6.124511,41.61725,45.0036,5.405257,1.0494527,0.012979688,0,0,0,0,1759.480042]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['0-156','156-234','234-546','546-1014','1014-1950','1950-3510','3510-6630','6630-14430','14430-30030','30030-53430','53430-76830','76830-','Distance'])
df['lastindex']=None
df['sum']=None
After creating basically your dataframe, I add two columns 'lastindex' and 'sum'.
Then I am searching for the last index in every row, that is has its lower limit below the distance given in that row (df.iloc[x,-3]); afterwards I'm summing up the respective columns in that row.
for i in np.arange(1,len(df)):
df.at[i,'lastindex']=np.where(df.iloc[0,:-3]<df.iloc[i,-3])[0][-1]
df.at[i,'sum']=sum(df.iloc[i][0:df.at[i,'lastindex']+1])
I hope, this is helpful. Best, lepakk

Finding nearest Zip-Code Given Lat Long and List of Zip Codes w/ Lat Long

I have a left data frame of over 1million lat/long observations. I have another data frame (the right) of 43191 zip codes that have a central Lat/Long.
My goal is to run each row of the 1 million lat/long against the entire Zip Code data frame, take distance of each observation, then return the corresponding minimum distance zip code that goes with that minimum distance point . I want to take a loop approach since there is too much data to do a cartesian join with.
I understand this will probably be a lengthy operation but I only need to do it once. I am just trying to do it in a way that doesn't take days and won't give me a memory error.
The database with the lat/long zip codes lives here:
https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/export/
I have tried to join the left table with the right in a cartesian setting but that creates over 50 billion rows so that isn't going to work.
Some dummy data:
import geopy.distance as gd
import pandas as pd
import os
import numpy as np
df = pd.DataFrame(np.array([[42.801104,-76.827879],[38.187102,-83.433917],
[35.973115,-83.955932]]), columns = ['Lat', 'Long'])
for index, row in df.iterrows():
gd.vincenty((row['Lat'], row['Long']))
My goal is to create the loop so that a single row on the left frame iterates over the 43000 rows in the right frame, calculate each distance and take the minimum of that result set (probably a list of some sort) then return the corresponding zip code in a new column.
I am a bit lost as I typically would just do this with a cartesian join and calculate everything in one go but I have too much data volume to do that.

Replacing NaN Values in a Pandas DataFrame with Different Random Uniform Variables

I have a uniform distribution in a pandas dataframe column with a few NaN values I'd like to replace.
Since the data is uniformly distributed, I decided that I would like to fill the null values with random uniform samples drawn from a range of the column's min and max values. I used the following code to get the random uniform sample:
df_copy['ep'] = df_copy['ep'].fillna(value=np.random.uniform(3, 331))
Of course, using pd.DafaFrame.fillna() replaces all existing NaNs with the same value. I would like each NaN to be a different value. I assume that a for loop could get the job done, but am unsure how to create such a loop to specifically handle these NaN values. Thanks for the help!
If looks like you are doing this on a series (column), but the same implementation would work on a DataFrame:
Sample Data:
series = pd.Series(range(100))
series.loc[2] = np.nan
series.loc[10:15] = np.nan
Solution:
series.mask(series.isnull(), np.random.uniform(3, 331, size=series.shape))
Use boolean indexing with DataFrame.loc:
m = df_copy['ep'].isna()
df_copy.loc[m, 'ep'] = np.random.uniform(3, 331, size=m.sum())

Apply this function to a 2D numpy Matrix vector operations only

guys, I have this function
def averageRating(a,b):
avg = (float(a)+float(b))/2
return round(avg/25)*25
Currently, I am looping over my np array which is just a 2D array that has numerical values. What I want to be able to do is have "a" be the 1st array and "b" be the 2nd array and get the average per row and what I want for my return is just an array with the values. I have used mean but could not find a way to edit it and have the round() part or multiple (avg*25)/25.
My goal is to get rid of looping and replace it with a vectorized operations because of how slow looping is.
Sorry for the question new to python and numpy.
def averageRating(a,b):
avg = (np.average(a,axis=1) + np.average(b,axis=1))/2
return np.round(avg,0)
This should do what you are looking for if I understand the question correctly. Specifying axis = 1 in np.average will give the average of the rows (axis = 0 would be the average of the columns). And the 0 in np.round will round to 0 decimal places, changing it will change the number of decimal places you round to. Hope that helps!
def averageRating(a, b):
averages = []
for i in range( len(a) ):
averages.append( (a[i] + b[i]) / 2 )
return averages
Giving your arrays are of equal length this should be a simple resolution.
This doesn't eliminate the use of for loops, however, it will be computationally cheaper than the current approach.

Resources