Z-score normalization in pandas DataFrame (python) - python-3.x

I am using python3 (spyder), and I have a table which is the type of object "pandas.core.frame.DataFrame". I want to z-score normalize the values in that table (to each value substract the mean of its row and divide by the sd of its row), so each row has mean=0 and sd=1. I have tried 2 approaches.
First approach
from scipy.stats import zscore
zetascore_table=zscore(table,axis=1)
Second approach
rows=table.index.values
columns=table.columns
import numpy as np
for i in range(len(rows)):
for j in range(len(columns)):
table.loc[rows[i],columns[j]]=(table.loc[rows[i],columns[j]] - np.mean(table.loc[rows[i],]))/np.std(table.loc[rows[i],])
table
Both approaches seem to work, but when I check the mean and sd of each row it is not 0 and 1 as it is suppose to be, but other float values. I donĀ“t know which can be the problem.
Thanks in advance for your help!

The code below calculates a z-score for each value in a column of a pandas df. It then saves the z-score in a new column (here, called 'num_1_zscore'). Very easy to do.
from scipy.stats import zscore
import pandas as pd
# Create a sample df
df = pd.DataFrame({'num_1': [1,2,3,4,5,6,7,8,9,3,4,6,5,7,3,2,9]})
# Calculate the zscores and drop zscores into new column
df['num_1_zscore'] = zscore(df['num_1'])
display(df)

Sorry, thinking about it I found myself another easier way to calculate z-score (substract the mean of each row and divide the result by the sd of the row) than the for loops:
table=table.T# need to transpose it since the functions work like that
sd=np.std(table)
mean=np.mean(table)
numerator=table-mean #numerator in the formula for z-score
z_score=numerator/sd
z_norm_table=z_score.T #we transpose again and we have the initial table but with all the
#values z-scored by row.
I checked and now mean in each row is 0 or very close to 0 and sd is 1 or very close to 1, so like that was working for me. Sorry, I have few experience with coding and sometimes easy things require a lot of trials until I figure out how to solve them.

Related

How to Subtract a column from another column if a condition is met, otherwise subtract from a different column?

I'm working with trading data and Pandas. Given a 4-column OHLC pandas DataFrame that is 100 rows in length, I'm trying to calculate if an "Upper Shadow" exists or not for an individual row and store the result in its own column. To calculate if an "Upper Shadow" exists all you have to do is take the high (H) value of the row and subtract the open (O) value if the close (C) value is less than the open value. Otherwise, you have to subtract the close value.
Right now I'm naively doing this in a for loop where I iterate over each row with an if statement.
for index, row in df.iterrows():
if row["close"] >= row["open"]:
df.at[index,"upper_shadow"]=float(row["high"]) - float(row["close"])
else:
df.at[index,"upper_shadow"]=float(row["high"]) - float(row["open"])
Is there a better way to do this?
You can use np.maximum to calculate the maximum of close and open in a vectorized way:
import numpy as np
df['upper_shadow'] = df['high'] - np.maximum(df['close'], df['open'])
I think #Psidom's solution is what you are looking for. However the following piece of code is another way of writing what you already have using apply-lambda...
df["upper_shadow"] = df.apply(lambda row: float(row["high"]) - float(row["close"]) if row["close"] >= row["open"] else float(row["high"]) - float(row["open"]),axis=1)

split a column based on a delimiter and then unpivot the result with preserving other columns

I need to split a column to multiple rows and then unpivot them by preseving a/multiple columns, how can I achive this in Python3
See below example
import numpy as np
data=np.array(['a0','a1,a2','a2,a3'])
pk=np.array([1,2,3])
df=pd.DataFrame({'data':data,'PK':pk})
df
df['data'].apply(lambda x : pd.Series(str(x).split(","))).stack()
What I need is:
data pk
a0 1
a1 2
a2 2
a2 3
a3 3
Is there any way to achieve this without merge and resetting indexes as mentioned here?
Convert column data into list and explode the data frame
Data
data=np.array(['a0','a1,a2','a2,a3'])
pk=np.array([1,2,3])
df=pd.DataFrame({'data':data,'PK':pk})
df=spark.createDataFrame(df)
Solution
df.withColumn('data', F.explode(F.split(col('data'),','))).show()
Using the Explode is the keyword (thx to wwnde for pointing it out) for searching this and can be done easily in Python with using existing libraries
First step is converting the column with a delimiter to a list
df=df.assign(Data=df.data.str.split(","))
and then explode
df.explode('Data')
if you are reading from Excel and Pandas detect a list of number as int and if you need to do the explode multiple times then this is the code and results

Finding nearest Zip-Code Given Lat Long and List of Zip Codes w/ Lat Long

I have a left data frame of over 1million lat/long observations. I have another data frame (the right) of 43191 zip codes that have a central Lat/Long.
My goal is to run each row of the 1 million lat/long against the entire Zip Code data frame, take distance of each observation, then return the corresponding minimum distance zip code that goes with that minimum distance point . I want to take a loop approach since there is too much data to do a cartesian join with.
I understand this will probably be a lengthy operation but I only need to do it once. I am just trying to do it in a way that doesn't take days and won't give me a memory error.
The database with the lat/long zip codes lives here:
https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/export/
I have tried to join the left table with the right in a cartesian setting but that creates over 50 billion rows so that isn't going to work.
Some dummy data:
import geopy.distance as gd
import pandas as pd
import os
import numpy as np
df = pd.DataFrame(np.array([[42.801104,-76.827879],[38.187102,-83.433917],
[35.973115,-83.955932]]), columns = ['Lat', 'Long'])
for index, row in df.iterrows():
gd.vincenty((row['Lat'], row['Long']))
My goal is to create the loop so that a single row on the left frame iterates over the 43000 rows in the right frame, calculate each distance and take the minimum of that result set (probably a list of some sort) then return the corresponding zip code in a new column.
I am a bit lost as I typically would just do this with a cartesian join and calculate everything in one go but I have too much data volume to do that.

Python: Convert time expressed in seconds to datetime for a series

I have a column of times expressed as seconds since Jan 1, 1990, that I need to convert to a DateTime. I can figure out how to do this for a constant (e.g. add 10 seconds), but not a series or column.
I eventually tried writing a loop to do this one row at a time. (Probably not the right way, and I'm new to python).
This code works for a single row:
def addSecs(secs):
fulldate = datetime(1990,1,1)
fulldate = fulldate + timedelta(seconds=secs)
return fulldate
b= addSecs(intag112['outTags_1_2'].iloc[1])
print(b)
2018-06-20 01:05:13
Does anyone know an easy way to do this for a whole column in a dataframe?
I tried this:
for i in range(len(intag112)):
intag112['TransactionTime'].iloc[i]=addSecs(intag112['outTags_1_2'].iloc[i])
but it errored out.
If you want to do something with column (series) in DataFrame you can use apply method, for example:
import datetime
# New column 'datetime' is created from old 'seconds'
df['datetime'] = df['seconds'].apply(lambda x: datetime.datetime.fromtimestamp(x))
Check documentation for more examples. Overall advice - try to think in terms of vectors (or series) of values. Most operations in pandas can be done with entire series or even dataframe.

Pandas: Filling random empty rows with data

I have a dataframe with several currently-empty columns. I want a fraction of these filled with data drawn from a normal distribution, while all the rest are left blank. So, for example, if 60% of the elements should be blank, then 60% would be, while the other 40% would be filled. I already have the normal distribution, via numpy, but I'm trying to figure out how to choose random rows to fill. Currently, the only way I can think of involves FOR loops, and I would rather avoid that.
Does anyone have any ideas for how I could fill empty elements of a dataframe at random? I have a bit of the code below, for the random numbers.
data.loc[data['ColumnA'] == 'B', 'ColumnC'] = np.random.normal(1000, 500, rowsB).astype('int64')
piRSquared's advice is good. We are left guessing what to solve.
Having just looked through some of the latest unanswered pandas questions there are worse.
import pandas as pd
import numpy as np
#some redundancy here as i make an empty dataframe -pretending i start like you with a Dataframe.
df = pd.DataFrame(index = range(11),columns=list('abcdefg'))
num_cells = np.product(df.shape)
# make a 2-dim array with number from 1 to number cells.
arr =np.arange(1,num_cells+1)
#inplace shuffle - this is the key randomization operation
np.random.shuffle(arr)
arr = arr.reshape(df.shape)
#place the shuffled values, normalized to the number of cells, into my dateframe.
df = pd.DataFrame(index = df.index,columns = df.columns,data=arr/np.float(num_cells))
#use applymap to set keep 40% of cells as ones, the other 60% as nan.
df = df.applymap(lambda x: 1 if x > 0.6 else np.nan)
# now sample a full set from normal distribution
# but when multiplying the nans will cause the sampled value to nullify, whilst the multiply by 1 will retain the sample value.
df * np.random.normal(1000,500,df.shape)
Thus you are left with a random 40% of the cells containing a draw from your normal distribution.
If your dataframe was large you could assume the stability of the uniform rand() function. Here i didn't do that and rather determined explicitly how many cells are above and below the threshold.

Resources