This question already has answers here:
Normalise between 0 and 1 ignoring NaN
(3 answers)
Closed 6 years ago.
I have a large pandas dataframe with about 80 columns. Each of the 80 columns in the dataframe report daily traffic statistics for websites (the columns are the websites).
As I don't want to work with the raw traffic statistics, I rather like to normalize all of my columns (except for the first, which is the date). Either from 0 to 1 or (even better) from 0 to 100.
Date A B ...
10/10/2010 100.0 402.0 ...
11/10/2010 250.0 800.0 ...
12/10/2010 800.0 2000.0 ...
13/10/2010 400.0 1800.0 ...
That being said, I wonder which normalization to apply. Min-Max scaling vs. z-Score Normalization (standardization)? Some of my columns have strong outliers. It would be great to have an example. I am sorry not being able to provide the full data.
First, turn your Date column into an index.
dates = df.pop('Date')
df.index = dates
Then either use z-score normalizing:
df1 = (df - df.mean())/df.std()
or min-max scaling:
df2 = (df-df.min())/(df.max()-df.min())
I would probably advise z-score normalization, because min-max scaling is highly susceptible to outliers.
Related
I have two timeseries with 40 rows each. Typically, I can calculate Cohen’s d beginning with at least 2 rows. Is it statistically acceptable to plot the cohen’s d of the two timeseries to determine the point or year of stability of similarity/dissimilarity of the two timeseries? Thanks for the answer.
I've got an OHLCV financial dataset I'm working with that's approx 1.5M rows. It's for a single security back to 2000 at a 1min resolution. The dataset had all the zero volume timestamps removed so has gaps in the trading day which I don't want for my purposes.
Raw data (df1) looks like:
Timestamp, Open, High, Low, Close, Volume
To fill in the all the zero volume timestamps I created an empty trading calendar (df2) using (pandas_market_calendars which is an amazing time saver) and I've then used df3 = merge_asof(df2, df1, on='Timestamp', direction = nearest) to fill all the timestamps. This is the behaviour I wanted for the price data (OHLC) but not for volume. I need all the 'filled' timestamps to show zero volume so I figured a lambda function would suit (as below) to check if each timestamp was in the original dataframe (df1) or not.
tss = df1.Timestamp.to_numpy()
df2['Adj_Volume'] = df2.apply(lambda x: x['Volume'] if x['Timestamp'] in tss else 0, axis=1)
I ran this for an hour, then two hours, then five hours and it's still not finished. To try and work out what's going on, I then used tqdm (progress_apply) and it estimates that it'll take 100 hours to complete! I running the conda dist of Jupyter Notebooks on an 2014 MacbookAir (1.7Ghz, 8Gb RAM) which isn't a supercomputer but 100 hours seems whacky.
If I cut down tss and df3 to a single year (~50k rows), it'll run in ~5mins. However, this doesn't scale linearly to the full dataset. 100 hours vs 100 mins (5mins x 20 years (2000 - 2019)). Slicing the dataframe up into years in a python level loop to then join them again afterwards feels clunky, but I can't think of another way.
Is there a smarter way to do this which takes advantage of vectorised operations that can be run on the entire dataset in a single operation?
Cheers
Maybe you could try with np.wherefunction and .isin method ?
import numpy as np
df2['Volume'] = np.where(df2['Timestamp'].isin(tss),df2['Volume'], 0)
I have a dataset which contains gender as Male and Female. I have converted male to 1 and female to 0 using pandas functionality which has now data type int8. now I wanted to normalize columns such as weight and height. So what should be done with the gender column: should it be normalized or not. I am planning to use it in for a linear regression.
So I think you are mixing up normalization with standardization.
Normalization:
rescales your data into a range of [0;1]
Standardization:
rescales your data to have a mean of 0 and a standard deviation of 1.
Back to your question:
For your gender column your points are already ranging between 0 and 1. Therefore your data is already "normalized". So your question should be if you can standarize your data and the answer is: yes you could, but it doesn't really make sense. This question was already discussed here: Should you ever standardise binary variables?
Trying to get some big panel data from excel into python so I can do some GMM/Cross sectional panel data regression analysis (think sci-kit package). I have moved my data from excel to Python but the format for regression analysis is not correct (see below). The Scikit website has some datasets on there to play with, but it is not really helpful for discussing formats and how to get your data into a similar format to get my data into Python.
Does anyone have any experience using excel (.xlsx) data and getting it into Python, 'regression-ready'?
I have already done my needed regression analysis in R and Stata, but I would like to get better at using Python for regression analysis, since it has some nice attributes.
Here is my dataframe format so far, from excel to Python.
(this is truncated from a 10,000 X 60 shape dataset)
BANKS YEARS CIR DSF EQCUS EQLI EQNT EQUITY
0 CR1 2005 65.46 927915.00 28.553 23.948 37.542 264946.50
1 CR1 2006 65.98 1026491.00 30.491 26.584 36.143 312986.00
2 CR1 2007 60.26 1437615.00 27.003 23.413 28.238 388197.20
3 CR1 2008 58.08 1605464.00 24.024 20.160 25.828 385696.80
4 CR1 2009 65.21 1538570.00 28.160 22.850 27.907 433267.30
5 CR1 2010 54.45 1822863.00 31.009 24.555 28.274 565254.60
6 CR1 2011 57.38 2075505.00 30.905 24.861 29.618 641440.50
7 CR1 2012 62.12 2533641.00 29.595 24.509 28.883 749821.50
Data types:
>>>df.dtypes
BANKS object
YEARS int64
CIR float64
DSF float64
EQCUS float64
EQLI float64
EQNT float64
EQUITY float64
Unicode in the columns (I don't think sci-kit likes that!)
>>>df.columns.tolist()
[u'BANKS', u'YEARS', u'CIR', u'DSF', u'EQCUS', u'EQLI', u'EQNT', u'EQUITY']
I'm not sure which columns you're including in the regression, or what errors you're getting, but you can't use categorical variables in regressions (like 'BANKS'). You need to convert the categorical var to dummy vars (binary 0/1) and exclude the original categorical variable from your regression.
I also don't believe you can include rows with missing data points, so you either need to impute the data or drop the rows. (df.fillna in pandas)
You may want to consider using pandas to manage datasets in python. It's a package you can install and import in python, and makes python behave more like R or STATA. There's a nice tutorial here: http://pandas.pydata.org/pandas-docs/stable/10min.html
Pandas even has a function for converting categorical variables into dummy variables: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
Hope that helps...
I have a set of data that has over 15,000 records in Excel that is from a measurement tool that finds trends over a large areas. I'm not interested in looking for trends within the data as whole but rather over the data closest to each other to get a sense of how noisy (variation with neighboring records). Almost like I want to know the average standard deviation of looking at the 15,000 or so records only at 20 records at a time. The hope is the data values trend gradually rather than sudden changes from record to record and thus looks noisy. If I add a Chart and use the "Moving Average" Trendline it kind of visually shows how noisy the data looks across the 15,000 + records. However, I was hoping to get a numeric value to rate how noisy the data is vs. other datasets. Any ideas on what I could do here with formula's built-in Excel or by adding some add-in? Let me know if I need to explain this any better.
Could you calculate your moving average for your 20 sample window, then use the difference between each point and the expected value to calculate a variance?
Hard to do tables here, but here is a sample of what I mean
Actual Measured Expected Variance
5 5.44 4.49 0.91
6 4.34 5.84 2.26
7 8.45 7.07 1.90
8 6.18 7.84 2.75
9 8.89 9.10 0.04
10 11.98 10.01 3.89
The "measured" values were determined as
measured = actual + (rand() - 0.5) * 4
The "expected" values were calculated from a moving average (the table was pulled from the middle of the data set).
The variance is simply the square of expected minus measured.
Then you could calculate an average variance as a summary statistic.
Moving average is the correct, but you need a critical element - order. Do you date/time variable or a sequence number?
Use the OFFSET function to setup your window. If you want 20, your formula will look something like AVERAGE(OFFSET(C15,-10,0,21)). This is your moving average.
Relate that to C15, whether additive or multiplicative, you'll have your distance. All we need now is your tolerance.