Scaling the data still gives NaN in python - python-3.x

The problem is that after removing all the columns in my data with equal values on the rows and applying the formula for max-min scaling which is: (x-x.min())/(x.max()-x.min()) I still got a column with NaNs.
P.S. I removed these constant columns because if I keep them and then do the scaling, x.max()-x.min() will be 0 and all these columns after the scaling will have NaN values.
So, what I am doing is the following:
I have train and test data sets separately. Once I import them in jupyter notebook, I create a function to show me which columns have exactly the same values on the rows.
def uniques(df):
for e in df.columns:
if len(pd.unique(df[e]))==1:
yield e
Then I check which are the constant columns:
col_test_=uniques(test_st1)
col_test=list(col_test_)
col_test
Result:
['uswrf_s1_3','uswrf_s1_6','uswrf_s1_10','uswrf_s1_11','uswrf_s1_13','uswrf_s1_5']
Then I get all indices of these columns:
for i in list(col_test):
idx_col=test_st1.columns.get_loc(i)
print ("All values in column {} of the test data are the same".format(idx_col))
Result:
All values in column 220 of the test data are the same
All values in column 445 of the test data are the same
All values in column 745 of the test data are the same
All values in column 820 of the test data are the same
All values in column 970 of the test data are the same
All values in column 1120 of the test data are the same
Then I drop these columns because I would not need them after I apply the min- max scaling.
for j in col_test:
test_st1=test_st1.drop(j,1)
Basically I do the same for the train partition.
Next I apply the formula for max min scaling to both train and test partitions with respect to the train data:
train_1= (train_st1-train_st1.min())/(train_st1.max()-train_st1.min())
test_1 = (test_st1-train_st1.min())/(train_st1.max()-train_st1.min())
After I got rid of the columns with the same values I supposed that there won't be any columns with NaNs after the normalization. However, when I check if there is any column with NaN values then the following happens:
a=uniques(test_1)
b=list(a)
b
Result:
['uswrf_s1_3']
Checking which column is that:
test_1.columns.get_loc('uswrf_s1_3')
Result:
1126
How come I got a column with NaNs after the scaling bearing in mind that I got rid of all the columns whose values on the rows are completely the same?

Related

Sampling a dataframe according to some rules: balancing a multilabel dataset

I have a dataframe like this:
df = pd.DataFrame({'id':[10,20,30,40],'text':['some text','another text','random stuff', 'my cat is a god'],
'A':[0,0,1,1],
'B':[1,1,0,0],
'C':[0,0,0,1],
'D':[1,0,1,0]})
Here I have columns from Ato D but my real dataframe has 100 columns with values of 0and 1. This real dataframe has 100k reacords.
For example, the column A is related to the 3rd and 4rd row of text, because it is labeled as 1. The Same way, A is not related to the 1st and 2nd rows of text because it is labeled as 0.
What I need to do is to sample this dataframe in a way that I have the same or about the same number of features.
In this case, the feature C has only one occurrece, so I need to filter all others columns in a way that I have one text with A, one text with B, one text with Cetc..
The best would be: I can set using for example n=100 that means I want to sample in a way that I have 100 records with all the features.
This dataset is a multilabel dataset training and is higly unbalanced, I am looking for the best way to balance it for a machine learning task.
Important: I don't want to exclude the 0 features. I just want to have ABOUT the same number of columns with 1 and 0
For example. with a final data set with 1k records, I would like to have all columns from A to the final_column and all these columns with the same numbers of 1 and 0. To accomplish this I will need to random discard text rows and id only.
The approach I was trying was to look to the feature with the lowest 1 and 0 counts and then use this value as threshold.
Edit 1: One possible way I thought is to use:
df.sum(axis=0, skipna=True)
Then I can use the column with the lowest sum value as threshold to filter the text column. I dont know how to do this filtering step
Thanks
The exact output you expect is unclear, but assuming you want to get 1 random row per letter with 1 you could reshape (while dropping the 0s) and use GroupBy.sample:
(df
.set_index(['id', 'text'])
.replace(0, float('nan'))
.stack()
.groupby(level=-1).sample(n=1)
.reset_index()
)
NB. you can rename the columns if needed
output:
id text level_2 0
0 30 random stuff A 1.0
1 20 another text B 1.0
2 40 my cat is a god C 1.0
3 30 random stuff D 1.0

Excel Array formula to count moving average outliers

I've tried a few things on this and settled on a 'cheap' solution. Wanted to know if this can be done directly and more elegantly.
Problem Statement and Sample Data
Assume we have a table in excel with ~200 columns and a large number of rows (~10k).
Sample Data:
identifier
val1
val2
val3
...
val200
ID_1
100
102
34
...
89
We want to add a column at the end that shows us how many "moving average" outliers exist. A moving average outlier is defined as a point that is outside the range (mean - 2 * std deviations, mean + 2 * std deviations), where the mean and std dev is calculated using the previous 10 values (therefore its a moving average outlier).
We will not test the first 10 values. But from val11, the previous 10 values will be used to form the window and we want to test if the value is an outlier.
My Solution so far
I created another table of same dimensions as the original. In cells from val11 (to val200, for all columns), I put in the formula below in the new table. And then, I can simply sum the columns in each row in the new table.
Assume val11 is on X2 in the "shocks" worksheet (for first row):
=IF(OR(shocks!X2<AVERAGEA(shocks!D2:W2)-2STDEVA(shocks!D2:W2),shocks!X2>AVERAGEA(shocks!D2:W2)+2STDEVA(shocks!D2:W2)),1,"")
But if possible, I want to avoid having a second table since it bloats and slows down the file. Any help would be greaty appreciated

Combine time series data for boxplot from multiple csv files

I have multiple time series data in csv files from Netlogo model runs. I would like to join those series into one dataframe so that I can do a boxplot to see variations from different simulation model runs. X values in each csv are the time iterations (integers). The y values are the values of a particular measure in the model, e.g., population count. So, I can join the csvs with concat. There are repeated column names for the y variables. My thought is to combine columns with the same name into one column as a list of numbers (y values). Then I can pass that x, y to boxplot to plot that variable across time with its variations - median, etc. Data is of the form:
x population groups color
0 0 0.00 0.00 0.00
1 1 74.47 42.48 40.96
2 2 74.46 42.48 40.96
would become
x population groups color
0 0 [0.00, 1.2] [0.00, 5] [0.00, 4]
1 1 [74.47, 3.2] [42.48, 55] [40.96, 55]
2 2 [74.46, Nan] [42.48, NaN] [40.96, NaN]
There are multiples of this dataframe from different csv files (thousands). The x axis value can have a different maximum time value for different runs / csvs.
How do I combine dataframes such that I get one dataframe with a list of y values for a given y (column) for each x value. There will be NaNs for some y values for runs that ended early. Note that are multiple y columns. Note that each column is a separate boxplot (overlayed on the same plot).
I have tried concat, join, merge, and not been able to convert multiple columns with the same or different names into one column with a list of values rather than a single value.
Or, is there even a better way to do what I want to do with the data?
The answer ended up being simpler than I expected. Insight into how to do this came from this answer.
Make a list of the time series dataframes: dn = [d1,d2,d3,...]
Concatenate the dataframes: dn = pd.concat(dl, axis=1)
Create a new column with the list of values:
dn['new'] = dn['data column name'].values.tolist()
This generates the new column with the list of values that I can now use to make a box plot.

Column of NaN created when concatenating series into dataframe

I've created an output variable 'a = pd.Series()', then run a number of simulations using a for loop that append the results of the simulation, temporarily stored in 'x', to 'a' in successive columns, each renamed to coincide with the simulation number, starting at the zero-th position, using the following code:
a = pandas.concat([a, x.rename(sim_count)], axis=1)
For some reason, the resulting dataframe includes a column of "NaN" values to the left of my first column of simulated results that I can't get rid of, as follows (example shows the results of three simulations):
0 0 1 2
0 NaN 0.136799 0.135325 -0.174987
1 NaN -0.010517 0.108798 0.003726
2 NaN 0.116757 0.030352 0.077443
3 NaN 0.148347 0.045051 0.211610
4 NaN 0.014309 0.074419 0.109129
Any idea how to prevent this column of NaN values from being generated?
Basically, by creating your output variable via pd.Series() you are creating an empty dataset. This is carried over in the concatenation, with the empty dataset's size being defined as the same size (well, same number of rows) as x[sim_count]. The only way Python/Pandas knows to represent this "empty" series is by using a series of NaN values. When you concatenate you are effectively saying: I want to add my new dataframe/series onto the "empty" series...and the empty series just gets NaN.
A more effective way of doing this is to assign "a" to a dataframe then concatenate.
a = pd.DataFrame()
a = pandas.concat([a, x.rename(sim_count)], axis=1)
You might be asking yourself why this works and using pd.Series() forces a column of NaNs. My understanding is the dataframe creates an empty place in memory for the data to be added (i.e. you are putting your new data INTO an empty dataframe), whereas when you do pd.concat([pd.Series(), x.rename(sim_count)], axis1) you are telling pandas that the empty series (pd.Series()) is important and should be retained, and that the new data should be added ONTO "a". Hence the column of NaNs.

How to select bunch of rows

I have dataframe with multiple columns , i want to select bunch of rows if column B have consecutive 1 and check in these rows if column A have any value equal to 0.04 then need this bunch of rows and extract start value and end value of column A for this bunch of rows
Here is my dataframe
Here is my desired output:
filtter Consecutive groups .diff().abs().cumsum().bfill() not following the specific considitons (x['B'].eq(1).any() and x['A'].eq(0.04).any()
agg first and last
followed by grouping consecutivity column to extract first and last rows with use of agg fun
df['temp'] = df.B.diff().abs().cumsum().bfill()
df.groupby('temp').filter(lambda x: (x['B'].eq(1).any() and x['A'].eq(0.04).any()))\
.groupby('temp').agg({'A':['first','last']})
Out:
A
first last
temp
3.0 344.0 39.9

Resources