I've created an output variable 'a = pd.Series()', then run a number of simulations using a for loop that append the results of the simulation, temporarily stored in 'x', to 'a' in successive columns, each renamed to coincide with the simulation number, starting at the zero-th position, using the following code:
a = pandas.concat([a, x.rename(sim_count)], axis=1)
For some reason, the resulting dataframe includes a column of "NaN" values to the left of my first column of simulated results that I can't get rid of, as follows (example shows the results of three simulations):
0 0 1 2
0 NaN 0.136799 0.135325 -0.174987
1 NaN -0.010517 0.108798 0.003726
2 NaN 0.116757 0.030352 0.077443
3 NaN 0.148347 0.045051 0.211610
4 NaN 0.014309 0.074419 0.109129
Any idea how to prevent this column of NaN values from being generated?
Basically, by creating your output variable via pd.Series() you are creating an empty dataset. This is carried over in the concatenation, with the empty dataset's size being defined as the same size (well, same number of rows) as x[sim_count]. The only way Python/Pandas knows to represent this "empty" series is by using a series of NaN values. When you concatenate you are effectively saying: I want to add my new dataframe/series onto the "empty" series...and the empty series just gets NaN.
A more effective way of doing this is to assign "a" to a dataframe then concatenate.
a = pd.DataFrame()
a = pandas.concat([a, x.rename(sim_count)], axis=1)
You might be asking yourself why this works and using pd.Series() forces a column of NaNs. My understanding is the dataframe creates an empty place in memory for the data to be added (i.e. you are putting your new data INTO an empty dataframe), whereas when you do pd.concat([pd.Series(), x.rename(sim_count)], axis1) you are telling pandas that the empty series (pd.Series()) is important and should be retained, and that the new data should be added ONTO "a". Hence the column of NaNs.
Related
I have a dataframe with only one column (headerless). I want to add another empty column to it having the same number of rows.
To make it clearer, currently, the size of my data frame is 1050 (since only one column), I want the new size to be 1050*2 with the second column being completely empty.
In pandas in DataFrame are always columns, so for new default column filled by missing values use length of columns:
s = pd.Series([2,3,4])
df = s.to_frame()
df[len(df.columns)] = np.nan
#what is same for one column df like
#df[1] = np.nan
print (df)
0 1
0 2 NaN
1 3 NaN
2 4 NaN
I have multiple time series data in csv files from Netlogo model runs. I would like to join those series into one dataframe so that I can do a boxplot to see variations from different simulation model runs. X values in each csv are the time iterations (integers). The y values are the values of a particular measure in the model, e.g., population count. So, I can join the csvs with concat. There are repeated column names for the y variables. My thought is to combine columns with the same name into one column as a list of numbers (y values). Then I can pass that x, y to boxplot to plot that variable across time with its variations - median, etc. Data is of the form:
x population groups color
0 0 0.00 0.00 0.00
1 1 74.47 42.48 40.96
2 2 74.46 42.48 40.96
would become
x population groups color
0 0 [0.00, 1.2] [0.00, 5] [0.00, 4]
1 1 [74.47, 3.2] [42.48, 55] [40.96, 55]
2 2 [74.46, Nan] [42.48, NaN] [40.96, NaN]
There are multiples of this dataframe from different csv files (thousands). The x axis value can have a different maximum time value for different runs / csvs.
How do I combine dataframes such that I get one dataframe with a list of y values for a given y (column) for each x value. There will be NaNs for some y values for runs that ended early. Note that are multiple y columns. Note that each column is a separate boxplot (overlayed on the same plot).
I have tried concat, join, merge, and not been able to convert multiple columns with the same or different names into one column with a list of values rather than a single value.
Or, is there even a better way to do what I want to do with the data?
The answer ended up being simpler than I expected. Insight into how to do this came from this answer.
Make a list of the time series dataframes: dn = [d1,d2,d3,...]
Concatenate the dataframes: dn = pd.concat(dl, axis=1)
Create a new column with the list of values:
dn['new'] = dn['data column name'].values.tolist()
This generates the new column with the list of values that I can now use to make a box plot.
I am working on a simple time series linear regression using statsmodels.api.OLS, and am running these regressions on groups of data based on an identifier variable. I have been able to get the grouped regressions working, but am now looking to merge the results of the regressions back into the original dataframe and am getting index errors.
A simplified version of my original dataframe, which we'll call "df" looks like this:
id value time
a 1 1
a 1.5 2
a 2 3
a 2.5 4
b 1 1
b 1.5 2
b 2 3
b 2.5 4
My function to conduct the regressions is as follows:
def ols_reg(df, xcol, ycol):
x = df[xcol]
y = df[ycol]
x = sm.add_constant(x)
model = sm.OLS(y, x, missing='drop').fit()
predictions = model.predict()
return pd.Series(predictions)
I then define a variable that stores the results of conducting this function on my dataset, grouping by the id column. This code is as follows:
var = df.groupby('id').apply(ols_reg,
xcol='time',ycol='value')
This returns a Series of the predicted linear values that has the same length as the original dataset, and looks like the following:
id
a 0 0.5
1 1
2 2.5
3 3
b 0 0.5
1 1
2 2.5
3 3
The column starting with 0.5 (ignore the values; not the actual output) is the column with predicted values from the regression. As the return on the function shows, this is a pandas Series.
I now want to merge these results back into the original dataframe, to look like the following:
id value time results
a 1 1 0.5
a 1.5 2 1
a 2 3 2.5
a 2.5 4 3
b 1 1 0.5
b 1.5 2 1
b 2 3 2.5
b 2.5 4 3
I've tried a number of methods, such as setting a new column in the original dataset equal to the series, but get the following error:
TypeError: incompatible index of inserted column with frame index
Any help on getting these results back into the original dataframe would be greatly appreciated. There are a number of other posts that correspond to this topic, but none of the solutions worked for me in this instance.
UPDATE:
I've solved this with a relatively simple method, in which I converted the series to a list, and just set a new column in the dataframe equal to the list. However, I would be really curious to hear if others have better/different/unique solutions to this problem. Thanks!
To not loose the position when inserting prediction in the missing values you can use this approach, in example:
X_train: The train data is a pandas dataframe corresponding to the known real results (in y_train).
X_test: The test data is a pandas dataframe without corresponding known real results. Need to predict.
y_train: The train data is pandas serie with real known results
Prediction: The prediction is a pandas series object
To get the complete data merged in one pandas dataframe first get the known part together:
# merge train part of the data into a dataframe
X_train = X_train.sort_index()
y_train = y_train.sort_index()
result = pd.concat([X_train,X_test])
# if need to convert numpy array to pandas series:
# prediction = pd.Series(prediction)
# here is the magic
result['specie'][result['specie'].isnull()] = prediction.values
If there is no missing value would do the job.
The problem is that after removing all the columns in my data with equal values on the rows and applying the formula for max-min scaling which is: (x-x.min())/(x.max()-x.min()) I still got a column with NaNs.
P.S. I removed these constant columns because if I keep them and then do the scaling, x.max()-x.min() will be 0 and all these columns after the scaling will have NaN values.
So, what I am doing is the following:
I have train and test data sets separately. Once I import them in jupyter notebook, I create a function to show me which columns have exactly the same values on the rows.
def uniques(df):
for e in df.columns:
if len(pd.unique(df[e]))==1:
yield e
Then I check which are the constant columns:
col_test_=uniques(test_st1)
col_test=list(col_test_)
col_test
Result:
['uswrf_s1_3','uswrf_s1_6','uswrf_s1_10','uswrf_s1_11','uswrf_s1_13','uswrf_s1_5']
Then I get all indices of these columns:
for i in list(col_test):
idx_col=test_st1.columns.get_loc(i)
print ("All values in column {} of the test data are the same".format(idx_col))
Result:
All values in column 220 of the test data are the same
All values in column 445 of the test data are the same
All values in column 745 of the test data are the same
All values in column 820 of the test data are the same
All values in column 970 of the test data are the same
All values in column 1120 of the test data are the same
Then I drop these columns because I would not need them after I apply the min- max scaling.
for j in col_test:
test_st1=test_st1.drop(j,1)
Basically I do the same for the train partition.
Next I apply the formula for max min scaling to both train and test partitions with respect to the train data:
train_1= (train_st1-train_st1.min())/(train_st1.max()-train_st1.min())
test_1 = (test_st1-train_st1.min())/(train_st1.max()-train_st1.min())
After I got rid of the columns with the same values I supposed that there won't be any columns with NaNs after the normalization. However, when I check if there is any column with NaN values then the following happens:
a=uniques(test_1)
b=list(a)
b
Result:
['uswrf_s1_3']
Checking which column is that:
test_1.columns.get_loc('uswrf_s1_3')
Result:
1126
How come I got a column with NaNs after the scaling bearing in mind that I got rid of all the columns whose values on the rows are completely the same?
I have an Excel spreadsheet with columns of values that represent different variables in an experimental setup. For example, one column in my data may be called "reaction time" and consequently contain values representative of time in milliseconds. If a problem occurs during the trial and no value is recorded for the reaction time, Matlab calls this "NaN." I know that I can use:
data = xlsread('filename.xlsx')
reaction_time = data(:,3)
average_reaction_time = mean(reaction_time, 'omitnan')
This will return the average values listed in the "reaction time" column of my spreadsheet (column 3). It skips over anything that isn't a number (NaN, in the case of an error during the experiment).
Here's what I need help with:
In addition to excluding NaNs, I also need to be able to leave out some values. For example, one type of error results in the printing of a "1 ms" reaction time, and this is consequently printed in the spreadsheet. How can I specify that I need to leave out NaNs, "1"s, and any other values?
Thanks in advance,
Mickey
One option for you might be to try the standardizeMissing function to replace the values that you want to exclude with NaN prior to using mean with 'omitnan'. For instance:
>> x = 1:10;
>> x = standardizeMissing(x, [3 4 5]); % Treat 3, 4, and 5 as missing values
x =
1 2 NaN NaN NaN 6 7 8 9 10
>> y = mean(x, 'omitnan');
If you read your Excel sheet into a table, standardizeMissing can replace the values with NaN only in the column you care about if you use the DataVariables Name-Value pair.