Merge regression results back to original dataframe - python-3.x

I am working on a simple time series linear regression using statsmodels.api.OLS, and am running these regressions on groups of data based on an identifier variable. I have been able to get the grouped regressions working, but am now looking to merge the results of the regressions back into the original dataframe and am getting index errors.
A simplified version of my original dataframe, which we'll call "df" looks like this:
id value time
a 1 1
a 1.5 2
a 2 3
a 2.5 4
b 1 1
b 1.5 2
b 2 3
b 2.5 4
My function to conduct the regressions is as follows:
def ols_reg(df, xcol, ycol):
x = df[xcol]
y = df[ycol]
x = sm.add_constant(x)
model = sm.OLS(y, x, missing='drop').fit()
predictions = model.predict()
return pd.Series(predictions)
I then define a variable that stores the results of conducting this function on my dataset, grouping by the id column. This code is as follows:
var = df.groupby('id').apply(ols_reg,
xcol='time',ycol='value')
This returns a Series of the predicted linear values that has the same length as the original dataset, and looks like the following:
id
a 0 0.5
1 1
2 2.5
3 3
b 0 0.5
1 1
2 2.5
3 3
The column starting with 0.5 (ignore the values; not the actual output) is the column with predicted values from the regression. As the return on the function shows, this is a pandas Series.
I now want to merge these results back into the original dataframe, to look like the following:
id value time results
a 1 1 0.5
a 1.5 2 1
a 2 3 2.5
a 2.5 4 3
b 1 1 0.5
b 1.5 2 1
b 2 3 2.5
b 2.5 4 3
I've tried a number of methods, such as setting a new column in the original dataset equal to the series, but get the following error:
TypeError: incompatible index of inserted column with frame index
Any help on getting these results back into the original dataframe would be greatly appreciated. There are a number of other posts that correspond to this topic, but none of the solutions worked for me in this instance.

UPDATE:
I've solved this with a relatively simple method, in which I converted the series to a list, and just set a new column in the dataframe equal to the list. However, I would be really curious to hear if others have better/different/unique solutions to this problem. Thanks!

To not loose the position when inserting prediction in the missing values you can use this approach, in example:
X_train: The train data is a pandas dataframe corresponding to the known real results (in y_train).
X_test: The test data is a pandas dataframe without corresponding known real results. Need to predict.
y_train: The train data is pandas serie with real known results
Prediction: The prediction is a pandas series object
To get the complete data merged in one pandas dataframe first get the known part together:
# merge train part of the data into a dataframe
X_train = X_train.sort_index()
y_train = y_train.sort_index()
result = pd.concat([X_train,X_test])
# if need to convert numpy array to pandas series:
# prediction = pd.Series(prediction)
# here is the magic
result['specie'][result['specie'].isnull()] = prediction.values
If there is no missing value would do the job.

Related

Sampling a dataframe according to some rules: balancing a multilabel dataset

I have a dataframe like this:
df = pd.DataFrame({'id':[10,20,30,40],'text':['some text','another text','random stuff', 'my cat is a god'],
'A':[0,0,1,1],
'B':[1,1,0,0],
'C':[0,0,0,1],
'D':[1,0,1,0]})
Here I have columns from Ato D but my real dataframe has 100 columns with values of 0and 1. This real dataframe has 100k reacords.
For example, the column A is related to the 3rd and 4rd row of text, because it is labeled as 1. The Same way, A is not related to the 1st and 2nd rows of text because it is labeled as 0.
What I need to do is to sample this dataframe in a way that I have the same or about the same number of features.
In this case, the feature C has only one occurrece, so I need to filter all others columns in a way that I have one text with A, one text with B, one text with Cetc..
The best would be: I can set using for example n=100 that means I want to sample in a way that I have 100 records with all the features.
This dataset is a multilabel dataset training and is higly unbalanced, I am looking for the best way to balance it for a machine learning task.
Important: I don't want to exclude the 0 features. I just want to have ABOUT the same number of columns with 1 and 0
For example. with a final data set with 1k records, I would like to have all columns from A to the final_column and all these columns with the same numbers of 1 and 0. To accomplish this I will need to random discard text rows and id only.
The approach I was trying was to look to the feature with the lowest 1 and 0 counts and then use this value as threshold.
Edit 1: One possible way I thought is to use:
df.sum(axis=0, skipna=True)
Then I can use the column with the lowest sum value as threshold to filter the text column. I dont know how to do this filtering step
Thanks
The exact output you expect is unclear, but assuming you want to get 1 random row per letter with 1 you could reshape (while dropping the 0s) and use GroupBy.sample:
(df
.set_index(['id', 'text'])
.replace(0, float('nan'))
.stack()
.groupby(level=-1).sample(n=1)
.reset_index()
)
NB. you can rename the columns if needed
output:
id text level_2 0
0 30 random stuff A 1.0
1 20 another text B 1.0
2 40 my cat is a god C 1.0
3 30 random stuff D 1.0

Combine time series data for boxplot from multiple csv files

I have multiple time series data in csv files from Netlogo model runs. I would like to join those series into one dataframe so that I can do a boxplot to see variations from different simulation model runs. X values in each csv are the time iterations (integers). The y values are the values of a particular measure in the model, e.g., population count. So, I can join the csvs with concat. There are repeated column names for the y variables. My thought is to combine columns with the same name into one column as a list of numbers (y values). Then I can pass that x, y to boxplot to plot that variable across time with its variations - median, etc. Data is of the form:
x population groups color
0 0 0.00 0.00 0.00
1 1 74.47 42.48 40.96
2 2 74.46 42.48 40.96
would become
x population groups color
0 0 [0.00, 1.2] [0.00, 5] [0.00, 4]
1 1 [74.47, 3.2] [42.48, 55] [40.96, 55]
2 2 [74.46, Nan] [42.48, NaN] [40.96, NaN]
There are multiples of this dataframe from different csv files (thousands). The x axis value can have a different maximum time value for different runs / csvs.
How do I combine dataframes such that I get one dataframe with a list of y values for a given y (column) for each x value. There will be NaNs for some y values for runs that ended early. Note that are multiple y columns. Note that each column is a separate boxplot (overlayed on the same plot).
I have tried concat, join, merge, and not been able to convert multiple columns with the same or different names into one column with a list of values rather than a single value.
Or, is there even a better way to do what I want to do with the data?
The answer ended up being simpler than I expected. Insight into how to do this came from this answer.
Make a list of the time series dataframes: dn = [d1,d2,d3,...]
Concatenate the dataframes: dn = pd.concat(dl, axis=1)
Create a new column with the list of values:
dn['new'] = dn['data column name'].values.tolist()
This generates the new column with the list of values that I can now use to make a box plot.

Dask apply with custom function

I am experimenting with Dask, but I encountered a problem while using apply after grouping.
I have a Dask DataFrame with a large number of rows. Let's consider for example the following
N=10000
df = pd.DataFrame({'col_1':np.random.random(N), 'col_2': np.random.random(N) })
ddf = dd.from_pandas(df, npartitions=8)
I want to bin the values of col_1 and I follow the solution from here
bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1',bins,labels)
where
def test_f(df,col,bins,labels):
return df.assign(bin_num = pd.cut(df[col],bins,labels=labels))
and this works as I expect it to.
Now I want to take the median value in each bin (taken from here)
median = ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute()
Having 10 bins, I expect median to have 10 rows, but it actually has 80. The dataframe has 8 partitions so I guess that somehow the apply is working on each one individually.
However, If I want the mean and use mean
median = ddf2.groupby('bin_num')['col_1'].mean().compute()
it works and the output has 10 rows.
The question is then: what am I doing wrong that is preventing apply from operating as mean?
Maybe this warning is the key (Dask doc: SeriesGroupBy.apply) :
Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.
You are right! I was able to reproduce your problem on Dask 2.11.0. The good news is that there's a solution! It appears that the Dask groupby problem is specifically with the category type (pandas.core.dtypes.dtypes.CategoricalDtype). By casting the category column to another column type (float, int, str), then the groupby will work correctly.
Here's your code that I copied:
import dask.dataframe as dd
import pandas as pd
import numpy as np
def test_f(df, col, bins, labels):
return df.assign(bin_num=pd.cut(df[col], bins, labels=labels))
N = 10000
df = pd.DataFrame({'col_1': np.random.random(N), 'col_2': np.random.random(N)})
ddf = dd.from_pandas(df, npartitions=8)
bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1', bins, labels)
print(ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute())
which prints out the problem you mentioned
bin_num
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
5 0.550844
6 0.651036
7 0.751220
8 NaN
9 NaN
Name: col_1, Length: 80, dtype: float64
Here's my solution:
ddf3 = ddf2.copy()
ddf3["bin_num"] = ddf3["bin_num"].astype("int")
print(ddf3.groupby('bin_num')['col_1'].apply(pd.Series.median).compute())
which printed:
bin_num
9 0.951369
2 0.249150
1 0.149563
0 0.049897
3 0.347906
8 0.847819
4 0.449029
5 0.550608
6 0.652778
7 0.749922
Name: col_1, dtype: float64
#MRocklin or #TomAugspurger
Would you be able to create a fix for this in a new release? I think there is sufficient reproducible code here. Thanks for all your hard work. I love Dask and use it every day ;)

Column of NaN created when concatenating series into dataframe

I've created an output variable 'a = pd.Series()', then run a number of simulations using a for loop that append the results of the simulation, temporarily stored in 'x', to 'a' in successive columns, each renamed to coincide with the simulation number, starting at the zero-th position, using the following code:
a = pandas.concat([a, x.rename(sim_count)], axis=1)
For some reason, the resulting dataframe includes a column of "NaN" values to the left of my first column of simulated results that I can't get rid of, as follows (example shows the results of three simulations):
0 0 1 2
0 NaN 0.136799 0.135325 -0.174987
1 NaN -0.010517 0.108798 0.003726
2 NaN 0.116757 0.030352 0.077443
3 NaN 0.148347 0.045051 0.211610
4 NaN 0.014309 0.074419 0.109129
Any idea how to prevent this column of NaN values from being generated?
Basically, by creating your output variable via pd.Series() you are creating an empty dataset. This is carried over in the concatenation, with the empty dataset's size being defined as the same size (well, same number of rows) as x[sim_count]. The only way Python/Pandas knows to represent this "empty" series is by using a series of NaN values. When you concatenate you are effectively saying: I want to add my new dataframe/series onto the "empty" series...and the empty series just gets NaN.
A more effective way of doing this is to assign "a" to a dataframe then concatenate.
a = pd.DataFrame()
a = pandas.concat([a, x.rename(sim_count)], axis=1)
You might be asking yourself why this works and using pd.Series() forces a column of NaNs. My understanding is the dataframe creates an empty place in memory for the data to be added (i.e. you are putting your new data INTO an empty dataframe), whereas when you do pd.concat([pd.Series(), x.rename(sim_count)], axis1) you are telling pandas that the empty series (pd.Series()) is important and should be retained, and that the new data should be added ONTO "a". Hence the column of NaNs.

Difference between two different pandas columns

I build a function for do a difference between two different pandas columns from different data set. First data set contain the predict value and the second data set contain observed. The problem is that the row of two data set are different and for do a difference I must use the ID of row.
The function is:
def difference(data1,data2):
for i in range(data1.shape[0]):
e_id=data2.iloc[i,0]
p_oss =data1.iloc[int(e_id),9]
diff= p_oss - data2.iloc[i,1]
return diff
difference(df,evaluation)
Where: data1: observed value data2:predict value
The error of functionis:
IndexError: single positional indexer is out-of-bounds
The observed data set is structured like this:
ID Attribute1 Attribute2 ... Prime
1 N 10 123
2 S 10 128
3 N 8 26
4 S 12 567
..
n N 15 5
The predict data set is structured like this:
ID Prime
4 566.89
1 123.03
2 127.95
3 26.01
...
The ID of predict data set change because I use a function (train_test_split) to split the originally df in train e test set.
I want an output like this:
ID difference
1 0.03
2 0.05
3 0.1
4 0.11
..

Resources