Concat dataframe to multi index dataframe with gradient values - python-3.x
I have a Multi-index dataframe with multiple test result values.
For further data analysis I want to add the derivation to the dataframe.
I tried to either calculate it via a lambda function directly after I grouped the dataframe. Grouping (mean values) is required due to the noise in the sampling.
Later I want to delete the rows from my dataframes where the derivative is <= 0.
The simplified Multi-index dataframe looks like this:
arrays = [['LS13', 'LS13', 'LS13', 'LS13','LS14','LS14','LS14','LS14','LS14','LS14','LS14','LS14'],[0, 2, 2.5, 3,0,2,5,5.5,6,6.5,7,7.5]]
index = pd.MultiIndex.from_arrays(arrays, names=('File', 'Flow Rate Setpoint [l/s]'))
df = pd.DataFrame({('Flow Rate [l/s]','mean') : [-0.057,2.089,2.496,3.011,0.056,2.070,4.995,5.519,6.011,6.511,7.030,7.499],('Time [s]','mean') : [42.225,104.909,165.676,226.446,42.225,104.918,469.560,530.328,591.100,651.864,712.660,773.034],('Shear Stress [Pa]','mean') : [-0.698,5.621,7.946,11.278,-0.774,6.557,40.610,48.370,54.685,58.414,58.356,56.254]},index=index)
if I run my code:
import numpy as np
xls = ['LS13', 'LS14']
gradient = [pd.Series(np.gradient(df.loc[(i),('Shear Stress [Pa]','mean')],df.loc[(i),('Time [s]','mean')])) for i in xls]
now I want to concat gradient to df on axis = 1, Title could be df['Gradient''values'].
So my pd.Series looks like:
Gradient
values
0 0.100808
1 0.069048
2 0.04654
3 0.054801
0 0.116941
1 0.087431
2 0.149521
3 0.115805
4 0.082639
5 0.030213
6 -0.017938
7 -0.034806
next step would be to remove/drop the rows where ['Gradient','values'] <= 0, in my example ['LS14','7':'7.5']
When I tried to concatenate both Dataframe df and Series gradient (I'm aware that the indexes are different)
merged = pd.concat([pd.DataFrame(df),pd.Series(gradient)], axis=1 , ignore_index = True)
Errors are usually one of the following:
ValueError: Buffer dtype mismatch, expected 'Python object' but got
'long long'
TypeError: cannot concatenate object of type "<class 'list'>"; only
pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
I would also assume there is an easier way to get this done with an lambda function and just apply it in place.
merged = pd.concat([df, pd.Series([gradient], name=('Gradient','value'))], axis=1)
I would have expected that to work, but I also get a miss match error:
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
when I try:
df[("Gradient","value")] =pd.Series([pd.Series(np.gradient(df.loc[(i),('Shear Stress [Pa]','mean')],df.loc[(i),('Time [s]','mean')])) for i in xls])
The 'Gradient','value' column gets correctly added to the dataframe but the values are again NaN.
You can try groupby().apply():
def get_gradients(x):
gradients = np.gradient(x[('Shear Stress [Pa]', 'mean')],x[('Time [s]', 'mean')] )
return pd.Series(gradients, index=x.index)
df[('Gradient','Value')] = (df.groupby('File', group_keys=False)
.apply(get_gradients)
)
Related
Pandas Dataframe Matrix Multiplication using #
I am attempting to perform matrix multiplication between a Pandas DataFrame and a Pandas Series. I've set them up like so: c = pd.DataFrame({ "s1":[.04, .018, .0064], "s2":[.018, .0225, .0084], "s3":[.0064, .0084, .0064], }) x = pd.Series([0,0,0], copy = False) I want to perform x # c # x, but I keep getting a ValueError: matrices are not aligned error flag. Am I not setting up my matrices properly? I'm not sure where I am going wrong.
x # c returns a Series object which has different indices as x. You can use the underlying numpy array to do the calculation: (x # c).values # x.values # 0.39880000000000004
Floating point error when converting numpy ndarray to pandas dataframe index
I am creating an index for my dataframe in steps of 0.01 using numpy.linspace: new_index = np.linspace(0, 75.6 // 0.01 / 100, 7561))) This gives me the expected result, a list containing 7561 0.01 steps from 0 to 75.60: print(new_index) [0.000e+00 1.000e-02 2.000e-02 ... 7.558e+01 7.559e+01 7.560e+01] However, when I use this ndarray as the index for my dataframe, the numbers are slightly off: testdf = pd.DataFrame(np.ones(7561), index=np.linspace(0, tables[0]["Dehnung"].max() // 0.01 / 100, 7561), columns=["Ones"]) display(testdf) print(testdf.index) This causes my so created dataframe to behave ugly when concatenating it with other dataframes with the same index with 0.01 steps. Is there any way I can fix this?
what type of data receive as parameters the method predict of a LinearRegression instance from sklearn?
I am doing an example of Linear Regression with sciki-learn but i am confuse about the predict method; In Scikit-Learn you will see this; my_Linear_Model.predict(self, X) Parameters: X : array_like or sparse matrix, shape (n_samples, n_features) Samples. Note: array_like does not give me enough information of what type of data a predict method could receive. Remember that with Pandas we deal with Serie and DataFrame object. I want to know the different types of array the predict method could receive.
Note: array_like does not give me enough information of what type of data a predict method could receive. Remember that with Pandas we deal with Serie and DataFrame object. For linear regression in scikit-learn you need to use numeric types of your columns (int oder float), the other types are not able to read. If you have a dataframe df: df A B C target 1 1 1 1 2 3 -1 10 you will select your array X, directly as columns from your dataframe: X = df[['A', 'B', 'C']] your target variable y, you will also select from your dataframe: y = df[['target']]
'numpy.ndarray' object has no attribute 'iterrows' while predicting value using lstm in python
I have a dataset with three inputs and trying to predict next value of X1 with the combination of previous inputs values. My three inputs are X1, X2, X3, X4. So here I am trying to predict next future value of X1. To predict the next X1 these four inputs combination affect with: X1 + X2 - X3 -X4 I wrote this code inside the class. Then I wrote the code to run the lstm . After that I wrote the code for predict value. Then it gave me this error. Can anyone help me to solve this problem? my code: def model_predict(data): pred=[] for index, row in data.iterrows(): val = row['X1'] if np.isnan(val): data.iloc[index]['X1'] = pred[-1] row['X1'] = pred[-1] f = row['X1','X2','X3','X4'] s = row['X1'] - row['X2'] + row['X3'] -row['X4'] val = model.predict(s) pred.append(val) return np.array(pred) After lstm code then I wrote the code for predict value: pred = model_predict(x_test_n) Gave me this error: ` ---> 5 pred = model_predict(x_test_n) def model_predict(data): pred=[] -->for index, row in data.iterrows(): val = row['X1'] if np.isnan(val):` AttributeError: 'numpy.ndarray' object has no attribute 'iterrows'
Apparenty, data argument of your function is a Numpy array, not a DataFrame. Data, as a np.ndarray, has also no named columns. One of possible solutions, keeping the argument as np.ndarray is: iterate over rows of this array using np.apply_along_axis(), refer to columns by indices (instead of names). Another solution is to create a DataFrame from data, setting proper column names and iterate on its rows. One of possible solutions how to write the code without DataFrame Assume that data is a Numpy table with 4 columns, containing respectively X1, X2, X3 and X4: [[ 1 2 3 4] [10 8 1 3] [20 6 2 5] [31 3 3 1]] Then your function can be: def model_predict(data): s = np.apply_along_axis(lambda row: row[0] + row[1] - row[2] - row[3], axis=1, arr=data) return model.predict(s) Note that: s - all input values to your model - can be computed in a single instruction, calling apply_along_axis for each row (axis=1), the predictions can also be computed "all at once", passing a Numpy vector - just s. For demonstration purpose, compute s and print it.
Find SARIMAX AIC and pdq values, statsmodels
I'm trying to find the values of p,d,q and the seasonal values of P,D,Q using statsmodels as "sm" in python. The data set i'm using is a csv file that contains time series data over three years recording the energy consumption. The file was split into a smaller data frame in order to work with it. Here is what df_test.head() looks like. time total_consumption 122400 2015-05-01 00:01:00 106.391 122401 2015-05-01 00:11:00 120.371 122402 2015-05-01 00:21:00 109.292 122403 2015-05-01 00:31:00 99.838 122404 2015-05-01 00:41:00 97.387 Here is my code so far. #Importing the timeserie data set from local file df = pd.read_csv(r"C:\Users\path\Name of the file.csv") #Rename the columns, put time as index and assign datetime to the column time df.columns = ["time","total_consumption"] df['time'] = pd.to_datetime(df.time) df.set_index('time') #Select test df (there is data from the 2015-05-01 2015-06-01) df_test = df.loc[(df['time'] >= '2015-05-01') & (df['time'] <= '2015-05-14')] #Find minimal AIC value for the ARIMA model integers p = range(0,2) d = range(0,2) q = range(0,2) pdq = list(itertools.product(p,d,q)) seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p,d,q))] warnings.filterwarnings("ignore") for param in pdq: for param_seasonal in seasonal_pdq: try: mod = sm.tsa.statespace.SARIMAX(df_test, order=param, seasonal_order=param_seasonal, enforce_stationarity=False, enforce_invertibility=False) results = mod.fit() print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic)) except: continue When I try to run the code as it is, the program doesn't even acknowledge the "for" loop. But when I take out the try: except: continue the program gives me this error message ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data). How could I remedy that and is there a way to automate the process directly output the parameters with the lowest AIC value (without having to look for it throught all the possibilities). Thanks !