appending pytorch object in pandas DataFrame - python-3.x

I want to append tensor objects c, s in empty dataframe df_data1_cluster
df_data1_cluster = pd.DataFrame(columns = ["cluster", "text"])
label, center = detect_clusters(torch.as_tensor(embeddings), 50)
for c, s in zip(label, phrases):
df_data1_cluster.append(c,s)
It is resulting in error.
TypeError: cannot concatenate object of type '<class 'torch.Tensor'>'; only Series and DataFrame objs are valid

Hi this is a pandas problem, the append command requires u append a dataframe, there are ways to insert to a dataframe
import pandas as pd
import torch
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']})
df['numbers'].loc[-1] = torch.tensor([2, 3, 4]) # adding a row

Related

Pandas Series of dates to vlines kwarg in mplfinance plot

import numpy as np
import pandas as pd
df = pd.DataFrame({'dt': ['2021-2-13', '2022-2-15'],
'w': [5, 7],
'n': [11, 8]})
df.reset_index()
print(list(df.loc[:,'dt'].values))
gives: ['2021-2-13', '2022-2-15']
NEEDED: [('2021-2-13'), ('2022-2-15')]
Important (at comment's Q): "NEEDED" is the way "mplfinance" accepts vlines argument for plot (checked) - I need to draw vertical lines for specified dates at x-axis of chart
import mplfinance as mpf
RES['Date'] = RES['Date'].dt.strftime('%Y-%m-%d')
my_vlines=RES.loc[:,'Date'].values # NOT WORKS
fig, axlist = mpf.plot( ohlc_df, type="candle", vlines= my_vlines, xrotation=30, returnfig=True, figsize=(6,4))
will only work if explcit my_vlines= [('2022-01-18'), ('2022-02-25')]
SOLVED: Oh, it really appears to be so simple after all
my_vlines=list(RES.loc[:,'Date'].values)
Your question asks for a list of Numpy arrays but your desired output looks like Tuples. If you need Tuples, note that it's the comma that makes the tuple not the parentheses, so you'd do something like this:
desired_format = [(x,) for x in list(df.loc[:,'dt'].values)]
If you want numpy arrays, you could do this
desired_format = [np.array(x) for x in list(df.loc[:,'dt'].values)]
I think I understand your problem. Please see the example code below and let me know if this resolves your problem. I expanded on your dataframe to meet mplfinance plot criteria.
import pandas as pd
import numpy as np
import mplfinance as mpf
df = pd.DataFrame({'dt': ['2021-2-13', '2022-2-15'],'Open': [5,7],'Close': [11, 8],'High': [21,30],'Low': [7, 3]})
df['dt']=pd.to_datetime(df['dt'])
df.set_index('dt', inplace = True)
mpf.plot(df, vlines = dict(vlines = df.index.tolist()))

Join common text to multiple dataframes in one syntax

Is there a way to join text in one syntax. I have 10 different dataframes to which I need to join the same text. I have done it separately for now. e.g.
import pandas as pd
df1 = pd.DataFrame({'Col1': [40, 30, 20], 'COl2': [50, 10, 5]})
name = ['Sam']
lis = []
for i in name:
lis.append(i)
df = pd.DataFrame({'i': lis}) #Creating a dataframe to append the name
df1 = df1.join(df)
df1.join(df)
df2.join(df) ...... so on
I want to do it in one syntax. Making a list of dataframes and join text
[df1,df2,df3,df4].join(df)

Iterating over columns from two dataframes to estimate correlation and p-value

I am trying to estimate Pearson's correlation coefficient and P-value from the corresponding columns of two dataframes. I managed to write this code so far but it is just providing me the results from the last columns. Need some help with this code. Also, want to save the outputs in a new dataframe.
import os
import pandas as pd
import scipy as sp
import scipy.stats as stats
df_1 = pd.DataFrame(pd.read_excel('15_Oct_Yield_A.xlsx'))
df_2= pd.DataFrame(pd.read_excel('Oct_Z_index.xlsx'))
for column in df_1.columns[1:]:
for column in df_2.columns[1:]:
x = (df_1[column])
y = (df_2[column])
correl = stats.pearsonr(x, y)
Your looping setup is incorrect on a couple measures... You are using the same variable name in both for-loops which is going to cause problems. Also, you are computing correl outside of your inner loop... etc.
What you want to do is loop over the columns with 1 loop, assuming that both data frames have the same column names. If they do not, you will need to take extra steps to find the common column names and then iterate over them.
Something like this should work:
import os
import pandas as pd
import scipy as sp
import scipy.stats as stats
df_1 = pd.DataFrame({ 'A': ['dog', 'pig', 'cat'],
'B': [0.25, 0.50, 0.75],
'C': [0.30, 0.40, 0.90]})
df_2 = pd.DataFrame({ 'A': ['bird', 'monkey', 'rat'],
'B': [0.20, 0.60, 0.90],
'C': [0.80, 0.50, 0.10]})
results = dict()
for column in df_1.columns[1:]:
correl = stats.pearsonr(df_1[column], df_2[column])
results[column] = correl
print(results)

How do I convert a Python DataFrame into a NumPy array

Below is a snippet that converts data into a NumPy array. It is then converted to a Pandas DataFrame where I intend to process it. I'm attempting to convert it back to a NumPy array. I'm failing at this. Badly.
import pandas as pd
import numpy as np
from pprint import pprint
data = [
('2020-11-01 00:00:00', 1.0),
('2020-11-02 00:00:00', 2.0)
]
coordinatesType = [('timestamp', 'datetime64[s]'), ('value', '<f8')]
npArray = np.asarray(data, coordinatesType)
df = pd.DataFrame(data = npArray)
# do some pandas processing, then convert back to a numpy array
mutatedNpArray = df.to_numpy(coordinatesType)
pprint(mutatedNpArray)
# don't suply dtype for kicks
pprint(df.to_numpy())
This yields crazytown:
array([[('2020-11-01T00:00:00', 1.6041888e+18),
('1970-01-01T00:00:01', 1.0000000e+00)],
[('2020-11-02T00:00:00', 1.6042752e+18),
('1970-01-01T00:00:02', 2.0000000e+00)]],
dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])
array([[Timestamp('2020-11-01 00:00:00'), 1.0],
[Timestamp('2020-11-02 00:00:00'), 2.0]], dtype=object)
I realize a DataFrame is really a fancy NumPy array under the hood, but I'm passing back to a function that accepts a simple NumPy array. Clearly I'm not handling dtypes correctly and/or I don't understand the data structure inside my DataFrame. Below is what the function I'm calling expects:
[('2020-11-01T00:00:00', 1.000 ),
('2020-11-02T00:00:00', 2.000 )],
dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])
I'm really lost on how to do this. Or what I should be doing instead.
Help!
As #hpaul suggested, I tried the following:
# ...
df = df.set_index('timestamp')
# do some pandas processing, then convert back to a numpy array
mutatedNpArray = df.to_records(coordinatesType)
# ...
All good!
Besides the to_records approach mentioned in comments, you can do:
df.apply(tuple, axis=1).to_numpy(coordinatesType)
Output:
array([('2020-11-01T00:00:00', 1.), ('2020-11-02T00:00:00', 2.)],
dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])
Considerations:
I believe the issue here is related to the difference between the original array and the dataframe.
The shape your original numpy array is (2,), where each value is a tuple. When creating the dataframe, both df.shape and df.to_numpy() shapes are (2, 2) so that the dtype constructor does not work as expected. When converting rows to tuples into a pd.Series, you get the original shape of (2,).

Use lmfit Model - function has dataframe as argument

I want to use lmfit in order to fit my data.
The function I am using, has only one argument features. The content of features will be different (both columns and values), so I can't initialize parameters.
I tried to create a dataframe as here, but I can't use the guess method because this is for LorentzianModel and I just want to use Model.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lmfit
from sklearn.linear_model import LinearRegression
df = {'a': [0, 0.2, 0.3], 'b':[14, 10, 9], 'target':[100, 200, 300]}
df = pd.DataFrame(df)
X = df[['a', 'b']]
y = df[['target']]
model = LinearRegression().fit(X, y)
features = pd.DataFrame({"a": np.array([0, 0.11, 0.36]),
"b": np.array([10, 14, 8])})
def eval_custom(features):
res = model.predict(features)
return res
x_val = features[["a"]].values
def calling_func(features, x_val):
pred_custom = eval_custom(features)
df = pd.DataFrame({'x': np.squeeze(x_val), 'y': np.squeeze(pred_custom)})
themodel = lmfit.Model(eval_custom)
params = themodel.guess(df['y'], x=df['x'])
result = themodel.fit(df['y'], params, x = df['x'])
result.plot_fit()
calling_func(features, x_val)
The model function needs to take independent variables and the individual model parameters as arguments. You're wrapping all of that into a single pandas Dataframe and then sending that. Don't do that.
If you need to create a dataframe from the current values of the model, do that inside your model function.
Also: a generic model function does not have a working guess function. Use model.make_params() and definitely, definitely (no exceptions, nope not ever) provide actual initial values for every parameter.

Resources