sklearn LedoitWolf covariance with missing - scikit-learn

I am trying to estimate the shrunk covarance matrix with Ledoit and Wolf (2004) method for the dataframe test:
from sklearn.covariance import LedoitWolf
LedoitWolf().fit(test).covariance_
I get the following error message:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
In fact, test looks like the following:
array([[ 0.003675, -0.085502, 0.023041, 0.0112 , -0.0048 , -0.022344,
-0.0048 , -0.1328 , nan, -0.152373, 0.061867],
[ 0.038778, 0.187608, 0.065888, 0.137032, -0.0047 , 0.081014,
-0.215226, 0.018556, nan, 0.179208, 0.047932],
[ 0.0466 , 0.012729, 0.128468, 0.162117, 0.020236, 0.138267,
0.5966 , 0.132964, nan, 0.054852, 0.0216 ],
[ 0.011873, -0.188127, 0.027068, 0.102509, 0.405091, 0.141985,
-0.212333, -0.012 , nan, 0.241872, -0.05522 ],
[ 0.02705 , -0.180671, -0.0042 , -0.014895, 0.382897, -0.080633,
-0.109463, -0.126649, nan, 0.099504, 0.164631],
[-0.0037 , -0.122748, -0.022748, -0.068565, -0.131142, -0.101602,
0.172771, -0.026956, nan, 0.278179, -0.0037 ],
[ 0.026003, 0.087592, -0.031484, 0.180671, nan, 0.135235,
-0.2043 , -0.061443, nan, 0.17214 , 0.004589],
[ 0.010006, -0.1797 , 0.035704, 0.083105, -0.166862, 0.008905,
-0.0672 , -0.0047 , nan, -0.044968, 0.006411],
[-0.1294 , -0.2544 , -0.00242 , 0.005452, -0.0044 , -0.031986,
-0.071067, -0.069265, nan, 0.021406, 0.040657],
[-0.030886, -0.1291 , 0.0159 , -0.121173, nan, -0.060838,
-0.075529, 0.025312, nan, 0.033636, -0.037433],
[ 0.014349, 0.138857, 0.005804, -0.169746, nan, 0.133405,
0.149846, 0.024571, -0.026772, -0.023394, 0.110943],
[-0.040036, 0.746 , 0.182408, 0.04898 , nan, 0.103383,
0.062667, 0.012667, -0.025277, 0.045689, 0.118887],
[ 0.054652, -0.241695, 0.128631, -0.016179, nan, 0.081248,
-0.1286 , -0.031378, 0.039878, 0.114743, 0.005659],
[-0.058546, 0.40275 , -0.061894, -0.016239, 0.030048, -0.105195,
0.067929, -0.0035 , -0.0035 , 0.033325, -0.06772 ],
[ 0.132122, 0.049533, -0.038684, -0.081219, -0.0038 , 0.008779,
0.062867, -0.015229, -0.0038 , -0.034569, 0.00522 ]])
Is there a workaround to use LedoitWolf with missing data, other than imputing the missing with (say) median or mean values? Unfortunately, I am not allowed to make arbitrary assumptions about those missing data.

Related

Error parsing parquet file and importing data to sql table using pandas dataframe

Seeing below error while trying to read parquet file and import the data to Microsoft SQL Server table, couldn't figure out where the problem is, any suggestions would be helpful.
code snippet:
from pandas import read_parquet
import numpy as np
df = read_parquet('<file-path>', engine='fastparquet')
df = df.fillna(value=np.nan)
cols = "],[".join([str(i) for i in df.columns.tolist()])
for index, row in df.iterrows():
sql = "INSERT INTO " + obj.table + "([" + cols + "]) VALUES (" + "?," * (len(row) - 1) + "?)"
try:
cur.execute(sql, tuple(row))
except:
print(sql, tuple(row))
tb.print_exc()
break
conn.commit()
Error (printed record that failed insert):
<For debugging>
INSERT INTO table(col1, col2 col3 ...) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?) (294958881.0, 0.0, 71142106.0, 5.0, 4.0, 4.0, 0.0, 4804102.0, 1.0, 1.0, Timestamp('2020-01-30 12:00:01.590000'), Timestamp('2020-01-30 12:05:12.480000'), nan, nan, nan, nan, nan, Timestamp('2020-01-30 12:05:00'), Timestamp('2020-01-30 12:05:12.420000'), Timestamp('2020-01-30 12:05:12.420000'), nan, nan, nan, 130864939.0, 1.0, 1.0, 0.0, 0.0, 253199575.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, nan, 0.0, 0.0, nan, nan, nan, nan, 1.0, 1.0, '63D861B100000001', Timestamp('2023-01-31 00:32:49.265000'), Timestamp('2023-01-31 00:32:49'))
Traceback (most recent call last):
File "C:\temp\file.py", line 293, in method01
cur.execute(sql, tuple(row))
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][SQL Server Native Client 11.0][SQL Server]The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect.
Parameter 16 (""): The supplied value is not a valid instance of data type float. Check the source data for invalid values.
An example of an invalid value is data of numeric type with scale greater than precision. (8023) (SQLExecDirectW)')
I don't have any float values here that the error is reporting, not sure if I'm missing something. All my target columns are nullable.
I have following two methods working with some exceptions,
Method 1: (Based on suggestion in one of the comments)
Using SqlAlchemy, pyodbc
import sqlalchemy as sa, pyodbc
con = sa.create_engine('mssql+pyodbc://'
+self.server+'/'
+self.database+
'?driver=SQL+Server+Native+Client+11.0'
, fast_executemany=True
, echo=False)
from pandas import read_parquet
df = read_parquet('<file-path>', engine='fastparquet')
df.to_sql(name=self.table, con=con, if_exists='append', chunksize=1000, index=False)
con.dispose()
Observations using method 1:
Slow loading.
No issues with NaN values from pandas df (My original problem).
5x slower than method 2, when tested with 100MB compressed parquet file.
Method 2: (If you are not worried of NaNs in the dataframe and need faster load time)
using pyodbc
import pyodbc
con = pyodbc.connect("DRIVER={SQL Server Native Client 11.0};"
"SERVER=" + self.server + ";"
"DATABASE=" + self.database + ";"
"Trusted_Connection=yes;")
cur = con.cursor()
from pandas import read_parquet
df = read_parquet('<file-path>', engine='fastparquet')
df.fillna(value='', inplace=True)
for index, row in df.iterrows():
sql = "INSERT INTO " + self.table + "([" + cols + "]) VALUES (" + "?," * (len(row) - 1) + "?)"
cur.execute(sql, tuple(row))
con.commit()
con.close()
Observations using method 2:
Fast loading.
No issues with NaN values, as we are marking them to empty string.
5x faster than method 1, when tested with 100MB compressed parquet file.

why does LedoitWolf return zeros for off-diagonal elements?

from sklearn.covariance import EmpiricalCovariance, LedoitWolf
a = np.array([[ 0.6278, -1.1273 ],
[ 0.2323, 0.4533 ],
[ 0.3234, 1.5356 ],
[1.7473 , -0.3113 ],
[-0.3525 , 0.2577 ]])
Empirical_cov = EmpiricalCovariance().fit(a)
Empirical_cov.covariance_
array([[ 0.48009421, -0.23144627],
[-0.23144627, 0.77341954]])
LedoitWolf_cov = LedoitWolf().fit(a)
LedoitWolf_cov.covariance_
array([[ 1., -0.],
[-0., 1.]])
Why is LedoitWolf giving me zeros for the off-diagonal elements? I have a few larger datasets in which this happens.
Is this a bug? The empirical covariance is non-zero, so shouldn't LedoitWolf also be non-zero? I read that using LedoitWolf would be slightly more accurate than empirical covariance, but what is the point if most covariance elements turn out to be zero?

Error converting covariance to correlation using scipy

I am trying to convert a covaraince matrix (from scipy.optimize.curve_fit) to a correlation matrix using the method here:
https://math.stackexchange.com/questions/186959/correlation-matrix-from-covariance-matrix
My test data is from here https://blogs.sas.com/content/iml/2010/12/10/converting-between-correlation-and-covariance-matrices.html
My code is here
import numpy as np
S = [[1.0, 1.0, 8.1],
[1.0, 16.0, 18.0],
[8.1, 18.0, 81.0] ]
S = np.array(S)
diag = np.sqrt(np.diag(np.diag(S)))
gaid = np.linalg.inv(diag)
corl = gaid * S * gaid
print(corl)
I was expecting to see [[1. 0.25 0.9 ], [0.25 1. 0.5 ], [0.9 0.5 1. ]] but instead get [[1. 0. 0.], [0. 1. 0.], [0. 0. 1.]]. I am obviously doing something silly but just not sure what so all suggestions gratefully received - thanks!
you've probably figured it out by now but you have to use the # operator for matrix multiplication in numpy. The operator * is for an element-wise multiplication.
So
corl = gaid # S # gaid
gives the answer you are looking for.

Keras 'Error when checking input' when trying to predict multiple values

I have a net with a length 4 input vector, length 2 output vector. I am trying to predict multiple inputs simultaneously. If I just want to predict one, I would do the following and it works:
in = numpy.array( [ [1,2,3,4] ] )
self.model.predict(in)
# prediction = [ [1,2] ]
However, when I try to pass in multiple inputs I get ValueError: Error when checking input: expected dense_1_input to have shape (4,) but got array with shape (1,)
in = numpy.array( [
[1,2,3,4],
[1,2,3,4]
]
)
#OR
in = numpy.array( [
[ [1,2,3,4] ],
[ [1,2,3,4] ]
]
)
self.model.predict(in)
#ERR
What am I doing wrong?
Edit:
Code =
model = Sequential()
model.add(Dense(24, input_dim=4, activation='relu'))
model.add(Dense(24, activation='relu'))
model.add(Dense(4, activation='linear'))
model.compile(loss='mse',
optimizer=Adam(lr=self.learning_rate))
print(batch_arr[:,3][0])
predictions = self.model.predict(batch_arr[:,3][0])
print(predictions)
print(batch_arr[:,3])
predictions = model.predict(batch_arr[:,3])
Output =
[[-0.00441936 -0.20398824 -0.08134908 0.09739554]]
[[ 0.01860509 -0.01136071]]
[array([[-0.00441936, -0.20398824, -0.08134908, 0.09739554]])
array([[-0.00517939, 0.38975933, -0.11951023, -0.9718224 ]])
array([[0.00272119, 0.0025476 , 0.002645 , 0.03973542]])
array([[-0.00421809, -0.01006362, -0.07795483, -0.16971247]])
array([[-0.00904593, 0.19332681, -0.10655871, -0.64757587]])
array([[ 0.00654432, 0.00347247, -0.15332555, -0.47302148]])
array([[-0.01921821, -0.17354519, -0.20207744, -0.58569029]])
array([[ 0.00661377, 0.20038962, -0.16278598, -0.80983334]])
array([[-0.00348096, 0.18171964, -0.07072813, -0.38913168]])
array([[-0.01268919, -0.00548544, -0.08286095, -0.27108632]])
array([[ 0.01077598, -0.19254374, -0.004982 , 0.33175341]])
array([[-4.37101750e-04, -5.68196965e-01, -1.99532537e-01,
1.10581883e-01]])
array([[ 0.00657382, -0.19263146, -0.00402872, 0.33368607]])
array([[ 0.00677398, 0.19760551, -0.00076944, -0.25153403]])
array([[ 0.00261579, 0.19642629, -0.13894668, -0.71894379]])
array([[-0.0221003 , 0.37477368, -0.03765055, -0.63564477]])
array([[-0.0110009 , 0.37599703, -0.0574645 , -0.66318148]])
array([[ 0.00277214, 0.19763152, 0.00343971, -0.25211181]])
array([[-9.31810654e-05, -2.06245307e-01, -8.09019674e-02,
1.47356796e-01]])
array([[ 0.00709025, -0.37636771, -0.19725323, -0.11396513]])
array([[ 0.00015344, -0.01233088, -0.07851076, -0.11956039]])
array([[ 0.01077811, -0.18439307, -0.19043179, -0.34107231]])
array([[-0.01460483, 0.18019651, -0.05036345, -0.35505252]])
array([[-0.0127989 , 0.19071515, -0.08828268, -0.58871071]])
array([[ 0.01072609, 0.00249456, -0.00580012, 0.0409061 ]])
array([[ 0.01062156, 0.00782762, -0.17898265, -0.57245695]])
array([[-0.01180104, -0.37085843, -0.1973209 , -0.23782701]])
array([[-0.00849912, -0.00780031, -0.07940117, -0.21980343]])
array([[ 0.00672477, 0.00246062, -0.00160252, 0.04165408]])
array([[-0.02268911, -0.36534914, -0.21379125, -0.36284594]])
array([[-0.00865513, -0.20170279, -0.08379724, 0.0468145 ]])
array([[-0.0256848 , 0.17922475, -0.03098346, -0.33335449]])]
#ERR
Edit: When I print out the shape of batch_arr[:,3] I get (32,), not (32,4) as I expected. Thus I'm guess the numpy array does not know the shape of its inner arrays. Is there an easy way to fix that? It might be the root of the problem
The issue was the way that I had created my numpy array. I created it with indices of variable size, and thus it didn't know it was shaped (32,4), only that it was (32,). Reformulating the logic to ensure that the array is always a set width from the beginning allowed the array to be a (32,4), which allowed the prediction to work.

How to vectorize this function in py3

I seem to be unable to find out how to vectorize this py3 loop
import numpy as np
a = np.array([-72, -10, -70, 37, 68, 9, 1, -3, 2, 3, -6, -4, ], np.int16)
result = np.array([-72, -10, -111, -23, 1, -2, 1, -3, 1, 2, -5, -5, ], np.int16)
b = np.copy(a)
for i in range(2, len(b)):
b[i] += int( (b[i-1] + b[i-2]) / 2)
assert (b == result).all()
I tried playing with np.convolve and pandas.rolling_apply but couldn't get it working. Maybe this is the time to learn about c-extensions?
It would be great to get the time for this down to something like 50..100ms for input arrays of ~500k elements.
#hpaulj asked in his answer for a closed expression of b[k] in terms of a[:k]. I didn't think it existed, but I worked a bit on it and indeed found that the closed form contains a bunch of Jacobsthal numbers as #Divakar pointed out.
Here is one closed form:
J_n here is the Jacobsthal number, when expanding it like this:
J_n = (2^n - (-1)^n) / 3
one ends up with an expression which I can imagine to use a vectorized implementation ...
Most numpy code operates on the whole array at once. Ok it iterates in C code but buffered in a way that it doesn't matter which element is used first.
Here changes to b[2] affect the value calculated for b[3] and on down the line.
add.at and other such ufunc do unbuffered calculations. This allows you to add some value repeatedly to one element. I played a bit with it in that case, but no luck so far.
cumsum and cumprod are also handy for problems were values depend earlier ones.
Is it possible to generalize the calculation, so as to as define b[i] in terms of all the a[:i]. We know b[2] as a function of a[:2], but what of b[3]?
Even if we go this working for floats, it might be off when doing integer divisions.
I think you already have the sane solution. Any other vectorization would rely on floating point calculations and it would be really difficult to keep track of the error accumulation. For example say you want to have a matrix vector multiplication: for the first seven terms the matrix would look like
array([[ 1. , 0. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 1. , 0. , 0. , 0. , 0. , 0. ],
[ 0.5 , 0.5 , 1. , 0. , 0. , 0. , 0. ],
[ 0.25 , 0.75 , 0.5 , 1. , 0. , 0. , 0. ],
[ 0.375 , 0.625 , 0.75 , 0.5 , 1. , 0. , 0. ],
[ 0.3125 , 0.6875 , 0.625 , 0.75 , 0.5 , 1. , 0. ],
[ 0.34375, 0.65625, 0.6875 , 0.625 , 0.75 , 0.5 , 1. ]])
The relationship can be described as the iterative formula
[ a[i-2] ]
b[i] = [0.5 , 0.5 , 1] [ a[i-1] ]
[ a[i] ]
That defines a series of elementary matrices of the form of an identity matrix with
[0 ... 0.5 0.5 1 0 ... 0]
on the ith row. And successive multiplication gives the matrix above for the first seven terms. There is indeed a subdiagonal structure but the terms are getting too small very quickly. As you have shown 2 to the power 500k is not fun.
In order to keep track of floating point noise, an iterative solution is required which is what you have anyways.

Resources