Truncated bytearray using pyodbc in Linux - python-3.x

I'm using a kalman filter in pykalman, then I pickle the filter and save it into Sybase as a binary array that is sufficiently long. I'm using pyodbc as the connection.
I run the script on a Linux server, then get the same filter from my Windows desktop and unpickle it, it works fine. However, if I get the same filter in Linux and try to unpickle it, it says data is truncated. In ipython I can see that it's only getting the first 255 bytes.
In [16]: x = cur.execute('select top 1 filter from kalman_filters where qdate = "20180115"').fetchone()
In [17]: x[0]
Out[17]: b'\x80\x03cpykalman.standard\nKalmanFilter\nq\x00)\x81q\x01}q\x02(X\x13\x00\x00\x00transition_matricesq\x03cnumpy.core.multiarray\n_reconstruct\nq\x04cnumpy\nndarray\nq\x05K\x00\x85q\x06C\x01bq\x07\x87q\x08Rq\t(K\x01K\x01K\x01\x86q\ncnumpy\ndtype\nq\x0bX\x02\x00\x00\x00f8q\x0cK\x00K\x01\x87q\rRq\x0e(K\x03X\x01\x00\x00\x00<q\x0fNNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tq\x10b\x89C\x08\x00\x00\x00\x00\x00\x00\xf0?q\x11tq\x12bX\x14\x00\x00\x00observation_matric'
In [18]: len(x[0])
Out[18]: 255
In [19]: type(x[0])
Out[19]: bytes
If I do the same in Windows, it gets it correctly.
In [7]: len(x[1])
Out[7]: 827
In [8]: x[1]
Out[8]: b'\x80\x03cpykalman.standard\nKalmanFilter\nq\x00)\x81q\x01}q\x02(X\x13\x00\x00\x00transition_matricesq\x03cnumpy.core.multiarray\n_reconstruct\nq\x04cnumpy\nndarray\nq\x05K\x00\x85q\x06C\x01bq\x07\x87q\x08Rq\t(K\x01K\x01K\x01\x86q\ncnumpy\ndtype\nq\x0bX\x02\x00\x00\x00f8q\x0cK\x00K\x01\x87q\rRq\x0e(K\x03X\x01\x00\x00\x00<q\x0fNNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tq\x10b\x89C\x08\x00\x00\x00\x00\x00\x00\xf0?q\x11tq\x12bX\x14\x00\x00\x00observation_matricesq\x13h\x04h\x05K\x00\x85q\x14h\x07\x87q\x15Rq\x16(K\x01K\x01K\x01\x86q\x17h\x0e\x89C\x08\x00\x00\x00\x00\x00\x00\xf0?q\x18tq\x19bX\x15\x00\x00\x00transition_covarianceq\x1ah\x04h\x05K\x00\x85q\x1bh\x07\x87q\x1cRq\x1d(K\x01K\x01K\x01\x86q\x1eh\x0e\x89C\x08\x99\xa2#\x03Y\xa0%?q\x1ftq bX\x16\x00\x00\x00observation_covarianceq!h\x04h\x05K\x00\x85q"h\x07\x87q#Rq$(K\x01K\x01K\x01\x86q%h\x0e\x89C\x08\xa4\xd4\x1fF\xd09D?q&tq\'bX\x12\x00\x00\x00transition_offsetsq(h\x04h\x05K\x00\x85q)h\x07\x87q*Rq+(K\x01K\x01\x85q,h\x0e\x89C\x08\x00\x00\x00\x00\x00\x00\x00\x00q-tq.bX\x13\x00\x00\x00observation_offsetsq/h\x04h\x05K\x00\x85q0h\x07\x87q1Rq2(K\x01K\x01\x85q3h\x0e\x89C\x08\x00\x00\x00\x00\x00\x00\x00\x00q4tq5bX\x12\x00\x00\x00initial_state_meanq6h\x04h\x05K\x00\x85q7h\x07\x87q8Rq9(K\x01K\x01\x85q:h\x0e\x89C\x08\x18WQ\x07\x1bK\'\xbfq;tq<bX\x18\x00\x00\x00initial_state_covarianceq=h\x04h\x05K\x00\x85q>h\x07\x87q?Rq#(K\x01K\x01K\x01\x86qAh\x0e\x89C\x08\x1e"\x12\xd5\xa6\xbc\x00?qBtqCbX\x0c\x00\x00\x00random_stateqDNX\x07\x00\x00\x00em_varsqE]qF(h\x1ah!h6h=eX\x0b\x00\x00\x00n_dim_stateqGK\x01X\t\x00\x00\x00n_dim_obsqHK\x01ub.'
In [9]: type(x[1])
Out[9]: bytes
So I'm pretty sure it's a pyodbc/odbc settings issue. Any idea what this is?
Thanks

Related

Pandas dataframe float index not self-consistent

I need/want to work with float indices in pandas but I get a keyerror when running something like this:
inds = [1.1, 2.2]
cols = [5.4, 6.7]
df = pd.DataFrame(np.random.randn(2, 2), index=inds, columns=cols)
df[df.index[0]]
I have seen some errors regarding precision, but shouldn't this work?
You get the KeyError because df[df.index[0]] would try to access a column with label 1.1 in this case - which does not exist here.
What you can do is use loc or iloc to access rows based on indices:
import numpy as np
import pandas as pd
inds = [1.1, 2.2]
cols = [5.4, 6.7]
df = pd.DataFrame(np.random.randn(2, 2), index=inds, columns=cols)
# to access e.g. the first row use
df.loc[df.index[0]]
# or more general
df.iloc[0]
# 5.4 1.531411
# 6.7 -0.341232
# Name: 1.1, dtype: float64
In principle, if you can, avoid equal comparisons for floating point numbers for the reason you already came across: precision. The 1.1 displayed to you might be != 1.1 for the computer - simply because that would theoretically require infinite precision. Most of the time, it will work though because certain tolerance checks will kick in; for example if the difference of the compared numbers is < 10^6.

read the scipy.beta distribution parameters from a scipy.stats._continuous_distns.beta_gen object

Having an instance of the beta object, how do I get back the parameters a and b?
There are properties a and b, but it seems they mean something else as I expected:
>>> import scipy
>>> scipy.__version__
'0.19.1'
>>> from scipy import stats
>>> my_beta = stats.beta(a=1, b=5)
>>> my_beta.a, my_beta.b
(0.0, 1.0)
Is there a way to get the parameters of the distribution? I could always fit a huge rvs sample but that seems silly :)
When you create a "frozen" distribution with a call such as my_beta = stats.beta(a=1, b=5), the positional and keyword arguments are saved as the attributes args and kwds, respectively, on the returned object. So in your case, you can access those values in the dictionary my_beta.kwds:
In [10]: from scipy import stats
In [11]: my_beta = stats.beta(a=1, b=5)
In [12]: my_beta.kwds
Out[12]: {'a': 1, 'b': 5}
The attributes my_beta.a and my_beta.b are, as you guessed, something different. They define the end points of the support of the probability distribution:
In [13]: my_beta.a
Out[13]: 0.0
In [14]: my_beta.b
Out[14]: 1.0

Speed up writing billions of rows to HDF5

This is a continuation of the scenario I tried to discuss in my question https://stackoverflow.com/questions/33251445/tips-to-store-huge-sensor-data-in-hdf5-using-pandas. Please read the question for more details about what follows.
Since the linked question above was closed as the subject was too broad, I did not get a chance to gather ideas from people more experienced at handling hundreds of gigabytes of data. I do not have any experience with that whatsoever, and I am learning as I go. I have apparently made some mistake somewhere, because my method is taking way too long to complete.
The data is as I described in the linked question above. I decided to create a node (group) for each sensor (with the sensor ID as the node name, under root) to store the data generated by each of the 260k sensors I have. The file will end up with 260k nodes, and each node will have a few GB of data stored in a Table under it. The code that does all the heavy lifting is as follows:
with pd.HDFStore(hdf_path, mode='w') as hdf_store:
for file in files:
# Read CSV files in Pandas
fp = os.path.normpath(os.path.join(path, str(file).zfill(2)) + '.csv')
df = pd.read_csv(fp, names=data_col_names, skiprows=1, header=None,
chunksize=chunk_size, dtype=data_dtype)
for chunk in df:
# Manipulate date & epoch to get it in human readable form
chunk['DATE'] = pd.to_datetime(chunk['DATE'], format='%m%d%Y', box=False)
chunk['EPOCH'] = pd.to_timedelta(chunk['EPOCH']*5, unit='m')
chunk['DATETIME'] = chunk['DATE'] + chunk['EPOCH']
#Group on Sensor to store in HDF5 file
grouped = chunk.groupby('Sensor')
for group, data in grouped:
data.index = data['DATETIME']
hdf_store.append(group, data.loc[:,['R1', 'R2', 'R3']])
# Adding sensor information as metadata to nodes
for sens in sensors:
try:
hdf_store.get_storer(sens).attrs.metadata = sens_dict[sens]
hdf_store.get_storer(sens).attrs['TITLE'] = sens
except AttributeError:
pass
If I comment out the line hdf_store.append(group, data.loc[:,['R1', 'R2', 'R3']]), the bit under for chunk in df: takes about 40 - 45 seconds to finish processing an iteration. (The chunk size I am reading is 1M rows.) But with the line included in the code (that is if the grouped chunk is being written to HDF file) the code takes about 10 - 12 minutes for each iteration. I am completely baffled by the increase in execution time. I do not know what is causing that to happen.
Please give me some suggestions to resolve the issue. Note that I cannot afford execution times that long. I need to process about 220 GB of data in this fashion. Later I need to query that data, one node at a time, for further analysis. I have spent over 4 days researching the topic, but I am still as stumped as when I began.
#### EDIT 1 ####
Including df.info() for a chunk containing 1M rows.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 7 columns):
SENSOR 1000000 non-null object
DATE 1000000 non-null datetime64[ns]
EPOCH 1000000 non-null timedelta64[ns]
R1 1000000 non-null float32
R2 773900 non-null float32
R3 483270 non-null float32
DATETIME 1000000 non-null datetime64[ns]
dtypes: datetime64[ns](2), float32(3), object(1), timedelta64[ns](1)
memory usage: 49.6+ MB
Of these, only DATETIME, R1, R2, R3 are written to the file.
#### EDIT 2 ####
Including pd.show_versions()
In [ ] : pd.show_versions()
Out [ ] : INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.23.2
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.1
pytz: 2015.6
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 2.0.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: None
You are constantly performing indexing the rows you write. It is much more efficient to write all of the rows, THEN create the index.
See the documentation on creating an index here.
On the append operations pass index=False; this will turn off indexing.
Then when you are finally finished, run (on each node), assuming store is your HDFStore.
store.create_table_index('node')
This operation will take some time, but will be done once rather than continuously. This makes a tremendous difference because the creation can take into account all of your data (and move it only once).
You might also want to ptrepack your data (either before or after the indexing operation), to reset the chunksize. I wouldn't specify it directly, rather set chunksize='auto' to let it figure out an optimal size AFTER all of the data is written.
So this should be a pretty fast operation (even with indexing).
In [38]: N = 1000000
In [39]: df = DataFrame(np.random.randn(N,3).astype(np.float32),columns=list('ABC'),index=pd.date_range('20130101',freq='ms',periods=N))
In [40]: df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000000 entries, 2013-01-01 00:00:00 to 2013-01-01 00:16:39.999000
Freq: L
Data columns (total 3 columns):
A 1000000 non-null float32
B 1000000 non-null float32
C 1000000 non-null float32
dtypes: float32(3)
memory usage: 19.1 MB
In [41]: store = pd.HDFStore('test.h5',mode='w')
In [42]: def write():
....: for i in range(10):
....: dfi = df.copy()
....: dfi.index = df.index + pd.Timedelta(minutes=i)
....: store.append('df',dfi)
....:
In [43]: %timeit -n 1 -r 1 write()
1 loops, best of 1: 4.26 s per loop
In [44]: store.close()
In [45]: pd.read_hdf('test.h5','df').info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10000000 entries, 2013-01-01 00:00:00 to 2013-01-01 00:25:39.999000
Data columns (total 3 columns):
A float32
B float32
C float32
dtypes: float32(3)
memory usage: 190.7 MB
Versions
In [46]: pd.__version__
Out[46]: u'0.17.0'
In [49]: import tables
In [50]: tables.__version__
Out[50]: '3.2.2'
In [51]: np.__version__
Out[51]: '1.10.1'

Convert list of numpy.float64 to float in Python quickly

What is the fastest way of converting a list of elements of type numpy.float64 to type float? I am currently using the straightforward for loop iteration in conjunction with float().
I came across this post: Converting numpy dtypes to native python types, however my question isn't one of how to convert types in python but rather more specifically how to best convert an entire list of one type to another in the quickest manner possible in python (i.e. in this specific case numpy.float64 to float). I was hoping for some secret python machinery that I hadn't come across that could do it all at once :)
The tolist() method should do what you want. If you have a numpy array, just call tolist():
In [17]: a
Out[17]:
array([ 0. , 0.14285714, 0.28571429, 0.42857143, 0.57142857,
0.71428571, 0.85714286, 1. , 1.14285714, 1.28571429,
1.42857143, 1.57142857, 1.71428571, 1.85714286, 2. ])
In [18]: a.dtype
Out[18]: dtype('float64')
In [19]: b = a.tolist()
In [20]: b
Out[20]:
[0.0,
0.14285714285714285,
0.2857142857142857,
0.42857142857142855,
0.5714285714285714,
0.7142857142857142,
0.8571428571428571,
1.0,
1.1428571428571428,
1.2857142857142856,
1.4285714285714284,
1.5714285714285714,
1.7142857142857142,
1.857142857142857,
2.0]
In [21]: type(b)
Out[21]: list
In [22]: type(b[0])
Out[22]: float
If, in fact, you really have python list of numpy.float64 objects, then #Alexander's answer is great, or you could convert the list to an array and then use the tolist() method. E.g.
In [46]: c
Out[46]:
[0.0,
0.33333333333333331,
0.66666666666666663,
1.0,
1.3333333333333333,
1.6666666666666665,
2.0]
In [47]: type(c)
Out[47]: list
In [48]: type(c[0])
Out[48]: numpy.float64
#Alexander's suggestion, a list comprehension:
In [49]: [float(v) for v in c]
Out[49]:
[0.0,
0.3333333333333333,
0.6666666666666666,
1.0,
1.3333333333333333,
1.6666666666666665,
2.0]
Or, convert to an array and then use the tolist() method.
In [50]: np.array(c).tolist()
Out[50]:
[0.0,
0.3333333333333333,
0.6666666666666666,
1.0,
1.3333333333333333,
1.6666666666666665,
2.0]
If you are concerned with the speed, here's a comparison. The input, x, is a python list of numpy.float64 objects:
In [8]: type(x)
Out[8]: list
In [9]: len(x)
Out[9]: 1000
In [10]: type(x[0])
Out[10]: numpy.float64
Timing for the list comprehension:
In [11]: %timeit list1 = [float(v) for v in x]
10000 loops, best of 3: 109 µs per loop
Timing for conversion to numpy array and then tolist():
In [12]: %timeit list2 = np.array(x).tolist()
10000 loops, best of 3: 70.5 µs per loop
So it is faster to convert the list to an array and then call tolist().
You could use a list comprehension:
floats = [float(np_float) for np_float in np_float_list]
So out of the possible solutions I've come across (big thanks to Warren Weckesser and Alexander for pointing out all of the best possible approaches) I ran my current method and that presented by Alexander to give a simple comparison for runtimes (the two choices come as a result of the fact that I have a true list of elements of numpy.float64 and wish to convert them to float speedily):
2 approaches covered: list comprehension and basic for loop iteration
First here's the code:
import datetime
import numpy
list1 = []
for i in range(0,1000):
list1.append(numpy.float64(i))
list2 = []
t_init = time.time()
for num in list1:
list2.append(float(num))
t_1 = time.time()
list2 = [float(np_float) for np_float in list1]
t_2 = time.time()
print("t1 run time: {}".format(t_1-t_init))
print("t2 run time: {}".format(t_2-t_1))
I ran four times to give a quick set of results:
>>> run 1
t1 run time: 0.000179290771484375
t2 run time: 0.0001533031463623047
Python 3.4.0
>>> run 2
t1 run time: 0.00018739700317382812
t2 run time: 0.0001518726348876953
Python 3.4.0
>>> run 3
t1 run time: 0.00017976760864257812
t2 run time: 0.0001513957977294922
Python 3.4.0
>>> run 4
t1 run time: 0.0002455711364746094
t2 run time: 0.00015997886657714844
Python 3.4.0
Clearly to convert a true list of numpy.float64 to float, the optimal approach is to use python's list comprehension.

getting indices in numpy

Can someone find out what is wrong with the code below?
import numpy as np
data = np.recfromcsv("data.txt", delimiter=" ", names=['name', 'types', 'value'])
indices = np.where((data.name == 'david') * data.types.startswith('height'))
mean_value = np.mean(data.value[indices])
I want to calculate mean of weight and height for david and mark as follows:
david>> mean(weight_2005 and weight_2012), mean (height_2005 and height_2012)
mark>> mean(weight_2005 and weight_2012), mean (height_2005 and height_2012)
From the text (data.txt) file:
david weight_2005 50
david weight_2012 60
david height_2005 150
david height_2012 160
mark weight_2005 90
mark weight_2012 85
mark height_2005 160
mark height_2012 170
I am using python 3.2 and numpy 1.8
The above code provides the type error as follows:
TypeError: startswith first arg must be bytes or a tuple of bytes, not numpy.str_
With Python3.2 and numpy 1.7, this line works
indices = np.where((data.name == b'david') * data.types.startswith(b'height'))
data displays as:
rec.array([(b'david', b'weight_2005', 50),...],
dtype=[('name', 'S5'), ('types', 'S11'), ('value', '<i4')])
type(data.name[0]) is <class 'bytes'>.
b'height' works in Python2.7 as well.
another option is to convert all the data to unicode (Python 3 strings)
dtype=[('name','U5'), ('types', 'U11'), ('value', '<i4')]
dataU=data.astype(dtype=dtype)
indices = np.where((dataU.name == 'david') * dataU.types.startswith('height'))
or
data = np.recfromtxt('data.txt', delimiter=" ",
names=['name', 'types', 'value'], dtype=dtype)
It looks like recfromcsv does not take a dtype, but recfromtxt does.

Resources