Speed up writing billions of rows to HDF5 - python-3.x

This is a continuation of the scenario I tried to discuss in my question https://stackoverflow.com/questions/33251445/tips-to-store-huge-sensor-data-in-hdf5-using-pandas. Please read the question for more details about what follows.
Since the linked question above was closed as the subject was too broad, I did not get a chance to gather ideas from people more experienced at handling hundreds of gigabytes of data. I do not have any experience with that whatsoever, and I am learning as I go. I have apparently made some mistake somewhere, because my method is taking way too long to complete.
The data is as I described in the linked question above. I decided to create a node (group) for each sensor (with the sensor ID as the node name, under root) to store the data generated by each of the 260k sensors I have. The file will end up with 260k nodes, and each node will have a few GB of data stored in a Table under it. The code that does all the heavy lifting is as follows:
with pd.HDFStore(hdf_path, mode='w') as hdf_store:
for file in files:
# Read CSV files in Pandas
fp = os.path.normpath(os.path.join(path, str(file).zfill(2)) + '.csv')
df = pd.read_csv(fp, names=data_col_names, skiprows=1, header=None,
chunksize=chunk_size, dtype=data_dtype)
for chunk in df:
# Manipulate date & epoch to get it in human readable form
chunk['DATE'] = pd.to_datetime(chunk['DATE'], format='%m%d%Y', box=False)
chunk['EPOCH'] = pd.to_timedelta(chunk['EPOCH']*5, unit='m')
chunk['DATETIME'] = chunk['DATE'] + chunk['EPOCH']
#Group on Sensor to store in HDF5 file
grouped = chunk.groupby('Sensor')
for group, data in grouped:
data.index = data['DATETIME']
hdf_store.append(group, data.loc[:,['R1', 'R2', 'R3']])
# Adding sensor information as metadata to nodes
for sens in sensors:
try:
hdf_store.get_storer(sens).attrs.metadata = sens_dict[sens]
hdf_store.get_storer(sens).attrs['TITLE'] = sens
except AttributeError:
pass
If I comment out the line hdf_store.append(group, data.loc[:,['R1', 'R2', 'R3']]), the bit under for chunk in df: takes about 40 - 45 seconds to finish processing an iteration. (The chunk size I am reading is 1M rows.) But with the line included in the code (that is if the grouped chunk is being written to HDF file) the code takes about 10 - 12 minutes for each iteration. I am completely baffled by the increase in execution time. I do not know what is causing that to happen.
Please give me some suggestions to resolve the issue. Note that I cannot afford execution times that long. I need to process about 220 GB of data in this fashion. Later I need to query that data, one node at a time, for further analysis. I have spent over 4 days researching the topic, but I am still as stumped as when I began.
#### EDIT 1 ####
Including df.info() for a chunk containing 1M rows.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 7 columns):
SENSOR 1000000 non-null object
DATE 1000000 non-null datetime64[ns]
EPOCH 1000000 non-null timedelta64[ns]
R1 1000000 non-null float32
R2 773900 non-null float32
R3 483270 non-null float32
DATETIME 1000000 non-null datetime64[ns]
dtypes: datetime64[ns](2), float32(3), object(1), timedelta64[ns](1)
memory usage: 49.6+ MB
Of these, only DATETIME, R1, R2, R3 are written to the file.
#### EDIT 2 ####
Including pd.show_versions()
In [ ] : pd.show_versions()
Out [ ] : INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.23.2
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.1
pytz: 2015.6
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 2.0.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: None

You are constantly performing indexing the rows you write. It is much more efficient to write all of the rows, THEN create the index.
See the documentation on creating an index here.
On the append operations pass index=False; this will turn off indexing.
Then when you are finally finished, run (on each node), assuming store is your HDFStore.
store.create_table_index('node')
This operation will take some time, but will be done once rather than continuously. This makes a tremendous difference because the creation can take into account all of your data (and move it only once).
You might also want to ptrepack your data (either before or after the indexing operation), to reset the chunksize. I wouldn't specify it directly, rather set chunksize='auto' to let it figure out an optimal size AFTER all of the data is written.
So this should be a pretty fast operation (even with indexing).
In [38]: N = 1000000
In [39]: df = DataFrame(np.random.randn(N,3).astype(np.float32),columns=list('ABC'),index=pd.date_range('20130101',freq='ms',periods=N))
In [40]: df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000000 entries, 2013-01-01 00:00:00 to 2013-01-01 00:16:39.999000
Freq: L
Data columns (total 3 columns):
A 1000000 non-null float32
B 1000000 non-null float32
C 1000000 non-null float32
dtypes: float32(3)
memory usage: 19.1 MB
In [41]: store = pd.HDFStore('test.h5',mode='w')
In [42]: def write():
....: for i in range(10):
....: dfi = df.copy()
....: dfi.index = df.index + pd.Timedelta(minutes=i)
....: store.append('df',dfi)
....:
In [43]: %timeit -n 1 -r 1 write()
1 loops, best of 1: 4.26 s per loop
In [44]: store.close()
In [45]: pd.read_hdf('test.h5','df').info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10000000 entries, 2013-01-01 00:00:00 to 2013-01-01 00:25:39.999000
Data columns (total 3 columns):
A float32
B float32
C float32
dtypes: float32(3)
memory usage: 190.7 MB
Versions
In [46]: pd.__version__
Out[46]: u'0.17.0'
In [49]: import tables
In [50]: tables.__version__
Out[50]: '3.2.2'
In [51]: np.__version__
Out[51]: '1.10.1'

Related

Writing a CSV, or reading a CSV changes my pandas data frame from float16 into float64. How can I avoid this?

I have a data frame test_df with the following information
test_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Columns: 4097 entries, index to 4095
dtypes: float16(4096), object(1)
memory usage: 800.9+ KB
Clearly, all the columns are of data type float16 except the first column and the total size of the data frame is about 800 MB. Now I save this data frame as a CSV file using Watson Studio as follows
# import the lib
from ibm_watson_studio_lib import access_project_or_space
wslib = access_project_or_space()
# save the data frame as csv
wslib.save_data("test_df.csv", test_df.to_csv(index=False, header=False).encode())
Checking the size of the CSV, it is suddenly 1.8 MB. For some reason the size doubled.
Now, when reading in the same CSV again with the following code
import itc_utils.flight_service as itcfs
readClient = itcfs.get_flight_client()
nb_data_request = {
'data_name': """test_df.csv""",
'interaction_properties': {
#'row_limit': 500,
'infer_schema': 'true',
'infer_as_varchar': 'false'
}
}
flightInfo = itcfs.get_flight_info(readClient, nb_data_request=nb_data_request)
test_df = itcfs.read_pandas_and_concat(readClient, flightInfo, timeout=10000)
test_df.index.name = None
# rename first column to 'index'
test_df.rename(columns = {'COLUMN1':'index'}, inplace = True)
# rename the rest of columns as a consecutive integer
new_columns = {}
for i in range(len(test_df.columns)-1):
new_columns[test_df.columns[i+1]] = str(i)
test_df = test_df.rename(columns=new_columns)
And checking info now gives
test_df.info()
Time taken: 0.0468 minutes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Columns: 4097 entries, index to 4095
dtypes: float64(4096), object(1)
memory usage: 3.1+ MB
Now the data type is float64 and it's 3.1 MB. How can avoid this?

Speed up getting distance between two lat and lon

I have two DataFrame containing Lat and Lon. I want to find distance from one (Lat, Lon) pair to ALL (Lat, Lon) from another DataFrame and get the minimum. The package that I am using geopy. The code is as follows:
from geopy import distance
import numpy as np
distanceMiles = []
count = 0
for id1, row1 in df1.iterrows():
target = (row1["LAT"], row1["LON"])
count = count + 1
print(count)
for id2, row2 in df2.iterrows():
point = (row2["LAT"], row2["LON"])
distanceMiles.append(distance.distance(target, point).miles)
closestPoint = np.argmin(distanceMiles)
distanceMiles = []
The problem is that df1 has 168K rows and df2 has 1200 rows. How do I make it faster?
geopy.distance.distance uses geodesic algorithm by default, which is rather slow but more accurate. If you can trade accuracy for speed, you can use great_circle, which is ~20 times faster:
In [4]: %%timeit
...: distance.distance(newport_ri, cleveland_oh).miles
...:
236 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: %%timeit
...: distance.great_circle(newport_ri, cleveland_oh).miles
...:
13.4 µs ± 94.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Also you may use multiprocessing to parallelize the computation:
from multiprocessing import Pool
from geopy import distance
import numpy as np
def compute(points):
target, point = points
return distance.great_circle(target, point).miles
with Pool() as pool:
for id1, row1 in df1.iterrows():
target = (row1["LAT"], row1["LON"])
distanceMiles = pool.map(
compute,
(
(target, (row2["LAT"], row2["LON"]))
for id2, row2 in df2.iterrows()
)
)
closestPoint = np.argmin(distanceMiles)
Leaving this here in case anyone needs it in the future:
If you need only the minimum distance, then you don't have to bruteforce all the pairs. There are some data structures that can help you solve this in O(n*log(n)) time complexity, which is way faster than the bruteforce method.
For example, you can use a generalized KNearestNeighbors (with k=1) algorithm to do exactly that, given that you pay attention to your points being on a sphere, not a plane. See this SO answer for an example implementation using sklearn.
There seems to be a few libraries to solve this too, like sknni and GriSPy.
Here's also another question that talks a bit about the theory.
This should run much faster if you utilize itertools instead of explicit for loops. Inline comments should help you understand whats happening at each step.
import numpy as np
import itertools
from geopy import distance
#Creating 2 sample dataframes with 10 and 5 rows of lat, long columns respectively
df1 = pd.DataFrame({'LAT':np.random.random(10,), 'LON':np.random.random(10,)})
df2 = pd.DataFrame({'LAT':np.random.random(5,), 'LON':np.random.random(5,)})
#Zip the 2 columns to get (lat, lon) tuples for target in df1 and point in df2
target = list(zip(df1['LAT'], df1['LON']))
point = list(zip(df2['LAT'], df2['LON']))
#Product function in itertools does a cross product between the 2 iteratables
#You should get things of the form ( ( lat, lon), (lat, lon) ) where 1st is target, second is point. Feel free to change the order if needed
product = list(itertools.product(target, point)])
#starmap(function, parameters) maps the distance function to the list of tuples. Later you can use i.miles for conversion
geo_dist = [i.miles for i in itertools.starmap(distance.distance, product)]
len(geo_dist)
50
geo_dist = [42.430772028845716,
44.29982320107605,
25.88823239877388,
23.877570442142783,
29.9351451072828,
...]
Finally,
If you are working with a massive dataset, then I would recommend using multiprocessing library to map the itertools.starmap to different cores and asynchronously compute the distance values. Python Multiprocessing library now supports starmap.
If you need to check all the the pairs by brute force, I think the following approach is the best you can do.
Looping directly on columns is usually slightly faster than iterrows, and the vectorized approach replacing the inner loop saves time too.
for lat1, lon1 in zip(df1["LAT"], df1["LON"]):
target = (lat1, lon1)
count = count + 1
# print(count) #printing is also time expensive
df2['dist'] = df1.apply(lambda row : distance.distance(target, (row['LAT'], row['LON'])).miles, axis=1)
closestpoint = df2['dist'].min() #if you want the minimum distance
closestpoint = df2['dist'].idxmin() #if you want the position (index) of the minimum.

Efficiently create categorical data frame with nulls

I want to create a categorical data frame with nulls and set the categories before expanding the index. The index is very large and I want to avoid the memory spike and I cannot seem to do this.
Example:
# memory spike
df = pd.DataFrame(index=list(range(0, 1000)), columns=['a', 'b'])
df.info(memory_usage='deep')
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 2 columns):
a 0 non-null object
b 0 non-null object
dtypes: object(2)
memory usage: 70.3 KB
Convert to Categorical:
for _ in df.columns:
df[_] = df[_].astype('category')
# set categories for columns
df['a'] = df['a'].cat.add_categories(['d', 'e', 'f'])
df['b'] = df['b'].cat.add_categories(['g', 'h', 'i'])
# check memory usage
df.info(memory_usage='deep')
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 2 columns):
a 0 non-null category
b 0 non-null category
dtypes: category(2)
memory usage: 9.9 KB
Is there a way to do this while avoiding the memory spike?
If the data frame is created by the DataFrame constructor, the columns can be initialized as category types.
import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype
cat_type1 = CategoricalDtype(["d", "e", "f"])
cat_type2 = CategoricalDtype(["g", "h", "i"])
index = pd.Index(range(1000))
df = pd.DataFrame({"a": pd.Series([np.nan] * len(index), dtype=cat_type1, index=index),
"b": pd.Series([np.nan] * len(index), dtype=cat_type2, index=index)},
index=index)
Another alternative solution is the following.
cols = ["a", "b"]
index = pd.Index(range(1000))
df = pd.DataFrame({k: [np.nan] * len(index) for k in cols}, index=index, dtype="category")
df["a"].cat.set_categories(["d", "e", "f"], inplace=True)
df["b"].cat.set_categories(["g", "h", "i"], inplace=True)
If the data frame is created via methods such as read_csv, the dtype keyword parameter can be used to make sure the output columns have desired data types rather than making conversions after the data frame is created -- which leads to more memory consumption.
df = pd.read_csv("file.csv", dtype={"a": cat_type1, "b": cat_type2})
Here, the category values can also be directly inferred from the data by passing in dtype={"a": "category"}. Specifying the categories beforehand can save the inference overhead and also let the parser check the data values match the specified category values. It is also necessary if some category values do not occur in the data.

Truncated bytearray using pyodbc in Linux

I'm using a kalman filter in pykalman, then I pickle the filter and save it into Sybase as a binary array that is sufficiently long. I'm using pyodbc as the connection.
I run the script on a Linux server, then get the same filter from my Windows desktop and unpickle it, it works fine. However, if I get the same filter in Linux and try to unpickle it, it says data is truncated. In ipython I can see that it's only getting the first 255 bytes.
In [16]: x = cur.execute('select top 1 filter from kalman_filters where qdate = "20180115"').fetchone()
In [17]: x[0]
Out[17]: b'\x80\x03cpykalman.standard\nKalmanFilter\nq\x00)\x81q\x01}q\x02(X\x13\x00\x00\x00transition_matricesq\x03cnumpy.core.multiarray\n_reconstruct\nq\x04cnumpy\nndarray\nq\x05K\x00\x85q\x06C\x01bq\x07\x87q\x08Rq\t(K\x01K\x01K\x01\x86q\ncnumpy\ndtype\nq\x0bX\x02\x00\x00\x00f8q\x0cK\x00K\x01\x87q\rRq\x0e(K\x03X\x01\x00\x00\x00<q\x0fNNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tq\x10b\x89C\x08\x00\x00\x00\x00\x00\x00\xf0?q\x11tq\x12bX\x14\x00\x00\x00observation_matric'
In [18]: len(x[0])
Out[18]: 255
In [19]: type(x[0])
Out[19]: bytes
If I do the same in Windows, it gets it correctly.
In [7]: len(x[1])
Out[7]: 827
In [8]: x[1]
Out[8]: b'\x80\x03cpykalman.standard\nKalmanFilter\nq\x00)\x81q\x01}q\x02(X\x13\x00\x00\x00transition_matricesq\x03cnumpy.core.multiarray\n_reconstruct\nq\x04cnumpy\nndarray\nq\x05K\x00\x85q\x06C\x01bq\x07\x87q\x08Rq\t(K\x01K\x01K\x01\x86q\ncnumpy\ndtype\nq\x0bX\x02\x00\x00\x00f8q\x0cK\x00K\x01\x87q\rRq\x0e(K\x03X\x01\x00\x00\x00<q\x0fNNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tq\x10b\x89C\x08\x00\x00\x00\x00\x00\x00\xf0?q\x11tq\x12bX\x14\x00\x00\x00observation_matricesq\x13h\x04h\x05K\x00\x85q\x14h\x07\x87q\x15Rq\x16(K\x01K\x01K\x01\x86q\x17h\x0e\x89C\x08\x00\x00\x00\x00\x00\x00\xf0?q\x18tq\x19bX\x15\x00\x00\x00transition_covarianceq\x1ah\x04h\x05K\x00\x85q\x1bh\x07\x87q\x1cRq\x1d(K\x01K\x01K\x01\x86q\x1eh\x0e\x89C\x08\x99\xa2#\x03Y\xa0%?q\x1ftq bX\x16\x00\x00\x00observation_covarianceq!h\x04h\x05K\x00\x85q"h\x07\x87q#Rq$(K\x01K\x01K\x01\x86q%h\x0e\x89C\x08\xa4\xd4\x1fF\xd09D?q&tq\'bX\x12\x00\x00\x00transition_offsetsq(h\x04h\x05K\x00\x85q)h\x07\x87q*Rq+(K\x01K\x01\x85q,h\x0e\x89C\x08\x00\x00\x00\x00\x00\x00\x00\x00q-tq.bX\x13\x00\x00\x00observation_offsetsq/h\x04h\x05K\x00\x85q0h\x07\x87q1Rq2(K\x01K\x01\x85q3h\x0e\x89C\x08\x00\x00\x00\x00\x00\x00\x00\x00q4tq5bX\x12\x00\x00\x00initial_state_meanq6h\x04h\x05K\x00\x85q7h\x07\x87q8Rq9(K\x01K\x01\x85q:h\x0e\x89C\x08\x18WQ\x07\x1bK\'\xbfq;tq<bX\x18\x00\x00\x00initial_state_covarianceq=h\x04h\x05K\x00\x85q>h\x07\x87q?Rq#(K\x01K\x01K\x01\x86qAh\x0e\x89C\x08\x1e"\x12\xd5\xa6\xbc\x00?qBtqCbX\x0c\x00\x00\x00random_stateqDNX\x07\x00\x00\x00em_varsqE]qF(h\x1ah!h6h=eX\x0b\x00\x00\x00n_dim_stateqGK\x01X\t\x00\x00\x00n_dim_obsqHK\x01ub.'
In [9]: type(x[1])
Out[9]: bytes
So I'm pretty sure it's a pyodbc/odbc settings issue. Any idea what this is?
Thanks

Slow performance of timedelta methods

Why does .dt.days take 100 times longer than .dt.total_seconds()?
df = pd.DataFrame({'a': pd.date_range('2011-01-01 00:00:00', periods=1000000, freq='1H')})
df.a = df.a - pd.to_datetime('2011-01-01 00:00:00')
df.a.dt.days # 12 sec
df.a.dt.total_seconds() # 0.14 sec
.dt.total_seconds is basically just a multiplication, and can be performed at numpythonic speed:
def total_seconds(self):
"""
Total duration of each element expressed in seconds.
.. versionadded:: 0.17.0
"""
return self._maybe_mask_results(1e-9 * self.asi8)
Whereas if we abort the days operation, we see it's spending its time in a slow listcomp with a getattr and a construction of Timedelta objects (source):
360 else:
361 result = np.array([getattr(Timedelta(val), m)
--> 362 for val in values], dtype='int64')
363 return result
364
To me this screams "look, let's get it correct, and we'll cross the optimization bridge when we come to it."

Resources