Related
#I EDITED MY ORIGINAL POST in order to put a simpler example.
I use differential evolution (DE) of Scipy to optimize certain parameters.
I would like to use all the PC processors in this task and I try to use the option "workers=-1"
The codition asked is that the function called by DE must be pickleable.
If I run the example in https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.differential_evolution.html#scipy.optimize.differential_evolution, the optimisation works.
from scipy.optimize import rosen, differential_evolution
import pickle
import dill
bounds = [(0,2), (0, 2)]
result = differential_evolution(rosen, bounds, updating='deferred',workers=-1)
result.x, result.fun
(array([1., 1.]), 0.0)
But if I define a custom function 'Ros_custom', the optimisation crashes (doesn't give a result)
def Ros_custom(X):
x = X[0]
y = X[1]
a = 1. - x
b = y - x*x
return a*a + b*b*100
result = differential_evolution(Ros_custom, bounds, updating='deferred',workers=-1)
If I try to pickle.dumps and pickle.loads 'Ros_custom' I get the same behaviour (optimisation crash, no answer).
If I use dill
Ros_pick_1=dill.dumps(Ros_custom)
Ros_pick_2=dill.loads(Ros_pick_1)
result = differential_evolution(Ros_pick_2, bounds, updating='deferred',workers=-1)
result.x, result.fun
I get the following message error
PicklingError: Can't pickle <function Ros_custom at 0x0000020247F04C10>: it's not the same object as __main__.Ros_custom
My question are:
Why do I get the error ? and if there would be a way to get 'Ros_custom' pickleable in order to use all the PC processors in DE.
Thank you in advance for any advice.
Two things:
I'm not able to reproduce the error you are seeing unless I first pickle/unpickle the custom function.
There's no need to pickle/unpickle the custom function before passing it to the solver.
This seems to work for me. Python 3.6.12 and scipy 1.5.2:
>>> from scipy.optimize import rosen, differential_evolution
>>> bounds = [(0,2), (0, 2)]
>>>
>>> def Ros_custom(X):
... x = X[0]
... y = X[1]
... a = 1. - x
... b = y - x*x
... return a*a + b*b*100
...
>>> result = differential_evolution(Ros_custom, bounds, updating='deferred',workers=-1)
>>> result.x, result.fun
(array([1., 1.]), 0.0)
>>>
>>> result
fun: 0.0
message: 'Optimization terminated successfully.'
nfev: 4953
nit: 164
success: True
x: array([1., 1.])
>>>
I can even nest a function inside of the custom objective:
>>> def foo(a,b):
... return a*a + b*b*100
...
>>> def custom(X):
... x,y = X[0],X[1]
... return foo(1.-x, y-x*x)
...
>>> result = differential_evolution(custom, bounds, updating='deferred',workers=-1)
>>> result
fun: 0.0
message: 'Optimization terminated successfully.'
nfev: 4593
nit: 152
success: True
x: array([1., 1.])
So, for me, at least the code works as expected.
You should have no need to serialize/deserialize the function ahead of it's use in scipy. Yes, the function need to be picklable, but scipy will do that for you. Basically, what's happening under the covers is that your function will get serialized, passed to multiprocessing as a string, then distributed to the processors, then unpickled and used on the target processors.
Like this, for 4 sets on inputs, run one per processor:
>>> import multiprocessing as mp
>>> res = mp.Pool().map(custom, [(0,1), (1,2), (4,9), (3,4)])
>>> list(res)
[101.0, 100.0, 4909.0, 2504.0]
>>>
Older versions of multiprocessing had difficulty serializing functions defined in the interpreter, and often needed to have the code executed in a __main__ block. If you are on windows, this is still often the case... and you might also need to call mp.freeze_support(), depending on how the code in scipy is implemented.
I tend to like dill (I'm the author) because it can serialize a broader range of objects that pickle. However, as scipy uses multiprocessing, which uses pickle... I often choose to use mystic (I'm the author), which uses multiprocess (I'm the author), which uses dill. Very roughly, equivalent codes, but they all work with dill instead of pickle.
>>> from mystic.solvers import diffev2
>>> from pathos.pools import ProcessPool
>>> diffev2(custom, bounds, npop=40, ftol=1e-10, map=ProcessPool().map)
Optimization terminated successfully.
Current function value: 0.000000
Iterations: 42
Function evaluations: 1720
array([1.00000394, 1.00000836])
With mystic, you get some additional nice features, like a monitor:
>>> from mystic.monitors import VerboseMonitor
>>> mon = VerboseMonitor(5,5)
>>> diffev2(custom, bounds, npop=40, ftol=1e-10, itermon=mon, map=ProcessPool().map)
Generation 0 has ChiSquare: 0.065448
Generation 0 has fit parameters:
[0.769543181527466, 0.5810893880113548]
Generation 5 has ChiSquare: 0.065448
Generation 5 has fit parameters:
[0.588156685059123, -0.08325052939774935]
Generation 10 has ChiSquare: 0.060129
Generation 10 has fit parameters:
[0.8387858177101133, 0.6850849855634057]
Generation 15 has ChiSquare: 0.001492
Generation 15 has fit parameters:
[1.0904350077743412, 1.2027007403275813]
Generation 20 has ChiSquare: 0.001469
Generation 20 has fit parameters:
[0.9716429877952866, 0.9466681129902448]
Generation 25 has ChiSquare: 0.000114
Generation 25 has fit parameters:
[0.9784047411865372, 0.9554056558210251]
Generation 30 has ChiSquare: 0.000000
Generation 30 has fit parameters:
[0.996105436348129, 0.9934091068974504]
Generation 35 has ChiSquare: 0.000000
Generation 35 has fit parameters:
[0.996589586891175, 0.9938925277204567]
Generation 40 has ChiSquare: 0.000000
Generation 40 has fit parameters:
[1.0003791956048833, 1.0007133195321427]
Generation 45 has ChiSquare: 0.000000
Generation 45 has fit parameters:
[1.0000170425596364, 1.0000396089375592]
Generation 50 has ChiSquare: 0.000000
Generation 50 has fit parameters:
[0.9999013984263114, 0.9998041148375927]
STOP("VTRChangeOverGeneration with {'ftol': 1e-10, 'gtol': 1e-06, 'generations': 30, 'target': 0.0}")
Optimization terminated successfully.
Current function value: 0.000000
Iterations: 54
Function evaluations: 2200
array([0.99999186, 0.99998338])
>>>
All of the above are running in parallel.
So, in summary, the code should work as is (and without pre-pickling) -- maybe unless you are on windows, where you might need to use freeze_support and run the code in the __main__ block.
Writing the function separately from the code worked for me.
create rosen_custom.py with code inside:
import numpy as np
def rosen(x):
x = np.array(x)
r = np.sum(100.0 * (x[1:] - x[:-1]**2.0)**2.0 + (1 - x[:-1])**2.0,
axis=0)
return r
Then use it in DE:
from scipy.optimize import differential_evolution
from rosen_custom import rosen
import numpy as np
bounds = [(0,2), (0, 2), (0, 2), (0, 2), (0, 2)]
result = differential_evolution(rosen_custom, bounds,
updating='deferred',workers=-1)
print(result.x, result.fun)
I am wanting to do the following:
Fill NaN values in a single column using values within a specific range.
The range I am wanting to use is the mean of the non-Nan values in the column +/- 1 one standard
deviation of the computed mean.
NOTE If possible, I would like to be able to use multiples of the std dev by simply multiplying it by
a constant.
I thought I had it (see full code below) but the output from print(df['C'].describe()) shows that
I am generating values well outside my desired range. In fact, I am generating numbers outside
the original min and max of the column, which is definitely not what I want.
import pandas as pd
import numpy as np
import sys
print('Python: {}'.format(sys.version))
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('\033[1;31m' + '--------------' + '\033[0m') # Bold red
display_settings = {
'max_columns': 15,
'max_colwidth': 60,
'expand_frame_repr': False, # Wrap to multiple pages
'max_rows': 50,
'precision': 6,
'show_dimensions': False
}
# pd.options.display.float_format = '{:,.2f}'.format
for op, value in display_settings.items():
pd.set_option("display.{}".format(op), value)
df = pd.DataFrame(np.random.randint(0, 1000, size=(200, 10)), columns=list('ABCDEFGHIJ'))
# df = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list(['AA','BB','C2','D2']))
print(df, '\n')
# https://stackoverflow.com/questions/55149738/pandas-replace-values-with-nan-at-random
df['C'] = df['C'].sample(frac=0.65) # The percentage of non-NaN values.
df['H'] = df['H'].sample(frac=0.75) # The percentage of non-NaN values.
print(df, '\n')
print(df.isnull().sum(), '\n')
print(df['C'].describe(), '\n')
def fillNaN_with_unifrand(col):
a = col.values
m = np.isnan(a) # mask of NaNs
mu, sigma = col.mean(), col.std()
a[m] = np.random.normal(mu, sigma, size=m.sum())
return col
# https://stackoverflow.com/questions/46543060/how-to-replace-every-nan-in-a-column-with-different-random-values-using-pandas?rq=1
fillNaN_with_unifrand(df['C'])
pd.options.display.float_format = '{:.0f}'.format
print(df, '\n')
print(df.isnull().sum(), '\n')
print(df['C'].describe())
Output of print(df['C'].describe()):
Starting:
count 130.000000
mean 462.446154
std 290.760432
min 7.000000
25% 187.500000
50% 433.000000
75% 671.250000
max 992.000000
Name: C, dtype: float64
Ending:
count 200
mean 517
std 298
min -187
25% 281
50% 544
75% 763
max 1218
Name: C, dtype: float64
Note the min and max. All of my fill values (in this instance) should have been 462 +/- 290.
Well, this is not how statistics work. A Gaussian Normal Distribution has a mean and a std but values can be drawn far away from mean +- std, they are just less likeley. As per definition of a normal distribution, 68 % of all values are within +- 1*std, 95 % are within +-2*std and so on. The question is: What do you want to do with outliers? Set them to mean +- std or draw again?
Case 1: Set outliers to min/max
This is usually unwanted, as this changes your distribution and puts more weight on the lower and upper boundary.
from matplotlib import pyplot as plt
mu = 100
sigma = 7
a = np.random.normal(mu, sigma, size=2000) # I used a size of 2000 as an example
a[a<(mu-sigma)] = mu-sigma
a[a>(mu+sigma)] = mu+sigma
plt.hist(a, bins=12, edgecolor='black')
plt.show()
Case 2: Truncated Normal Distribution
What you usually want is the Truncated Normal Distribution. It creates a distribution with an upper and a lower boundary. You find this function at the scipy.stats module. It works a bit different though: you first create the distribution by normalizing the lower and upper clip and then you create a numer of random variates rvs from it like this:
from matplotlib import pyplot as plt
import scipy.stats as stats
mu = 100
sigma = 7
lower_clip = mu-sigma
upper_clip = mu+sigma
a = stats.truncnorm((lower_clip - mu) / sigma, (upper_clip - mu) / sigma, loc=mu, scale=sigma)
plt.hist(a.rvs(2000), bins=12, edgecolor='black')
plt.show()
The constant of multiples of sigma is easily implemented. You can just change your lower and upper clip like
lower_clip = mu-x*sigma
with x being your constant.
I want to use huber simultaneous scale and mean estimator found here : http://www.statsmodels.org/dev/generated/statsmodels.robust.scale.Huber.html but here is the error :
In [1]: from statsmodels.robust.scale import huber
In [2]: huber([1,2,1000,3265,454])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-80c7d73a4467> in <module>()
----> 1 huber([1,2,1000,3265,454])
/usr/local/lib/python3.5/dist-packages/statsmodels/robust/scale.py in __call__(self, a, mu, initscale, axis)
132 scale = tools.unsqueeze(scale, axis, a.shape)
133 mu = tools.unsqueeze(mu, axis, a.shape)
--> 134 return self._estimate_both(a, scale, mu, axis, est_mu, n)
135
136 def _estimate_both(self, a, scale, mu, axis, est_mu, n):
/usr/local/lib/python3.5/dist-packages/statsmodels/robust/scale.py in _estimate_both(self, a, scale, mu, axis, est_mu, n)
176 else:
177 return nmu.squeeze(), nscale.squeeze()
--> 178 raise ValueError('joint estimation of location and scale failed to converge in %d iterations' % self.maxiter)
179
180 huber = Huber()
ValueError: joint estimation of location and scale failed to converge in 30 iterations
The weird thing is that it depends on the input:
In [3]: huber([1,2,1000,3265])
Out[3]: (array(1067.0), array(1744.3785635989168))
Is it a bug or did I do something wrong here ?
Thanks
EDIT : I knew about the tol and maxiter parameter, what you say works in that case but here is an example where it doesn't :
In [1]: a=[4.3498776644415429, 16.549773154535362, 4.6335866963356445, 8.2581784707468771, 1.3508951981036594, 1.2918098244960199, 5.734
...: 9939516388453, 0.41663442483143953, 4.5632532990486077, 8.1020487048604473, 1.3823829480004797, 1.7848176927929804, 4.3058348043
...: 423473, 0.9427710734983884, 0.95646846668018171, 0.75309469901235238, 8.4689505489677011, 0.77420558084543778, 0.765060223824508
...: 45, 1.5673666392992407, 1.4109878442590897, 0.45592078018861532, 4.71748181503082, 0.65942167325205436, 0.19099796838644958, 1.0
...: 979997466466069, 4.8145761128848106, 0.75417363824157768, 5.0723603274833362, 0.30627007428414721, 4.8178689054947981, 1.5383475
...: 959362511, 0.7971041296695851, 4.689826268915076, 8.6704498595703274, 0.56825576954483947, 8.0383098149129708, 0.394000842811084
...: 22, 0.89827542590321019, 8.5160701523615785, 9.0413284666560934, 1.3590549231652516, 8.355489609767794, 4.2413169378427682, 4.84
...: 97143419119348, 4.8566372637376292, 0.80979444214378904, 0.26613505510736446, 1.1525345100417608, 4.9784132426823824, 1.07663603
...: 91211101, 1.9604545887151259, 0.77151237419054963, 1.2302626325699455, 0.846912462599126, 0.85852710339862037, 0.380355420248302
...: 99, 4.7586522644359093, 0.46796412732813891, 0.52933680009769146, 5.2521765047159708, 0.71915381047435945, 1.3502865819436387, 0
...: .76647272458736559, 1.1206637428992841, 0.72560665950851866, 4.4248008256265781, 4.7984989298357457, 1.0696617588880453, 0.71104
...: 701759920497, 0.46986438176394463, 0.71008686283792688, 0.40698839770374351, 1.0015132141773508, 1.3825224746094535, 0.932562703
...: 04709066, 8.8896053101317687, 0.64148877800521564, 0.69250319745644506, 4.7187793763802919, 5.0620089438920939, 5.17105647739872
...: 33, 9.5341720525579809, 0.43052713463119635, 0.79288845392647533, 0.51059695992994469, 0.48295891743804287, 0.93370512281086504,
...: 1.7493284310512855, 0.62744557356984221, 5.0965146009791704, 0.12615625248684664, 1.1064189602023351, 0.33183381198282491, 4.90
...: 32450273833179, 0.90296573725985785, 1.2885647882049298, 0.84669066664867576, 1.1481783837280477, 0.94784483590946278, 9.8019240
...: 792478755, 0.91501030105202807, 0.57121190468293803, 5.5511993201050887, 0.66054793663263078, 9.6626055869916065, 5.262806161853
...: 6908, 9.5905100705465696, 0.70369230764306401, 8.9747551552440186, 1.572014845182425, 1.9571634928868149, 0.62030418652298325, 0
...: .3395356767840213, 0.48287760518144929, 4.7937042347984198, 0.74251393675618682, 0.87369567300592954, 4.5381205696031586, 5.2673
...: 192797619084]
In [2]: from statsmodels.robust.scale import huber, Huber
In [3]: Huber(maxiter=10000,tol=1e-1)(a)
/usr/lib/python3.6/site-packages/statsmodels/robust/scale.py:168: RuntimeWarning: invalid value encountered in sqrt
/ (n * self.gamma - (a.shape[axis] - card) * self.c**2))
/usr/lib/python3.6/site-packages/statsmodels/robust/scale.py:164: RuntimeWarning: invalid value encountered in less_equal
subset = np.less_equal(np.fabs((a - mu)/scale), self.c)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-4b9929ff84bb> in <module>()
----> 1 Huber(maxiter=10000,tol=1e-1)(a)
/usr/lib/python3.6/site-packages/statsmodels/robust/scale.py in __call__(self, a, mu, initscale, axis)
132 scale = tools.unsqueeze(scale, axis, a.shape)
133 mu = tools.unsqueeze(mu, axis, a.shape)
--> 134 return self._estimate_both(a, scale, mu, axis, est_mu, n)
135
136 def _estimate_both(self, a, scale, mu, axis, est_mu, n):
/usr/lib/python3.6/site-packages/statsmodels/robust/scale.py in _estimate_both(self, a, scale, mu, axis, est_mu, n)
176 else:
177 return nmu.squeeze(), nscale.squeeze()
--> 178 raise ValueError('joint estimation of location and scale failed to converge in %d iterations' % self.maxiter)
179
180 huber = Huber()
ValueError: joint estimation of location and scale failed to converge in 10000 iterations
Sorry, this was my original error but because the "a" is long, I tried to recreate the error with a smaller array. In this case, I don't think maxiter and tol are to blame.
The number of iterations allowed, maxiter, can be changed when using the Huber class.
e.g. this works
>>> from statsmodels.robust.scale import huber, Huber
>>> Huber(maxiter=200)([1,2,1000,3265,454])
(array(925.6483958529737), array(1497.0624070525248))
It is also possible to change the threshold parameter for the norm function when using the class. In very small samples like this the estimate might be very sensitive to the threshold parameter.
As alternative we can use the RLM model and regress on a constant, both thresholds and the algorithm are different but it should produce similar robust results. In the new example the estimate for the scale in between standard deviation and robust MAD, while the mean estimate is larger than the median and the mean.
>>> res = RLM(a, np.ones(len(a)), M=norms.HuberT(t=1.5)).fit(scale_est=scale.HuberScale(d=1.5))
>>> res.params, res.scale
(array([ 2.47711987]), 2.5218278029435406)
>>> np.median(a), scale.mad(a)
(1.1503564468849041, 0.98954533464908301)
>>> np.mean(a), np.std(a)
(2.8650886010542269, 3.0657561979615977)
The resulting weights show that some of the high values are downweighted
>>> widx = np.argsort(res.weights)
>>> np.asarray(a)[widx[:10]]
array([ 16.54977315, 9.80192408, 9.66260559, 9.59051007,
9.53417205, 9.04132847, 8.97475516, 8.88960531,
8.67044986, 8.51607015])
I am not familiar with the details of the implementation of the Huber joint mean-scale estimator.
One possible reason for the convergence failure is that the distribution of the values is bunched in 3 groups with one extra outlier at 16, visible when plotting the histogram. This could result in a convergence cycle with the iterative solver where the third group is either included or excluded. But that is just a guess.
I am trying to find the equation of a line within a DF
Here is a fake data set to explain:
Clicks Sales
5 10
5 11
10 16
10 20
10 18
15 28
15 26
... ...
100 200
What I am trying to do:
Calculate the equation of the line between so that I am able to input a number of clicks and have an output of sales at any predicted level. The thing I am trying to wrap my brain around is that I have many different line functions (e.g. there are multiple sales for each amount of clicks). How can I iterate through my DF to just to calculate one aggregate line function?
Here's what I have but it only accept ONE input at a time, I would like to create an average or aggregate...
def slope(self, target):
return slope(target.x - self.x, target.y - self.y)
def y_int(self, target): # <= here's the magic
return self.y - self.slope(target)*self.x
def line_function(self, target):
slope = self.slope(target)
y_int = self.y_int(target)
def fn(x):
return slope*x + y_int
return fn
a = Point(5, 10) # I am stuck here since - what to input!?
b = Point(10, 16) # I am stuck here since - what to input!?
line = a.line_function(b)
print(line(x=10))
Use the scipy function scipy.stats.linregress to fit your data.
Maybe also check https://en.wikipedia.org/wiki/Linear_regression to better understand linear regression.
You could group by Clicks and take the average of the Sales per group:
In [307]: sales = df.groupby('Clicks')['Sales'].mean(); sales
Out[307]:
Clicks
5 10.5
10 18.0
15 27.0
100 200.0
Name: Sales, dtype: float64
Then form the piecewise linear interpolating function based on
the groupwise-averaged data above using interpolate.interp1d:
from scipy import interpolate
fn = interpolate.interp1d(sales.index, sales.values, kind='linear')
For example,
import numpy as np
import pandas as pd
from scipy import interpolate
import matplotlib.pyplot as plt
df = pd.DataFrame({'Clicks': [5, 5, 10, 10, 10, 15, 15, 100],
'Sales': [10, 11, 16, 20, 18, 28, 26, 200]})
sales = df.groupby('Clicks')['Sales'].mean()
Once you have the groupwise-averaged sales, you can compute the interpolated sales
a number of ways. One way is to use np.interp:
newx = [10]
print(np.interp(newx, sales.index, sales.values))
# [ 18.] <-- The interpolated sales when the number of clicks is 10 (newx)
The problem with np.interp is that you are passing sales.index and sales.values to np.interp every time you call it -- it has no memory of the interpolating function. It is re-computing the interpolating function every time you call it.
If you have scipy, then you could create the interpolating function once and then use it as many times as you like later:
fn = interpolate.interp1d(sales.index, sales.values, kind='linear')
print(fn(newx))
# [ 18.]
For example, you could evaluate the interpolating function at a whole bunch of points (and plot the result) like this:
newx = np.linspace(5, 100, 100)
plt.plot(newx, fn(newx))
plt.plot(df['Clicks'], df['Sales'], 'o')
plt.show()
Pandas Series (and DataFrames) have an iterpolate method too. To use it, you reindex the Series to include the points where you wish to interpolate:
In [308]: sales.reindex(sales.index.union([14]))
Out[308]:
5 10.5
10 18.0
14 NaN
15 27.0
100 200.0
Name: Sales, dtype: float64
and then interpolate fills in the interpolated values where the Series is NaN:
In [295]: sales.reindex(sales.index.union([14])).interpolate('values')
Out[295]:
5 10.5
10 18.0
14 25.2 # <-- interpolated value
15 27.0
100 200.0
Name: Sales, dtype: float64
But I think it is perhaps not appropriate for your problem since it does not
return just the interpolated values you are looking for; it returns a whole
Series.
This is a continuation of the scenario I tried to discuss in my question https://stackoverflow.com/questions/33251445/tips-to-store-huge-sensor-data-in-hdf5-using-pandas. Please read the question for more details about what follows.
Since the linked question above was closed as the subject was too broad, I did not get a chance to gather ideas from people more experienced at handling hundreds of gigabytes of data. I do not have any experience with that whatsoever, and I am learning as I go. I have apparently made some mistake somewhere, because my method is taking way too long to complete.
The data is as I described in the linked question above. I decided to create a node (group) for each sensor (with the sensor ID as the node name, under root) to store the data generated by each of the 260k sensors I have. The file will end up with 260k nodes, and each node will have a few GB of data stored in a Table under it. The code that does all the heavy lifting is as follows:
with pd.HDFStore(hdf_path, mode='w') as hdf_store:
for file in files:
# Read CSV files in Pandas
fp = os.path.normpath(os.path.join(path, str(file).zfill(2)) + '.csv')
df = pd.read_csv(fp, names=data_col_names, skiprows=1, header=None,
chunksize=chunk_size, dtype=data_dtype)
for chunk in df:
# Manipulate date & epoch to get it in human readable form
chunk['DATE'] = pd.to_datetime(chunk['DATE'], format='%m%d%Y', box=False)
chunk['EPOCH'] = pd.to_timedelta(chunk['EPOCH']*5, unit='m')
chunk['DATETIME'] = chunk['DATE'] + chunk['EPOCH']
#Group on Sensor to store in HDF5 file
grouped = chunk.groupby('Sensor')
for group, data in grouped:
data.index = data['DATETIME']
hdf_store.append(group, data.loc[:,['R1', 'R2', 'R3']])
# Adding sensor information as metadata to nodes
for sens in sensors:
try:
hdf_store.get_storer(sens).attrs.metadata = sens_dict[sens]
hdf_store.get_storer(sens).attrs['TITLE'] = sens
except AttributeError:
pass
If I comment out the line hdf_store.append(group, data.loc[:,['R1', 'R2', 'R3']]), the bit under for chunk in df: takes about 40 - 45 seconds to finish processing an iteration. (The chunk size I am reading is 1M rows.) But with the line included in the code (that is if the grouped chunk is being written to HDF file) the code takes about 10 - 12 minutes for each iteration. I am completely baffled by the increase in execution time. I do not know what is causing that to happen.
Please give me some suggestions to resolve the issue. Note that I cannot afford execution times that long. I need to process about 220 GB of data in this fashion. Later I need to query that data, one node at a time, for further analysis. I have spent over 4 days researching the topic, but I am still as stumped as when I began.
#### EDIT 1 ####
Including df.info() for a chunk containing 1M rows.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 7 columns):
SENSOR 1000000 non-null object
DATE 1000000 non-null datetime64[ns]
EPOCH 1000000 non-null timedelta64[ns]
R1 1000000 non-null float32
R2 773900 non-null float32
R3 483270 non-null float32
DATETIME 1000000 non-null datetime64[ns]
dtypes: datetime64[ns](2), float32(3), object(1), timedelta64[ns](1)
memory usage: 49.6+ MB
Of these, only DATETIME, R1, R2, R3 are written to the file.
#### EDIT 2 ####
Including pd.show_versions()
In [ ] : pd.show_versions()
Out [ ] : INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.23.2
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.1
pytz: 2015.6
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 2.0.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: None
You are constantly performing indexing the rows you write. It is much more efficient to write all of the rows, THEN create the index.
See the documentation on creating an index here.
On the append operations pass index=False; this will turn off indexing.
Then when you are finally finished, run (on each node), assuming store is your HDFStore.
store.create_table_index('node')
This operation will take some time, but will be done once rather than continuously. This makes a tremendous difference because the creation can take into account all of your data (and move it only once).
You might also want to ptrepack your data (either before or after the indexing operation), to reset the chunksize. I wouldn't specify it directly, rather set chunksize='auto' to let it figure out an optimal size AFTER all of the data is written.
So this should be a pretty fast operation (even with indexing).
In [38]: N = 1000000
In [39]: df = DataFrame(np.random.randn(N,3).astype(np.float32),columns=list('ABC'),index=pd.date_range('20130101',freq='ms',periods=N))
In [40]: df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000000 entries, 2013-01-01 00:00:00 to 2013-01-01 00:16:39.999000
Freq: L
Data columns (total 3 columns):
A 1000000 non-null float32
B 1000000 non-null float32
C 1000000 non-null float32
dtypes: float32(3)
memory usage: 19.1 MB
In [41]: store = pd.HDFStore('test.h5',mode='w')
In [42]: def write():
....: for i in range(10):
....: dfi = df.copy()
....: dfi.index = df.index + pd.Timedelta(minutes=i)
....: store.append('df',dfi)
....:
In [43]: %timeit -n 1 -r 1 write()
1 loops, best of 1: 4.26 s per loop
In [44]: store.close()
In [45]: pd.read_hdf('test.h5','df').info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10000000 entries, 2013-01-01 00:00:00 to 2013-01-01 00:25:39.999000
Data columns (total 3 columns):
A float32
B float32
C float32
dtypes: float32(3)
memory usage: 190.7 MB
Versions
In [46]: pd.__version__
Out[46]: u'0.17.0'
In [49]: import tables
In [50]: tables.__version__
Out[50]: '3.2.2'
In [51]: np.__version__
Out[51]: '1.10.1'