TA-LIB Python Finance Library - Applying on NEW Data approach - ta-lib

I applied TALIB on 5000 stocks on Daily charts and saved the results set to the file/Database.
Now the end of day new data - one new row per stock arrives. How do we deal with the new data.
GIven each indicator has its own lookback - using the default for now - do i need to pull back data back into pandas frame for last X days and reapply the indicator and then save only the latest row with TA value? Or have a program loop infinitely keeping pandas frames in the cache and then apply TA and save the last row?
Can people comment on how this is being used - On daily basis - 4 hrs - 1 hr and 1-minute intervals the data will be applied.
Please share ideas and code if any how to best deal with this.

TA-Lib is a C library with Python wrapper around it. And it's not designed to process a newly received data without recalculating it from scratch. That's why I forked the TA-Lib library, renamed the fork to TA-Lib RT and implemented a modified TA functions which accept (and return) the state objects. So one can call the indicator with only new data and state object that he got from calling this indicator with previous data. As a result user can process data piece by piece without recalculation. The fork code could be found here.
I also tried to make a Python wrapper and managed (I don't know python) to make some proof-of-concept version. Its code is here. Also there is a discussion of this proof of concept at pages of original TA-lib python wrapper project. The problem is that due to Python-to-C overhead calling TA func for each new value from python is 60 time slower than calling it for whole data once. But if you manage to cache incoming data you may process it in a bigger pieces that might speed up the code up to 4 times slower than original funcs.

Related

Python Dash table live updates: Any alternate to Interval?

I am making a dash board with a table using python dash. I want it to be updated every time i gather data in the backgroud. The problem is I cant use the dcc.Interval as my data gathering can take bit longer sometimes, so I cant set periodic updates.
Is there any other alternatives?
Can I use the change in table data itself as a trigger to fire callback again? I tried that, but below code is not working.
#app.callback(Output('table','data'),
Output('table', 'style_data_conditional'),
Input('table','data'))
def updateTable(ignore):
return get_data()
The table can't change itself, or it would just create an infinite loop.
Some options:
You could set the interval to much much larger, like every minute or 2 minutes
You could use a button to manually trigger an update each time the user clicks it
You could also look into using web sockets. See some examples here and here. That would require something external to be able to send a message to your Dash app, though

share python object between multiprocess in python3

Here I create a producer-customer program,the parent process(producer) create many child process(consumer),then parent process read file and pass data to child process.
but , here comes a performance problem,pass message between process cost too much time (I think).
for an example ,a 200MB original data ,parent process read and pretreat will cost less then 8 seconds , than just pass data to child process by multiprocess.pipe will cost another 8 seconds , and child processes do the remain work just cost another 3 ~ 4 seconds.
so ,a complete work flow cost less than 18 seconds ,and more than 40% time cost on communication between process , it is much bigger than I used think about ,and I tried multiprocess.Queue and Manager ,they are worse.
I works with windows7 / Python3.4.
I had google for several days , and POSH maybe a good solution , but it can't build with python3.4
there I have 3 ways:
1.is there any way can share python object direct between process in Python3.4 ? as POSH
or
2.is it possable pass the "pointer" of an object to child process and child process can recovery the "pointer" to python object?
or
3.multiprocess.Array may be a valid solution , but if I want share complex data structure, such as list, how it works? should I make a new class base on it and provide interfaces as list?
Edit1:
I tried the 3rd way,but it works worse.
I defined those value:
p_pos = multiprocessing.Value('i') #producer write position
c_pos = multiprocessing.Value('i') #customer read position
databuff = multiprocess.Array('c',buff_len) # shared buffer
and two function:
send_data(msg)
get_data()
in send_data function(parent process),it copies msg to databuff , and send the start and end position (two integer)to child process via pipe.
than in get_data function (child process) ,it received the two position and copy the msg from databuff.
in final,it cost twice than just use pipe #_#
Edit 2:
Yes , I tried Cython ,and the result looks good.
I just changed my python script's suffix to .pyx and compile it ,and the program speed up for 15%.
No doubt , I met the " Unable to find vcvarsall.bat" and " The system cannot find the file specified" error , and I cost whole day for solved the first one , and blocked by the second one.
Finally , I found Cyther , and all troubles gone ^_^.
I was at your place five month ago. I looked around few times but my conclusion is multiprocessing with Python has exactly the problem you describe :
Pipes and Queue are good but not for big objects from my experience
Manager() proxies objects are slow except arrays and those one are limited. if you want to share a complex data structure use a Namespace like it is done here : multiprocessing in python - sharing large object (e.g. pandas dataframe) between multiple processes
Manager() has a shared list you are looking for : https://docs.python.org/3.6/library/multiprocessing.html
There are no pointers or real memory management in Python, so you can't share selected memory cells
I solved this kind of problem by learning C++, but it's probably not what you want to read...
To pass data (especially big numpy arrays) to a child process, I think mpi4py can be very efficient since I can work directly on buffer-like objects.
An example of using mpi4py to spawn processes and communicate (using also trio, but it is another story) can be found here.

progressbar position increment in multithread

I made a multiThread download application, and now I got to show the progress of each downloading Thread, like in IDM, When Data is downloaded the progressbar is notified about downloaded data, and as you know each thread position in progressBar had to begin from a specified position, now the question is:
How can I increment progressposition according to downloaded data, it is pretty simple in monothread by using IDHTTPWORK, so can I use the same method in multithread application or is there another simple method to implement?
Do I need to synchronise the instructions that increment position?
Suppose you have N downloads, of known size M[i] bytes. Before you start downloading, sum these values to get the total number of bytes to be downloaded, M.
While the threads are working they keep track of how many bytes have been downloaded so far, m[i] say. Then, at any point in time the proportion of the task that is complete is:
Sum(m[i]) / M
You can update the progress out of the main thread using a timer. Each time the timer fires, calculate the sum of the m[i] counts. There's no need for synchronisation here so long as the m[i] values are aligned. Any data races are benign.
Now, m[i] might not be stored in an array. You might have an array of download thread objects. And each of those objects stored all the information relating to that download object, including m[i].
Alternatively you can use the same sort of synchronized updating as you do for single threaded code. Remove the timer and update from the made thread when you get new progress information. However, with a lot of threads there is a lot of synchronization and that can potentially lead to contention. The lock free approach above would be my preference. Even though it involves polling on the timer.
You can take a look at the subclassed MFC list controls developed in the article by Michael Dunn 15 years ago: Articles/79/Neat-Stuff-to-Do-in-List-Controls-Using-Custom-Dra on codeproject dot com.
If you implement one of them, say, CXListCtrl* pListCtrl, at thread creation time, then the progress reporting of that thread becomes as simple as making calls such as:
pListCtrl->SetProgress(mItem,0);
when it's time to start showing progress, and
pListCtrl->SetProgress(mItem,0, i);
when you're i% done.
Actually, if you just want the progress bar functionality and don't care about all that's under the hood, you could obtain and use without modification (or license issues) the class XListCtrl.cpp in the Work Queue article at Articles/3607/Work-Queue on that same site.

Implementation of realtime break iterator

I'm interested in modifying the break iterator data (zh) as my program is running as the user adds new words. This means that the data cannot be originally packaged and must be generated as I go. Can I use something like udata_setAppData or udata_setCommonData to achieve the result? I expect the .dat for the break iterator to change 2-3 times a day - so loading time should not be the critical issue.
Here's the psuedo code:
1. Start program
2. Generate .dat-like data from database for break iterators
3. Load into icu as zh break iterator
If the user makes a change to the database
4. Drop current .dat for zh break iterator
5. Regenerate .dat-like data
6. Reload
Is this possible. I think it is almost possible if I have a way of replacing U_ICUDAT_BRKITR on the fly.
Update. It seems that to pull this off, I must use code from gencmn to generate the new .dat file.
There is no API to customize the dictionary.

Progress bar and multiple threads, decoupling GUI and logic - which design pattern would be the best?

I'm looking for a design pattern that would fit my application design.
My application processes large amounts of data and produces some graphs.
Data processing (fetching from files, CPU intensive calculations) and graph operations (drawing, updating) are done in seperate threads.
Graph can be scrolled - in this case new data portions need to be processed.
Because there can be several series on a graph, multiple threads can be spawned (two threads per serie, one for dataset update and one for graph update).
I don't want to create multiple progress bars. Instead, I'd like to have single progress bar that inform about global progress. At the moment I can think of MVC and Observer/Observable, but it's a little bit blurry :) Maybe somebody could point me in a right direction, thanks.
I once spent the best part of a week trying to make a smooth, non-hiccupy progress bar over a very complex algorithm.
The algorithm had 6 different steps. Each step had timing characteristics that were seriously dependent on A) the underlying data being processed, not just the "amount" of data but also the "type" of data and B) 2 of the steps scaled extremely well with increasing number of cpus, 2 steps ran in 2 threads and 2 steps were effectively single-threaded.
The mix of data effectively had a much larger impact on execution time of each step than number of cores.
The solution that finally cracked it was really quite simple. I made 6 functions that analyzed the data set and tried to predict the actual run-time of each analysis step. The heuristic in each function analyzed both the data sets under analysis and the number of cpus. Based on run-time data from my own 4 core machine, each function basically returned the number of milliseconds it was expected to take, on my machine.
f1(..) + f2(..) + f3(..) + f4(..) + f5(..) + f6(..) = total runtime in milliseconds
Now given this information, you can effectively know what percentage of the total execution time each step is supposed to take. Now if you say step1 is supposed to take 40% of the execution time, you basically need to find out how to emit 40 1% events from that algorithm. Say the for-loop is processing 100,000 items, you could probably do:
for (int i = 0; i < numItems; i++){
if (i % (numItems / percentageOfTotalForThisStep) == 0) emitProgressEvent();
.. do the actual processing ..
}
This algorithm gave us a silky smooth progress bar that performed flawlessly. Your implementation technology can have different forms of scaling and features available in the progress bar, but the basic way of thinking about the problem is the same.
And yes, it did not really matter that the heuristic reference numbers were worked out on my machine - the only real problem is if you want to change the numbers when running on a different machine. But you still know the ratio (which is the only really important thing here), so you can see how your local hardware runs differently from the one I had.
Now the average SO reader may wonder why on earth someone would spend a week making a smooth progress bar. The feature was requested by the head salesman, and I believe he used it in sales meetings to get contracts. Money talks ;)
In situations with threads or asynchronous processes/tasks like this, I find it helpful to have an abstract type or object in the main thread that represents (and ideally encapsulates) each process. So, for each worker thread, there will presumably be an object (let's call it Operation) in the main thread to manage that worker, and obviously there will be some kind of list-like data structure to hold these Operations.
Where applicable, each Operation provides the start/stop methods for its worker, and in some cases - such as yours - numeric properties representing the progress and expected total time or work of that particular Operation's task. The units don't necessarily need to be time-based, if you know you'll be performing 6,230 calculations, you can just think of these properties as calculation counts. Furthermore, each task will need to have some way of updating its owning Operation of its current progress in whatever mechanism is appropriate (callbacks, closures, event dispatching, or whatever mechanism your programming language/threading framework provides).
So while your actual work is being performed off in separate threads, a corresponding Operation object in the "main" thread is continually being updated/notified of its worker's progress. The progress bar can update itself accordingly, mapping the total of the Operations' "expected" times to its total, and the total of the Operations' "progress" times to its current progress, in whatever way makes sense for your progress bar framework.
Obviously there's a ton of other considerations/work that needs be done in actually implementing this, but I hope this gives you the gist of it.
Multiple progress bars aren't such a bad idea, mind you. Or maybe a complex progress bar that shows several threads running (like download manager programs sometimes have). As long as the UI is intuitive, your users will appreciate the extra data.
When I try to answer such design questions I first try to look at similar or analogous problems in other application, and how they're solved. So I would suggest you do some research by considering other applications that display complex progress (like the download manager example) and try to adapt an existing solution to your application.
Sorry I can't offer more specific design, this is just general advice. :)
Stick with Observer/Observable for this kind of thing. Some object observes the various series processing threads and reports status by updating the summary bar.

Resources