We are working on a Algorithmic trading software in C#. We monitor Market Price and then based on certain conditions, we want to buy the stock.
User input can be taken from GUI (WPF) and send to back-end for monitoring.
Back - end receives data continuously from Stock Exchange and checks if user entered price is met with certain limits and conditions. If all are satisfied, then we will buy / sell the stock (in Futures FUT).
Now, I want to design my Back end service.
I need Task Parallel Library or Custom Thread Pool where I want to create my tasks / threads / pool when application starts (may be incremental or fixed say 5000).
All will be in waiting state.
Once user creates an algorithm, we will activate one thread from the pool and monitors price for each incoming string. If it matches, then buy / sell and then go into waiting state again. (I don't want to create and destroy the threads / tasks as it is time consuming).
So please can you guys help me in this regard? If the above approach is good or do we have any other approach?
I am struck with this idea and not able to go out of box to think on this.
The above approach is definitely not "good"
Given the idea above, the architecture is wrong in many cardinal aspects. If your Project aspires to survive in 2017+ markets, try to learn from mistakes already taken in 2007-2016 years.
The percentages demonstrate the NBBO flutter for all U.S. Stocks from 2007-01 ~ 2012-01. ( Lower values means better NBBO stability. Higher values: Instability ) ( courtesy NANEX )
Financial Markets operate on nanosecond scales
Yes, a few inches of glass-fibre signal propagation transport delay decide on PROFIT or LOSS.
If planning to trading in Stock Markets, your system will observe the HFT crowd, doing dirty practice of Quote Stuffing and Vacuum-Cleaning 'em right in front of your nose at such scales, that your single-machine multi-threaded execution will just move through thin-air of fall in gap already created many microseconds before your decision took place on your localhost CPU.
The rise of HFT from 2007-01 ~ 2012-01 ( courtesy NANEX ).
May read more about an illusion of liquidity here.
See the expansion of Quotes against the level of Trades:
( courtesy NANEX )
Even if one decides to trade in a single instrument, on FX, the times are prohibitively short ( more than 20% of the ToB Bids are changed in time less than 2 ms and do not arrive to your localhost before your trading algorithm may react accordingly ).
If your TAMARA-measurements are similar to this, at your localhost, simply forget to trade in any HF/MF/LF-HFT instruments -- you simply do not see the real market ( the tip of the iceberg ) -- as the +20% price-events happen in the very first column ( 1 .. 2 ms ), where you do not see any single event at all!
5000 threads is bad, don't do that ever, you'll degrade the performance with context switch loss much more than parallel execution timing improvement. Traditionally the number of threads for your application should be equal to the number of cores in your system, by default. There are other possible variants, but probably they aren't the best option for your.
So you can use a ThreadPool with some working item method there with infinite loop, which is very low level, but you have control on what is going on in your system. Callback function could update the UI so the user will be notified about the trading results.
However, if you are saying that you can use the TPL, I suggest to consider these two options for your case:
Use a collection of tasks running forever for checking the new trading request. You still should tune up the number of simultaneously running tasks because you probably don't want them to fight each other for a CPU time. As the LongRunning tasks are created with dedicated background thread, many of them will degrade your application performance as well. Maybe in this approach you should introduce a strategy pattern implementation for a algorithm being run inside the task.
Setup a TPL Dataflow process within your application. For such approach your should encapsulate the info about the algorithm inside a DTO-object, and introduce a pipeline:
BufferBlock for storing all the incoming requests. Maybe you can use here a BroadcastBlock, if you want to check the sell or buy options in parallel. You can link the block with a boolean predicate here so the different block will process different types of requests.
ActionBlock (maybe one block for each algorithm from user) for processing the algorithmic check for a pattern based on which you are providing the decision.
ActionBlock for storing all the buy / sell requests for a data successfully passed by the algorithm.
BufferBlock for UI reaction with a Reactive Extensions (Introductory book for Rx, if you aren't familiar with it)
This solution still has to be tuned up with a block creation options, and more informative for you how exactly your data flow across the trading algorithm, the speed of the decision making and overall performance. You should properly examine for a defaults for TPL Dataflow blocks, you can find them into the official documentation. Other good place to start is Stephen Cleary's introductory blog posts (Part 1, Part 2, Part 3) and the chapter #4 about this library in his book.
With C# 5.0, the natural approach is to use async methods running on top of the default thread pool.
This way, you are creating Tasks quite often, but the most prominent cost of that is in GC. And unless you have very high performance requirements, that cost should be acceptable.
I think you would be better with an event loop, and if you need to scale, you can always shard by stock.
Related
I am reading ZeroMQ-the guide, currently on Chapter 4, for those of you who know.
http://zguide.zeromq.org/page:all
I am working in python with the binding pyzmq.
The author says that we should forget everything we know about concurrent programming, never use locks and critical sections etc.
Right now I am doing a pet project for fun with ZeroMQ, I have a piece of data, which is shared between some threads (don't worry my threads, don't pass sockets). They are sharing a single Database,
my question is :
Should I put a lock around that piece of data, to avoid race conditions, like one normally would, in order to serialize access, or is this something to avoid when using ZeroMQ, because better alternatives exist?
I remember the author saying that one should always share data between threads using inproc:// or ipc:// ( for processes ), but I am not sure how that fits here.
A lot of thrilling FUN in doing this, indeed
Yes, Pieter HINJENS does not advice any step outside of his Zen-of-Zero. Share nothing, lock nothing ... as you have noticed many times in his great book already.
What is the cornerstone of the problem -- protecting the [TIME]-domain consistent reflections of the Database-served pieces of data.
In distributed-system, the [TIME]-domain problem spans a non-singular landscape, so more problems spring out of the box.
If we know, there is no life-saving technology connected to the system ( regulatory + safety validations happily do not apply here ), there could be a few tricks to solve the game with ZeroMQ and without any imperative locks.
Use .setsockopt( { ZMQ.CONFLATE | ZMQ.IMMEDIATE }, 1 ) infrastructure, with .setsockopt( ZMQ.AFFINITY, ... ) and .setsockopt( ZMQ.TOS, ... ) latency-tweaked hardware-infrastructure + O/S + kernel setups end-to-end, where possible.
A good point to notice is also that of the Python threading-model still goes under GIL-lock-stepping, so the principal collision-avoidance is already in-place.
An indeed very hard-pet-project :( ZeroMQ cannot and will not avoid problems )
Either
co-locate the decision-making process(es), so as to have an almost-zero latency on taking decisions on updated data,
or
permit a non-local decision taking, but make them equipped with some robust rules, based on latency of uncompensated ( principally impossible until a Managed Quantum Entanglement API gets published and works indeed at our service ) [TIME]-domain stamped event-notifications - thus having also a set of rules controlled chances to moderate DB data-reflection's consistency corner cases, where any "earlier"-served DB-READ has been delivered "near" or "after" a known DB-WRITE has changed its value - both visible to the remote-observer at an almost-the-same time.
Database's own data-consistency is maintained by the DBMS-engine per se. No need to care here.
Let's imagine the Database-accesses being mediated "through" ZeroMQ communication tools. The risk is not in the performance-scaling, where ZeroMQ enjoys an almost-linear scaling, the problem is in the said [TIME]-domain consistency of the served "answer" anywhere "behind" the DBMS-engine's perimeter, the more once we got into the distributed-system realms.
Why?
A DBMS-engine's "locally" data-consistent "answer" to a read-request, served by the DBMS-engine right at a moment of UTC:000,000,000.000 000 000 [s] will undergo a transport-class specific journey, along the intended distribution path, but -- due to principal reasons -- does not get delivered onto a "remote"-platform until UTC:000,000,000.??? ??? ??? [s] ( depending on the respective transport-class and intermediating platforms' workloads ).
Next, there may be and will be an additional principal inducted latency, caused from workloads of otherwise uncoordinated processes requests, principally concurrent in their respective appearance, that get later somehow aligned into a pure-serial queue, be it due to the ZeroMQ Context()'s job or someone else's one. Dynamics of queue-management and resources-(un-)availability during these phases add another layer of latency-uncertainties.
Put together, one may ( and ought ) fight as a herd of lions for any shaving-off of the latency costs alongside the { up | down }-the-paths, yet getting a minimum + predictable latency is more a wish, than any easily achievable low hanging fruit.
DB-READ-s may seem to be principally easy to serve, as they might appear as lock-free and un-orchestrated among themselves, yet the very first DB-WRITE-request may some of the already scheduled "answer"-s, that were not yet sent out onto the wire ( and each such piece of data ought have got updated / replaced by a DBMS-local [TIME]-domain " freshmost " piece of data --- As no one is willing to dog-fight, the less to further receive shots from a plane, that was already known to have been shot down a few moments ago, is she ... ? )
These were lectures already learnt during collecting pieces of distributed-system smart designs' experience ( Real Time Multiplayer / Massive Ops SIMs being the best of the best ).
inproc:// transport-class is the best tool for co-located decision taking, yet will not help you in this in Python-based ecosystems ( ref. GIL-lock-stepping has enforced a pure-serial execution and latency of GIL-release goes ~3~5 orders of magnitude above the almost-"direct"-RAM-read/write-s.
ipc:// transport-class sockets may span inter-process communications co-located on the same host, yet if one side goes in python, still the GIL-lock-stepping will "chop" your efforts of minimising the accumulated latencies as regular blocking will appear in GIL-interval steps and added latency-jitter is the worst thing a Real-Time distributed-system designer is dreaming about :o)
I need an architectural opinion and approaches to the following problem:
INTRO:
We have a table of ~4M rows called Purchases.We also have a table
of ~5k rows called Categories.In addition, we have a table of
~4k SubCategories. We are using T-SQL to store data.At users
request ( during runtime ), server receives a request of about 10-15 N
possibilities of parameters. Based on the parameters, we take
purchases, sort them by categories and subCategories and do some
computing.Some of the process of "computing" includes filtering,
sorting, rearranging fields of purchases, subtracting purchases with
each other, adding some other purchases with each other, find savings,
etc...This process is user specific, therefore every user WILL get
different data back, based on their roles.
Problem:
This process takes about 3-5 minutes and we are wanting to cut it
down.Previously, this process was done in-memory, on the browser via
webworkers (JS). We have moved away from it as the memory started to get
really large and most of browsers start to fail on load. Then we moved
the service to the server (NodeJS), which processed the request on the fly,
via child-processes. Reason for child-processes: the computing process
goes through a for loop about 5,000x times ( for each category ) and does
the above mentioned "computing".Via child processes we were able to
distribute the work into #of child processes, which gave us somewhat
better results, if we ran at least 16-cores ( 16 child processes ).
Current processing time is down to about a 1.5-2 minutes, but we are
wanting to see if we have better options.
I understand its hard to fully understand our goals without seeing any code but to ask question specifically. What are some ways of doing computing on semi-big data at runtime?
Some thoughts we had:
using SQL in-memory tables and doing computations in sql
using azure batch services
using bigger machines ( ~ 32-64 cores, this may be our best shot if we cannot get any other thoughts. But of course, cost increases drasticaly, yet we accept the fact that cost will increase )
stepping into hadoop ecosystem ( or other big data ecosystems )
some other useful facts:
our purchases are about ~1GB ( becoming a little too large for in-memory computing )
We are thinking of doing pre-computing and caching on redis to have SOME data ready for client ( we are going to use their parameters set in their account to pre-compute every day, yet clients tend to change those parameters frequently, therefore we have to have some efficient way of handling data that is NOT cached and pre-computed )
If there is more information we can provide to better understand our dilemma, please comment and I will provide as much info as possible. There would be too much code to paste in here for one to fully understand the algorithms therefore I want to try delivering our problem with words if possible.
Never decide about technology before being sure about workflow's critical-path
This will never help you achieve ( a yet unknown ) target.
Not knowing the process critical-path, no one could calculate any speedup from whatever architecture one may aggressively-"sell" you or just-"recommend" you to follow as "geeky/nerdy/sexy/popular" - whatever one likes to hear.
What would you get from such pre-mature decision?
Typically a mix of both the budgeting ( co$t$ ) and Project-Management ( sliding time-scales ) nightmares:
additional costs ( new technology also means new skills to learn, new training costs, new delays for the team to re-shape and re-adjust and grow into a mature using of the new technology at performance levels better, than the currently used tools, etc, etc )
risks of choosing a "popular"-brand, which on the other side does not exhibit any superficial powers the marketing texts were promising ( but once having paid the initial costs of entry, there is no other way than to bear the risk of never achieving the intended target, possibly due to overestimated performance benefits & underestimated costs of transformation & heavily underestimated costs of operations & maintenance )
What would you say, if you could use a solution,where "Better options" remain your options:
you can start now, with the code you are currently using without a single line of code changed
you can start now with a still YOUR free-will based gradual path of performance scaling
you can avoid all risks of (mis)-investing into any extra-premium costs "super-box", but rather stay on the safe side re-use a cheap and massively in-service tested / fine-tuned / deployment-proven COTS hardware units ( a common dual-CPU + a few GB machines, commonly used in large thousands in datacentres )
you can scale up to any level of performance you need, growing CPU-bound processing performance gradually from start, hassle-free, up to some ~1k ~2k ~4k ~8k CPUs, as needed -- yes, up to many thousands of CPUs, that your current workers'-code can immediately use for delivering the immediate benefit of the such increased performance and thus leave your teams free hands and more time for thorough work on possible design improvements and code re-factoring for even better performance envelopes if the current workflow, having been "passively" just smart-distributed to say ~1000, later ~2000 or ~5000-CPU-cores ( still without a single SLOC changed ) do not suffice on its own?
you can scale up -- again, gradually, on an as-needed basis, hassle-free -- up to ( almost ) any size of the in-RAM capacity, be it on Day 1 ~8TB, ~16TB, ~32TB, ~64TB, jumping to ~72TB or ~128TB next year, if needed -- all that keeping your budget always ( almost ) linear and fully adjusted by your performance plans and actual customer-generated traffic
you can isolate and focus your R&D efforts not on (re)-learning "new"-platform(s), but purely into process (re)-design for further increasing the process performance ( be it using a strategy of pre-computing, where feasible, be it using smarter fully-in-RAM layouts for even faster ad-hoc calculations, that cannot be statically pre-computed )
What would business owners say to such ROI-aligned strategy?
If one makes CEO + CFO "buy" any new toy, well, that is cool for hacking this today, that tommorrow, but such approach will never make shareholders any happier, than throwing ( their ) money into the river of Nile.
If one can show the ultimately efficient Project plan, where most of the knowledge and skills are focused on business-aligned target and at the same time protecting the ROI, that would make both your CEO + CFO and I guarantee that also all your shareholders very happy, wouldn't it?
So, which way would you decide to go?
This topic isn't really new but just in case... As far as my experience can tell, I would say your T-SQL DB might by your bottle neck here.
Have you measured the performance of your SQL queries? What do you compute on SQL server side? on the Node.js side?
A good start would be to measure the response time of your SQL queries, revamp your queries, work on indexes and dig into how your DB query engine works if needed. Sometimes a small tuning in the DB settings does the trick!
I want to approximate the Worst Case Execution Time (WCET) for a set of tasks on linux. Most professional tools are either expensive (1000s $), or don't support my processor architecture.
Since, I don't need a tight bound, my line of thought is that I :
disable frequency scaling
disbale unnecesary background services and tasks
set the program affinity to run on a specified core
run the program for 50,000 times with various inputs
Profiling it and storing the total number of cycles it had completed to
execute.
Given the largest clock cycle count and knowing the core frequency, I can get an estimate
Is this is a sound Practical approach?
Secondly, to account for interference from other tasks, I will run the whole task set (40) tasks in parallel with each randomly assigned a core and do the same thing for 50,000 times.
Once I get the estimate, a 10% safe margin will be added to account for unforseeble interference and untested path. This 10% margin has been suggested in the paper "Approximation of Worst Case Execution time in Preepmtive Multitasking Systems" by Corti, Brega and Gross
Some comments:
1) Even attempting to compute worst case bounds in this way means making assumptions that there aren't uncommon inputs that cause tasks to take much more or even much less time. An extreme example would be a bug that causes one of the tasks to go into an infinite loop, or that causes the whole thing to deadlock. You need something like a code review to establish that the time taken will always be pretty much the same, regardless of input.
2) It is possible that the input data does influence the time taken to some extent. Even if this isn't apparent to you, it could happen because of the details of the implementation of some library function that you call. So you need to run your tests on a representative selection of real life data.
3) When you have got your 50K test results, I would draw some sort of probability plot - see e.g. http://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htm and links off it. I would be looking for isolated points that show that in a few cases some runs were suspiciously slow or suspiciously fast, because the code review from (1) said there shouldn't be runs like this. I would also want to check that adding 10% to the maximum seen takes me a good distance away from the points I have plotted. You could also plot time taken against different parameters from the input data to check that there wasn't any pattern there.
4) If you want to try a very sophisticated approach, you could try fitting a statistical distribution to the values you have found - see e.g. https://en.wikipedia.org/wiki/Generalized_Pareto_distribution. But plotting the data and looking at it is probably the most important thing to do.
I'm writing an app using sms as communication.
I have chosen to subscribe to an sms-gateway, which provides me with an API for doing so.
The API has functions for sending as well as pulling new messages. It does however not have any kind of push functionality.
In order to do my queries most efficient, I'm seeking data on how long time people wait before they answer a text message - as a probability function.
Extra info:
The application is interactive (as can be), so I suppose the times will be pretty similar to real life human-human communication.
I don't believe differences in personal style will play a big impact on the right times and frequencies to query, so average data should be fine.
Update
I'm impressed and honered by the many great answers recieved. I have concluded that my best shot will be a few adaptable heuristics, including exponential (or maybe polynomial) backoff.
All along I will be gathering statistics for later analysis. Maybe something will show up. I think I will cheat start on the algorithm for generating poll-frquenzies from a probability distribution. That'll be fun.
Thanks again many times.
In the absence of any real data, the best solution may be to write the code so that the application adjusts the wait time based on current history of response times.
Basic Idea as follows:
Step 1: Set initial frequency of pulling once every x seconds.
Step 2: Pull messages at the above frequency for y duration.
Step 3: If you discover that messages are always waiting for you to pull decrease x otherwise increase x.
Several design considerations:
Adjust forever or stop after sometime
You can repeat steps 2 and 3 forever in which case the application dynamically adjusts itself according to sms patterns. Alternatively, you can stop after some time to reduce application overhead.
Adjustment criteria: Per customer or across all customers
You can chose to do the adjustment in step 3 on a per customer basis or across all customers.
I believe GMAIL's smtp service works along the same lines.
well I would suggest finding some statistics on daily SMS/Text Messaging usage by geographical location and age groups and come up with an daily average, it wont be an exact measurement for all though.
Good question.
Consider that people might have multiple tasks and that answering a text message might be one of those tasks. If each of those tasks takes an amount of time that is exponentially distributed, the time to get around to answering the text message is the sum of those task completion times. The sum of n iid random variables has a Gamma distribution.
The number of tasks ahead of the text return also has a dicrete distribution - let's say it's Poisson. I don't have the time to derive the resulting distribution, but simulating it using #Risk, I get either a Weibull or Gamma distribution.
SMS is a store-and-forward messaging service, so you have to add in the delay that can be added by the various SMSCs (Short Message Service Centers) along the way. If you are connecting to one of the big aggregation houses (Sybase, TNS, mBlox etc) commercial bulk SMS providers (Clickatel, etc) then you need to allow for the message to transverse their network as well as the carriers network. If you are using a smaller shop then most likely they are using a GSM Modem (or modems) and there is a throughput limit on the message the can receive and process (as well as push out)
All that said, if you are using a direct connection or one of the big guys MO (mobile originated) messages coming to you as a CP (content provider) should take less than 5 seconds. Add to that the time it takes the Mobile Subscribers to reply.
I would say that anecdotal evidence form services I've worked on before, where the Mobile Subscriber needs to provide a simple reply it's usually within 10 seconds or not at all.
If you are polling for specific replies I would poll at 5 and 10 seconds then apply an exponential back off.
All of this is from a North American point-of-view. Europe will be fairly close, but places like Africa, Asia will be a bit slower as the networks are a bit slower. (unless you are connected directly to the operator and even then some of them are slow).
I'm looking for a design pattern that would fit my application design.
My application processes large amounts of data and produces some graphs.
Data processing (fetching from files, CPU intensive calculations) and graph operations (drawing, updating) are done in seperate threads.
Graph can be scrolled - in this case new data portions need to be processed.
Because there can be several series on a graph, multiple threads can be spawned (two threads per serie, one for dataset update and one for graph update).
I don't want to create multiple progress bars. Instead, I'd like to have single progress bar that inform about global progress. At the moment I can think of MVC and Observer/Observable, but it's a little bit blurry :) Maybe somebody could point me in a right direction, thanks.
I once spent the best part of a week trying to make a smooth, non-hiccupy progress bar over a very complex algorithm.
The algorithm had 6 different steps. Each step had timing characteristics that were seriously dependent on A) the underlying data being processed, not just the "amount" of data but also the "type" of data and B) 2 of the steps scaled extremely well with increasing number of cpus, 2 steps ran in 2 threads and 2 steps were effectively single-threaded.
The mix of data effectively had a much larger impact on execution time of each step than number of cores.
The solution that finally cracked it was really quite simple. I made 6 functions that analyzed the data set and tried to predict the actual run-time of each analysis step. The heuristic in each function analyzed both the data sets under analysis and the number of cpus. Based on run-time data from my own 4 core machine, each function basically returned the number of milliseconds it was expected to take, on my machine.
f1(..) + f2(..) + f3(..) + f4(..) + f5(..) + f6(..) = total runtime in milliseconds
Now given this information, you can effectively know what percentage of the total execution time each step is supposed to take. Now if you say step1 is supposed to take 40% of the execution time, you basically need to find out how to emit 40 1% events from that algorithm. Say the for-loop is processing 100,000 items, you could probably do:
for (int i = 0; i < numItems; i++){
if (i % (numItems / percentageOfTotalForThisStep) == 0) emitProgressEvent();
.. do the actual processing ..
}
This algorithm gave us a silky smooth progress bar that performed flawlessly. Your implementation technology can have different forms of scaling and features available in the progress bar, but the basic way of thinking about the problem is the same.
And yes, it did not really matter that the heuristic reference numbers were worked out on my machine - the only real problem is if you want to change the numbers when running on a different machine. But you still know the ratio (which is the only really important thing here), so you can see how your local hardware runs differently from the one I had.
Now the average SO reader may wonder why on earth someone would spend a week making a smooth progress bar. The feature was requested by the head salesman, and I believe he used it in sales meetings to get contracts. Money talks ;)
In situations with threads or asynchronous processes/tasks like this, I find it helpful to have an abstract type or object in the main thread that represents (and ideally encapsulates) each process. So, for each worker thread, there will presumably be an object (let's call it Operation) in the main thread to manage that worker, and obviously there will be some kind of list-like data structure to hold these Operations.
Where applicable, each Operation provides the start/stop methods for its worker, and in some cases - such as yours - numeric properties representing the progress and expected total time or work of that particular Operation's task. The units don't necessarily need to be time-based, if you know you'll be performing 6,230 calculations, you can just think of these properties as calculation counts. Furthermore, each task will need to have some way of updating its owning Operation of its current progress in whatever mechanism is appropriate (callbacks, closures, event dispatching, or whatever mechanism your programming language/threading framework provides).
So while your actual work is being performed off in separate threads, a corresponding Operation object in the "main" thread is continually being updated/notified of its worker's progress. The progress bar can update itself accordingly, mapping the total of the Operations' "expected" times to its total, and the total of the Operations' "progress" times to its current progress, in whatever way makes sense for your progress bar framework.
Obviously there's a ton of other considerations/work that needs be done in actually implementing this, but I hope this gives you the gist of it.
Multiple progress bars aren't such a bad idea, mind you. Or maybe a complex progress bar that shows several threads running (like download manager programs sometimes have). As long as the UI is intuitive, your users will appreciate the extra data.
When I try to answer such design questions I first try to look at similar or analogous problems in other application, and how they're solved. So I would suggest you do some research by considering other applications that display complex progress (like the download manager example) and try to adapt an existing solution to your application.
Sorry I can't offer more specific design, this is just general advice. :)
Stick with Observer/Observable for this kind of thing. Some object observes the various series processing threads and reports status by updating the summary bar.