How to implement difference aggregations / rollups on time series data in Cassandra

How to implement difference aggregations / rollups on time series data in Cassandra - cassandra

I have a situation where I will be collect many time series metrics (electricity used, hours used, hours idle) from operating equipment in a manufacturing plan. I need to create many different rollup numbers on individual and grouped assets. For example, I need to create min, max, average electricity used over 1,5,10,30 days for a given machine. Create same types of metrics for different groups of machines.... Many of the calculated values are driven from the raw values retrieved from the assets.
What is the best approach for calculating these values within a Cassandra environment?
Do I need to create 'batch jobs' that execute the calculations?
It seems as if there are some built in data types (counter) in Cassandra, but seem to be some issues (simply reading some comments on stack overflow)
Has anyone integrate Cassandra with a Twitter storm or something to constantly update the counters?
Thanks

There's an open-source project called Blueflood that does exactly this. You could likely use it directly out of the box to fill your use-case, or fork the repo and modify as necessary.
Documentation and homepage: http://blueflood.io/
Source-code: https://github.com/rackerlabs/blueflood
Irc: #blueflood on Freenode
(Disclaimer: I am a contributor to the project)

Related

azure video streaming live how calculate the cost by size audience?

I want to make a live events with 100k viewers or more for 3 hours... How can I calculate the cost
I read documentation but the azure docs are quite ambiguous regarding the pricing

There is no shortcut really where you can just enter an audience size and get a result - different factors will impact your approach including, bitrates, support level, number of events, redundancy and geo reach required etc.
A typical approach is to build a model, maybe using excel, and test different combinations and scenarios in the Azure online cost calculator.
You can then experiment with changing details like regions, support level etc to see the impact.

How do I analyze a bunch of profiles

I have a bunch of profiles about my application(200+ profiles per week), I want to analyze them for getting some performance information.
I don't know what information can be analyzed from these files, maybe
the slowest function or the hot-spot code in the different release versions?
I have no relevant experience in this field, has anyone had that?

Well, it depends on the kind of profiles you have, you can capture number of different kinds of profiles.
In general most people are interested in the CPU profile and heap profile since they allow you to see which functions/lines use the most amount of CPU and allocate the most memory.
There are a number of ways you can visualize the data and/or drill into them. You should take a look at help output of go tool pprof and the pprof blog article to get you going.
As for the amount of profiles. In general, people analyze them just one by one. If you want an average over multiple samples, you can merge multiple samples with pprof-merge.
You can substract one profile from another to see relative changes. By using the -diff_base and -base flags.
Yet another approach can be to extract percentages per function for each profile and see if they change over time. You can do this by parsing the output of go tool pprof -top {profile}. You can correlate this data with known events like software updates or high demand and see if one functions starts taking up more resources than others(it might not scale very well).
If you are ever that the point where you want to develop custom tools for analysis, you can use the pprof library to parse the profiles for you.

statistical test for an effect across groups, data are nested, not of normal distribution

What is the best statistical test for an effect across groups when the data are nested but may not be of normal distribution? I get a highly significant effect using Kruskall Wallis test, but using it there is no account to that the data points are from several locations, each contributed for several years, and in every year the data were pulled into age groups.

I think you can categorize the data by year, and change the data structure so that the data will be non-nested, making it easier to process. I agree that Kruskal-Wallis' test is a good choice of the cross-group effect test.

How to export specific price and volume data from the LMAX level 2 widget to excel

Background -
I am not a programmer.
I do trade spot forex on an intraday basis.
I am willing to learn programming
Specific Query -
I would like to know how to export into Excel in real time 'top of book' price and volume data as displayed on the LMAX level 2 widget/frame on -
https://s3-eu-west-1.amazonaws.com/lmax-widget/website-widget-quote-prof-flex.html?a=rTWcS34L5WRQkHtC
In essence I am looking to export
price and volume data where the coloured flashes occur.
price and volume data for when the coloured flashes do not occur.
I understand that 1) and 2) will encompass all the top of book prices and volume. However i would like to keep 1) and 2) separate/distinguished as far as data collection is concerned.
Time period for which the collected data intends to be stored -> 2-3 hours.
What kind of languages do I need to know to do the above?
I understand that I need to be an advanced excel user too.
Long term goals -
I intend to use the above information to make discretionary intraday trading decisions.
In the long run I will get more involved with creating an algo or indicator to help with the decision making process, which would include the information above.
I have understood that one needs to know coding to get involved in activities such as the above. Hence I have started learning C ++. More so to get a hang/feel for coding.
I have been searching all over the web as to where to start in this endeavor. However I am quite confused and overwhelmed with all the information.
Hence apart from the specific data export query, any additional guidelines would also be helpful.
As of now I use MT4 to trade. Hence I believe to do the above - I will need more than just MT4.
Any help would be highly appreciated.

Yes, MetaTrader4 is still not able ( in spite of all white-label-ed Terminals' OrderBook Add-On(s) marketing and PR efforts ) to provide an OrderBook-L2/DoM-data into your MQL4 / NewMQL4 algorithm for any decision making. Third party software tools' integration is needed to make MQL4-code aware of the real-time L2/DoM-data.
LMAX widget has impressive look & feel, however for your Excel export it requires a lot of programming efforts to re-use it for an automated scanner to produce data for 1 & 2 while there may be some further, non-technical, troubles on legal / operational restrictions for automated scanner to be operated on such data-source. To bring an example, the data-publisher policy restrict automated Options-pricing scanners for options on { FTSE | CAC | AMS | DAX }, may re-visit the online published data-sources no more than once a quarter of an hour and get blocked / black-listed otherwise. So a care and a proper data-source engineering is in place.
Size of data collection is another issue. Excel has some restrictions on an amount of rows/columns that may get imported. Large data-files, the more the CSV-imports may strike these limits. L2/DoM-data, collected for 2-3 hours just for one single FX Major may go beyond such a limit, as there are many records per second ( tens, if not hundreds, with just a few miliseconds between them ). Static file-size of collected data-records take typically several minutes to just get written on disk, so proper distributed processing data-flow-design and non-blocking-fileIO engineering is a must.
Real-time system design is the right angle to view the problem solution approach, rather than just some programming language excersise. Having mastered some programming language is a great move, nevertheless, so called robust real-time system design, and Trading software is such a domain, requires, with all respect, a lot more insights and hands-on experience than to make an MQL4 code run multi-thread-ed & multi-process-ed with a few DLL services for a Cloud/Grid-based distributed processing system.
How much real-time traffic is expected to be there?
For just a raw idea, what the Market can produce per second, per milisecon, per microsecond, let's view a NYNEX traffic analysis for one instrument:
One second can have this wild relief:
And once looking into 5-msec sampling:
How to export
Check if the data-source owner legally permits your automated processing.
Create your own real-time DataPump software, independent of the HTML-wrapped Widget
Create your own 'DB-store' to efficiently off-load scanned data-records from real-time DataPump
Test the live data-source >> DataPump >> DB-store performance & robustness on being able to serve error-free a 24/6 duty for several FX Majors in parallel
Integrate your DataPump fed DB-store local data-source for on-line/off-line interactions with your preferred { MT4 | Excel | quantitative-analytics } package
Integrate a monitoring of any production environment irregularity in your real-time processing pipeline, which may range from network issues, VPN / hosting issues, data-source availability issues to an unexpected change in the scanned data-source format/access conditions.

Data Aggregation :: How important is it really?

I'm curious to know where people value Data Aggregation. I'm truly curious, if you don't mind letting me know how important this really is to you personally with respect to your work environment, and if you have to work directly with data agg in your line of work.
Really interested to hear about your feedback.

If you persist data (e.g. store it in a database) chances are that the data will be used by managers, statisticians, stake holders etc. to analyze the workings of their software-supported undertaking to make executive decisions. This analysis can only take place by methods of aggregation. There's no one in the world who can look at a million rows of raw data and glean insight. The data has to be summed, averaged, standard deviated etc. to make any sense to a human being.
A few examples of areas where data aggregation is important:
Public Health (CDC, WHO)
Marketing
Advertising
Politics
Organizational Management
Space Exploration
lol. Take your pick!

Very important, what else is there to say?
I work at a large hospital and not only do we have numerous departments using Analysis Services cubes we develioped but they rely heavily on the daily totals and different aggregations they can derive from these cubes by simple browsing. Without the very basic capability of being able to aggregate on some portions of your data you might as well write it on paper (IMO).

Say you have data over every individual sale.
Looking at these individual purchases could be interesting some level level(e.g. whne a customer comes and wants a refund)
However, I cannot send those 20 million records to my boss at the end of each month and say "Heres how much we sold this month".
This data needs to be aggregated and summarized on various levels. The business would not operate if the marketing guys couldn't get an aggregate for each product, the regional boss couldn't get an total aggregate over a time period and so on.

Our databases have millions of rows, of course we rely on aggregation for managment information, not to use it would be to put too heavy a load on the database in order to run large reports which would impact heavily on the users of the database. I can't think of many cases where the database contains business critical information that managers use to make decisions where aggregation would not be needed for managment reports.

I view data aggregation like data in a grid and being able to group, order, and sort columns. In a large grid of data, this is very important. It's really the difference between looking at a pile of numbers and looking at meaningful data.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string