Hazelcast Jet - Pattern matching - hazelcast-jet

Does hazelcast jet support pattern matching use cases like few of the even stream processors support.
If not, will this be included in the future release.
Here is an use ase for pattern matching
Find the trader who sold 'AAA' stock followed by bought
'BBB' stock in 10 seconds time frame.

Related

How to design a real time charging system together with spark and nosql databases

I would like to design a system that
Will be reading CDR (call data records) files and inserts them
into a nosql database. To achieve this spark streaming with Cassandra as nosql looks promising as the files will keep coming
Will be able to calculate real time price by rating the duration and called number or just kilobytes in case of data and store the total so far chargable amount for the current billcycle. I need a nosql that i will be both inserting rated cdrs and updating total so far chargable amount for the current billcycle for that msisdn in that cdr.
In case rate plans are updated for a specific subscription, for the current billcycle all the cdrs using that price plan needs to be recalculated and total so far amount needs to be calculated for all the customers
Notes:
Msisdns are unique for each subscription with one to one relation.
Within a month One msisdn can have up to 100000 cdrs.
I have been going through the nosql databases so far i am thinking to
use cassandra but i am still not sure how to design the database to
optimize for this business case.
Please also consider while one cdr is being processed in one node,
the other cdr for the same msisdn can be processed in another node
at the same time and both nodes doing the above logic.
The question is indeed very broad - StackOverflow is a meant to cover more specific technical questions and not debate architectural aspects of an entire system.
Apart from this, let me attempt to address some of the aspects of your questions:
a) Using streaming for CDR processing:
Spark Streaming is indeed a tool of choice for incoming CDRs, typically delivered over a message queueing system such as Kafka. It allows windowed operations, whcih com in handy when you need to calculate call charges over a set period (hours, days etc..). You can very easily combine existing static records, such as price plans from other databases, with your incoming CDRs in windowed operations. All of that in a robust and expansive API.
b) using Cassandra as a store
Cassandra has excellent scaling capabilities with instantaneous row access - for that, it's an absolute killer. However, in the case of TelCo industry setting, I would seriously question using it for anything else than MSISDN lookups and credit checks. Cassandra is essentially a columnar KV storage, and trying to store multi dimensional, essentially relational records such as price plans, contracts and the lot will give you lots of headaches. I would suggest storing your data in different stores, depending on the use cases. These could be:
CDR raw records in HDFS -> CDRs can be plentiful, and if you need to reprocess them, collecting them from HDFS will be more efficient
Bill summaries in Cassandra -> the itemized bill summaries is the result of CDR as initially processed by Spark Streaming. These are essentially columnar and can be perfectly stored in Cassandra
MSISDN and Credit information -> as mentioned above, that is also a perfect use case for Cassandra
price plans -> these are multi dimensional, more document oriented, and should be stored in databases that support such structures. You can perfectly use Postgres with JSON for that, as you you wouldn't expect more than a handful of plans.
To conclude this, you're actually looking at a classic lambda use case with Spark Streaming for immediate processing of incoming CDRs, and batch processing with regular Spark on HDFS for post processing, for instance when you're recalculating CDR costs after plan changes.

Hazelcast or AppScale to manage parallel computational tasks over a shared dataset

Starting out on a new project and looking for advice on a suitable platform. Current thinking is between Hazelcast or AppScale, given our team’s combined (but limited) experience covers an older version of Hazelcast and GAE. Both can also apparently be setup on EC2, which may be the easiest way to meet the CPU demand we expect.
Problem Profile
1). Our data consists of many small records stored by date (but not always time). Some are small numerical records (business stats, looks like daily weather info or stock market prices) and some are bulky text (log file entries). Data volumes not huge, in the region of hundreds/day between 1k and 50k each.
2). Very very large number of instances of computationally expensive numerical models (think monte-carlo sims) operate constantly over fixed-size windows of the same data.
3). A number of monitoring agents make data available.
4). Larger (longer periods of time) sets of the same data to be processed offline once daily.
With Hazelcast we would add incoming data to maps and use the Executor service to run models over the shared data. Likely use of Tomcat to provide minimal front end access to the grid as required.
With AppScale we would add tables per data-type and use the Task Queues API to frame the numerical models. Servlets deployed to AppScale as per GAE to provide front end.
Question
Should we use AppScale or Hazelcast for requirements like this? That is - for the problem as stated, are there any stand-out factors for/against either platform that we should consider?
If you prefer/require a distributed, service-oriented programming model (bag of tasks) then the answer is AppScale. If you prefer/require a parallel programming model (single machine abstraction) then the answer is Hazelcast. AppScale is also a complete cloud platform (vs only a datastore) which enables you to do more things with your app as it evolves. If you go with AppScale, you can adjust the timing restriction on the tasks and customize the platform with the libraries you want to use, for your computationally expensive methods.

Azure Table Storage Partition Key

Two somewhat related questions.
1) Is there anyway to get an ID of the server a table entity lives on?
2) Will using a GUID give me the best partition key distribution possible? If not, what will?
we have been struggling for weeks on table storage performance. In short, it's really bad, but early on we realized that using a randomish partition key will distribute the entities across many servers, which is exactly what we want to do as we are trying to achieve 8000 reads per second. Apparently our partition key wasn't random enough, so for testing purposes, I have decided to just use a GUID. First impression is it is waaaaaay faster.
Really bad get performance is < 1000 per second. Partition key is Guid.NewGuid() and row key is the constant "UserInfo". Get is execute using TableOperation with pk and rk, nothing else as follows: TableOperation retrieveOperation = TableOperation.Retrieve(pk, rk); return cloudTable.ExecuteAsync(retrieveOperation). We always use indexed reads and never table scans. Also, VM size is medium or large, never anything smaller. Parallel no, async yes
As other users have pointed out, Azure Tables are strictly controlled by the runtime and thus you cannot control / check which specific storage nodes are handling your requests. Furthermore, any given partition is served by a single server, that is, entities belonging to the same partition cannot be split between several storage nodes (see HERE)
In Windows Azure table, the PartitionKey property is used as the partition key. All entities with same PartitionKey value are clustered together and they are served from a single server node. This allows the user to control entity locality by setting the PartitionKey values, and perform Entity Group Transactions over entities in that same partition.
You mention that you are targeting 8000 requests per second? If that is the case, you might be hitting a threshold that requires very good table/partitionkey design. Please see the article "Windows Azure Storage Abstractions and their Scalability Targets"
The following extract is applicable to your situation:
This will provide the following scalability targets for a single storage account created after June 7th 2012.
Capacity – Up to 200 TBs
Transactions – Up to 20,000 entities/messages/blobs per second
As other users pointed out, if your PartitionKey numbering follows an incremental pattern, the Azure runtime will recognize this and group some of your partitions within the same storage node.
Furthermore, if I understood your question correctly, you are currently assigning partition keys via GUID's? If that is the case, this means that every PartitionKey in your table will be unique, thus every partitionkey will have no more than 1 entity. As per the articles above, the way Azure table scales out is by grouping entities in their partition keys inside independent storage nodes. If your partitionkeys are unique and thus contain no more than one entity, this means that Azure table will scale out only one entity at a time! Now, we know Azure is not that dumb, and it groups partitionkeys when it detects a pattern in the way they are created. So if you are hitting this trigger in Azure and Azure is grouping your partitionkeys, it means your scalability capabilities are limited to the smartness of this grouping algorithm.
As per the the scalability targets above for 2012, each partitionkey should be able to give you 2,000 transactions per second. Theoretically, you should need no more than 4 partition keys in this case (assuming that the workload between the four is distributed equally).
I would suggest you to design your partition keys to group entities in such a way that no more than 2,000 entities per second per partition are reached, and drop using GUID's as partitionkeys. This will allow you to better support features such as Entity Transaction Group, reduce the complexity of your table design, and get the performance you are looking for.
Answering #1: There is no concept of a server that a particular table entity lives on. There are no particular servers to choose from, as Table Storage is a massive-scale multi-tenant storage system. So... there's no way to retrieve a server ID for a given table entity.
Answering #2: Choose a partition key that makes sense to your application. just remember that it's partition+row to access a given entity. If you do that, you'll have a fast, direct read. If you attempt to do a table- or partition-scan, your performance will certainly take a hit.
See
http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx for more info on key selection (Note the numbers are 3 years old, but the guidance is still good).
Also this talk can be of some use as far as best practice : http://channel9.msdn.com/Events/TechEd/NorthAmerica/2013/WAD-B406#fbid=lCN9J5QiTDF.
In general a given partition can support up to 2000 tps, so spreading data across partitions will help achieve greater numbers. Something to consider is that atomic batch transactions only apply to entities that share the same partitionkey. Additionally, for smaller requests you may consider disabling Nagle as small requests may be getting held up at the client layer.
From the client end, I would recommend using the latest client lib (2.1) and Async methods as you have literally thousands of requests per second. (the talk has a few slides on client best practices)
Lastly, the next release of storage will support JSON and JSON no metadata which will dramatically reduce the size of the response body for the same objects, and subsequently the cpu cycles needed to parse them. If you use the latest client libs your application will be able to leverage these behaviors with little to no code change.

Design of Partitioning for Azure Table Storage

I have some software which collects data over a large period of time, approx 200 readings per second. It uses an SQL database for this. I am looking to use Azure to move a lot of my old "archived" data to.
The software uses a multi-tenant type architecture, so I am planning to use one Azure Table per Tenant. Each tenant is perhaps monitoring 10-20 different metrics, so I am planning to use the Metric ID (int) as the Partition Key.
Since each metric will only have one reading per minute (max), I am planning to use DateTime.Ticks.ToString("d19") as my RowKey.
I am lacking a little understanding as to how this will scale however; so was hoping somebody might be able to clear this up:
For performance Azure will/might split my table by partitionkey in order to keep things nice and quick. This would result in one partition per metric in this case.
However, my rowkey could potentially represent data over approx 5 years, so I estimate approx 2.5 million rows.
Is Azure clever enough to then split based on rowkey as well, or am I designing in a future bottleneck? I know normally not to prematurely optimise, but with something like Azure that doesn't seem as sensible as normal!
Looking for an Azure expert to let me know if I am on the right line or whether I should be partitioning my data into more tables too.
Few comments:
Apart from storing the data, you may also want to look into how you would want to retrieve the data as that may change your design considerably. Some of the questions you might want to ask yourself:
When I retrieve the data, will I always be retrieving the data for a particular metric and for a date/time range?
Or I need to retrieve the data for all metrics for a particular date/time range? If this is the case then you're looking at full table scan. Obviously you could avoid this by doing multiple queries (one query / PartitionKey)
Do I need to see the most latest results first or I don't really care. If it's former, then your RowKey strategy should be something like (DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d19").
Also since PartitionKey is a string value, you may want to convert int value to a string value with some "0" prepadding so that all your ids appear in order otherwise you'll get 1, 10, 11, .., 19, 2, ...etc.
To the best of my knowledge, Windows Azure partitions the data based on PartitionKey only and not the RowKey. Within a Partition, RowKey serves as unique key. Windows Azure will try and keep data with the same PartitionKey in the same node but since each node is a physical device (and thus has size limitation), the data may flow to another node as well.
You may want to read this blog post from Windows Azure Storage Team: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx.
UPDATE
Based on your comments below and some information from above, let's try and do some math. This is based on the latest scalability targets published here: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/11/04/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx. The documentation states that:
Single Table Partition– a table partition are all of the entities in a
table with the same partition key value, and usually tables have many
partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the
20,000 entities/second, which is the overall account target described
above.
Now you mentioned that you've 10 - 20 different metric points and for for each metric point you'll write a maximum of 1 record per minute that means you would be writing a maximum of 20 entities / minute / table which is well under the scalability target of 2000 entities / second.
Now the question remains of reading. Assuming a user would read a maximum of 24 hours worth of data (i.e. 24 * 60 = 1440 points) per partition. Now assuming that the user gets the data for all 20 metrics for 1 day, then each user (thus each table) will fetch a maximum 28,800 data points. The question that is left for you I guess is how many requests like this you can get per second to meet that threshold. If you could somehow extrapolate this information, I think you can reach some conclusion about the scalability of your architecture.
I would also recommend watching this video as well: http://channel9.msdn.com/Events/Build/2012/4-004.
Hope this helps.

Can I detect conflicts when writing to Cassandra?

Is there some timestamp/counter that can be used to validate that in a read-modify-write cycle, the data in the row did not change between reading and modifying?
In other words, can I read some kind of ID while reading the row, and when I write it back tell Cassandra what that ID was, and the write then fails if the ID changed since then? (Which amounts to saying that some other write took place after I read the data)
Each column in cassandra is a Tuple (or a triplet) that contains a name, value and a timestamp. The timestamp of the column represents the last time it was modified. If you have 100's of nodes, whichever node has an update with a the most recent timestamp will win. This is how Eventual Consistency is achieved.
zznate has a good presentation: Introduction to Apache Cassandra for Java Developers where this topic is referenced (slide 37)
Accessing timestamp of a Cassandra column
In summary, you don't need "some kind of ID" when you have the ability to retrieve the timestamp for a given column representing the last time it was modified. However, at scale, with 100's of nodes, how can you be sure that the node you are connecting to, has the most up to date column? (refer back to the zznate presentation)
Point is, you can't, without enabling transactions:
Cassandra - transaction support
Cassandra Transaction with ZooKeeper - Does this work?
how to integrate cassandra with zookeeper to support transactions
And many more: cassandra & transactions

Resources