Hazelcast or AppScale to manage parallel computational tasks over a shared dataset - hazelcast

Starting out on a new project and looking for advice on a suitable platform. Current thinking is between Hazelcast or AppScale, given our team’s combined (but limited) experience covers an older version of Hazelcast and GAE. Both can also apparently be setup on EC2, which may be the easiest way to meet the CPU demand we expect.
Problem Profile
1). Our data consists of many small records stored by date (but not always time). Some are small numerical records (business stats, looks like daily weather info or stock market prices) and some are bulky text (log file entries). Data volumes not huge, in the region of hundreds/day between 1k and 50k each.
2). Very very large number of instances of computationally expensive numerical models (think monte-carlo sims) operate constantly over fixed-size windows of the same data.
3). A number of monitoring agents make data available.
4). Larger (longer periods of time) sets of the same data to be processed offline once daily.
With Hazelcast we would add incoming data to maps and use the Executor service to run models over the shared data. Likely use of Tomcat to provide minimal front end access to the grid as required.
With AppScale we would add tables per data-type and use the Task Queues API to frame the numerical models. Servlets deployed to AppScale as per GAE to provide front end.
Question
Should we use AppScale or Hazelcast for requirements like this? That is - for the problem as stated, are there any stand-out factors for/against either platform that we should consider?

If you prefer/require a distributed, service-oriented programming model (bag of tasks) then the answer is AppScale. If you prefer/require a parallel programming model (single machine abstraction) then the answer is Hazelcast. AppScale is also a complete cloud platform (vs only a datastore) which enables you to do more things with your app as it evolves. If you go with AppScale, you can adjust the timing restriction on the tasks and customize the platform with the libraries you want to use, for your computationally expensive methods.

Related

Fast and Scalable Real-Time Application (Is Hazelcast Jet a good way?)

Actually, in our architecture, we use Hazelcast IMDG, in order to share information about user operativities among several server nodes.
Our map has the following structure: [key:String|value: CustomObject].
Now, we want to expand our product functionality and we want to develop a real-time dashboard performing real-time data stream by doing:
Complex Aggregation
Continuous Query
etc.
At the end of the process we want to “send” the result to a Vert.x Eventbus and then to a socket layer (SockJS), in order to show the data in a dashboard.
Our need is to set-up a scalable and fast system, in order to handle a heavy amount of data such as thousands of events per second.
The first image represent our current (old) architecture, the second image represent our “target” architecture.
Old Architecture
Target Architecture
What do you think about target architecture?
Is the role of Hazelcast Jet correct or are there another way to perform these operations (for example only with Hazelcast IMDG)?
Thanks in advance.
Looks like a good fit for Hazelcast Jet. You probably will use Sources.mapJournal() to process entries as they are added to the IMap. You can aggregate into sliding windows easily. Writing a Vert.x Event Bus sink should be straightforward with SinkBuilder. Thousands events/s is a low figure, it depends on how much work will you do with each event.

What timeseries database to select for large number of records?

I got into scenario where I have about 100,000 input records per seconds to store. The nature of records is timeseries data.
I need to run both aggregation, other analytics and also some machine learning algorithms over the data continuously. Performance is here the factor as I look for near real-time results.
What would you recommend as database engine?
Take a look at ClickHouse analytical database. It can accept millions of rows per second. It can scan billions of rows per second on a single computer. It scales horizontally to multiple nodes. It fits time series workloads.
If you still need time series database, then try VictoriaMetrics. It is built on ClickHouse ideas, so it is fast and resource-efficient.
I am adding my own solution...
ClickHouse is definitely nice killer. But I am now evaulating for new project open source gpu database OmniSci. Its open source version is limited to single gpu node (up to 16 gpu devices - with oem tesla having 64GB per device you can get 1TB VRAM, of course not that cheap as clickhouse). Its simply SQL database on steroids (JDBC driver exists) with Kafka data source
Omnisci is having also crossdashboarding solution which is licensed already, but you can have real time dashboarding over lets say 20-50 billions of ts records (8-16 gpus) and multidashboard real time analytics without any kind of preaggregation required, etc....
But it will cost money...
If you want going purely open source, my second candidate is NVIDA's RAPIDS framework which implements cuDF (CUDA Dataframe - like Spark data structure), eventually you can use it to keep your data window (append new, delete obsolete), and cuxfilter solution which is similar to OmniSci, but its more framework, but with skilled frontend coder you can achieve something very similar/same as OmniSci.
Of course you can go and implement your own on top of cassandra with an appropriate data model for your usecase. This will maybe get you the best results tailored to your needs.
You could look at KairosDB (https://kairosdb.github.io/) which is a timeseries database on top of apache cassandra and I got 50k writes per second on a medium sized single (but bare metal) node.
It's quite good documented (https://kairosdb.github.io/docs/build/html/CassandraSchema.html) and it has aggregators out of the box (https://kairosdb.github.io/docs/build/html/restapi/QueryMetrics.html).
OpenTSDB was slower in my tests. Influx looks promising but i have no experience myself: https://github.com/influxdata/influxdb

Is it better to create many small Spark clusters or a smaller number of very large clusters

I am currently developing an application to wrangle a huge amount of data using Spark. The data is a mixture of Apache (and other) log files as well as csv and json files. The directory structure of my Google bucket will look something like this:
root_dir
web_logs
\input (subdirectory)
\output (subdirectory)
network_logs (same subdirectories as web_logs)
system_logs (same subdirectories as web_logs)
The directory structure under the \input directories is arbitrary. Spark jobs pick up all of their data from the \input directory and place it in the \output directory. There is an arbitrary number of *_logs directories.
My current plan is to split the entire wrangling task into about 2000 jobs and use the cloud dataproc api to spin up a cluster, do the job, and close down. Another option would be to create a smaller number of very large clusters and just send jobs to the larger clusters instead.
The first approach is being considered because each individual job is taking about an hour to complete. Simply waiting for one job to finish before starting the other will take too much time.
My questions are: 1) besides the cluster startup costs, are there any downside to taking the first approach? and 2) is there a better alternative?
Thanks so much in advance!
Besides startup overhead, the main other consideration when using single-use clusters per job is that some jobs might be more prone to "stragglers" where data skew leads to a small number of tasks taking much longer than other tasks, so that the cluster isn't efficiently utilized near the end of the job. In some cases this can be mitigated by explicitly downscaling, combined with the help of graceful decommissioning, but if a job is shaped such that many "map" partitions produce shuffle output across all the nodes but there are "reduce" stragglers, then you can't safely downscale nodes that are still responsible for serving shuffle data.
That said, in many cases, simply tuning the size/number of partitions to occur in several "waves" (i.e. if you have 100 cores working, carving the work into something like 1000 to 10,000 partitions) helps mitigate the straggler problem even in the presence of data skew, and the downside is on par with startup overhead.
Despite the overhead of startup and stragglers, though, usually the pros of using new ephemeral clusters per-job vastly outweigh the cons; maintaining perfect utilization of a large shared cluster isn't easy either, and the benefits of using ephemeral clusters includes vastly improved agility and scalability, letting you optionally adopt new software versions, switch regions, switch machine types, incorporate brand-new hardware features (like GPUs) if they become needed, etc. Here's a blog post by Thumbtack discussing the benefits of such "job-scoped clusters" on Dataproc.
A slightly different architecture if your jobs are very short (i.e. if each one only runs a couple minutes and thus amplify the downside of startup overhead) or the straggler problem is unsolveable, is to use "pools" of clusters. This blog post touches on using "labels" to easily maintain pools of larger clusters where you still teardown/create clusters regularly to ensure agility of version updates, adopting new hardware, etc.
You might want to explore my solution for Autoscaling Google Dataproc Clusters
The source code can be found here

Massive query with predicate questions

I am working in a specific project to change my repository to hazelcast.
I need find some documents by data range, store type and store ids.
During my tests i got 90k throughput using one instance c3.large, but when i execute the same test with more instances the result decrease significantly (10 instances 500k and 20 instances 700k).
These numbers were the best i could tuning some properties:
hazelcast.query.predicate.parallel.evaluation
hazelcast.operation.generic.thread.count
hz:query
I have tried to change instance to c3.2xlarge to get more processing but but the numbers don't justify the price.
How can i optimize hazelcast to be more fast in this scenario?
My user case don't use map.get(key), only map.values(predicate).
Settings:
Hazelcast 3.7.1
Map as Data Structure;
Complex object using IdentifiedDataSerializable;
Map index configured;
Only 2000 documents on map;
Hazelcast embedded configured by Spring Boot Application (singleton);
All instances in same region.
Test
Gatling
New Relic as service monitor.
Any help is welcome. Thanks.
If your use-case only contains map.values with a predicate, I would strongly suggest to use object type as in in-memory storage model. This way, there will not be any serialization involved during Query execution.
On the other end, it is normal to get very high numbers when you only have 1 member. Because, there is no data moving across network. Potentially to improve, I would check EC2 instances with high network capacity. For example c3.8xlarge has 10 Gbit network, compared to High that comes with c3.2xlarge.
I can't promise, how much increase you can get, but I would definitely try these changes first.

Real time analytics Time series Database

I'm looking for a distributed Time series database which is free to use in a cluster setup up mode and production ready plus it has to fit well in the hadoop ecosystem.
I have an IOT project which is basically around 150k Sensors which send data every 10 minutes or One hour, so I'm trying to look at time series database that has useful functions like aggregating metrics, Down-sampling, pre-aggregate (roll-ups) i have found this comparative in this Google stylesheet document time series database comparative .
I have tested Opentsdb, the data model of the hbaserowkey really suits my use case : but the functions that sill need to be developed for my use case are :
aggregate multiples metrics
do rollups
I have tested also keirosDB which is a fork of opentsdb with a richer API and it uses Cassandra as a backend storage the thing is that their API does all what my looking for downsampling rollups querying multiples metrics and a lot more.
I have tested Warp10.io and Apache Phoenix which i have read here Hortonworks link that it will be used by Ambari Metrics so i assume that its well suited for time series data too.
My question is as of now what's the best Time series Database to do real time analytics with requests performance under 1S for all the type of requests example : we want the average of the aggregated data sent by 50 sensors in a period of 5 years resampled by months ?
Such requests I assume can't be done under 1S so I believe for such requests we need some rollups/ pre aggregate mechanism, but I'm not so sure because there's a lot of tools out there and i can't decide which one suits my need the best.
I'm the lead for Warp 10 so my answer can be considered opinionated.
Given your projected data volume, 150k sensors sending data every 10 minutes, it is a mean of 250 datapoints per second and less than 40B on a period of 5 years. Such a volume can easily fit on a simple Warp 10 standalone, and if you later need to have a larger infrastructure you can migrate to a distributed Warp 10 based on Hadoop.
In terms of requests, if your data is already resampled, fetch 5 years of monthly data for 50 sensors is only 3000 datapoints, Warp 10 can do that in far less than 1s, and doing the automatic rollups is just a matter of scheduling WarpScript code in a monthly manner, nothing fancy.
Lastly, in terms of integration with the Hadoop ecosystem, Warp 10 is on top of things with integration of the WarpScript language in Pig, Spark, Flink and Storm. With the Warp10InputFormat you can fetch data from a Warp 10 platform or you can load data using any other InputFormat and then manipulate them using WarpScript.
At OVH we are heavy users of #OvhMetrics which rely on Warp10/HBase, and we provide a protocol abstraction with OpenTSDB/WarpScript/PromQL/...
I'm not interested in Warp10, but it has been a great success for us. Both on the scaling challenge and for the use cases that WarpScript can cover.
Most of the time we don't even leverage hadoop/flink integration because our customers needs are addressed easily with the real time WarpScript API.
For real time analytics, you can try Druid, an open source project maintainted by Apache, or you can also check out database specialized for IoT: GridDB and CrateDB. The best way is to test these databases yourselves and see if they suit your need. You can also connect these databases as a sink to Kafka.
When you are dealing with IoT project, you need to forecast if you have to maintain large data set in the future or if you are happy with downsampled data. Some TSDB have good compression like InfluxDB, but others may not be scalable beyond tens of terabytes, so if you think you need to scale big, look also for one with scale-out architecture.

Resources