How big can batches in Flink respectively Spark get? - apache-spark

I am currently working on a framework for analysis application of an large scale experiment. The experiment contains about 40 instruments each generating about a GB/s with ns timestamps. The data is intended to be analysed in time chunks.
For the implemetation I would like to know how big such a "chunk" aka batch can get before Flink or Spark stop processing the data. I think it goes with out saying that I intend to recollect the processed data.

For live data analysis
In general, there is no hard limit on how much data you can process with the systems. It all depends on how many nodes you have and what kind of a query you have.
As it sounds as you would mainly want to aggregate per instrument on a given time window, your maximum scale-out is limited to 40. That's the maximum number of machines that you could throw at your problem. Then, the question arises on how big your time chunks are/how complex the aggregations become. Assuming that your aggregation requires all data of a window to be present, then the system needs to hold 1 GB per second. So if you window is one hour, the system needs to hold at least 3.6 TB of data.
If the main memory of the machines is not sufficient, data needs to be spilled to disk, which slows down processing significantly. Spark really likes to keep all data in memory, so that would be the practical limit. Flink can spill almost all data to disk, but then disk I/O becomes a bottleneck.
If you rather need to calculate small values (like sums, averages), main memory shouldn't become an issue.
For old data analysis
When analysis old data, the system can do batch processing and have much more options to handle the volume including spilling to local disk. Spark usually shines if you can keep all data of one window in main memory. If you are not certain about that or you know it will not fit into main memory, Flink is the more scalable solution. Nevertheless, I'd expect both frameworks to work well for your use case.
I'd rather look at the ecosystem and the suit for you. Which languages do you want to use? It feels like using Jupyter notebooks or Zeppelin would work best for your rather ad-hoc analysis and data exploration. Especially if you want to use Python, I'd probably give Spark a try first.

Related

Does Spark's reduceByKey use a constant amount of memory, or a linear one in the number of keys?

As far as I know there are solutions of external sorting and/or in Hadoop MapReduce that allow for a constant amount of memory, not more, to be used when sorting/grouping data by keys for further piping through aggregation functions for each key.
Assuming that the reduce state is a constant amount as well, like addition.
Is this constant-memory grouping/sorting available for Apache Spark or Flink as well, and if so, is there any specific configuration or programatic way of asking for this constant memory way of processing in the case of reduceByKey or aggregateByKey?
Both systems needs to implicitly perform the operation as the Java processes get only a fixed amount of main memory. Note that when the data to sort gets much larger, data needs to be spilled on disk. In the case of sorting and depending on your query, it may mean that the complete dataset needs to be materialized on main memory and disk.
If you are asking if you could limit the memory consumption of a specific operator, then things look much more complicated. You could limit your application to one specific operation and use the global memory setting to limit the consumption but that would result in complicated setup.
Do you have a specific use case in mind, where you would need to limit the memory of a specific operation?
Btw you can consider Spark and Flink to supersede Hadoop MapReduce. There are just a couple of edge cases, where MapReduce may be able to beat the next generation systems.

What it mean to be write heavy in cassandra?

I am kind of confused in terms of what it means to be write heavy in cassandra. Is this relative to the read frequency on a particular cluster or something absolute.
e.g lets say in our environment if we have like 500 tps written and may be 100 queries per second.
In this case write is relatively higher then the reads. But if we look at just the writes from cassandra perspective, is this really high ?
We have ssd in prod with around 64 gigs of ram.
Writes are much faster and the defaults are tuned around data models that write more than they read for denormalizing the data.
4k writes/sec per core kinda standard rule of thumb which is safe (per host). With big host like that if tuned correctly its perfectly reasonable to have ~100k-200k writes a sec (if dont run outta disk). When you introduce reads that drops significantly but its really a factor of data model and the client, it should be relatively easy to do 20k a sec with 50% reads/writes but it can definitely go higher. With a bad data model though its all out the window.
If your doing below 1k reads+writes a sec per host you shouldn't have many issues unless your data model is an anti pattern, and even then with that load it might work out.

Count distinct in infinite stream

I am looking for a way to create a streaming application that can withstand millions of events per second and output a distinct count of those events in real time. As this stream is unbounded by any time window it obviously has to be backed by some storage. However, I cannot find the best way to do this maintaining a good level of abstraction (meaning that I want a framework to handle storing and counting for me, otherwise I don't need a framework at all). The preferred storage for me are Cassandra and Redis (both ideally).
The options I've considered are Flink, Spark and Kafka Streams. I do know the differences between them, but I still can't pick the best solution. Can someone advice? Thanks in advance.
Regardless of which solution you choose, if you can withstand it not being 100% accurate (but being very very close), you can have your operator using HyperLogLog (there are Java implementations available). This allows you to not actually have to keep around data about each individual item, drastically reducing your memory usage.
Assuming Flink, the necessary state is quite small (< 1MB), so can easily use the FSStateBackend which is heap-based and checkpoints to the file system, allowing you to reduce serialization overhead.
Again assuming you go with Flink, Using the [ContinuousEventTimeTrigger][2], you can also get a view into how many unique items are currently being tracked.
I'd suggest to reconsider the choice of storage system. Using an external system is significantly slower than using local state. Flink applications locally maintain state on the JVM heap or in RocksDB (on disk) and can checkpoint it in regular intervals to persistent storage such as HDFS. This state can grow very big (10s of TBs) and still be efficiently maintained because checkpoints can be incrementally and asynchronously done. This gives much better performance than sending a query to an external system for each record.
If you still prefer Redis or Cassandra, you can use Flink's AsyncIO operator to send asynchronous requests to improve the throughput of your application.

How to exaust a machine's resources with RethinkDB?

I'm asking this question because I would like to understand how I can run RethinkDB better, which means what kind of hardware should it be running on, what kind of filesystem it should be running on and other system configurations to maximize it's throughput.
I'm trying to fill a table as fast as I can with documents that are {"n": <counter>, "rand": <Math.random()>}. I read somewhere that this is faster with batches of 200 documents, so that's what I'm inserting. I am also using soft durability. I started one nodejs process of this and I can insert on average 10k documents per second, pretty good.
But while this is happening, rethinkdb is using about 70% of one core (I have 8 virtual cores, it's an i7-4770) and the nodejs process is using 5%. So it seems that CPU is not the bottleneck.
As soon as I start another nodejs process doing the same thing, the inserts per second on both processes drop to about 4k-5k. Again, the CPU load keeps the same.
I fired iotop and I do see a lot of action there, but not what I expected. I configured two SSDs in a RAID0, and a quick dd test says I can write and read at about 800MBps. That's far above what the actual read and actual write speed iotop reports (average read ~14MBps average write ~50MBps).
So how can I exaust my machine's resources ? What does rethinkdb need to run faster ? Why doesn't it spend more resources and have a higher throughput ?
More information on what it's running on: It's an EX40SSD from Hetzner, two SSDs in a software RAID0, ext4 filesystem (tomorrow I'll try to mount it with noatime to see if it's better). The rethinkdb configuration is everything by default, the inserts are done to a table that has only one shard and one replica. Please feel free to ask anything else relevant I might have forgotten to mention.
Thanks in advance.
What I suspect is going on here is lock contention on the actual btrees. When you're inserting a large batch of documents, the system grabs various parts of the btree in parallel to update it with the new documents. This is a set of read-write locks -- other parts of the system can still read, but if you insert another large batch in parallel, there is a high probability that it will touch similar parts of the btree, and therefore has to wait for the system to start unlocking as it inserts parts of the first batch. (This isn't specific to RethinkDB, but a problem in databases in general) This is likely why you're not hitting 100% CPU/disk throughput.
There are a few things you can try, but note that there are subtleties to various approaches. Benchmarking in general is hard.
You can try to shard the table into 32 shards and retry your benchmark. You don't actually have to create a cluster, you can shard into 32 shards on a single machine. This will result in multiple btrees, so you'll minimize contention and will be able to use more system resources. Note that while this will likely increase the throughput, increasing the number of shards also slightly increases the latency, so you might need to significantly increase parallelism before you start seeing throughput increases.
You can try not to batch writes and instead write one document at a time (which generally approximates real-world use cases a little better). Then, start hundreds of parallel clients instead of just one or two, and have all them write one document at a time in parallel. Note that you need to make sure the clients themselves aren't a bottleneck in this situation.
You can try to rerun your benchmark and also spin up clients that read from the database in parallel with the writes. In RethinkDB reads can usually go through even when you're writing to a particular document, so this will give you the opportunity to up CPU usage and bypass contention.
Pay attention to the IDs of the documents. If the database is large enough (say, millions of documents), and the IDs you're inserting are random, you're much less likely to touch the same parts of the btree so contention becomes less of an issue.
You can combine various approaches (sharding, reading + writing, various numbers of concurrent clients) to start getting a sense for how the database behaves in various scenarios.
Note that there might be things going on that you wouldn't normally be aware of. For example, RethinkDB has a log-structured storage engine that does live compaction on disk, and this might use up some IO (and CPU) cycles that you'd be surprised by if you didn't know about live compaction. There are dozens of other components like this that might compound to surprising behavior, as these systems are typically very complex under the hood.
Hope this helps -- would love to hear about your progress on the benchmarks. We do a lot of them internally, and it's an art and a science to discover the boundaries of the system's performance on different use cases.
My guess is that the bottleneck here is the disk system, but not its throughput. What's more likely is that writes are happening in chunks that are too small to be efficient, or that there are delays due to latency between individual writes.
It's also possible that the latency between individual write queries coming from the client and their processing on the server slows the system down.
Here are a few things I recommend trying:
Increase the batch size further. Your documents are very small. Therefore I think you might get a significantly higher throughput with batches of 1,000-10,000 documents. This might work especially well in combination with the next point.
Run multiple concurrent clients. You mentioned that you have run 2 clients concurrently, but this might not be enough. I recommend running 16-32 if possible.
Check the cache size RethinkDB is using. By default, RethinkDB picks the cache size as a fraction of the available memory, but that is not always reliable. I recommend passing a --cache-size <MB> parameter to RethinkDB (or adding the cache-size=<MB> parameter to the configuration file, if you're using one). I can see that your server has 32 GB of RAM. I recommend using in the range of 20000 MB (or even more) as the cache size. A larger cache reduces the number of reads, but up to a certain limit also increases the amount of unsaved data that RethinkDB can accumulate in RAM to make disk writes more efficient.
Experiment with the --io-threads <THREADS> parameter. The default is 64, but you can try increasing it to e.g. 128 and see if it has any effect.

HIVE/HDFS for realtime storage of sensor data on a massive scale?

I am evaluating sensor data collection systems with the following requirements,
1 million endpoints sending in 100 bytes of data every minute (as a time series).
Basically millions of small writes to the storage.
This data is write-once, so basically it never gets updated.
Access requirements
a. Full data for a user needs to be accessed periodically (less frequent)
b. Partial data for a user needs to be access periodically (more frequent). For e.g I need sensor data collected over the last hour/day/week/month for analysis/reporting.
Have started looking at Hive/HDFS as an option. Can someone comments on the applicability of Hive in such a use case? I am concerned that while the distributed storage needs would work, it seems more suited to data warehousing applications than real time data collection/storage.
Do HBase/Cassandra make more sense in this scenario?
I think HBase can be a good option for you. In fact there's already an open/source implementation in HBase which solves similar problem that you might want to use. Take a look at openTSB which is an open source implementation for solving similar problems. Here's a short excerpt from their blurb:
OpenTSDB is a distributed, scalable Time Series Database (TSDB)
written on top of HBase. OpenTSDB was written to address a common
need: store, index and serve metrics collected from computer systems
(network gear, operating systems, applications) at a large scale, and
make this data easily accessible and graphable. Thanks to HBase's
scalability, OpenTSDB allows you to collect many thousands of metrics
from thousands of hosts and applications, at a high rate (every few
seconds). OpenTSDB will never delete or downsample data and can easily
store billions of data points. As a matter of fact, StumbleUpon uses
it to keep track of hundred of thousands of time series and collects
over 600 million data points per day in their main production
datacenter.
There are actually quite a few people collecting sensor data in a time-series fashion with Cassandra. It's a very good fit. I recommend you read this article on basic time series in Cassandra for an idea of what your data model would be like.
Writes in Cassandra are extremely cheap, so even a moderately sized cluster could easily handle one million writes per minute.
Both of your read queries could be answered very efficiently. For the second type of query, where you're reading data for a slice of time for a single sensor, you would end up reading a contiguous slice from a single row; this should take about 10ms for a completely cold read. For the first type of query, you would simply be running several of the per-sensor queries in parallel. Assuming you store a basic map of users to sensor IDs, you would lookup all of the sensor IDs for a user with one query, and then your second query would fetch the data for all of those sensors (although you might break up this query if the number of sensors is high).
Hive and HDFS don't really make sense when you're talking about real-time queries, as they're more suited for long-running batch jobs.

Resources