Controlling batch size to hint scheduler performance - apache-spark

I would like to manually tune how big my mini-batches (in terms of cardinality) are. A way to set max number of events would be enough, but if there's a way to set max/min that would be better.
The reason I want to mess around with this is because I know for a fact that my processing code does not scale linearly.
In my particular case I'm not doing time aggregation, so I don't really care about time-frame aggregation, but depleting the "input queue" as soon as possible (by hinting the engine how many elements to process at a time).
However, if there's no way to set the max/min batch cardinality directly, I could probably workaround the limitation using a dummy time aggregation approach by stamping my input data before Spark consumes it.
Thanks

Related

How big can batches in Flink respectively Spark get?

I am currently working on a framework for analysis application of an large scale experiment. The experiment contains about 40 instruments each generating about a GB/s with ns timestamps. The data is intended to be analysed in time chunks.
For the implemetation I would like to know how big such a "chunk" aka batch can get before Flink or Spark stop processing the data. I think it goes with out saying that I intend to recollect the processed data.
For live data analysis
In general, there is no hard limit on how much data you can process with the systems. It all depends on how many nodes you have and what kind of a query you have.
As it sounds as you would mainly want to aggregate per instrument on a given time window, your maximum scale-out is limited to 40. That's the maximum number of machines that you could throw at your problem. Then, the question arises on how big your time chunks are/how complex the aggregations become. Assuming that your aggregation requires all data of a window to be present, then the system needs to hold 1 GB per second. So if you window is one hour, the system needs to hold at least 3.6 TB of data.
If the main memory of the machines is not sufficient, data needs to be spilled to disk, which slows down processing significantly. Spark really likes to keep all data in memory, so that would be the practical limit. Flink can spill almost all data to disk, but then disk I/O becomes a bottleneck.
If you rather need to calculate small values (like sums, averages), main memory shouldn't become an issue.
For old data analysis
When analysis old data, the system can do batch processing and have much more options to handle the volume including spilling to local disk. Spark usually shines if you can keep all data of one window in main memory. If you are not certain about that or you know it will not fit into main memory, Flink is the more scalable solution. Nevertheless, I'd expect both frameworks to work well for your use case.
I'd rather look at the ecosystem and the suit for you. Which languages do you want to use? It feels like using Jupyter notebooks or Zeppelin would work best for your rather ad-hoc analysis and data exploration. Especially if you want to use Python, I'd probably give Spark a try first.

Spark count dataframe to estimate output partitions, then write, efficiently without caching?

As my spark program runs on more data, I think I am crashing because I'm picking up the default number of output partitions for aggregation - namely the 200. I've learned how to control this, but it seems ideally, I would set the number of output partitions based on the amount of data I'm writing. Here in lies the conundrum - I need to first call count() on the dataframe, and then write it. That means I may re-ready it from S3 twice. I could cache and then count, but I've seen spark crash when I cache this data, caching seems to use the most resources, whereas if I just write it - it can do something more optimal.
So my questions are, if you think this is a decent approach - doing a count first (the count is a proxy to the size on disk) or should you just hard code some numbers, change them when you need? And if I am going to count first, is their some clever way to optimize things so that the count and write share work? Other than caching the whole dataframe?
Yes the count approach is actually correct way to go. Ideally you want your rdd partitions to be some considerable size like 50MB before writing. Otherwise you will end up with "small file problem".
Now if you have large data caching in memory could be hard. You could try MEMORY_AND_DISK but then the data will spill to disk and cause slowdown.
I have faced this predicament multiple times and every time I have chosen a "magic number" for the number of partitions. The number is parameterized so when I need to change I don't need to change the code, rather pass the different parameter.
If you know your datasize is generally in a particular range you could set the partition number hard coded. It is not ideal but gets the job done.
Also you could pump the metrics like size of the data in s3 and if that breaches some threshold raise an alarm then someone could change the partition number manually.
In generally if you keep the partition number moderately high like 5000 for approximately 500GB data that works for a large range i.e from 300GB to 1.2TB amount of data. This means probably you don't need to change the partition number too often if you have moderate inflow of data.

Temporary speed improvements

In our workflow, we have little ongoing work in the arangodb (~1% cpu use). For about 30 minutes of the day usage spikes and we need it to be more performant (helping do a 3s query to 1s).
Instead of moving up the instance box that it's hosted on, is there a way to get more out of arango temporarily during peak times? Would this be clustering or should we just look into temporarily boosting the instance that it's on.
Accumulating above suggestions plus adding some more that fit the generic nature of this question.
if possible split read/write workload, either in a timely fashion by holding back writes, or by switching to a new collection for the new writes.
make sure indices are properly set (use explain)
try whether query profiling can help you improve the performance

Spark streaming - Does reduceByKeyAndWindow() use constant memory?

I'm playing with the idea of having long-running aggregations (possibly a one day window). I realize other solutions on this site say that you should use batch processing for this.
I'm specifically interested in understanding this function though. It sounds like it would use constant space to do an aggregation over the window, one interval at a time. If that is true, it sounds like a day-long aggregation would be possible-viable (especially since it uses check-pointing in case of failure).
Does anyone know if this is the case?
This function is documented as: https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html
A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
After researching this on the MapR forums, it seems that it would definitely use a constant level of memory, making a daily window possible assuming you can fit one day of data in your allocated resources.
The two downsides are that:
Doing a daily aggregation may only take 20 minutes. Doing a window over a day means that you're using all those cluster resources permanently rather than just for 20 minutes a day. So, stand-alone batch aggregations are far more resource efficient.
Its hard to deal with late data when you're streaming exactly over a day. If your data is tagged with dates, then you need to wait till all your data arrives. A 1 day window in streaming would only be good if you were literally just doing an analysis of the last 24 hours of data regardless of its content.

dont understand the cassandra jmx mbean metrics

there are several questions about cassandra jmx mbean.
1.how the attribute mean calculate?
I have monitored metrics with jconsole.and I see the value of write mbean.
there is an attribute Mean in write MBean.and I dont know how to count the value in cassandra as I doubt that the value is right
make a junit test,
Timer latency=new Timer()
latency.update(timeTaken, TimeUnit.MILLISECONDS);
I input three values,0,1,2
and expected the mean value is 1000microseconds.but the fact is 1131.752microseconds,which confuse me a lot.
3.there are mean attribute, 50thPercentile attribute,and etc.
but I cannot get the instant value,when I want to see the quick change in cluster.all of these attributes cannot indicate the changes
Important to note that the latencies are estimates, not exact. It cannot store every latency that has occurred or it will run out of memory. So it keeps an approximate reservoir or histogram of all the latencies (depending on version) that it uses to calculate the statistics. Assuming your over C* 2.2, it stores a histogram of the latencies and can calculate approximate mean, min, max and percentiles within a given error %.
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/EstimatedHistogram.java#L227 is the mean calculation. Since each bucket represents a range of latencies it uses the high end, so it always will be at worst, higher than the actual latency.
Before 2.2 this was kept differently (see http://metrics.dropwizard.io/3.1.0/ for details).
Aside: Mean is a pretty bad statistic to go by for latencies so shouldn't put too much stock in it, percentiles are better to look at.

Resources