After establishing my temporal database, I used timeline.characterize_static() to calculate my DLCA results and found the CO2 emission result was exponential.
Since my activity has complex processes, ranging from mining, manufacturing process, to recycling process, and I have temporal distribution for almost each process, the Timeline calculation seems to multiply the whole temporal datasets with each other amounts. Should I put the whole upstream processes into one activity to avoid double counting? Or is there any other way to deal with the complex upstream process for an activity?
Related
I would like to manually tune how big my mini-batches (in terms of cardinality) are. A way to set max number of events would be enough, but if there's a way to set max/min that would be better.
The reason I want to mess around with this is because I know for a fact that my processing code does not scale linearly.
In my particular case I'm not doing time aggregation, so I don't really care about time-frame aggregation, but depleting the "input queue" as soon as possible (by hinting the engine how many elements to process at a time).
However, if there's no way to set the max/min batch cardinality directly, I could probably workaround the limitation using a dummy time aggregation approach by stamping my input data before Spark consumes it.
Thanks
We are exploring using Cassandra as a way to store time series type data, so this may be somewhat of a noob question. One of the use cases is to read data from a Kafka stream, look for matches, and incrementing a counter (e.g. 5 customers have clicked through link alpha on page beta, increment (beta, alpha) by 5). However, we expect a very wide degree of parallelism to keep up with the load, so there may be more than one consumer reading from Kafka at the same time.
My question is: How would Cassandra resolve multiple simultaneous writes to a given counter from multiple sources?
It's my understanding that multiple writes to the counter with different timestamps will be added to the counter in the timestamp order received. However, if there were to be a simultaneous write with exact same timestamp, would the LWW model of Cassandra throw out one of those counter increments?
If we were to have a large cluster (100+ nodes), ALL or QUORUM writes may not be sufficient performant to keep up with the messasge traffic. Writes with THREE would seem to be likely to result in a situation where process #1 writes to nodes A, B, and C, but process #2 might write to X, Y, and Z. Would LWT work here, or do they not play well with counter activity?
I would try out a proof of concept and benchmark it, it will most likely work just fine. Counters are not super performant in Cassandra though, especially if there will be a lot of contention.
Counters are not like the normal writes with a simple LWW, it uses paxos with some pessimistic locking and specialized caches. The partition lock contention will slow it down soome, and paxos is an expensive multiple network hop process with reads before writes.
Use quorum, don't try to do something funky with CL's with counters, especially before benchmarking to know if you need it. 100 node cluster should be able to handle a lot as long as your not trying to update all the same partitions constantly.
What would be some considerations for choosing stateless sliding-window operations (e.g. reduceByKeyAndWindow) vs. choosing to keep state (e.g. via updateStateByKey or the new mapStateByKey) when handling a stream of sequential, finite event sessions with Spark Streaming?
For example, consider the following scenario:
A wearable device tracks physical exercises performed by
the wearer. The device automatically detects when an exercise starts,
and emits a message; emits additional messages while the exercise
is undergoing (e.g. heart rate); and finally, emits a message when the
exercise is done.
The desired result is a stream of aggregated records per exercise session. i.e. all events of the same session should be aggregated together (e.g. so that each session could be saved in a single DB row). Note that each session has a finite length, but the entire stream from multiple devices is continuous. For convenience, let's assume the device generates a GUID for each exercise session.
I can see two approaches for handling this use-case with Spark Streaming:
Using non-overlapping windows, and keeping state. A state is saved per GUID, with all events matching it. When a new event arrives, the state is updated (e.g. using mapWithState), and in case the event is "end of exercise session", an aggregated record based on the state will be emitted, and the key removed.
Using overlapping sliding windows, and keeping only the first sessions. Assume a sliding window of length 2 and interval 1 (see diagram below). Also assume that the window length is 2 X (maximal possible exercise time). On each window, events are aggreated by GUID, e.g. using reduceByKeyAndWindow. Then, all sessions which started at the second half of the window are dumped, and the remaining sessions emitted. This enables using each event exactly once, and ensures all events belonging to the same session will be aggregated together.
Diagram for approach #2:
Only sessions starting in the areas marked with \\\ will be emitted.
-----------
|window 1 |
|\\\\| |
-----------
----------
|window 2 |
|\\\\| |
-----------
----------
|window 3 |
|\\\\| |
-----------
Pros and cons I see:
Approach #1 is less computationally expensive, but requires saving and managing state (e.g. if the number of concurrent sessions increases, the state might get larger than memory). However if the maximal number of concurrent sessions is bounded, this might not be an issue.
Approach #2 is twice as expensive (each event is processed twice), and with higher latency (2 X maximal exercise time), but more simple and easily manageable, as no state is retained.
What would be the best way to handle this use case - is any of these approaches the "right" one, or are there better ways?
What other pros/cons should be taken into consideration?
Normally there is no right approach, each has tradeoffs. Therefore I'd add additional approach to the mix and will outline my take on their pros and cons. So you can decide which one is more suitable for you.
External state approach (approach #3)
You can accumulate state of the events in external storage. Cassandra is quite often used for that. You can handle final and ongoing events separately for example like below:
val stream = ...
val ongoingEventsStream = stream.filter(!isFinalEvent)
val finalEventsStream = stream.filter(isFinalEvent)
ongoingEventsStream.foreachRDD { /*accumulate state in casssandra*/ }
finalEventsStream.foreachRDD { /*finalize state in casssandra, move to final destination if needed*/ }
trackStateByKey approach (approach #1.1)
It might be potentially optimal solution for you as it removes drawbacks of updateStateByKey, but considering it is just got released as part of Spark 1.6 release, it could be risky as well (since for some reason it is not very advertised). You can use the link as starting point if you want to find out more
Pros/Cons
Approach #1 (updateStateByKey)
Pros
Easy to understand or explain (to rest of the team, newcomers, etc.) (subjective)
Storage: Better usage of memory stores only latest state of exercise
Storage: Will keep only ongoing exercises, and discard them as soon as they finish
Latency is limited only by performance of each micro-batch processing
Cons
Storage: If number of keys (concurrent exercises) is large it may not fit into memory of your cluster
Processing: It will run updateState function for each key within the state map, therefore if number of concurrent exercises is large - performance will suffer
Approach #2 (window)
While it is possible to achieve what you need with windows, it looks significantly less natural in your scenario.
Pros
Processing in some cases (depending on the data) might be more effective than updateStateByKey, due to updateStateByKey tendency to run update on every key even if there are no actual updates
Cons
"maximal possible exercise time" - this sounds like a huge risk - it could be pretty arbitrary duration based on a human behaviour. Some people might forget to "finish exercise". Also depends on kinds of exercise, but could range from seconds to hours, when you want lower latency for quick exercises while would have to keep latency as high as longest exercise potentially could exist
Feels like harder to explain to others on how it will work (subjective)
Storage: Will have to keep all data within the window frame, not only the latest one. Also will free the memory only when window will slide away from this time slot, not when exercise is actually finished. While it might be not a huge difference if you will keep only last two time slots - it will increase if you try to achieve more flexibility by sliding window more often.
Approach #3 (external state)
Pros
Easy to explain, etc. (subjective)
Pure streaming processing approach, meaning that spark is responsible to act on each individual event, but not trying to store state, etc. (subjective)
Storage: Not limited by memory of the cluster to store state - can handle huge number of concurrent exercises
Processing: State is updated only when there are actual updates to it (unlike updateStateByKey)
Latency is similar to updateStateByKey and only limited by the time required to process each micro-batch
Cons
Extra component in your architecture (unless you already use Cassandra for your final output)
Processing: by default is slower than processing just in spark as not in-memory + you need to transfer the data via network
you'll have to implement exactly once semantic to output data into cassandra (for the case of worker failure during foreachRDD)
Suggested approach
I'd try the following:
test updateStateByKey approach on your data and your cluster
see if memory consumption and processing is acceptable even with large number of concurrent exercises (expected on peak hours)
fall back to approach with Cassandra in case if not
I think one of other drawbacks of third approach is that the RDDs are not received chronologically..considering running them on a cluster..
ongoingEventsStream.foreachRDD { /*accumulate state in casssandra*/ }
also what about check-pointing and driver node failure..In that case do u read the whole data again? curious to know how you wanna handle this?
I guess maybe mapwithstate is a better approach why you consider all these scenario..
Good day
My system consumes stream of trades and consists of multiple processing steps (jobs). Essentially it processes stream of trades and computes some metric over them (VaR). Source stream can be split in certain chunks (trades are grouped in portfolios). Currently I have home-grown solution and look to switch to Samza or Flink.
Trades Processors Output
{<portfolio done>, t3, t2, t1} -> A->B->... -> Measure
Fundamental question which I struggle with is - I need to understand when portfolio is fully processed and figure in the output is complete. Basically I need to correlate the output (aggregated measure) with the input (which trades are processed resulting in this measure) so one can say "this figure reflects these trades". Ideally when all portfolios are processed, I need to record final figure.
Processing stages are non-trivial, they create intermediary streams (like return vectors), do some filtering, etc. so simple counting doesn't work.
I feel that my problem should be fairly standard, but can't find neither algorithm nor implementation in Samza or Flink.
Closest question I found is this one: https://mail-archives.apache.org/mod_mbox/samza-dev/201510.mbox/%3CCAA5FViH_2cz9CcS4VEo49HoWHF++QKyPB5t7y+bDCoVynZBqtg#mail.gmail.com%3E
I have recently gotten involved with an old BI solution (SQL Server, SSIS, SSAS). One dimension is very bloated with 50ish attributes and it processes slowly. I want to break it down in at least 2-3 dimensions to reduce processing time. My concern is that all pivot tables and other front-end reporting utilising these attributes will break and need redesigning - we are a big company and tons and tons of excel-sheets etc. currently use this dimension.
Is there a way to split the dimension while maintaining references and filters to the affected attributes?
I would rethink this approach - I would expect splitting a dimension to increase processing time, not reduce it. SQL will need to run 2-3 queries to get the data (instead of 1) and SSAS will need to build and check it's dimension-fact relationships 2-3 times (instead of 1).
I would have a look at whether the time is being spent running the SQL queries to gather the info, or in SSAS's processing of that data. You can get a rough feel by watching Task Manager while that dimension is being processed - if the SQL queries are efficient then the sqlserver.exe process should only spike up in CPU briefly, before msmdsrv.exe takes over.