Apache Beam Wall Time keeps increasing - python-3.x

I have a Beam pipeline that reads from a pubsub topic, does some small transformations and then writes the events to some BigQuery tables.
The transforms are light on processing, maybe removing a field or something else, but, as you can see from the image below, the Wall Time is very high for some steps. What can actually cause this?
Every element is actually a tuple of the form ((str, str, str), {**dict with data}). By this key we actually try to do a naive deduplication by taking the latest event by this key.
Basically whatever I add after that Get latest element per key is slow, and tagging is also slow, even tho it just adds a tag to the element.

By "slow" I assume you mean how many elements it processes per second?
There are two things that are going on here. First, I assume that Get latest element per key contains a GroupByKey of sorts. This involves a global shuffle, with all elements being sent over the network to other elements, to ensure all elements with a given key are grouped onto the same worker. This IO can be expensive, at least in terms of wall time.
Secondly, steps that don't need worker-to-worker communication are "fused" which couples their throughputs. E.g. if one has DoFnA followed by DoFnB followed by DoFnC, the processing proceeds by passing the first element through DoFnA, then passing those outputs to DoFnB and subsequently DoFnC before getting the second element to pass to DoFnA. This means that if one of the Fns (or the reading or writing) has a bounded throughput, all of them will.

Related

Spark streaming - Does reduceByKeyAndWindow() use constant memory?

I'm playing with the idea of having long-running aggregations (possibly a one day window). I realize other solutions on this site say that you should use batch processing for this.
I'm specifically interested in understanding this function though. It sounds like it would use constant space to do an aggregation over the window, one interval at a time. If that is true, it sounds like a day-long aggregation would be possible-viable (especially since it uses check-pointing in case of failure).
Does anyone know if this is the case?
This function is documented as: https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html
A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
After researching this on the MapR forums, it seems that it would definitely use a constant level of memory, making a daily window possible assuming you can fit one day of data in your allocated resources.
The two downsides are that:
Doing a daily aggregation may only take 20 minutes. Doing a window over a day means that you're using all those cluster resources permanently rather than just for 20 minutes a day. So, stand-alone batch aggregations are far more resource efficient.
Its hard to deal with late data when you're streaming exactly over a day. If your data is tagged with dates, then you need to wait till all your data arrives. A 1 day window in streaming would only be good if you were literally just doing an analysis of the last 24 hours of data regardless of its content.

Cassandra counter usage

I am finding some difficulties in the data modeling of an application which may involve the use of counters.
The app is basically a messaging app. Messages are bounded for free users, hence the initial plan of using a counter column to keep track of the total count.
I've discovered that batches (logged or not) cannot contain operations on both standard tables and counter ones. How do I ensure correctness if I cannot batch the operation I am trying to perform and the counter update together? Is the counter type really needed if there's basically no race condition on the column, being that associated to each individual user?
My second idea would be to use a standard int column to use only inside batches. Is this a viable option?
Thank you
If you can absolutely guarantee that each user will produce only one update at time then you could rely on plain ints to perform the job.
The problem however is that you will need to perform a read-before-write anti-pattern. You could solve this as well, eg skipping the read part by caching your ints and performing in-memory updates followed by writes only. This is viable by coupling your system with a caching server (e.g. Redis).
And thinking about it, you should still need to read these counters at some point, because if the number of messages a free user can send is bound to some value then you need to perform a check when they login/try to send a new message/look at the dashboard/etc and block their action.
Another option (if you store the messages sent by each user somewhere and don't want to add complexity to your system) could be to directly count them with a SELECT COUNT... type query, even if this could be become pretty inefficient very soon in the Cassandra world.

How hazelcast-jet achieves anything different from hazelcast EntryProcessors

How hazelcast-jet achieves anything vastly different from what was earlier achievable by submitting EntryProcessors on keys in an IMap?
Curious to know.
Quoting the InfoQ article on Jet:
Sending a runnable to a partition is analogous to the work of a single DAG vertex. The advantage of Jet comes from the ability to have the vertex transform the data it reads, producing items which no longer belong to the same partition, then reshuffle them while sending to the downstream vertex so they are again correctly partitioned. This is essential for any kind of map-reduce operation where the reducing unit must observe all the data items with the same key. To minimize network traffic, Jet can first reduce the data slice produced on the local member, then send only one item per key to the remote member that combines the partial results.
And note that this is just an advantage in the context of the same or similar use cases currently covered by entry processors. Jet can take data from any source and make use of the whole cluster's computational resources to process it.

reducer concept in Spark

I'm coming from a Hadoop background and have limited knowledge about Spark. BAsed on what I learn so far, Spark doesn't have mapper/reducer nodes and instead it has driver/worker nodes. The worker are similar to the mapper and driver is (somehow) similar to reducer. As there is only one driver program, there will be one reducer. If so, how simple programs like word count for very big data sets can get done in spark? Because driver can simply run out of memory.
The driver is more of a controller of the work, only pulling data back if the operator calls for it. If the operator you're working on returns an RDD/DataFrame/Unit, then the data remains distributed. If it returns a native type then it will indeed pull all of the data back.
Otherwise, the concept of map and reduce are a bit obsolete here (from a type of work persopective). The only thing that really matters is whether the operation requires a data shuffle or not. You can see the points of shuffle by the stage splits either in the UI or via a toDebugString (where each indentation level is a shuffle).
All that being said, for a vague understanding, you can equate anything that requires a shuffle to a reducer. Otherwise it's a mapper.
Last, to equate to your word count example:
sc.textFile(path)
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_+_)
In the above, this will be done in one stage as the data loading (textFile), splitting(flatMap), and mapping can all be done independent of the rest of the data. No shuffle is needed until the reduceByKey is called as it will need to combine all of the data to perform the operation...HOWEVER, this operation has to be associative for a reason. Each node will perform the operation defined in reduceByKey locally, only merging the final data set after. This reduces both memory and network overhead.
NOTE that reduceByKey returns an RDD and is thus a transformation, so the data is shuffled via a HashPartitioner. All of the data does NOT pull back to the driver, it merely moves to nodes that have the same keys so that it can have its final value merged.
Now, if you use an action such as reduce or worse yet, collect, then you will NOT get an RDD back which means the data pulls back to the driver and you will need room for it.
Here is my fuller explanation of reduceByKey if you want more. Or how this breaks down in something like combineByKey

Cassandra: rotating lists

Suppose I store a list of events in a Cassandra row, implemented with composite columns:
{
event:123 => 'something happened'
event:234 => 'something else happened'
}
It's almost fine by me, and, as far as I understand, that's a common pattern. Comparing to having a single column event with the jsonized list, that scales better since it's easy to add a new item to the list without reading it first and then writing back.
However, now I need to implement these two requirements:
I don't want to add a new event if the last added one is the same,
I want to keep only N last events.
Is there any standard way of doing that with the best possible performance? (Any storage schema changes are ok).
Checking whether or not things already exist, or checking how many that exist and removing extra items, are both read-modify-write operations, and they don't fit very well with the constraints of Cassandra.
One way of keeping only the N last events is to make sure they are ordered so that you can do a range query and read the N last (for example prefixing the column key with a timestamp/TimeUUID). This wouldn't remove the outdated events, that you need to do as a separate process, but by doing it this way the code that queries the data will only see the last N, which is the real requirement if I interpret things correctly. The garbage collection of old events is just an optimization to avoid keeping things that will never be needed again.
If the requirement isn't a strict N events, but events that are not older than T you can of course use the TTL feature, but I assume that it's not an option for you.
The first requirement is trickier. You can do a read before ever write and check if you have an item, but that would be slow, and unless you do some kind of locking outside of Cassandra there is no guarantee that two writers won't do both do a read and then both do a write, so that neither sees the other's write. Maybe that's not a problem for you, but there's no good way around it. Cassandra doesn't do CAS.
The way I've handled similar situations when using Cassandra is to keep a cache in the application nodes of what has been written, and check that before writing. You then need to make sure that each application node sees all events for the same row, and that events for the same row aren't distributed over multiple application nodes. One way of doing that is to have a message queue system in front of your application nodes, and divide the event stream over several queues by the same key as you use as row key in the database.

Resources