Nested pipeline in Apache Beam

Nested pipeline in Apache Beam - apache-spark

My Apache Beam pipeline takes an infinite stream of messages. Each message fans out N elements (N is ~1000 and is different for each input). Then for each element produced by the previous stage there is a map operation that produces new N elements, which should be reduced using a top 1 operation (elements are grouped by the original message that was read from the queue). The results of top 1 is saved to an external storage. In Spark I can easily do it by reading messages from the stream and creating RDD for each message that does map + reduce. Since Apache Beam does not have nested pipelines, I can't see a way implementing it in Beam with an infinite stream input. Example:
Infinite stream elements: A, B
Step 1 (fan out, N = 3): A -> A1, A2, A3
(N = 2): B -> B1, B2
Step 2 (map): A1, A2, A3 -> A1', A2', A3'
B1, B2, B3 -> B1', B2'
Step 3 (top1): A1', A2', A3' -> A2'
B1', B2' -> B3'
Output: A2', B2'
There is no dependency between A and B elements. A2' and B2' are top elements withing their group. The stream is infinite. The map operation can take from a couple of seconds to a couple of minutes. Creating a window watermark for the maximum time it takes to do the map operation would make the overall pipeline time much slower for fast map operations. Nested pipeline would help because this way I could create a pipeline per message.

It doesn't seem like you'd need a 'nested pipeline' for this. Let me show you what that looks like in the Beam Python SDK (it's similar for Java):
For example, try the dummy operation of appending a number, and an apostrophe to a string (e.g. "A"=>"A1'"), you'd do something like this:
def my_fn(value):
def _inner(elm):
return (elm, elm + str(value) + "'") # A KV-pair
return _inner
# my_stream has [A, B]
pcoll_1 = (my_stream
| beam.Map(my_fn(1)))
pcoll_2 = (my_stream
| beam.Map(my_fn(2)))
pcoll_3 = (my_stream
| beam.Map(my_fn(3)))
def top_1(elms):
... # Some operation
result = ((pcoll_1, pcoll_2, pcoll_3)
| beam.CoGroupByKey()
| beam.Map(top_1))

So here is the sort of working solution. I will most likely be editing it for any mistakes I may make in understanding the question. (P.s. the template code is in java). Assuming that input is your stream source
PCollection<Messages> msgs = input.apply(Window.<Messages>into(
FixedWindows.of(Duration.standardSeconds(1))
.triggering(AfterWatermark.pastEndOfWindow()
// fire the moment you see an element
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
//optional since you have small window
.withLateFirings(AfterProcessingTime.pastFirstElementInPane()))
.withAllowedLateness(Duration.standardMinutes(60))
.discardingFiredPanes());
This would allow you to read a stream of Messages which could either be a string or a HashMap or even a list. Observe that you are telling beam to fire a window for every element that it receives and you have set a maximum windowing of 1 second. You can change this if you want to fire every 10 messages and a window of a minute etc.
After that you would need to write 2 classes that extend DoFn primarily
PCollection<Element> top = msgs.apply(ParDo.of(new ExtractElements()))
.apply(ParDo.of(new TopElement()));
Where Element can be a String, an int, double, etc.
Finally, you would right each Element to storage with:
top.apply(ParDo.of(new ParsetoString()))
.apply(TextIO.write().withWindowedWrites()
.withNumShards(1)
.to(filename));
Therefore, you would have roughly 1 file for every message which may be a lot. But sadly you can not append to file. Unless you do a windowing where you group all the elements into one list and write to that.
Of course, there is the hacky way to do it without windowing, which I will explain if this use case does not seem to work out with you (or if you are curious)
Let me know if I missed anything! :)

Related

Permutation to disjoint cycles in Haskell

I was trying to implement permutation to cycles in Haskell without using Monad. The problem is as follow: given a permutation of numbers [1..n], output the correspondence disjoint cycles. The function is defined like
permToCycles :: [Int] -> [[Int]]
For the input:
permToCycles [3,5,4,1,2]
The output should be
[[3,4,1],[5,2]]
By the definition of cyclic permutation, the algorithm itself is straightforward. Since [3,5,4,1,2] is a permutation of [1,2,3,4,5], we start from the first element 3 and follow the orbit until we get back to 3. In this example, we have two cycles 3 -> 4 -> 1 -> 3. Continue to do so until we traverse all elements. Thus the output is [[3,4,1],[5,2]].
Using this idea, it is fairly easy to implement in any imperative language, but I have trouble with doing it in Haskell. I find something similar in the module Math.Combinat.Permutations, but the implementation of function permutationToDisjointCycles uses Monad, which is not easy to understand as I'm a beginner.
I was wondering if I could implement it without Monad. Any help is appreciated.
UPDATE: Here is the function implemented in Python.
def permToCycles(perm):
pi_dict = {i+1: perm[i]
for i in range(len(perm))} # permutation as a dictionary
cycles = []
while pi_dict:
first_index = next(iter(pi_dict)) # take the first key
this_elem = pi_dict[first_index] # the first element in perm
next_elem = pi_dict[this_elem] # next element according to the orbit
cycle = []
while True:
cycle.append(this_elem)
# delete the item in the dict when adding to cycle
del pi_dict[this_elem]
this_elem = next_elem
if next_elem in pi_dict:
# continue the cycle
next_elem = pi_dict[next_elem]
else:
# end the cycle
break
cycles.append(cycle)
return cycles
print(permToCycles([3, 5, 4, 1, 2]))
The output is
[[3,4,1],[5,2]]
I think the main obstacle when implementing it in Haskell is how to trace the marked (or unmarked) elements. In Python, it can easily be done using a dictionary as I showed above. Also in functional programming, we tend to use recursion to replace loops, but here I have trouble with thinking the recursive structure of this problem.

Let's start with the basics. You hopefully started with something like this:
permutationToDisjointCycles :: [Int] -> [[Int]]
permutationToDisjointCycles perm = ...
We don't actually want to recur on the input list so much as we want to use an index counter. In this case, we'll want a recursive helper function, and the next step is to just go ahead and call it, providing whatever arguments you think you'll need. How about something like this:
permutationToDisjointCycles perm = cycles [] 0
where
cycles :: [Int] -> Int -> [[Int]]
cycles seen ix = ...
Instead of declaring a pi_dict variable like in Python, we'll start with a seen list as an argument (I flipped it around to keeping track of what's been seen because that ends up being a little easier). We do the same with the counting index, which I here called ix. Let's consider the cases:
cycles seen ix
| ix >= length perm = -- we've reached the end of the list
| ix `elem` seen = -- we've already seen this index
| otherwise = -- we need to generate a cycle.
That last case is the interesting one and corresponds to the inner while loop of the Python code. Another while loop means, you guessed it, more recursion! Let's make up another function that we think will be useful, passing along as arguments what would have been variables in Python:
| otherwise = let c = makeCycle ix ix in c : cycles (c ++ seen) (ix+1)
makeCycle :: Int -> Int -> [Int]
makeCycle startIx currentIx = ...
Because it's recursive, we'll need a base case and recursive case (which corresponds to the if statement in the Python code which either breaks the loop or continues it). Rather than use the seen list, it's a little simpler to just check if the next element equals the starting index:
makeCycle startIx currentIx =
if next == start
then -- base case
else -- recursive call, where we attach an index onto the cycle and recur
where next = perm !! i
I left a couple holes that need to be filled in as an exercise, and this version works on 0-indexed lists rather than 1-indexed ones like your example, but the general shape of the algorithm is there.
As a side note, the above algorithm is not super efficient. It uses lists for both the input list and the "seen" list, and lookups in lists are always O(n) time. One very simple performance improvement is to immediately convert the input list perm into an array/vector, which has constant time lookups, and then use that instead of perm !! i at the end.
The next improvement is to change the "seen" list into something more efficient. To match the idea of your Python code, you could change it to a Set (or even a HashSet), which has logarithmic time lookups (or constant with a hashset).
The code you found Math.Combinat.Permutations actually uses an array of Booleans for the "seen" list, and then uses the ST monad to do imperative-like mutation on that array. This is probably even faster than using Set or HashSet, but as you yourself could tell, readability of the code suffers a bit.

Query sub-sequences in time-series sequence data

I'm facing a problem and I feel like there's a solution in Graph theory or Graph databases. My knowledge in these fields is very limited. I'm hoping someone can recognise my problem and perhaps point me to the name of a technique used to solve it.
Simplified Example:
I am dealing with time-series of states. A simple example, where there are only two states:
TS State
t0 T
t1 F
t2 F
t3 F
t4 T
t5 T
t6 T
t7 F
t... ...
I could convert this into some graph with two nodes (T and F) and where the "dwell time" in the state is an attribue (in brackets):
T(1) -> F(3) -> T(3) -> F(1)
An example of my problem is to write a "query" that extracts any sub-sequence matching this pattern F(>=2) -> T(<10).
In my example above, my query would extract the sub-sequence:
F(3) -> T(3)
But if it were present in the dataset, the query could also extract sequences like:
F(2) -> T(8)
F(20) -> T(3)
The example I've put up is simplified: there are more than two states, and more advanced queries would allow loops, where these loops could be constrained in either overall time spent in the loop, or number of loops that can be done:E.g.
`T(>2) -> [loops of F(1)->T(1)] -> T(<10)`
Where my loop could perhaps be constrained not to take more than 10 iterations, or not more than 10 time units.
The icing on the cake would be to find sequences like this
T(n)->F(<n)
Which translates as: sequences that start with T (and stay in T for n time-units), followed by the F state where it stays in F for less than n (i.e., F is shorter than the preceding T)
What I tried:
I originally thought of converting this to a string, and using a RegEx to extract matches. Regex could do all I need, but fall short of comprehending arithmetic like "greater than". I guess I could keep my raw time-series of states (TFFFTTTF) and do a regex on this... but it seems pretty ugly.
The fields of natural Language Processing, Graph Theory, Graph databases come to mind, as ones that would have similar problems.
I don't know how I would encode the "duration of state" attribute in my graph. I don't know if there's some sort of "industry-standard" query language for sub-sequence searches in graph databases.
Questions:
-Is there a framework to solve these sub-sequence extraction problems, if so, how is it called? Is there a "best practice"? How should I structure my data? Is there a query language to query sub-sequences in a database of sequences?

I might flip the problem around. You've indicated that this is time series data. Given that, I might create a new state node every time the state changes. I would then encode the "dwell" time in the previous node and link the new node to the previous state node creating a linked list in the graph database. With this structure, your pattern query becomes simple.
Objectivity/DB is a schema-based object/graph database with a complete set of graph navigational query capabilities. It has its own query language called Declarative Objectivity, or DO.
We start with a schema definition:
UPDATE SCHEMA {
CREATE CLASS State{
label : String,
dwellTime : INTEGER { Storage: B32 },
prev : Reference { referenced: State, Inverse: next },
next : Reference { referenced: State, Inverse: prev}
}
};
Then we can execute a DO query like the following:
MATCH p = (:State {label == 'T' AND dwellTime > 5})
-->(:State {label == 'F' AND dwellTime > 5})
-->(:State {label == 'T' AND dwellTime < 2})
-->(:State {label == 'T' AND dwellTime > 100})
-->(:State {label == 'F' AND dwellTime > 100})
RETURN p;
This kind of query will find all of the "TFTTF" patterns that meet the specified dwell times.

What is the Efficient way to right rotate list circularly in python without inbuilt function

def circularArrayRotation(a, k, queries):
temp=a+a
indexToCountFrom=len(a)-k
for val in queries:
print(temp[indexToCountFrom+val])
I am having this code to perform the rotation .
This function takes list as a, the number of time it needs to be rotated as k, and last is query which is a list containing indices whose value is needed after the all rotation.
My code works for all the cases except some bigger ones.
Where i am doing it wrong ?
link: https://www.hackerrank.com/challenges/circular-array-rotation/problem

You'll probably run into a timeout when you concatenate large lists with temp = a + a.
Instead, don't create a new list, but use the modulo operator in your loop:
print(a[(indexToCountFrom+val) % len(a)])

Multithreading based on duplicated jOOλ streams

The code below represents a toy example of the problem I am trying to solve.
Imagine that we have an original stream of data originalStream and that the goal is to apply 2 very different data processing. As an example here, one data processing will multiply each element by 2 and sum the result (dataProcess1) and the other will multiply by 4 and sum the result (dataProcess2). Obviously the operation would not be so simple in real life....
The idea is to use jOOλ in order to duplicate the stream and apply both operations to the 2 streams. However, the trick is that I want to run both data processing in different threads. Since originalStream.duplicate() is not thread-safe out of the box, the code below will fail to give the right result which should be: result1 = 570; result2 = 180. Instead the code may unpredictably fail on NPE, yield the wrong result or (sometimes) even give the right result...
The question is how to minimally modify the code such that it will become thread-safe.
Note that I do not want to first collect the stream into a list and then generate 2 new streams. Instead I want to stay with streams until they are eventually collected at the end of the data process. It may not be the most efficient nor the most logical thing to want to do but I think it is nevertheless conceptually interesting. Note also that I wish to keep using org.jooq.lambda.Seq (group: 'org.jooq', name: 'jool', version: '0.9.12') as much as possible as the real data processing functions will use methods that are specific to this library and not present in regular Java streams.
Seq<Long> originalStream = seq(LongStream.range(0, 10));
Tuple2<Seq<Long>, Seq<Long>> duplicatedOriginalStream = originalStream.duplicate();
ExecutorService executor = Executors.newFixedThreadPool(2);
List<Future<Long>> res = executor.invokeAll(Arrays.asList(
() -> duplicatedOriginalStream.v1.map(x -> 2 * x).zipWithIndex().map(x -> x.v1 * x.v2).reduce((x, y) -> x + y).orElse(0L),
() -> duplicatedOriginalStream.v2.map(x -> 4 * x).reduce((x, y) -> x + y).orElse(0L)
));
executor.shutdown();
System.out.printf("result1 = %d\tresult2 = %d\n", res.get(0).get(), res.get(1).get());

Why is chaining iterables this complicated? Simplify this code

I want to chain multiple iterables, everything with lazy evaluation (speed is crucial), to do the following:
read many integers from a single huge line of stdin
split() that line
convert the resulting strings to int
compute the diff between successive ints
... and some further things not shown here
The real example is more complex, here's a simplified example:
Here's a sample line of stdin:
2 13 4 16 16 15 22 17 8 8 7 6
(For debugging purposes, instream below might point to sys.stdin, or an opened filehandle)
You can't simply chain generators since map() returns a (lazily-evaluated) list:
import itertools
gen1 = map(int, (map(str.split, instream))) # CAN'T CHAIN DIRECTLY
The least complicated working solution I found is this, can it surely not be simplified?
gen1 = map(int, itertools.chain.from_iterable(itertools.chain(map(str.split, instream))))
Why the hell do I need to chain itertools.chain.from_iterable(itertools.chain just to process the result from map(str.split, instream) - it sort of defeats the purpose?
Is manually defining my generators faster?

An explicit ("manual") generator expression should be preferred over using map and filter. It is more readable to most people, and more flexible.
If I understand your question, this generator expression does what you need:
gen1 = ( int(x) for line in instream for x in line.split() )

You could build your generator by hand:
import string
def gen1(stream):
# presuming that stream is of type io.TextIOBase
s = ""
c = stream.read(1)
while len(c)>0:
if (c not in string.digits):
if len(s) > 0:
i = int(s)
yield i
s = ""
else:
s += c
c = stream.read(1)
if len(s) > 0:
i = int(s)
yield i
import io
g = gen1(io.StringIO("12 45 6 7 88"))
for x in g: # dangerous if stream is unlimited
print(x)
Which is certainly not the most beautiful code, but it does what you want.
Explanations:
If your input is indefinitely long you have to read it in chunks (or character wise).
Whenever you encounter a non-digit (whitespace), you convert the characters you have read until that point into an integer and yield it.
You also have to consider what happens when you reach the EOF.
My implementation is probably not very well performed, due to the fact that I'm reading char-wise. Using chunks one could speed it up significantly.
EDIT as to why your approach will never work:
map(str.split, instream)
does simply not do what you appear to think it does. map applies the given function str.split to each element of the iterator given as the second parameter. In your case that is a stream, i.e. a file object, in the case of sys.stdin specifically a io.TextIOBase object. Which indeed can be iterated over. Line by line, which emphatically is NOT what you want! In effect you iterate over your input line by line and split each line into words. The map generator iterates over (many) lists of words NOT over A list of words. Which is why you have to chain them together to get a single list to iterate on.
Also, the itertools.chain() in itertools.chain.from_iterable(itertools.chain(map(...))) is redundant. itertools.chain chains its arguments (each an inalterable object) together into one iterator. You only give it one argument so there is nothing to chain together, it basically returns the map object unchanged.
itertools.chain.from_iterable() on the other hand takes one argument, which is expected to be an iterator of iterators (e.g. a list of lists) and flattens it into one iterator (list).
EDIT2
import io, itertools
instream = io.StringIO("12 45 \n 66 7 88")
gen1 = itertools.chain.from_iterable(map(str.split, instream))
gen2 = map(int, gen1)
list(gen2)
returns
[12, 45, 66, 7, 88]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Nested pipeline in Apache Beam - apache-spark

Related

Permutation to disjoint cycles in Haskell

Query sub-sequences in time-series sequence data

What is the Efficient way to right rotate list circularly in python without inbuilt function

Multithreading based on duplicated jOOλ streams

Why is chaining iterables this complicated? Simplify this code

Categories

Resources