I have a raw log file and I would like to extract relationships/behavioral patterns between events.
But important point that I do not have ActivityId/GroupId/SessionId which I can cluster them (referring to process mining), so it can start at any moment in log and end at any moment.
My question is what kind of techniques available to extract behavioral states from log like this:
t1, event1
t2, event2
t2, event3
t3, event4
...
t5, event11
t[N] - is time, and periodicity is not constant (one event can happen in 1 minute, next one can be in 5 minutes, then 4 events can happens in same time after 20 minutes)
where I could say give me all possible sequences which leads to event10.
Ideally I would like something what can produce following outcome, something what could be later described as many different state machines (means some events could be skipped in between, and I wait only for condition I care about):
event5 -> event6 -> event7
event2 -> (NOT event6 + event7) -> event10
event1 -> (event8 + event9) -> (event10+event11) -> even13
what would be the possible techniques to extract this from stream of events?
Most of the stuff I know looking into data from perspective of bags-of-events and then searching similar patterns inside bags, but what if I do not have this grouping but still want to extract some process/patterns repeated?
Frequent Sequence Mining.
A variant of frequent itemset mining hat takes temporal order into account.
Related
I am trying to understand how Queue Runners work. I know that queue runners are used to fill a queue more efficiently, running several enqueuing threads in parallel. But how does the resulting queue look like? What exactly is happening?
Let me make my question more specific:
I have a number of files whose content I want to put in one big queue.
file 1: A, B, C
file 2: D, E, F
file 3: G, H, I
I follow the TensorFlow input pipeline. I first create a list of filenames, then a queue of filenames using tf.string_input_producer(), a reader operation and a decoder. In the end I want to put my serialized examples into an example queue that can be shared with the graph. For this I use a QueueRunner:
qr = tf.train.QueueRunner(queue, [enqueue_op] * numberOfThreads)
and I add it to the QUEUE_RUNNERS collection:
tf.train.add_queue_runner(qr)
My enqueue_op enqueues one example at a time. So, when I use e.g. numberOfThreads = 2, how does the resulting queue look like? In which order are the files read into the queue? For example, does the queue look something like
q = [A, B, C, D, E, F, G, H, I] (so despite the parallel processing the content of the files is not mixed in the queue)
Or does the queue look rather look like
q = [A, D, B, E, C, F, G, H, I] ?
def get_batch(file_list, batch_size, input_size,
num_enqueuing_threads):
file_queue = tf.train.string_input_producer(file_list)
reader = tf.TFRecordReader()
_, serialized_example = reader.read(file_queue)
sequence_features = {
'inputs': tf.FixedLenSequenceFeature(shape=[input_size],
dtype=tf.float32),
'labels': tf.FixedLenSequenceFeature(shape=[],
dtype=tf.int64)}
_, sequence = tf.parse_single_sequence_example(
serialized_example, sequence_features=sequence_features)
length = tf.shape(sequence['inputs'])[0]
queue = tf.PaddingFIFOQueue(
capacity=1000,
dtypes=[tf.float32, tf.int64, tf.int32],
shapes=[(None, input_size), (None,), ()])
enqueue_ops = [queue.enqueue([sequence['inputs'],
sequence['labels'],
length])] * num_enqueuing_threads
tf.train.add_queue_runner(tf.train.QueueRunner(queue, enqueue_ops))
return queue.dequeue_many(batch_size)
I actually find the animated figure of the same page you mention to be rather illustrative.
There is a first queue, the Filename Queue, that feeds file reading threads. Within each of these threads, files are read sequentially -- or rather, according to the order followed by your reader. Each of these threads feed another queue, the Example Queue, the output of which is consumed by your training.
In your question, you are interested in the resulting order of the examples in the Example Queue.
The answer depends on how the reader works. The best practice and the most efficient way (illustrated in the animated figure above) is to send a sample to the Example Queue as soon as it is ready. This streaming functionality is usually provided by the various Reader variants proposed by tensorflow to promote and ease this best practice. However it is certainly possible to write a reader that would enqueue_many all samples at once at the end.
In all cases, since readers are in separate, non-synchronized threads, you do not have any guarantee in the order of their respective output. For example in the standard, streaming cases, the examples may be interleaved regularly, or not -- virtually anything can happen.
Scenario is as follows:
We have 3 tasks such as T1, T2, T3. T1 is a time consuming process and output of T1 is being utilized in T2. The Order of operation is T1-T2-T3.
As of node.js programming following could be thought of.
T1: fs.readFile(filename, mode, callback); // most expensive computation
T2: get file content from T1 and parse as of certain logic.
T3: Generate report basis of your found content.
Note: I am expecting an answer how to implement asynchronous programming for T1 or it can be done only with synchronous way. :)
You may have the option to not read the file at once, but do e.g. line based parsing, and fire up your events after you read a line.
This will likely complicate your logic quite a lot, and it is really dependent on your T2-T3 costs if it is worth the effort. (It would likely only help if T2 and T3 are also somewhat costly, and can be executed on a different thread)
I am actually looking into some predicting algorithms. My Question is I have set of threads in a process lets say T1, T2 , T3 ... T4. Initially I will be getting some request basing on which I run these threads in a order say T2-T1-T3-T4 and for other request T3-T1-T2-T4 ... and so on for another N iterations. If I want to predict future M request order of execution. which algorithm can I use and how can I predict??
Problem
Summary: Parallely apply a function F to each element of an array where F is NOT thread safe.
I have a set of elements E to process, lets say a queue of them.
I want to process all these elements in parallel using the same function f( E ).
Now, ideally I could call a map based parallel pattern, but the problem has the following constraints.
Each element contains a pair of 2 objects.( E = (A,B) )
Two elements may share an object. ( E1 = (A1,B1); E2 = (A1, B2) )
The function f cannot process two elements that share an object. so E1 and E2 cannot be processing in parallel.
What is the right way of doing this?
My thoughts are like so,
trivial thought: Keep a set of active As and Bs, and start processing an Element only when no other thread is already using A OR B.
So, when you give the element to a thread you add the As and Bs to the active set.
Pick the first element, if its elements are not in the active set spawn a new thread , otherwise push it to the back of the queue of elements.
Do this till the queue is empty.
Will this cause a deadlock ? Ideally when a processing is over some elements will become available right?
2.-The other thought is to make a graph of these connected objects.
Each node represents an object (A / B) . Each element is an edge connecting A & B, and then somehow process the data such that we know the elements are never overlapping.
Questions
How can we achieve this best?
Is there a standard pattern to do this ?
Is there a problem with these approaches?
Not necessary, but if you could tell the TBB methods to use, that'll be great.
The "best" approach depends on a lot of factors here:
How many elements "E" do you have and how much work is needed for f(E). --> Check if it's really worth it to work the elements in parallel (if you need a lot of locking and don't have much work to do, you'll probably slow down the process by working in parallel)
Is there any possibility to change the design that can make f(E) multi-threading safe?
How many elements "A" and "B" are there? Is there any logic to which elements "E" share specific versions of A and B? --> If you can sort the elements E into separate lists where each A and B only appears in a single list, then you can process these lists parallel without any further locking.
If there are many different A's and B's and you don't share too many of them, you may want to do a trivial approach where you just lock each "A" and "B" when entering and wait until you get the lock.
Whenever you do "lock and wait" with multiple locks it's very important that you always take the locks in the same order (e.g. always A first and B second) because otherwise you may run into deadlocks. This locking order needs to be observed everywhere (a single place in the whole application that uses a different order can cause a deadlock)
Edit: Also if you do "try lock" you need to ensure that the order is always the same. Otherwise you can cause a lifelock:
thread 1 locks A
thread 2 locks B
thread 1 tries to lock B and fails
thread 2 tries to lock A and fails
thread 1 releases lock A
thread 2 releases lock B
Goto 1 and repeat...
Chances that this actually happens "endless" are relatively slim, but it should be avoided anyway
Edit 2: principally I guess I'd just split E(Ax, Bx) into different lists based on Ax (e.g one list for all E's that share the same A). Then process these lists in parallel with locking of "B" (there you can still "TryLock" and continue if the required B is already used.
Assume there are 2 threads performing operations on a shared queue q. The lines of code for each thread are numbered, and initially the queue is empty.
Thread A:
A1) q.enq(x)
A2) q.deq()
Thread B:
B1) q.enq(y)
Assume that the order of execution is as follows:
A1) q.enq(x)
B1) q.enq(y)
A2) q.deq()
and as a result we get y (i.e. q.deq() returns y)
This execution is based on a well-known book and is said to be sequentially consistent. Notice that the method calls don’t overlap. How is that even possible? I believe that Thread A executed A1 without actually updating the queue until it proceeded to line A2 but that's just my guess. I'm even more confused if I look at this explanation from The Java Language Specification:
Sequential consistency is a very strong guarantee that is made about visibility and ordering in an execution of a program. Within a sequentially consistent execution, there is a total order over all individual actions (such as reads and writes) which is consistent with the order of the program, and each individual action is atomic and is immediately visible to every thread.
If that was the case, we would have dequeue x.
I'm sure I'm somehow wrong. Could somebody throw a light on this?
Note that the definition of sequential consistency says "consistent with program order", not "consistent with the order in which the program happens to be executed".
It goes on to say:
If a program has no data races, then all executions of the program will appear to be sequentially consistent.
(my emphasis of "appear").
Java's memory model does not enforce sequential consistency. As the JLS says:
If we were to use sequential consistency as our memory model, many of the compiler and processor optimizations that we have discussed would be illegal. For example, in the trace in Table 17.3, as soon as the write of 3 to p.x occurred, subsequent reads of that location would be required to see that value.
So Java's memory model doesn't actually support sequential consistency. Just the appearance of sequential consistency. And that only requires that there is some sequentially consistent order of actions that's consistent with program order.
Clearly there is some execution of threads A and B that could result in A2 returning y, specifically:
B1) q.enq(y)
A1) q.enq(x)
A2) q.deq()
So, even if the program happens to be executed in the order you specified, there is an order in which it could have been executed that is "consistent with program order" for which A2 returns y. Therefore, a program that returns y in that situation still gives the appearance of being sequentially consistent.
Note that this shouldn't be interpreted as saying that it would be illegal for A2 to return x, because there is a sequentially consistent sequence of operations that is consistent with program order that could give that result.
Note also that this appearance of sequential consistency only applies to correctly synchronized programs. If your program is not correctly synchronized (i.e. has data races) then all bets are off.