GroupBy then ObserveOn loses items - multithreading

Try this in LinqPad:
Observable
.Range(0, 10)
.GroupBy(x => x % 3)
.ObserveOn(Scheduler.NewThread)
.SelectMany(g => g.Select(x => g.Key + " " + x))
.Dump()
The results are clearly non-deterministic, but in every case I fail to receive all 10 items. My current theory is that the items are going through the grouped observable unobserved as the pipeline marshals to the new thread.

Linqpad doesn't know that you're running all of these threads - it gets to the end of the code immediately (remember, Rx statements don't always act synchronously, that's the idea!), waits a few milliseconds, then ends by blowing away the AppDomain and all of its threads (that haven't caught up yet). Try adding a Thread.Sleep to the end to give the new threads time to catch up.
As an aside, Scheduler.NewThread is a very inefficient scheduler, EventLoopScheduler (create exactly one thread), or Scheduler.TaskPool (use the TPL pool, as if you created a Task for each item) are much more efficient (of course in this case since you only have 10 items, Scheduler.Immediate is the best!)

It appears here that the problem is in timing between starting the subscription to the new group in the GroupBy operation and the delay of implementing the new subscription. If you increase the number of iterations from 10 to 100, you should start seeing some results after a period of time.
Also, if you change the GroupBy to .Where(x => x % 3 == 0), you will likely notice that no values are lost because the dynamic subscription to the IObservable groups doesn't need to initialize new observers.

Related

Counting the number of values emitted before the Observable completes?

Attempting to verify that an observable emits a certain number of events before it completes. This is pseudo code:
o.pipe(count).subscribe(count=>
expect(count).toEqual(4));
Thoughts?
The count operator works as follows:
Counts the number of emissions on the source and emits that number when the source completes (source)
So you can use it like so:
obs.pipe(count()).subscribe(totalEmissions => expect(totalEmissions).toEqual(4))
Note that you can't really measure how many events occured before the original observable completed, because if it didn't complete then you didn't finish counting!
You can, however, take note of the "index" of each emission using tap:
let count = 0
obs.pipe(tap(() => console.log("emitted! Index: " + count++))).subscribe(obsValue => {/*...*/})
I'm not sure which is your use case, but that's how you can do it.

sqs.recieveMessage not receiving even when messages in queue

So i have 3 lambdas, one with an API event that triggers a lambda that pulls down around 50,000 objects and pushes them all to a queue.
The second lambda reads from the queue, 10 at a time, in a loop 30 times - meaning it reads, does stuff, invokes the third lambda, returns promise, then reads again - 30 times for a total of 300 reads in the time the lambda executes
The 3rd lambda takes the information from the queue and hits another endpoint with it.
The issue is in that second lambda...First i call a function that returns the number of messages in the queue and if it's more than zero i read them. However, even if there's 20,000 messages in the queue it often comes back with nothing. I'm not sure why.
I have WaitTimeSeconds set to 20 for long polling. Any help would be greatly appreciated, the docs claim i can read up to 3,000/second with a FIFO queue and i'm having trouble getting anywhere near that performance.
Here's the code:
exports.handler = (event, context, callback) => {
const sqs = new AWS.SQS({ region: process.env.AWS_REGION });
getMessageCount(sqs)
.then((messageCount) => {
if (messageCount > 0) {
mapSeries(range(0, 30), getMessages(sqs))
.then((messageRes) => {
callback(null, messageRes);
})
.catch(e => Promise.reject(e));
}
callback(null, 'No more messages');
})
.catch((e) => {
callback(e);
});
};
getMessageCount makes a call to sqs.getQueueAttributes and returns a promise that receives the number of messages.
mapSeries allows the loop to wait for the previous promise to be resolved/rejected before iterating and on each iteration it calls getMessages which calls sqs.receiveMessage and invokes the 3rd lambda with the data.
Any perspective on this is appreciated, thank you!
As i understand your questions, the problem lies with getting the number of messages in the queue. If you had also given the getMessageCount(sqs) as well, we could have determined the types of attributes you are trying to retrieve from SQS.
There are three types of attributes relevant, to get the message count in SQS. These attributes are given below.
ApproximateNumberOfMessages - Returns the approximate number of visible
messages in a queue
ApproximateNumberOfMessagesNotVisible - Returns the approximate number of messages that have not timed-out and aren't deleted.
If you want to include the messages that are waiting to be added, you can consider the following property as well.
ApproximateNumberOfMessagesDelayed - Returns the approximate number of
messages that are waiting to be added to the queue.
By considering these attributes, you can get a much more accurate count from SQS.
Also if I may suggest, I implemented a similar system, but without looking for the count.I retrieve 10 messages at a time via polling, process them and delete them from the queue. As per your example, you can repeat this for 30 times. But if the getMessages(sqs) function returns an empty set, we could assume that the list is empty. (This depends on whether you are using short polling or long polling). Nevertheless, checking for the number of messages at every step seems to be redundant. This is according to this example, but it might defer according to the use case.
Read through the API documentation: https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/SQS.html#receiveMessage-property
Parameters:
MaxNumberOfMessages — (Integer)
The maximum number of messages to return. Amazon SQS never returns
more messages than this value (however, fewer messages might be
returned). Valid values are 1 to 10. Default is 1.
Wrap your code in a while loop and anticipate a frequent case of 0 messages since 0 is fewer than 1 to 10.
Something like...
var messages = [];
while(messages.length < NUMBER_OF_MSGS_YOU_REALLY_WANT) {
var new_messages = await getSQSMessages(NUMBER_OF_MSGS_YOU_REALLY_WANT - messages.length);
if(new_messages.Data.Messages.length > 0) {
messages.push(new_messages.Data.Messages);
}
}

Use First() and Repeat() without restarting whole stream RxJS

I am building a trading bot using RxJS. For that i have to convert ticker data from a socket connection to candles that is getting emitted every x seconds.
I created the socketObservable like this
const subscribeObservable = Observable.fromEventPattern(h => bittrex.websockets.subscribe(['USDT-BTC'], h))
const clientCallBackObservable = Observable.fromEventPattern(h => bittrex.websockets.client(h))
const socketObservable = clientCallBackObservable
.flatMap(() => subscribeObservable)
.filter(subscribtionData => subscribtionData && subscribtionData.M === 'updateExchangeState')
.flatMap(exchangeState => Observable.from(exchangeState.A))
.filter(marketData => marketData.Fills.length > 0)
.map(marketData => marketData && marketData.Fills)
Which works fine - when i connect to the client i flatMap to the subscription connection.
Then i have the candleObservable that is causing problems
export const candleObservable = (promise, timeFrame = TIME_FRAME) =>
promise
.scan((acc, curr) => [...acc, ...curr])
.skipWhile(exchangeData => dateDifferenceInSeconds(exchangeData) < timeFrame)
// take first after skipping
.first()
// first will complete the stream, so we repeat it
.repeat()
// we create candle data from the timeFrame array
.map(fillsData => createCandle(fillsData))
// accumulate candles
.scan((acc, curr) => [...[acc], curr])
What i am trying to achieve is to accumulate data until i have for a full candle that can be x seconds. Then i would like to take that emit and reset the scan function so i start for a new candle. Then i create the candle and accumulate it in another scan.
My problem is that when i call repeat() my socketObservable also gets called again. I do not know if this causes any overhead with the node-bittrex-api but i would like to avoid it.
I have tried putting the accumulating candle part in a flatMap or similar but couldn't get anyt of that to work.
Do you know how i can avoid to repeat() the whole stream or another way of make candles where i can accumulate and then reset the accumulator after first emit?
From what you've described it sounds like you have an observable you want to cut up into buckets of some kind based on some condition. In general, the reduction of a stream to another stream with fewer elements (without filtering) is referred to as "backpressure". In your specific case, it sounds like the backpressure operator you'd be interested in is buffer. The buffer operator can accept an observable as an argument that functions as a "closing selector", i.e. emissions in this observable can be used to regulate when you tie off one buffer and start a new one.
I'd suggest replacing your scan, skipWhile, first, and repeat with a buffer call, passing in a closing selector that will yield a value when your "TIME_FRAME" expires. This should be easy to express as an observable either using timer (in the case of a fixed amount) or a debounced version of the driving stream (if you want to stop when there's a pause in the data). If your buffer is strictly time-based, there's even a specialization of buffer called bufferTime that handles this. Because you'll wind up with an observable of arrays (rather than raw values), you'll likely want to replace your final scan with a regular array reduce.
It's hard to give concrete code without a simpler example to work with. I'd urge you to consult the sample code for the various backpressure operators to see if you can find something similar to what you're attempting to achieve.

How to save data using multiple threads in grails-2.4.4 application using thread pool

I have a multithreaded program running some logic to come up with rows of data that I need to save in my grails (2.4.4) application. I am using a fixedthreadpool with 30 threads. The skeleton of my program is below. My expectation is that each thread calculates all the attributes and saves on a row in the table. However, the end result I am seeing is that there are some random rows that are not saved. Upon repeating this exercise, it is seen that a different set of rows are not saved in the table. So, overall, each time this is attempted a certain set of rows are NOT saved in table at all. GORMInstance.errors did not reveal any errors. So, I have no clue what is incorrect in this program.
ExecutorService exeSvc = Executors.newFixedThreadPool(30)
for (obj in list){
exeSvc.execute({-> finRunnable obj} as Callable)
}
Also, here's the runnable program that the above snippet invokes.
def finRunnable = {obj ->
for (item in LIST-1){
for (it in LIST-2){
for (i in LIST-3){
rowdata = calculateValues(item, it, i);
GORMInstance instance = new GORMInstance();
instance.withTransaction{
instance.attribute1=rowdata[0];
instance.attribute2=rowdata[1];
......so on..
instance.save(flush:true)/*without flush:true, I am
running into HeuristicCompletion exception. So I need it
here. */
}//endTransaction
}//forloop 3
}//forloop 2
}//forloop 1
}//runnable closure

Parallel.ForEach Ordered Execution

I am trying to execute parallel functions on a list of objects using the new C# 4.0 Parallel.ForEach function. This is a very long maintenance process. I would like to make it execute in the order of the list so that I can stop and continue execution at the previous point. How do I do this?
Here is an example. I have a list of objects: a1 to a100. This is the current order:
a1, a51, a2, a52, a3, a53...
I want this order:
a1, a2, a3, a4...
I am OK with some objects being run out of order, but as long as I can find a point in the list where I can say that all objects before this point were run. I read the parallel programming csharp whitepaper and didn't see anything about it. There isn't a setting for this in the ParallelOptions class.
Do something like this:
int current = 0;
object lockCurrent = new object();
Parallel.For(0, list.Count,
new ParallelOptions { MaxDegreeOfParallelism = MaxThreads },
(ii, loopState) => {
// So the way Parallel.For works is that it chunks the task list up with each thread getting a chunk to work on...
// e.g. [1-1,000], [1,001- 2,000], [2,001-3,000] etc...
// We have prioritized our job queue such that more important tasks come first. So we don't want the task list to be
// broken up, we want the task list to be run in roughly the same order we started with. So we ignore tha past in
// loop variable and just increment our own counter.
int thisCurrent = 0;
lock (lockCurrent) {
thisCurrent = current;
current++;
}
dothework(list[thisCurrent]);
});
You can see how when you break out of the parallel for loop you will know the last list item to be executed, assuming you let all threads finish prior to breaking. I'm not a big fan of PLINQ or LINQ. I honestly don't see how writing LINQ/PLINQ leads to maintainable source code or readability.... Parallel.For is a much better solution.
If you use Parallel.Break to terminate the loop then you are guarenteed that all indices below the returned value will have been executed. This is about as close as you can get. The example here uses For but ForEach has similar overloads.
int n = ...
var result = new double[n];
var loopResult = Parallel.For(0, n, (i, loopState) =>
{
if (/* break condition is true */)
{
loopState.Break();
return;
}
result[i] = DoWork(i);
});
if (!loopResult.IsCompleted &&
loopResult.LowestBreakIteration.HasValue)
{
Console.WriteLine("Loop encountered a break at {0}",
loopResult.LowestBreakIteration.Value);
}
In a ForEach loop, an iteration index is generated internally for each element in each partition. Execution takes place out of order but after break you know that all the iterations lower than LowestBreakIteration will have been completed.
Taken from "Parallel Programming with Microsoft .NET" http://parallelpatterns.codeplex.com/
Available on MSDN. See http://msdn.microsoft.com/en-us/library/ff963552.aspx. The section "Breaking out of loops early" covers this scenario.
See also: http://msdn.microsoft.com/en-us/library/dd460721.aspx
For anyone else who comes across this question - if you're looping over an array or list (rather than an IEnumberable ), you can use the overload of Parallel.Foreach that gives the element index to maintain original order too.
string[] MyArray; // array of stuff to do parallel tasks on
string[] ProcessedArray = new string[MyArray.Length];
Parallel.ForEach(MyArray, (ArrayItem,loopstate,ArrayElementIndex) =>
{
string ProcessedArrayItem = TaskToDo(ArrayItem);
ProcessedArray[ArrayElementIndex] = ProcessedArrayItem;
});
As an alternate suggestion, you could record which object have been run and then filter the list when you resume exection to exclude the objects which have already run.
If this needs to be persistent across application restarts, you can store the ID's of the already executed objects (I assume here the objects have some unique identifier).
For anybody looking for a simple solution, I have posted 2 extension methods (one using PLINQ and one using Parallel.ForEach) as part of an answer to the following question:
Ordered PLINQ ForAll
Not sure if question was altered as my comment seems wrong.
Here improved, basically remind that parallel jobs run in out of your control order.
ea printing 10 numbers might result in 1,4,6,7,2,3,9,0.
If you like to stop your program and continue later.
Problems alike this usually endup in batching workloads.
And have some logging of what was done.
Say if you had to check 10.000 numbers for prime or so.
You could loop in batches of size 100, and have a prime log1, log2, log3
log1= 0..99
log2=100..199
Be sure to set some marker to know if a batch job was finished.
Its a general aprouch since the question isnt that exact either.

Resources