Use First() and Repeat() without restarting whole stream RxJS - node.js

I am building a trading bot using RxJS. For that i have to convert ticker data from a socket connection to candles that is getting emitted every x seconds.
I created the socketObservable like this
const subscribeObservable = Observable.fromEventPattern(h => bittrex.websockets.subscribe(['USDT-BTC'], h))
const clientCallBackObservable = Observable.fromEventPattern(h => bittrex.websockets.client(h))
const socketObservable = clientCallBackObservable
.flatMap(() => subscribeObservable)
.filter(subscribtionData => subscribtionData && subscribtionData.M === 'updateExchangeState')
.flatMap(exchangeState => Observable.from(exchangeState.A))
.filter(marketData => marketData.Fills.length > 0)
.map(marketData => marketData && marketData.Fills)
Which works fine - when i connect to the client i flatMap to the subscription connection.
Then i have the candleObservable that is causing problems
export const candleObservable = (promise, timeFrame = TIME_FRAME) =>
promise
.scan((acc, curr) => [...acc, ...curr])
.skipWhile(exchangeData => dateDifferenceInSeconds(exchangeData) < timeFrame)
// take first after skipping
.first()
// first will complete the stream, so we repeat it
.repeat()
// we create candle data from the timeFrame array
.map(fillsData => createCandle(fillsData))
// accumulate candles
.scan((acc, curr) => [...[acc], curr])
What i am trying to achieve is to accumulate data until i have for a full candle that can be x seconds. Then i would like to take that emit and reset the scan function so i start for a new candle. Then i create the candle and accumulate it in another scan.
My problem is that when i call repeat() my socketObservable also gets called again. I do not know if this causes any overhead with the node-bittrex-api but i would like to avoid it.
I have tried putting the accumulating candle part in a flatMap or similar but couldn't get anyt of that to work.
Do you know how i can avoid to repeat() the whole stream or another way of make candles where i can accumulate and then reset the accumulator after first emit?

From what you've described it sounds like you have an observable you want to cut up into buckets of some kind based on some condition. In general, the reduction of a stream to another stream with fewer elements (without filtering) is referred to as "backpressure". In your specific case, it sounds like the backpressure operator you'd be interested in is buffer. The buffer operator can accept an observable as an argument that functions as a "closing selector", i.e. emissions in this observable can be used to regulate when you tie off one buffer and start a new one.
I'd suggest replacing your scan, skipWhile, first, and repeat with a buffer call, passing in a closing selector that will yield a value when your "TIME_FRAME" expires. This should be easy to express as an observable either using timer (in the case of a fixed amount) or a debounced version of the driving stream (if you want to stop when there's a pause in the data). If your buffer is strictly time-based, there's even a specialization of buffer called bufferTime that handles this. Because you'll wind up with an observable of arrays (rather than raw values), you'll likely want to replace your final scan with a regular array reduce.
It's hard to give concrete code without a simpler example to work with. I'd urge you to consult the sample code for the various backpressure operators to see if you can find something similar to what you're attempting to achieve.

Related

Nodejs compute gets slow after query big list from Mongodb

I am using mongoose to query a really big list from Mongodb
const chat_list = await chat_model.find({}).sort({uuid: 1}); // uuid is a index
const msg_list = await message_model.find({}, {content: 1, xxx}).sort({create_time: 1});// create_time is a index of message collection, time: t1
// chat_list length is around 2,000, msg_list length is around 90,000
compute(chat_list, msg_list); // time: t2
function compute(chat_list, msg_list) {
for (let i = 0, len = chat_list.length; i < len; i++) {
msg_list.filter(msg => msg.uuid === chat_list[i].uuid)
// consistent handling for every message
}
}
for above code, t1 is about 46s, t2 is about 150s
t2 is really to big, so weird.
then I cached these list to local json file,
const chat_list = require('./chat-list.json');
const msg_list = require('./msg-list.json');
compute(chat_list, msg_list); // time: t2
this time, t2 is around 10s.
so, here comes the question, 150 seconds vs 10 seconds, why? what happened?
I tried to use worker to do the compute step after mongo query, but the time is still much bigger than 10s
The mongodb query returns a FindCursor that includes arrayish methods like .filter() but the result is not an Array.
Use .toArray() on the cursor before filtering to process the mongodb result set like for like. That might not make the overall process any faster, as the result set still needs to be fetched from mongodb, but compute will be similar.
const chat_list = await chat_model
.find({})
.sort({uuid: 1})
.toArray()
const msg_list = await message_model
.find({}, {content: 1, xxx})
.sort({create_time: 1})
.toArray()
Matt typed faster than I did, so some of what was suggested aligns with part of this answer.
I think you are measuring and comparing something different than what you are expecting and implying.
Your expectation is that the compute() function takes around 10 seconds once all of the data is loaded by the application. This is (mostly) demonstrated by your second test, apart from the fact that that test includes the time it takes to load the data from the local files. But you're seeing that there is a difference of 104 seconds (150 - 46) between the completion of message_model.find() and compute() hence leading to the question.
The key thing is that successfully advancing from the find against message_model is not the same thing as retrieving all of the results. As #Matt notes, the find() will return with a cursor object once the initial batch of results are ready. That is very different than retrieving all of the results. So there is more work (apparently ~94 seconds worth) left to do from the two find() operations to further iterate the cursors and retrieve the rest of the results. This additional time is getting reported inside of t2.
Ass suggested by #Matt, calling .toArray() should shift that time back into t1 as you are expecting. Also sounds like it may be more correct due to ambiguity with .filter() functions.
There are two other things that catch my attention. The first is: why are you retrieving all of this data client-side to do the filtering there? Perhaps you would like to do this uuid matching inside of the database via $lookup?
Secondly, this comment isn't clear to me:
// create_time is a index of message collection, time: t1
create_time itself is a field here, existent or not, that you are requesting an ascending sort against.
You are taking data from 2 tables, then with for loop you are comparing ID using filter function, what is happening now is your loop will be executed 2000 time and so the filter function also which contains 90000 records.
So take a worst case scenario here lets consider 2000 uuid you are getting is not inside the msg_list, here you are executing loop 2000*90000 even though you are not getting data.
It wan't take more than 10 to 15 secs if use below code.
//This will generate array of uuid present in message_model
const msg_list = await message_model.find({}, {content: 1, xxx}).sort({create_time: 1}).distinct("uuid");
// Below query will match all uuid present in msg_list array with chat_list UUID
const chat_list = await chat_model.find({uuid:{$in:msg_list}}).sort({uuid: 1});
The above result is doing same as you have done in your code with filter function and loop but this is proper and fastest way to receive the data you required.

Why would you use the spread operator to spread a variable onto itself?

In the Google Getting started with Node.js tutorial they perform the following operation
data = {...data};
in the code for sending data to Firestore.
You can see it on their Github, line 63.
As far as I can tell this doesn't do anything.
Is there a good reason for doing this?
Is it potentially future proofing, so that if you added your own data you'd be less likely to do something like data = {data, moreData}?
#Manu's answer details what the line of code is doing, but not why it's there.
I don't know exactly why the Google code example uses this approach, but I would guess at the following reason (and would do the same myself in this situation):
Because objects in JavaScript are passed by reference, it becomes necessary to rebuild the 'data' object from it's constituent parts to avoid the original data object being further modified by the ref.set(data) call on line 64 of the example code:
await ref.set(data);
For example, in MongoDB, when you pass an object into a write or update method, Mongo will actually modify the object to add extra properties such as the datetime it was insert into a collection or it's ID within the collection. I don't know for sure if Firestore does the same, but if it doesn't now, it's possible that it may in future. If it does, and if your original code that calls the update method from Google's example code goes on to further manipulate the data object that it originally passed, that object would now have extra properties on it that may cause unexpected problems. Therefore, it's prudent to rebuild the data object from the original object's properties to avoid contamination of the original object elsewhere in code.
I hope that makes sense - the more I think about it, the more I'm convinced that this must be the reason and it's actually a great learning point.
I include the full original function from Google's code here in case others come across this in future, since the code is subject to change (copied from https://github.com/GoogleCloudPlatform/nodejs-getting-started/blob/master/bookshelf/books/firestore.js at the time of writing this answer):
// Creates a new book or updates an existing book with new data.
async function update(id, data) {
let ref;
if (id === null) {
ref = db.collection(collection).doc();
} else {
ref = db.collection(collection).doc(id);
}
data.id = ref.id;
data = {...data};
await ref.set(data);
return data;
}
It's making a shallow copy of data; let's say you have a third-party function that mutates the input:
const foo = input => {
input['changed'] = true;
}
And you need to call it, but don't want to get your object modified, so instead of:
data = {life: 42}
foo(data)
// > data
// { life: 42, changed: true }
You may use the Spread Syntax:
data = {life: 42}
foo({...data})
// > data
// { life: 42 }
Not sure if this is the particular case with Firestone but the thing is: spreading an object you get a shallow copy of that obj.
===
Related: Object copy using Spread operator actually shallow or deep?

Counting the number of values emitted before the Observable completes?

Attempting to verify that an observable emits a certain number of events before it completes. This is pseudo code:
o.pipe(count).subscribe(count=>
expect(count).toEqual(4));
Thoughts?
The count operator works as follows:
Counts the number of emissions on the source and emits that number when the source completes (source)
So you can use it like so:
obs.pipe(count()).subscribe(totalEmissions => expect(totalEmissions).toEqual(4))
Note that you can't really measure how many events occured before the original observable completed, because if it didn't complete then you didn't finish counting!
You can, however, take note of the "index" of each emission using tap:
let count = 0
obs.pipe(tap(() => console.log("emitted! Index: " + count++))).subscribe(obsValue => {/*...*/})
I'm not sure which is your use case, but that's how you can do it.

Thread pool with Apps Script on Spreadsheet

I have a Google Spreadsheet with internal AppsScript code which process each row of the sheet and perform an urlfetch with the row data. The url will provide a value which will be added to the values returned by each row processing..
For now the code is processing 1 row at a time with a simple for:
var spreadsheet = SpreadsheetApp.getActiveSpreadsheet();
var sheet = spreadsheet.getActiveSheet();
var range = sheet.getDataRange();
for(var i=1 ; i<range.getValues().length ; i++) {
var payload = {
// retrieve data from the row and make payload object
};
var options = {
"method":"POST",
"payload" : payload
};
var result = UrlFetchApp.fetch("http://.......", options);
var text = result.getContentText();
// Save result for final processing
// (with multi-thread function this value will be the return of the function)
}
Please note that this is only a simple example, in the real case the working function will be more complex (like 5-6 http calls, where the output of some of them are used as input to the next one, ...).
For the example let's say that there is a generic "function" which executes some sort of processing and provides a result as output.
In order to speed up the process, I'd like to try to implement some sort of "multi-thread" processing, so I can process multiple rows in the same time.
I already know that javascript does not offer a multi-thread handling, but I read about WebWorker which seems to create an async processing of a function.
My goal is to obtain some sort of ThreadPool (like 5 threads at a time) and send every row that need to be processed to the pool, obtaining as output the result of each function.
When all the rows finished the processing, a final action will be performed gathering all the results of each function.
So the capabilities I'm looking for are:
managed "ThreadPool" where I can submit an N amount of tasks to be performed
possibility to obtain a resulting value from each task processed by the pool
possibility to determine that all the tasks has been processed, so a final "event" can be executed
I already see that there are some ready-to-use libraries like:
https://www.hamsters.io/wiki#thread-pool
http://threadsjs.readthedocs.io/en/latest/
https://github.com/andywer/threadpool-js
but they work with NodeJS. Due to AppsScript nature, I need a more simplier approach, which is provided by native JS. Also, it seems that minified JS are not accepted by AppsScript editor, so I also need the "expanded" version.
Do you know a simple ThreadPool in JS where I can submit a function to be execute and I get back a Promise for the result?

Map three different functions to Observable in Node.js

I am new to Rxjs. I want to follow best practices if possible.
I am trying to perform three distinct functions on the same data that is returned in an observable. Following the 'streams of data' concept, I keep on thinking I need to split this Observable into three streams and carry on.
Here is my code, so I can stop talking abstractly:
// NotEmptyResponse splits the stream in 2 to account based on whether I get an empty observable back.
let base_subscription = RxNode.fromStream(siteStream).partition(NotEmptyResponse);
// Success Stream to perform further actions upon.
let successStream = base_subscription[0];
// The Empty stream for error reporting
let failureStream = base_subscription[1];
//Code works up until this point. I don't know how to split to 3 different streams.
successStream.filter(isSite)
.map(grabData)// Async action that returns data
/*** Perform 3 separate actions upon data that .map(grabData) returned **/
.subscribe();
How can I split this data stream into three, and map each instance of the data to a different function?
In fact partition() operator internally just calls filter() operator twice. First to create an Observable from values matching the predicate and then for values not matching the predicate.
So you can do the exact same thing with filter() operator:
let obs1 = base_subscription.filter(val => predicate1);
let obs2 = base_subscription.filter(val => predicate2);
let obs3 = base_subscription.filter(val => predicate3);
Now you have three Observables, each of them emitting only some specific values. Then you can carry on with your existing code:
obs2.filter(isSite)
.map(grabData)
.subscribe();
Just be aware that calling subscribe() triggers the generating values from the source Observable. This doesn't have to be always like this depending on what Observable you use. See “Hot” and “Cold” Observables in the documentation. Operator connect() might be useful for you depending on your usecase.

Resources