Stream - Suspiciously large amount of feed updates

Stream - Suspiciously large amount of feed updates - getstream-io

We are in the process of integrating Stream to power our notifications module.
When looking at the usage metrics in the dashboard, we see suspiciously large amount of feed updates:
As you can see we have around 9K feed updates per day.
Those daily 9K feed updates don't make sense since right now our backend code does not create any activities.
The only Stream API calls that happen, is when a new user registers we create a new stream for it, of type 'notification', and make this new stream follow a single admin stream which is of type 'flat':
const notifications = client.feed('notifications', userId);
await notifications.follow('user', 'admin');
So for example, if we had today 200 new users who registered, the admin's flat stream will have additional +200 followers.
As of today, we have:
4722 streams of type 'notification'
1 stream of type 'flat'
These are the only interactions we do with Stream's API, and we don't understand what is the source of all those feed updates in the dashboard.
(Maybe these follow commands counts as a feed update?)

We have something happening very similar. We have a dev app for testing, and suddenly 1K "read feed" operation on notification feed group showed up in the log 1 day ago. Its impossible since we haven't rolled the feature out and in case this is the dev app where we did read the feed at most 10 times manually via postman to our backend to getstream.
The correct ops in the log show client as stream-python-client-2.11.0 which makes sense.
The incorrect ops in the log show client as stream-javascript-client-browser-unknown, which does not make sense.
Further the incorrect ops timestamps are all clustered within a short time.
This has not happened since, and has not happened in the production app yet.

Related

Calling external API only when new data is available

I am serving my users with data fetched from an external API. Now, I don't know when this API will have new data, how would be the best approach to do that using Node, for example?
I have tried setInterval's and node-schedule to do that and got it working, but isn't it expensive for the CPU? For example, over a day I would hit this endpoint to check for new data every minute, but it could have new data every five minutes or more.
The thing is, this external API isn't ran by me. Would the only way to check for updates hitting it every minute? Is there any module that can do that in Node or any approach that fits better?
Use case 1 : Call a weather API for every city of the country and just save data to my db when it is going to rain in a given city.
Use case 2 : Send notification to the user when a given Philips Hue lamp is turned on at the time it is turned on without having to hit the endpoint to check if it is on or not.
I appreciate the time to discuss this.

If this external API has no means of notifying you when there's new data, then the only thing you can do is to "poll" it to check for new data.
You will have to decide what an "efficient design" for polling is in your specific application and given the type of data and the needs of the client (what is an acceptable latency for new data).
You also need to be sure that your service is not violating any terms of service with your polling scheme or running afoul of rate limiting that may deny you access to the server if you use it "too much".
Would the only way to check for updates hitting it every minute?
Unless the API offers some notification feature, there is no other scheme other than polling at some interval. Polling every minute is fairly quick. Do your clients really need information that is less than a minute old? Or would it really make no difference if the information was as much as 5 minutes old.
For example, in your example of weather, a client wouldn't really need temperature updates more often than probably every 10-15 minutes.
Is there any module that can do that in Node or any approach that fits better?
No. Not really. You'll probably just use some sort of timer (either repeated setTimeout() or setInterval() in a node.js app to repeatedly carry out your API operations.
Use case: Call a weather API for every city of the country and just save data to my db when it is going to rain in a given city.
Trying to pre-save every possible piece of data from an external API is probably a losing proposition. You're essentially trying to "scrape" all the data from the external API. That is likely against the terms of service and will likely also run afoul of rate limits. And, it's just not very practical.
Instead, you will probably want to fetch data upon demand (when a client requests data for Phoenix, then, and only then, do you start collecting data for Phoenix) and then once a demand for a certain type of data (temperatures in a particular city) is established, then you might want to pre-cache that data more regularly so you can notify clients of changes. If, after awhile, no clients are asking for data from Phoenix, you stop requesting updates for Phoenix any more until a client establishes demand again.
I have tried setInterval's and node-schedule to do that and got it working, but isn't it expensive for the CPU? For example, over a day I would hit this endpoint to check for new data every minute, but it could have new data every five minutes or more.
Making a remote network request is not a CPU intensive operation, even if you're doing it every minute. node.js uses non-blocking networking so most of the time during a network request, node.js isn't doing anything and isn't using the CPU at all. The only time the CPU would be briefly used is when you first send the API request and then when you receive back the result from the API call and need to process it.
Whether you really need to "poll" every minute depends upon the data and the needs of the client. I'd ask yourself if your app will work just fine if you check for new data every 5 minutes.

The method I would use to update would be contained outside of the code in a scheduled batch/powershell/bash file. In windows you can schedule tasks based upon time of day or duration since last run, so what you could do is run a simple command that will kill your application for five minutes, run npm update, and then restart your application before closing the shell.
That way you're staying out of your API and keeping code to a minimum, and if your code is inside that Node package in the update, it'll be there and ready once you make serious application changes or you need to take the server down for maintenance and updates to the low-level code.
This is a light-weight solution for you and it's a method I've used once or twice at my workplace. There are lots of options out there, and if this isn't what you're looking for I can keep looking out for you.

AWS Step/Lambda - storing variable between runs

In my first foray into any computing in the cloud, I was able to follow Mark West's instructions on how to use AWS Rekognition to process images from a security camera that are dumped into an S3 bucket and provide a notification if a person was detected. His code was setup for the Raspberry Pi camera but I was able to adapt it to my IP camera by having it FTP the triggered images to my Synology NAS and use CloudSync to mirror it to the S3 bucket. A step function calls Lambda functions per the below figure and I get an email within 15 seconds with a list of labels detected and the image attached.
The problem is the camera will upload one image per second as long the condition is triggered and if there is a lot of activity in front of the camera, I can quickly rack up a few hundred emails.
I'd like to insert a function between make-alert-decision and nodemailer-send-notification that would check to see if an email notification was sent within the last minute and if not, proceed to nodemailer-send-notification right away and if so, store the list of labels, and path to the attachment in an array and then send a single email with all of the attachments once 60 seconds had passed.
I know I have to store the data externally and came across this article explaining the benefits of different methods of caching data and I also thought that I could examine the timestamps of the files uploaded to S3 to compare the time elapsed between the two most recent uploaded files to decide whether to proceed or batch the file for later.
Being completely new to AWS, I am looking for advice on which method makes the most sense from a complexity and cost perspective. I can live with the lag involved in any of methods discussed in the article, just don't know how to proceed as I've never used or even heard of any of the services.
Thanks!

You can use a SQS queue to which the lambda make-alert-decision sends message with each label and path to attachment.
The lambda nodemailer-send-notification would be a consumer of that queue, but being executed on a regular schedule.
You can specify that lambda to be executed every 1 minute, reading all the messages from the queue - and deleting them from the queue right away or setting a visibility time suitable and deleting afterwards - to get the list of attachments and send a single email. We would have a single email with all the attachments every 60 seconds.

Cache model for often requesting items

I have a bunch of user-generated messages with timestamps, text messages, profile images respectively and other stuff. All clients (phones) who are using my Web API are able to request last messages then scroll them down and request oldest items. Obviously, top messages are hottest data in whole list. Obviously, I want to make a cache, which has caching policy and clear undestanding about new requested messages - are requsted messages hot, or not?
I created a stateless service with MemoryCache and now use it for my purposes. Is there are any underwater stones which I should take into account during my work with it? Except point, of course, that I have five nodes, and user is able to make a request to service which has no cache inside. In that case this service goes to data-layer-service then gets and loads some data from it.
UPD #1
Forgot mention that this list of messages updates time out of time with new entries.
UPD #2
I wrapped MemoryCache in IReliableDictionary implementation and palm off it under a stateful Service with my own StateManager implementation. Every time a request didn't find an item in the collection I go to the Azure Storage and retrieve actual data. After I had finished I realized that my experiment is not useful because there is no way for scaling such approach. I mean if my app has fixed partitioned Reliable Services working as cache, I do not have possibility to grow them up with upscaling my Service Fabric. In case of load increase after some time this fact hits me in my face :)
I still do not know how to make a cache for my super hot most readable messages more efficient way. And I still doubt in Reliable Actors approach. It creates a huge amount of replicated data.

I think this is an ideal use of an actor.
The actor will be garbage collected after a period of time, so data won't stay in memory.
One actor per user.

What does the 500 telemetry data points per second limit actually mean in Application Insights?

On this documentation page there is the following limitation of Application Insights documented:
Up to 500 telemetry data points per second per instrumentation key (that is, per application). This includes both the standard telemetry sent by the SDK modules, and custom events, metrics and other telemetry sent by your code.
However it doesn't explain what the implications of that limit are?
a) Does it buffer and throttle, but still persist all data eventually? So say - 1000 data points get pushed within a second - it will persist the first 500, then wait for a bit and push the other 500?
or
b) Does it just drop/not log data? So say - 1000 data points get pushed within a second and only the first 500 will be persisted and the other 500 not (ever)?

It is the latter (b) with the caveat that ALL data will start to be throttled in this case, i.e. once RPC is > 500 (100 for free apps, please see https://azure.microsoft.com/en-us/documentation/articles/app-insights-data-retention-privacy/ for details) is detected, it will start rejecting all data from this instrumentation key on data collection endpoint, until RPC rate is back to under 500.
EDIT: Further information from Bret Grinslade:
The current implementation averages over one minute -- so if you send 30K in 1 minute (500*60) it will throttle your application. The HTTP response will tell the SDK to retry later. If the incoming rate never comes down, the response will tell the SDK to drop the data. We are working on other features to improve this experience -- pre-aggregation on the client, improved burst data rates, etc.

A bit more detail on top of Alex's response. The current implementation averages over one minue -- so if you send 30K in 1 minute (500*60) it will throttle your application. The HTTP response will tell the SDK to retry later. If the incoming rate never comes down, the response will tell the SDK to drop the data. We are working on other features to improve this experience -- pre-aggregation on the client, improved burst data rates, etc.

AI now has the ingestion throttling limit of 16K EPS: https://learn.microsoft.com/en-us/azure/application-insights/app-insights-pricing

Can a date and time be specified when sending data to Azure event hub?

Here's the scenario. I'm not working with real-time data. Instead, I get data from my electric company for the past day's electric usage. Specifically, each day I can get # of kwhs for each hour on the clock on the past day.
So, I'd like to load this past information into event hub each following day. Is this doable? Does event hub support loading past information, or is it only and forever about realtime streaming data, with no ability to load past data in?
I'm afraid this is the case, as I've not seen any date specification in what limited api documentation I could find for it. I'd like to confirm, though...
Thanks,
John

An Azure Event Hub is really meant for short-term storage. By default you may only retain data up to 7 days. After which the data will be deleted based upon an append timestamp that was created when the message first entered the Event Hub. Therefore it is not practical to use an Azure Event Hub for data that's older than 7 days.
An Azure Event Hub is meant for message/event management, not long term storage. A possible solution would be to write the Event Hub data to an Azure SQL server or blob storage for long term storage. Then use Azure Stream Analytics (an event processor) to join the active stream with the legacy data that has accumulated on the SQL server. Also note, you can call this appended attribute. It's called "EventEnqueuedUtcTime". Keep in mind that it will be on the server time, whose clock may be different from the date/time of actual measurement.
As for appending a date time. If you are sending it in as a JSON, just simply append it as a key and message value. Example Message with Time: { "Time": "My UTC Time here" }

A streaming system of this type doesn't care about times a particular application may wish to apply to the items. There simply isn't any processing that happens based on a time field unless your code does it.
Each message sent is an EventData which contains a message with an arbitrary set of bytes. You can easily include a date/time in that serialized data structure, but EventHubs won't care about it. There is no sorting performed or fixed ordering other than insertion order within a partition which is defined by the sequence number. While the enqueued time is available it's mostly useful for monitoring how far behind in processing you are.
As to the rest of your problem, I'd agree with the comment that EventHubs may not really be the best choice. You can certainly load data into it once per day, but if it's really only 24 data points/day, it's not really the appropriate technology choice unless it's a prototype/tech demo for a system that's eventually supposed to have a whole load of smart meters reporting to it with fair frequency. (Note also that EventHubs cost $11/month minimum, Service Bus Queue $10/Month min, and AWS SQS $0 min)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string