logstash vs a task queue for nodejs application - node.js

I have a nodejs web server where there seem to be a couple of bottlenecks preventing it from fully functioning peak load.
logging multiple events to our SQL server
logging multiple events to our elastic cluster
under heavy load , both SQL and elastic seem to reject my requests and/or considerably slow performance. So I've decided to reduce the load on these DBs via Logstash (for elastic) and an async task queue (for SQL)
Since i'm working on limited time , i'm wondering if i can solve these issues with just an async task queue (like kue) where i can push both SQL and elastic logs.
Do i need to implement logstash as well? or does an async queue solve the same thing as logstash.

You can try Async library's Queue feature and try and make it run on a child process or much better in a separate server as a Microservice for queues. That way you can move the load to another location, giving you a boost on your app server.

As you mentioned you are using azure I would strongly recommend using their queue solution plus a couple of azure functions to handle the read from the queue and processing.
I've rolled my own solution before using node.js and rabbitmq with node workers to read from the queue and write to elastic search because the solution couldn't go into the cloud.
It works and is robust but it takes quite a bit of configuration and custom code that is a maintenance nightmare. Ripped that out as soon as I could.
The benefits of using the azure service are:
Very little bespoke configuration is required.
Less custom code === less bugs & maintainence
scaling isn't an issue, some huge businesses rely on this
no 2am support calls, if azure is down they are going to fix it... fast
much cheaper, unless the throughput is massive and constant the scaling model is much cheaper and azure functions are perfect as you won't have running servers sitting there doing nothing when the queue is empty.
Same could be said for AWS and Google Cloud.

Related

.NetCore/NEST - ElasticSearch I/O issue or concurrency throttle

I'm having some issues for weeks now i can't resolve after many tries.
I have a .NET Core application which is, basically, indexing and searching documents all day.
This app allows concurrency calls to Elastic for all my use cases so when i have a big load (like 1k documents in a few seconds) i'm calling Elastic with no limits (i didn't find any limit for it, maybe i should limit threads..).
I have 2 ElasticSearch nodes on Production which have pretty good configurations for my load (according to Elastic prerequisites), 16Go RAM, 4vCPU
Sometimes, it seems that many calls are failing with that kind of error (i'm using ApplicationInsights) :
I assume NEST is doing an underlying HTTP connection to Elastic server that just timeout/failed.
When i'm looking the dependencies at the same time, i can see calls that run for over 100 000ms (but 4ms for a "normal" call)
Here an indexing example :
Our team checked all http route through network and we can't find any network issue.
We definitely understood it was an Elastic issue when we just restarted the Elastic services on both nodes and all started again to work.
Both nodes had a low CPU and RAM usage.
We "tried" to look at Elastic logs but we aren't very comfortable with that actually but with our "knowledge" on that we didn't find anything "unusual".
We're indexing documents one by one, never bulked. We can't change that because documents are received from an external system thanks to a ServiceBus queue and we can't control the load. So one ServiceBus message is an individual PUT call to Elastic servers.
I can't reproduce it easily to be honnest, but i would like to have some feedbacks by people using NEST/ElasticSearch and how they managed their threadPool.
Should I limit Elastic calls (using Semaphore for example) ?
Should I change Elastic server configuration ?
Please let me know if you need more information
Thanks for your help

Recommended Azure service to replace Azure functions

We have a service running as an Azure function (Event and Service bus triggers) that we feel would be better served by a different model because it takes a few minutes to run and loads a lot of objects in memory and it feels like it loads it every time it gets called instead of keeping in memory and thus performing better.
What is the best Azure service to move to with the following goals in mind.
Easy to move and doesn't need too many code changes.
We have long term goals of being able to run this on-prem (kubernetes might help us here)
Appreciate your help.
To achieve first goal:
Move your Azure function code inside a continuous running Webjob. It has no max execution time and it can run continuously caching objects in its context.
To achieve second goal (On-premise):
You need to explain this better, but a webjob can be run as a console program on-premise, also you can wrap it into a docker container to move it from on-premise to any cloud but if you need to consume messages from an Azure Service Bus you will need an On-Premise-Azure approach connecting your local server to the cloud with a VPN or expressroute.
Regards.
There are a couple of ways to solve the said issue, each with slightly higher amount of change from where you are.
If you are just trying to separate out the heavy initial load, then you can do it once in a Redis Cache instance and then reference it from there.
If you are concerned about how long your worker can run, then Webjobs (as explained above) can work, however, that is something I'd suggest avoiding since its not where Microsoft is putting its resources. Rather look at durable functions. Here an orchestrator function can drive a worker function. (Even here be careful, that since durable functions retain history after running for very very very long times, the history tables might get too large - so probably program in something like, restart the orchestrator after say 50,000 runs (obviously the number will vary based on your case)). Also see this.
If you want to add to this, the constrain of portability then you can run this function in a docker image that can be run in an AKS cluster in Azure. This might not work well for durable functions (try it out, who knows :) ), but will surely work for the worker functions (which would cost you the most compute anyways)
If you want to bring the workloads completely on-prem then Azure functions might not be a good choice. You can create an HTTP server using the platform of your choice (Node, Python, C#...) and have that invoke the worker routine. Then you can run this whole setup inside an image on an AKS cluster on prem and to the user it looks just like a load balanced web-server :) - You can decide if you want to keep the data on Azure or bring it down on prem as well, but beware of egress costs if you decide to move it out once you've moved it up.
It appears that the functions are affected by cold starts:
Serverless cold starts within Azure
Upgrading to the Premium plan would move your functions to pre-warmed instances, which should counter the problem you are experiencing:
Pre-warmed instances for Azure Functions
However, if you potentially want to deploy your function/triggers to on-prem, you should spin them out as microservices and deploy them with containers.
Currently, the fastest way would probably be to deploy the containerized triggers via Azure Container Instances if you don't already have a Kubernetes Cluster running. With some tweaking, you can deploy them on-prem later on.
There are few options:
Move your function app on to premium. But it will not help u a lot at the time of heavy load and scale out.
Issue: In that case u will start facing cold startup issues and problem will be persist in heavy load.
Redis Cache, it will resolve your most of the issues as the main concern is heavy loading.
Issue: If your system is multitenant system then your Cache become heavy during the time.
Create small micro durable functions. It will be not the answer of your Q as u don't want lots of changes but it will resolve your most of the issues.

Best practice for Sequelize database migrations on GAE

I have built an application using Express, Postgres, and Sequelize on Google App Engine and I'm having some trouble running a longer migration. This migration simply dumps the data from one of my large tables into elastic search.
As of right now, I have been running my migrations in the pre-start command as such
npm i && sequelize db:migrate
but I notice that Google App Engine has been running my migration over and over again due to the auto-scaling nature of the instances. Is there a better practice for running migrations? Is there a way to only run this migration once and prevent auto-scaling for just the pre-start command?
First is necessary understand how App Engine handle the scalling types:
Automatic scaling creates instances based on request rate, response latencies, and other application metrics. You can specify thresholds for each of these metrics, as well as a minimum number instances to keep running at all times.
Basic scaling creates instances when your application receives requests. Each instance will be shut down when the application becomes idle. Basic scaling is ideal for work that is intermittent or driven by user activity.
Manual scaling specifies the number of instances that continuously run regardless of the load level. This allows tasks such as complex initializations and applications that rely on the state of the memory over time.
I recommend you to choose the manual scaling in order to set the specific number of instances you need/want, or, if you are going to use automatic scaling just pay attention to the limits (max/min (idle) instances) to set specific limits. However, it is up to you choosing the configuration that best suits your requirements.
Being this said, regardless the scaling method you choose, it seems that your script is being restarted every time your GAE scales, or, that it is the script telling your application to repeat the process over and over. It could be useful if you share details on how you are executing your script and what it does, in order to get a better perspective.
A possible workaround for this task could be port the functionality of the migration script itself into the body of an admin-protected handler in the GAE app which can be triggered with a HTTP request for a particular URL.
I think is possible to split the potentially long-running migration operation into a sequence of smaller operations (using push task queues), much more GAE-friendly.
Also, I suggest you take a look on this thread.
However I understand you want to migrate your data from PostgreSQL to one Elasticsearch table, I found out this tutorial where is recommended create a CSV file from your PostgreSQL database, then you can pass the data from CSV to the Json format, this is because you can use the service Elasticdump format your Json file as Elastic Search document,these steps are on Node JS, therefore you can create a script on App Engine or in Cloud functions depending of data size and execute the import, by example:
# import
node_modules/elasticdump/bin/elasticdump --input=formatted.json --output=http://localhost:9200/

Running Locust in distributed mode on Azure functions

I am building a small utility that packages Locust - performance testing tool (https://locust.io/) and deploys it on azure functions. Just a fun side project to get some hands on with the serverless craze.
Here's the git repo: https://github.com/amanvirmundra/locust-serverless.
Now I am thinking that it would be great to run locust test in distributed mode on serverless architecture (azure functions consumption plan). Locust supports distributed mode but it needs the slaves to communicate with master using it's IP. That's the problem!!
I can provision multiple functions but I am not quite sure how I can make them talk to each other on the fly(without manual intervention).
Thinking out loud:
Somehow get the IP of the master function and pass it on to the slave functions. Not sure if that's possible in Azure functions, but some people have figured a way to get an IP of azure function using .net libraries. Mine is a python version but I am sure if it can be done using .net then there would be a python way as well.
Create some sort of a VPN and map a function to a private IP. Not sure if this sort of mapping is possible in azure.
Some has done this using AWS Lambdas (https://github.com/FutureSharks/invokust). Ask that person or try to understand the code.
Need advice in figuring out what's possible at the same time keeping things serverless. Open to ideas and/or code contributions :)
Update
This is the current setup:
The performance test session is triggered by an http request, which takes in number of requests to make, the base url, and no. of concurrent users to simulate.
Locustfile define the test setup and orchestration.
Run.py triggers the tests.
What I want to do now, is to have master/slave setup (cluster) for a massive scale perf test.
I would imagine that the master function is triggered by an http request, with a similar payload.
The master will in turn trigger slaves.
When the slaves join the cluster, the performance session would start.
What you describe doesn't sounds like a good use-case for Azure Functions.
Functions are supposed to be:
Triggered by an event
Short running (max 10 minutes)
Stateless and ephemeral
But indeed, Functions are good to do load testing, but the setup should be different:
You define a trigger for your Function (e.g. HTTP, or Event Hub)
Each function execution makes a given amount of requests, in parallel or sequentially, and then quits
There is an orchestrator somewhere (e.g. just a console app), who sends "commands" (HTTP call or Event) to trigger the Function
So, Functions are "multiplying" the load as per schedule defined by the orchestrator. You rely on Consumption Plan scalability to make sure that enough executions are provisioned at any given time.
The biggest difference is that function executions don't talk to each other, so they don't need IPs.
I think the mentioned example based on AWS Lambda is just calling Lambdas too, it does not setup master-client lambdas talking to each other.
I guess my point is that you might not need that Locust framework at all, and instead leverage the built-in capabilities of autoscaled FaaS.

Migrating Task Queues to Cloud Functions

We're using Google App Engine Standard Environment for our application. The runtime we are using is Python 2.7. We have a single service which uses multiple versions to deploy the app.
Most of our long-running tasks are done via Task Queues. Most of those tasks do a lot of Cloud Datastore CRUD operations. Whenever we have to send the results back to the front end, we use Firebase Cloud Messaging for that.
I wanted to try out Cloud Functions for those tasks, mostly to take advantage of the serverless architecture.
So my question is What sort of benefits can I expect if I migrate the tasks from Task Queues to Cloud Functions? Is there any guideline which tells when to use which option? Or should we stay with Task Queues?
PS: I know that migrating a code which is written in Python to Node.js will be a trouble, but I am ignoring this for the time being.
Apart from the advantage of being serverless, Cloud Functions respond to specific events "glueing" elements of your architecture in a logical way. They are elastic and scale automatically - spinning up and down depending on the current demand (therefore they incur costs only when they are actually used). On the other hand Task Queues are a better choice if managing execution concurrency is important for you:
Push queues dispatch requests at a reliable, steady rate. They
guarantee reliable task execution. Because you can control the rate at
which tasks are sent from the queue, you can control the workers'
scaling behavior and hence your costs.
This is not possible with Cloud Functions which handle only one request at a time and run in parallel. Another thing for which Task Queues would be a better choice is handling retry logic for the operations that didn't succeed.
Something you can also do with Cloud Functions together with App Engine Cron jobs is to run the function based on a time interval, not an event trigger.
Just as a side note, Google is working on implementing Python to Cloud Functions also. It is not known when that will be ready, however it will be surely announced in Google Cloud Platform Blog.

Resources