Node.js memory leak, despite constant Heap + RSS sizes - node.js

According to my server monitoring, my memory usage is creeping up over time:
After ~4 weeks of uptime, it ends up causing problems / crashing (which makes sense, given that I'm on EC2 with m1.large instances => 8GB RAM, and RAM seems to be increasing at about 1.5 GB / week).
If I restart my node.js app, the memory usage resets.
Yet... I'm keeping track of my memory usage via process.memoryUsage(), and even after ~1 week, I'm seeing
{"rss":"693 Mb","heapTotal":"120 Mb","heapUsed":"79 Mb"}
What am I missing? Clearly the leak is in node, yet the process seems to not be aware of it...

You can try node-memwatch module, which helps with leak detection and heap diffing in Node.
Heap diff would look similar to:
{
"before": { "nodes": 11625, "size_bytes": 1869904, "size": "1.78 mb" },
"after": { "nodes": 21435, "size_bytes": 2119136, "size": "2.02 mb" },
"change": { "size_bytes": 249232, "size": "243.39 kb", "freed_nodes": 197,
"allocated_nodes": 10007,
"details": [
{ "what": "String",
"size_bytes": -2120, "size": "-2.07 kb", "+": 3, "-": 62
},
{ "what": "Array",
"size_bytes": 66687, "size": "65.13 kb", "+": 4, "-": 78
},
{ "what": "LeakingClass",
"size_bytes": 239952, "size": "234.33 kb", "+": 9998, "-": 0
}
]
}

Related

unable to getting memory used percentage from management api

I need to get memory used by azure VM, but i am not getting .
Tried this https://management.azure.com/subscriptions/XXXXXXXXXXXXXXXXXXXX/resourceGroups/XXXXXXXXXXXX/providers/Microsoft.Compute/virtualMachines/XXXXXXX/providers/microsoft.insights/metrics?timespan=2019-03-31T11:30:00.000Z/2020-09-14T11:00:00.000Z&interval=P1D&metricnames=\Memory\% Committed Bytes In Use&aggregation=Average&api-version=2018-01-01&metricnamespace=azure.vm.windows.guestmetrics
Response I am getting
{
"cost": 0,
"timespan": "2020-08-14T11:00:00Z/2020-09-14T11:00:00Z",
"interval": "P1D",
"value": [
{
"id": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/xxxxxxxxxxxxx/providers/Microsoft.Compute/virtualMachines/xxxxxxx/providers/Microsoft.Insights/metrics/\Memory\% Committed Bytes In Use",
"type": "Microsoft.Insights/metrics",
"name": {
"value": "\Memory\% Committed Bytes In Use",
"localizedValue": "\Memory\% Committed Bytes In Use"
},
"unit": "Unspecified",
"timeseries": [],
"errorCode": "Success"
}
],
"namespace": "azure.vm.windows.guestmetrics",
"resourceregion": "westus2"
}
Make sure you have enabled the guest-level monitoring for Azure virtual machines then try again.
See - https://docs.bmc.com/docs/capacityoptimization/btco115/collecting-additional-metrics-using-guest-os-diagnostics-890312716.html

Slow Elasticsearch indexing using join datatype

We have an index with a join datatype and the indexing speed is very slow.
At best we are indexing 100/sec, but mostly around 50/sec, the times is varying depending of the document size. We are using multiple threads with .NET Nest when indexing but both batching and single inserts are pretty slow. We have tested various batch sizes but still not getting any speed to talk about. Even with only small documents containing "metadata" it is slow, but speed will drop radically when the document size is increasing. Document size in this solution can vary from small up to 6 MB
What can we expect using the join datatype and indexing? How much penalty must we expect to get using it? We did of course try to avoid this when designing it, but we did not find any way around it. Any tips or tricks?
We are using a 3-node cluster in Azure, all with 32 GB of RAM and premium SSD disks. The Java Heap size is set to 16GB. Swapping is Disabled. Memory usage on the VM’s is stable about 60% of total, but the CPU is very low < 10 %. We are running Elasticsearch v. 6.2.3.
A short version of the mapping:
"mappings": {
"log": {
"_routing": {
"required": true
},
"properties": {
"body": {
"type": "text"
},
"description": {
"type": "text"
},
"headStepJoinField": {
"type": "join",
"eager_global_ordinals": true,
"relations": {
"head": "step"
}
},
"id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"statusId": {
"type": "keyword"
},
"stepId": {
"type": "keyword"
}
}
}
}

Azure Gremlin edge traversal suspiciously high (Out() step) RU cost

I have a weird issue, where doing an out-operation on a few edges causes my RU cost to triple. Hope someone can help me shed light on why + what I can do to mitigate it.
I have a Graph in CosmosDB, where there are two types of vertex labels: "Profile" and "Score". Each profile has 0 or 1 score-vertices via a "ProfileHasAggregatedScore" edge. The partitionKey is the ID of the Profile.
If I make the following queries, the RU currently is:
g.V().hasLabel('Profile').out('ProfileHasAggregatedScore')
>78 RU (8 scores found)
And for reference, the cost of getting all vertices of a type is:
g.V().hasLabel('Profile')
>28 RU (110 profiles found)
g.E().hasLabel('ProfileHasAggregatedScore')
>11 RU (8 edges found)
g.V().hasLabel('AggregatedRating')
>11 RU (8 scores found)
And the cost of a single of the vertices or edges are:
g.V('aProfileId').hasLabel('Profile')
>4 RU (1 found)
g.E('anEdgeId')
> 7RU
G.V('aRatingId')
> 3.5 RU
Can someone please help me as to why, making a traversal with only a few vertices along the way (see traversal at the bottom), is more expensive than searching for everything? And is there something I can do to prevent it? Adding a has-filter with the partitionKey does not seem to help. It seems odd that traversing/finding 16 elements more (8 edges and 8 vertices) after finding 110 vertices triples the cost of the operation?
(NB. With 1000 profiles the cost of doing 1 traversal along an edge to the score node is 2200 RU. This seems high, considering the emphasis their Azure team put on it being scalable?)
Traversal if it can help (It seems most of the time is spent finding the edges with the out() step):
[
{
"gremlin": "g.V().hasLabel('Profile').out('ProfileHasAggregatedScore').executionProfile()",
"totalTime": 46,
"metrics": [
{
"name": "GetVertices",
"time": 13,
"annotations": {
"percentTime": 28.26
},
"counts": {
"resultCount": 110
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 110,
"size": 124649,
"time": 2.47
}
]
},
{
"name": "GetEdges",
"time": 26,
"annotations": {
"percentTime": 56.52
},
"counts": {
"resultCount": 8
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 8,
"size": 5200,
"time": 6.22
},
{
"fanoutFactor": 1,
"count": 0,
"size": 49,
"time": 0.88
}
]
},
{
"name": "GetNeighborVertices",
"time": 7,
"annotations": {
"percentTime": 15.22
},
"counts": {
"resultCount": 8
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 8,
"size": 6303,
"time": 1.18
}
]
},
{
"name": "ProjectOperator",
"time": 0,
"annotations": {
"percentTime": 0
},
"counts": {
"resultCount": 8
}
}
]
}
]
enter code here

Azure App Service Autoscale Fails to Scale In

My app service has failed to scale-in after scaling-out. This seems to be a pattern I've been trying to troubleshoot for several months.
I've tried the following but none have worked:
My scale condition is based on CPU and memory. However, I've never seen CPU go past 12%, so I'm assuming it's actually scaling based on memory.
Set the scale out condition to memory over 90% over a 5 minute average with 10 min. cooldown and scale in condition for memory under 70% over a 5 minute average. This doesn't seem to make sense since if my memory utilization is already at 90%, I'm really having underlying memory leaks and should have already scaled out.
Set the scale out condition to memory over 80% over a 60 minute average with 10 min. cooldown and scale in condition for memory under 60% over a 5 minute average. This makes more sense, as I've seen memory usage burst over a few hours only to drop.
Expected behavior: App service autoscaling will reduce instance count after 5 minutes where memory usage drops below 60%.
Question:
What is the ideal threshold on a metric to scale smoothly by if my baseline CPU remains roughly at an average of 6% and memory at 53%? Meaning, what is the best minimum values to scale in by and best max values to scale out without worrying about anti-patterns such as flapping? A larger threshold of 20% difference makes more sense to me.
Alternative solution:
Given the amount of troubleshooting involved with what's marketed as as simple as "push button scaling", makes it almost not even worth the headache of the configuration vagueness (you can't even use IIS metrics like connection count without a custom powershell script!). I'm considering disabling autoscaling because of its unpredictability and just keep 2 instances running for automatic load balancing and scale manually.
Autoscale Configuration:
{
"location": "East US 2",
"tags": {
"$type": "Microsoft.WindowsAzure.Management.Common.Storage.CasePreservedDictionary, Microsoft.WindowsAzure.Management.Common.Storage"
},
"properties": {
"name": "CPU and Memory Autoscale",
"enabled": true,
"targetResourceUri": "/redacted",
"profiles": [
{
"name": "Auto created scale condition",
"capacity": {
"minimum": "1",
"maximum": "10",
"default": "1"
},
"rules": [
{
"scaleAction": {
"direction": "Increase",
"type": "ChangeCount",
"value": "1",
"cooldown": "PT10M"
},
"metricTrigger": {
"metricName": "MemoryPercentage",
"metricNamespace": "",
"metricResourceUri": "/redacted",
"operator": "GreaterThanOrEqual",
"statistic": "Average",
"threshold": 80,
"timeAggregation": "Average",
"timeGrain": "PT1M",
"timeWindow": "PT1H"
}
},
{
"scaleAction": {
"direction": "Decrease",
"type": "ChangeCount",
"value": "1",
"cooldown": "PT5M"
},
"metricTrigger": {
"metricName": "MemoryPercentage",
"metricNamespace": "",
"metricResourceUri": "/redacted",
"operator": "LessThanOrEqual",
"statistic": "Average",
"threshold": 60,
"timeAggregation": "Average",
"timeGrain": "PT1M",
"timeWindow": "PT10M"
}
},
{
"scaleAction": {
"direction": "Increase",
"type": "ChangeCount",
"value": "1",
"cooldown": "PT5M"
},
"metricTrigger": {
"metricName": "CpuPercentage",
"metricNamespace": "",
"metricResourceUri": "/redacted",
"operator": "GreaterThanOrEqual",
"statistic": "Average",
"threshold": 60,
"timeAggregation": "Average",
"timeGrain": "PT1M",
"timeWindow": "PT1H"
}
},
{
"scaleAction": {
"direction": "Decrease",
"type": "ChangeCount",
"value": "1",
"cooldown": "PT5M"
},
"metricTrigger": {
"metricName": "CpuPercentage",
"metricNamespace": "",
"metricResourceUri": "/redacted",
"operator": "LessThanOrEqual",
"statistic": "Average",
"threshold": 40,
"timeAggregation": "Average",
"timeGrain": "PT1M",
"timeWindow": "PT10M"
}
}
]
}
],
"notifications": [
{
"operation": "Scale",
"email": {
"sendToSubscriptionAdministrator": false,
"sendToSubscriptionCoAdministrators": false,
"customEmails": [
"redacted"
]
},
"webhooks": []
}
],
"targetResourceLocation": "East US 2"
},
"id": "/redacted",
"name": "CPU and Memory Autoscale",
"type": "Microsoft.Insights/autoscaleSettings"
}
For the CpuPercentage metric you have a SCALE UP action when it goes beyond 60 and a scale down action when it goes below 40 and the difference between the two is very less. This can cause a behavior described as Flapping and this will cause AutoScale's scale in action not to kick in. Similar issue is the MemoryPercent rule that you have configured.
You should have a difference of at-least 40 between your scale up and scale in threasholds to avoid flapping. More details on Flapping are in https://learn.microsoft.com/en-us/azure/monitoring-and-diagnostics/insights-autoscale-best-practices#choose-the-thresholds-carefully-for-all-metric-types (search for the word Flapping)
I have exactly the same problem and I've come to believe that autoscaling back to one instance like we want to do it currently not possible.
My current workaround is to scale in to 1 instance with a second profile that repeats every day between 23:55 and 00:00.
Just to reiterate the problem. I have the following scenario. It is basically identical to yours.
Memory baseline of the App Service is 50%
Scale out 1 instance when avg(memory) > 80%
Scale in 1 instance when avg(memory) < 60%
Scaling out from 1 instance to 2 instances will work correctly when the average memory percentage exceeds 80%. But scaling in to 1 instance will never work because the memory baseline is too high.
After reading the Best Practices, my understanding is that when scaling in, it will estimate the resulting memory percentage and check if no scale out rule is triggered.
So if the average memory percentage drops to 50% for two instances the scale in rule is triggered and it will estimate the resulting memory usage to be 2 * 50% / 1 = 100% which will of course trigger the scale out rule and thus it will not scale in.
It should however work when scaling from 3 to 2 instances: 3 * 50% / 2 = 75% which is smaller than the 80% of the scale out rule.
I have the same issue here. My App need only one instance and I have a auto scaling configuration like:
Scale out
When br-empresa (Average) CpuPercentage > 85 Increase instance count by 1
Or Br-Empresa (Average) MemoryPercentage > 85 Increase instance count by 1
Scale in
When br-empresa (Average) CpuPercentage <= 75 Decrease instance count by 1
And Br-Empresa (Average) MemoryPercentage <= 75 Decrease instance count by 1
And the baseline for memory is 60%.
The Scale Out logic works pretty. But the app never scale in even if the memory falls to 60%. (60% * 2) / 1 = 120%
For memory or cpu metrics the actual flapping estimate doesn't make sense.

Arangodb freeze when page fault increased

I using arango with nodejs and arangojs driver, one of the arango collection has 10,000,000 documents
Sometimes page fault going up (150 or 500) and arango freezed and don't response to query request Also freezed arango web panel.
My server config is:
RAM: 6 GB
CPU: 8 core
(From web panel arango using 4.76 GB (83.90 %) 6 GB of ram)
UPDATE1
This is output of /_api/collection/AdsStatics/figures
{
"id": "191689719157",
"name": "AdsStatics",
"isSystem": false,
"doCompact": true,
"isVolatile": false,
"journalSize": 33554432,
"keyOptions": {
"type": "traditional",
"allowUserKeys": true
},
"waitForSync": false,
"indexBuckets": 8,
"count": 7816780,
"figures": {
"alive": {
"count": 7815806,
"size": 3563838968
},
"dead": {
"count": 306,
"size": 167464,
"deletion": 0
},
"datafiles": {
"count": 104,
"fileSize": 3530743672
},
"journals": {
"count": 1,
"fileSize": 33554432
},
"compactors": {
"count": 0,
"fileSize": 0
},
"shapefiles": {
"count": 0,
"fileSize": 0
},
"shapes": {
"count": 121,
"size": 56520
},
"attributes": {
"count": 24,
"size": 56
},
"indexes": {
"count": 3,
"size": 1660594864
},
"lastTick": "10044860034955",
"uncollectedLogfileEntries": 985,
"documentReferences": 0,
"waitingFor": "-",
"compactionStatus": {
"message": "checked datafiles, but no compaction opportunity found",
"time": "2016-02-24T08:29:27Z"
}
},
"status": 3,
"type": 2,
"error": false,
"code": 200
}
Thanks
It seems that your system is running out of memory. The datafiles for the one collection are 3,530,743,672 bytes in size, the indexes are 1,660,594,864. That is about 5.1 GB for this one collection alone.
arangod will need further memory for its WAL, the V8 contexts and temporary query results in order to operate properly.
Provided the system has 6 GB of total RAM and the OS and other processes need some RAM, too, it looks like you're running out of memory.
I am wondering if you're seeing some sort of swapping activity, which would explain why (all) operations would get extremely slow.

Resources