Our migration application is built in java for migrating data into hyperledger fabric 2.2 using Java SDk to connect to HLF Network.
We are processing data in batches of 4000 records in a scheduler and scheduler runs after every 1 min interval.
We migrated 40,000 records in 10 minutes in after which, 993.689 Mb was occupied by Docker Containers (Peers, Orderers, CouchDB, CA, Chaincode).
we stopped the java process and kept the system idle and after 1 hour it released around 200 Mb only.
We are trying to migrate 2.5 million records. After running the scheduler multiple times, system is getting stopped (no memory available).
JVM is not leaking any memory, we have confirmed that.
I want to understand why Docker containers are not releasing memory. Where is the memory leak happening?
Related
I have 3000+ messages per minute being delivered to an Azure Service Bus Queue. In AKS I had 10 instances of a pod that implements a dotnet core 5 Web Job that triggers off of a ServiceBusTrigger (Microsoft.Azure.WebJobs.Extensions.ServiceBus, Version=5.0.0-beta.2).
public async Task RunAsync([ServiceBusTrigger("minutedata", Connection = "AzureWebJobsServiceBusConnection")] ServiceBusReceivedMessage message, ServiceBusMessageActions messageActions, ILogger log)
Up until recently my AKS cluster had four nodes running around 25% CPU capacity. This was running 1.18.2 of Kubernetes. This configuration was handling the message load OK - all new messages were being processed within 1 minute of receipt, before the next messages were delivered. It was also running 40+ other pods OK at this CPU level.
After an upgrade to Kubernetes 1.20.5 I noticed that the queue length started to grow rapidly indicating that the messages were no longer being processed within 1 minute by the 10 processor instances. I also noticed that the average CPU % for each node was over 90%.
I scrambled to try to fix the situation so at least the queue length would settle and the backlog be processed. This included adding more nodes (up to 8 in total) and more processor instances (up to 30).
Once the queue was stabilized I started trying to figure out what was going on and to draw back the nodes and processor instances. I managed to get back to 5 nodes and 20 processor instances in order to stay on top of the 3000+ messages coming in per minute.
However, the nodes are now running at ~60% CPU average, and I'm using double the number of processor instances, just to keep up with a messaging throughput that was previously being handled by 4 nodes (running at 25%) and 10 processors.
I have re-written the processor (Web Job) to use the latest libraries (which is why I'm using Microsoft.Azure.WebJobs.Extensions.ServiceBus, Version=5.0.0-beta.2, I would never normally be using a beta library in production), and this has had no significant impact. I even creating a new Service Bus namespace and moved over to a new queue in case somehow the original one had been corrupted. Again, no impact.
I'm beginning to believe that my focus should be on what changed from version 1.18.2 to 1.20.5 of Kubernetes, but I'm not sure where to start. I read somewhere that Kubernetes deprecated Docker after 1.20, but I'm not sure if I need to do anything differently given that I'm using AKS (surely AKS will be automatically configured to run Docker containers running dotnet core apps).
Another observation is that the messaging throughput drops off every 15 minutes and recovers within 2 minutes.
I was wondering if there's a default setting in Kubernetes that is now affecting the number of connections that can be made from each node that might be impacting the processors ability to receive and process messages from the service bus queue.
Anyway, I'm pretty much at a loss of how to proceed. Any ideas?
UPDATE
I've also noticed that AKS updates were rolled out the same day I started seeing performance issues.
https://github.com/Azure/AKS/blob/master/CHANGELOG.md
I wonder if this is just down to continerd differences: https://learn.microsoft.com/en-us/azure/aks/cluster-configuration#containerd-limitationsdifferences
I'm currently trying to deploy a node.js app on docker containers. I need to deploy 30 of them but they begin to have a weird behavior at some point, some of them freeze.
I am currently running Docker version for windows 18.03.0-ce, build 0520e24302, my computer specs (cpu and memory):
I5 4670 K
24 GB of ram
My docker default machine resource allocation is the following :
Allocated RAM : 10 Gb
Allocated vCPUs : 4
My node application is running on apline3.8 and node.js 11.4 and mostly do http requests every 2-3 seconds.
When i try to deploy 20 containers everything is running like a charm, my application do the job and i can see that there is an activity on every on my containers through the logs, activity stats.
The problem comes when i try to deploy more containers, more than 20, i can notice that some of the previously deployed containers start to stop their activities (0% cpu using, logs freezing). When everything is deployed (30 containers), Docker start to block the activity of some of them and unblock them at some point to block some others (blocking/unblocking is random). It seems to be sequential. I tried to wait and see what happened and the result is that some of the containers are able to poursue their activities and some others are stuck forever (still running but no more activity).
It's important to notice that i used the following resources restrictions on each of my containers :
MemoryReservation : 160mb
Memory soft limit : 160mb
NanoCPUs : 250000000 (0.25 cpus)
I had to increase my docker default machine resource allocation and decrease container's ressource allocation because it was using almost 100% of my cpu, maybe i did a mistake in my configuration. I tried to tweak those values, but no success i still have some containers freezing.
I'm kind of lost right know.
Any help would be appreciated even a little one, thank you in advance !
I have a very weird memory issue (which is what a lot of people will most
likely say ;-)) with Spark running in standalone mode inside a Docker
container. Our setup is as follows: We have a Docker container in which we have a Spring boot application that runs Spark in standalone mode. This Spring boot app also contains a few scheduled tasks (managed by Spring). These tasks trigger Spark jobs. The Spark jobs scrape a SQL database, shuffles the data a bit and then writes the results to a different SQL table (writing the results doesn't go through Spark). Our current data set is very small (the table contains a few million rows).
The problem is that the Docker host (a CentOS VM) that runs the Docker
container crashes after a while because the memory gets exhausted. I currently have limited the Spark memory usage to 512M (I have set both executor and driver memory) and in the Spark UI I can see that the largest job only takes about 10 MB of memory. I know that Spark runs best if it has 8GB of memory or more available. I have tried that as well but the results are the same.
After digging a bit further I noticed that Spark eats up all the buffer / cache memory on the machine. After clearing this manually by forcing Linux to drop caches (echo 2 > /proc/sys/vm/drop_caches) (clearing the dentries and inodes) the cache usage drops considerably but if I don't keep doing this regularly I see that the cache usage slowly keeps going up until all memory is used in buffer/cache.
Does anyone have an idea what I might be doing wrong / what is going on here?
Big thanks in advance for any help!
I am trying to better understand scaling a Node.js server on Heroku. I have an app that handles large amounts of data and have been running into some memory issues.
If a Node.js server is upgraded to a 2x dyno, does this mean that automatically my application will be able to handle up to 1.024 GB of RAM on a single thread? My understanding is that a single Node thread has memory limit of ~1.5 GB, which is above the limit of a 2x dyno.
Now, let's say that I upgrade to a performance-M dyno (2.5 GB of memory), would I need to use clustering to fully take advantage of the 2.5 GB of memory?
Also, if a single request is made to my Node.js app for a large amount of data, which while being processed exceeds the amount of memory allocated to that cluster, will the process then use some of the memory allocated to another cluster or will it just throw an error?
I am new to ELK stack, i just installed it to give it a test drive for our production systems logs management and started pushing logs(IIS & Event) from 10 Windows VMs using nxlog.
After the installation, I am receiving around 25K hits/15 minutes as per my Kibana dashboard. The size of /var/lib/elasticsearch/ has been increased to around 15GBs in just 4 days.
I am facing serious performance issues, Elasticsearch process is eating up all my CPU and around 90% of memory.
Elasticsearch service was stuck previously and /etc/init.d/elasticsearch stop/start/restart wasn't even working. The process was running even after trying to kill it with kill command. A system reboot also took the machine to same condition. I just deleted all the indices with curl command and now i am able to restart Elasticsearch.
I am using a standard A3 Azure instance(7GB RAM 4 cores) for this ELK setup.
Please guide me to tune my ELK stack to achieve good performance. Thanks.
your are using 7GB RAM your jvm heap size for elasticsearch should be less than 3.5GB
for more information you can read elasticsearch heap sizing