MariaDB connection very slow in NodeJS - node.js

I'm currently using MariaDB version 10.5.12 and the mariadb-connector-nodejs package to interact with it from NodeJS
The pool is created as such
const pool = MariaDB.createPool({
host: process.env.DB_HOST,
user: process.env.DB_USER,
password: process.env.DB_PASS,
database: process.env.DB_DATABASE,
connectionLimit: 25
});
A simple request such as connection.Query("SELECT * FROM Users"); Where the users table only contains 5 users can take anywhere between 90ms and 650ms
profiling the query in MariaDB says the query itself only takes a few nanoseconds.
The time a request takes seems to be pretty random, the first request could be 100ms and the second one right after using the same connection could take 600ms which I feel is way too high for a query that takes less than a millisecond to process.
But I'm unable to find where the issue is coming from, I have tried doing all the changes mysqltuner recommended, I've tried disabling name resolve with no visible speedup (I assume the requests are being cached as all requests come from the same ip)

top indicates there is NO Swap space available.
Consider enabling 6G of swap space to survive busy situations with minimal delay and a surviving system.
Please share the code generally used to Connect, Process, Close connections.
Your threads_connected count indicates the Close is being missed and has left 83 threads connected in 10 days.
Suggestions to consider for your my.cnf [mysqld] section,
log_error=/var/log/mysql/mariadb-error.log # from 0 to allow viewing ERRORS detected.
innodb_max_dirty_pages_pct_lwm=1 # from 0 percent to enable pre-flushing
innodb_max_dirty_pages_pct=1 # from 90 percent to minimize innodb_buffer_pool_pages_dirty of 367.
innodb_adaptive_hash_index=ON # from OFF to minimize deadlocks
max_connect_errors=10 # from 100 to frustrate hackers/crackers after 10 attempts.
connect_timeout=30 # from 10 seconds to be more tolerant of connection attempts.
Observations,
connections were 120,604 in 10 days and you had 28984 aborted_connects about 1/4 of the events. You must have a lot of unhappy people trying to use your system or they are hackers trying to break in.
View profile for contact info and free downloadable Utility Scripts to assist with performance tuning.

Related

How to determine Redis memory leak?

Our redis servers are, since yesterday, gradually (200MB/hour) using more memory, while the amount of keys (330K) and their data (132MB redis-rdb-tools) stay about the same.
Output of redis-cli info shows 6.89G used memory?!
redis_version:2.4.10
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
gcc_version:4.4.6
process_id:3437
uptime_in_seconds:296453
uptime_in_days:3
lru_clock:1905188
used_cpu_sys:8605.03
used_cpu_user:1480.46
used_cpu_sys_children:1035.93
used_cpu_user_children:3504.93
connected_clients:404
connected_slaves:0
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:7400076728
used_memory_human:6.89G
used_memory_rss:7186984960
used_memory_peak:7427443856
used_memory_peak_human:6.92G
mem_fragmentation_ratio:0.97
mem_allocator:jemalloc-2.2.5
loading:0
aof_enabled:0
changes_since_last_save:1672
bgsave_in_progress:0
last_save_time:1403172198
bgrewriteaof_in_progress:0
total_connections_received:3616
total_commands_processed:127741023
expired_keys:0
evicted_keys:0
keyspace_hits:18817574
keyspace_misses:8285349
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:1619791
vm_enabled:0
role:slave
master_host:***BLOCKED***
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
db0:keys=372995,expires=372995
db6:keys=68399,expires=68399
The problem started when we updated our (.net) client code from BookSleeve 1.1.0.4 to ServiceStack v3.9.71 to prepare for an upgrade to Redis 2.8. But a lot of other stuff was updated to And our session state store (also redis, but with harbour client) does not show the same symptoms.
Where is all that Redis memory going? How can I troubleshoot it's usage?
Edit: I just restarted this instance and memory returned to 350M and is now climbing again. The top 10 largest objects are still the same size, ranging from 100K to 25M for nr 1. The amount of keys has dropped to 270K (330K earlier).
Here are some sources of "hidden" memory consumption in Redis:
Marc already mentioned the buffers maintained by the master to feed the slave. If a slave is lagging behind its master (because it runs on a slower box for instance), then some memory will be consumed on the master.
when long running commands are detected, Redis logs them in the SLOWLOG area, which takes some memory. You may want to use the SLOWLOG LEN command to check the number of records you have here.
communication buffers can also take memory. As far as I remember, with old versions of Redis (and 2.4 is quite old - you should really upgrade), it was unbounded, meaning that if you transfer a big object at a point, the communication buffer associated to this client connection will grow and never shrink. If there are many clients dealing occasionally with large objects, it could be a possible explanation. If you use commands retrieving very large data from Redis (in one shot), it can be an explanation as well. For instance, a simple KEYS * command applied on a Redis server storing millions of keys will consume a significant amount of memory.
You mentioned that you have objects as big as 25 MB. You have 404 client connections, if each of them needs to access such objects at a point in time, it will consume 10 GB of memory.

Give reads priority over writes in Elasticsearch

I have an EC2 server running Elasticsearch 0.9 with a nginx server for read/write access. My index has about 750k small-medium documents. I have a pretty continuous stream of minimal writes (mainly updates) to the content. The speeds/consistency I receive with search is fine with me, but I have some sporadic timeout issues with multi-get (/_mget).
On some pages in my app, our server will request a multi-get of a dozen to a few thousand documents (this usually takes less than 1-2 seconds). The requests that fail, fail with a 30,000 millisecond timeout from the nginx server. I am assuming this happens because the index was temporarily locked for writing/optimizing purposes. Does anyone have any ideas on what I can do here?
A temporary solution would be to lower the timeout and return a user friendly message saying documents couldn't be retrieved (however they still would have to wait ~10 seconds to see an error message).
Some of my other thoughts were to give read priority over writes. Anytime someone is trying to read a part of the index, don't allow any writes/locks to that section. I don't think this would be scalable and it may not even be possible?
Finally, I was thinking I could have a read-only alias and a write-only alias. I can figure out how to set this up through the documentation, but I am not sure if it will actually work like I expect it to (and I'm not sure how I can reliably test it in a local environment). If I set up aliases like this, would the read-only alias still have moments where the index was locked due to information being written through the write-only alias?
I'm sure someone else has come across this before, what is the typical solution to make sure a user can always read data from the index with a higher priority over writes. I would consider increasing our server power, if required. Currently we have 2 m2x-large EC2 instances. One is the primary and the replica, each with 4 shards.
An example dump of cURL info from a failed request (with an error of Operation timed out after 30000 milliseconds with 0 bytes received):
{
"url":"127.0.0.1:9200\/_mget",
"content_type":null,
"http_code":100,
"header_size":25,
"request_size":221,
"filetime":-1,
"ssl_verify_result":0,
"redirect_count":0,
"total_time":30.391506,
"namelookup_time":7.5e-5,
"connect_time":0.0593,
"pretransfer_time":0.059303,
"size_upload":167002,
"size_download":0,
"speed_download":0,
"speed_upload":5495,
"download_content_length":-1,
"upload_content_length":167002,
"starttransfer_time":0.119166,
"redirect_time":0,
"certinfo":[
],
"primary_ip":"127.0.0.1",
"redirect_url":""
}
After more monitoring using the Paramedic plugin, I noticed that I would get timeouts when my CPU would hit ~80-98% (no obvious spikes in indexing/searching traffic). I finally stumbled across a helpful thread on the Elasticsearch forum. It seems this happens when the index is doing a refresh and large merges are occurring.
Merges can be throttled at a cluster or index level and I've updated them from the indicies.store.throttle.max_bytes_per_sec from the default 20mb to 5mb. This can be done during runtime with the cluster update settings API.
PUT /_cluster/settings HTTP/1.1
Host: 127.0.0.1:9200
{
"persistent" : {
"indices.store.throttle.max_bytes_per_sec" : "5mb"
}
}
So far Parmedic is showing a decrease in CPU usage. From an average of ~5-25% down to an average of ~1-5%. Hopefully this can help me avoid the 90%+ spikes I was having lock up my queries before, I'll report back by selecting this answer if I don't have any more problems.
As a side note, I guess I could have opted for more balanced EC2 instances (rather than memory-optimized). I think I'm happy with my current choice, but my next purchase will also take more CPU into account.

Scaling socket.io broadcast

I want to broadcast a 1Kb message with socket.io (node.js framework), every 3 seconds to a large number of users. What is the best way to scale it (1 user = 1 'listener' with socket.on('periodicMessage',callback) )?
There is no other CPU usage (one read of an external database which is filled by an other external module every 3 seconds), so i am trying to know if a simple heroku server can broadcast a message to 10 000, 100 000, 1 million or more users.
We have easily scaled to tens of thousands of 'listeners' on a single node.js process. I am not sure how many you actually can scale to, given that each socket is a file descriptor, and the plain vanilla kernel can have 65K fd's for each process, no more.
CPU would not be a problem. If at all, upload bandwidth would be (1KB * 50K users / 3 sec = 50M/3sec = 16MB/s upstream. I never measured Heroku, so don't know if they sustain this. I suppose they do, but maybe they limit you, since they are paying Amazon for this, after all).

Node.js struggling with lots of concurrent connections

I'm working on a somewhat unusual application where 10k clients are precisely timed to all try to submit data at once, every 3 mins or so. This 'ab' command fairly accurately simulates one barrage in the real world:
ab -c 10000 -n 10000 -r "http://example.com/submit?data=foo"
I'm using Node.js on Ubuntu 12.4 on a rackspacecloud VPS instance to collect these submissions, however, I'm seeing some very odd behavior from Node, even when I remove all my business logic and turn the http request into a no-op.
When the test gets about 90% done, it hangs for a long period of time. Strangely, this happens consistently at 90% - for c=n=10k, at 9000; for c=n=5k, at 4500; for c=n=2k, at 1800. The test actually completes eventually, often with no errors. But both ab and node logs show continuous processing up till around 80-90% of the test run, then a long pause before completing.
When node is processing requests normally, CPU usage is typically around 50-70%. During the hang period, CPU goes up to 100%. Sometimes it stays near 0. Between the erratic CPU response and the fact that it seems unrelated to the actual number of connections (only the % complete), I do not suspect the garbage collector.
I've tried this running 'ab' on localhost and on a remote server - same effect.
I suspect something related to the TCP stack, possibly involving closing connections, but none of my configuration changes have helped. My changes:
ulimit -n 999999
When I listen(), I set the backlog to 10000
Sysctl changes are:
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_max_orphans = 20000
net.ipv4.tcp_max_syn_backlog = 10000
net.core.somaxconn = 10000
net.core.netdev_max_backlog = 10000
I have also noticed that I tend to get this msg in the kernel logs:
TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
I'm puzzled by this msg since the TCP backlog queue should be deep enough to never overflow. If I disable syn cookies the "Sending cookies" goes to "Dropping connections".
I speculate that this is some sort of linux TCP stack tuning problem and I've read just about everything I could find on the net. Nothing I have tried seems to matter. Any advice?
Update: Tried with tcp_max_syn_backlog, somaxconn, netdev_max_backlog, and the listen() backlog param set to 50k with no change in behavior. Still produces the SYN flood warning, too.
Are you running ab on the same machine running node? If not do you have a 1G or 10G NIC? If you are, then aren't you really trying to process 20,000 open connections?
Also if you are changing net.core.somaxconn to 10,000 you have absolutely no other sockets open on that machine? If you do then 10,000 is not high enough.
Have you tried to use nodejs cluster to spread the number of open connections per process out?
I think you might find this blog post and also the previous ones useful
http://blog.caustik.com/2012/08/19/node-js-w1m-concurrent-connections/

Ideal timeout period for dns lookup

In my rails app i do a nslookup using a ruby library resolv. If the site like dgdfgdfgdfg.com is entered its talking too long to resolve. in some instance like 20 sec.(mostly for non-existent sites) Because it cause the application to slowdown.
So i though of introducing a timeout period for the dns lookup.
What will be the ideal timeout period for the dns lookup so that resolution of actual site doesnt fail. will something like 10 sec will be fine?
There's no IETF mandated value, although ยง6.1.3.3 of RFC 1123 suggests a value not less than 5 seconds.
Perl's Net::DNS and the command line dig utility do default to 5 seconds between retries. Some versions of the Microsoft resolver appear to default to 3 seconds.
You can run some tests among the users to find out the right number compromising responsiveness / performance.
Also you can adjust that timeout dinamically depending on the network traffic.
For example, for every sucessful resolv, you save how much time it took you to resolv it. And every hour (for example) you can calculate an average and set double of its value as timeout (Remember that "average" is, roughly speaking, "the middle"). This way if your latency is high at some point, it autoadjust itself to increase the timeout period.

Resources