We are running node server on GAE and for some reason a few times a day our server is offline (sometimes it can take a few mins to come back online).
Requests are the same throughout the day and there is also no exception that would be the cause of restart. There is no spike in requests or any special requests that could cause it.
Log when it happens:
2020-04-18T23:48:51.881806Z [GET /v1/util/example [36m304 [35.262 ms - -[ A
2020-04-18T23:50:17.119906Z [start] 2020/04/18 23:50:17.119185 Quitting on terminated signal A
2020-04-18T23:50:17.175632Z [start] 2020/04/18 23:50:17.175267 Start program failed: user application failed with exit code -1 (refer to stdout/stderr logs for more detail): signal: terminated
2020-04-18T23:51:38.772388Z GET 304 173 B 3.3 s Example-V2/3.1.13 (com.example.app; build:1; iOS 13.4.0) Alamofire/5.1.0 /v1/util/example GET 304 173 B 3.3 s Example-V2/3.1.13 (com.example.app; build:1; iOS 13.4.0) Alamofire/5.1.0 5e9b928a00ff0bc9244f94194c0001737e737065616b2d76322d32613166310001737065616b2d6170693a323032303034303374303630343431000100
2020-04-18T23:51:38.786760Z GET 404 324 B 2.4 s Unknown /_ah/start GET 404 324 B 2.4 s Unknown 5e9b928a00ff0c014898f5c27f0001737e737065616b2d76322d32613166310001737065616b2d6170693a323032303034303374303630343431000100
2020-04-18T23:51:39.529080Z [start] 2020/04/18 23:51:39.511828 No entrypoint specified, using default entrypoint: /serve
2020-04-18T23:51:39.529642Z [start] 2020/04/18 23:51:39.528742 Starting app
2020-04-18T23:51:39.529968Z [start] 2020/04/18 23:51:39.529100 Executing: /bin/sh -c exec /serve
2020-04-18T23:51:39.590085Z [start] 2020/04/18 23:51:39.589751 Waiting for network connection open. Subject:"app/invalid" Address:127.0.0.1:8080
2020-04-18T23:51:39.590571Z [start] 2020/04/18 23:51:39.590347 Waiting for network connection open. Subject:"app/valid" Address:127.0.0.1:8081
2020-04-18T23:51:39.764383Z [serve] 2020/04/18 23:51:39.763656 Serve started.
2020-04-18T23:51:39.764935Z [serve] 2020/04/18 23:51:39.764544 Args: {runtimeName:nodejs10 memoryMB:1024 positional:[]}
2020-04-18T23:51:39.766562Z [serve] 2020/04/18 23:51:39.765904 Running /bin/sh -c exec node server.js
2020-04-18T23:51:41.072621Z [start] 2020/04/18 23:51:41.071895 Wait successful. Subject:"app/valid" Address:127.0.0.1:8081 Attempts:296 Elapsed:1.481194491s
2020-04-18T23:51:41.072978Z Express server started on port: 8081
2020-04-18T23:51:41.073008Z [start] 2020/04/18 23:51:41.072411 Starting nginx
2020-04-18T23:51:41.085901Z [start] 2020/04/18 23:51:41.085451 Waiting for network connection open. Subject:"nginx" Address:127.0.0.1:8080
2020-04-18T23:51:41.132064Z [start] 2020/04/18 23:51:41.131572 Wait successful. Subject:"nginx" Address:127.0.0.1:8080 Attempts:9 Elapsed:45.911234ms
2020-04-18T23:51:41.170786Z [GET /_ah/start [33m404 [11.865 ms - 61[
There is always more than 70% memory free, so that could not be the issue. Only noticed very high CPU utilization when it restarts occurs (10x higher than normally).
In the bottom picture you can clearly see when the restarts happen:
This is my app.yaml
runtime: nodejs10
instance_class: B4
service: example-api
basic_scaling:
max_instances: 1
idle_timeout: 30m
handlers:
- url: .*
secure: always
script: auto
This is happening on our production server, so any help would be more than welcome.
Thanks!
Reading this document, it is mentioned that even though they try to keep basic and manual scaling instances running indefinitely, they are sometimes restarted for maintenance or they might fail due to some other reasons. That is why keeping your max instances as 1 is not considered best practice as it is prone to all of these failures. As mentioned in the other answer, I would also recommend to increase the number of instances so the likelyhood of more failing or being restarted at the same time is lower.
We had the same problem when we migrated our Ruby on Rails app to Google App Engine Standard a year ago. After emailing back and forth with Google Cloud Support, they suggested: "increasing the minimum number of instances will help because you will have more “backup” instances."
At the time we had two instances, and since we upped it the three instances, we have had no downtime related to unexpected server restarts.
We are still not sure why our servers are sometimes deemed unhealthy and restarted by App Engine, but having more instances can help you to avoid downtime in the short run while you investigate the underlying issue.
Related
I deployed my app to aws EC2 so I need a way of always having this node app running no matter what if the server restart if the app crashes whatever it always need to restart the app.
the app needs to running always. so I use npm start & is that enough to restart my app?
I've tried to use systemD, but I had error while start the service I've created :
sudo systemctl start nameX.service
Job for nameX.service failed because of unavailable resources or
another system error. See "systemctl status ...service" and
"journalctl -xeu ....service" for details.
sudo systemctl status nameX.service
nameX.service: Scheduled restart job, restart > systemd[1]: Stopped
My Node Server. Failed to load environment file> systemd[1]:
...service: Failed to run 'start' task: No > systemd[1]: ...service:
Failed with result 'resources'. systemd[1]: Failed to start My Node
Server.
$ docker container ls --format "table {{.ID}}\t{{.Names}}\t{{.Ports}}" -a
CONTAINER ID NAMES PORTS
ae87d83af7d3 hopeful_engelbart
d13e260c4dec unruffled_bouman
db2c482de210 jenkinsci 0.0.0.0:8080->8080/tcp, 50000/tcp
cd201cbd413e xyz 0.0.0.0:5000->5000/tcp
c64c32ac68b8 pqr
$ docker container ls -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ae87d83af7d3 442c97a73937 "/bin/bash" 11 minutes ago Exited (0) 9 minutes ago hopeful_engelbart
d13e260c4dec 442c97a73937 "/bin/bash" 27 minutes ago Exited (0) 24 minutes ago unruffled_bouman
db2c482de210 jenkins/jenkins:lts "/sbin/tini -- /usr/…" 3 days ago Up 41 minutes 0.0.0.0:8080->8080/tcp, 50000/tcp jenkinsci
cd201cbd413e 442c97a73937 "bash" 3 days ago Up 7 minutes 0.0.0.0:5000->5000/tcp xyz
c64c32ac68b8 442c97a73937 "bash" 3 days ago Exited (0) 2 days ago pqr
Above outputs show that the port 5000 has been exposed (I hope).
docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' xyz
172.17.0.3
Now when I do from the host machine:
wget -c 172.17.0.3:5000
--2019-12-30 16:26:44-- http://172.17.0.3:5000/
Connecting to 172.17.0.3:5000... failed: Connection refused.
What is the way to access that port since it is exposed and the container is running?
$ wget -c localhost:5000
--2019-12-30 16:41:57-- http://localhost:5000/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:5000... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.
--2019-12-30 16:41:58-- (try: 2) http://localhost:5000/
Connecting to localhost (localhost)|127.0.0.1|:5000... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.
First of all, check if you have an application, listening to the port inside your container. Just try to connect to it from your container:
docker exec xyz wget 127.0.0.1:5000
If it'll work then you have a problem with port exposing, otherwise, there is no web server running inside your container.
And the error you get
Read error (Connection reset by peer) in headers.
seems to point to the problem in your web server rather then connectivity issues.
I had this problem running one Quarkus application with Docker.
I found this topic on forum.dockers talking about a similar issue.
In this topic is said:
I was using flask, by default it binds to localhost & 5000, so you have to specify:
app.run(host=“0.0.0.0”)
So, in my case I guess the problem was on my application and not on Docker network.
I added this property on the Java command to start my application on my Dockerfile and everything worked fine:
-Dquarkus.http.host=0.0.0.0
I tried to set up API Platform on my local machine to explore it.
I tried to performed all the operations according to API Platform's "Getting Started" page. So I downloaded the latest offical distribution which happens to be v2.4.2 (https://github.com/api-platform/api-platform/releases/tag/v2.4.2) and I started it using Docker.
I cannot however access the administration backend at http://localhost:81 receiving "Unable to retrieve API documentation."
I searched for help at https://api-platform.com/docs/admin/getting-started/, but it describes steps that seems to be already done in the distribution
How can I enable the admin component or debug what went wrong?
Edit (2019-04-14)
$ docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
40a5d5213cfe quay.io/api-platform/nginx "nginx -g 'daemon of…" 45 hours ago Up 6 minutes 0.0.0.0:8080->80/tcp apiplatformdemo_api_1
d53711c0ba0c quay.io/api-platform/php "docker-entrypoint p…" 45 hours ago Up 6 minutes 9000/tcp apiplatformdemo_php_1
2d4eb8d09e3e quay.io/api-platform/client "/bin/sh -c 'yarn st…" 45 hours ago Up 6 minutes 0.0.0.0:80->3000/tcp apiplatformdemo_client_1
abe3e3b41810 quay.io/api-platform/admin "/bin/sh -c 'yarn st…" 45 hours ago Up 6 minutes 0.0.0.0:81->3000/tcp apiplatformdemo_admin_1
4596a7f81cd8 postgres:10-alpine "docker-entrypoint.s…" 45 hours ago Up 6 minutes 0.0.0.0:5432->5432/tcp apiplatformdemo_db_1
c805fc2f11c9 dunglas/mercure "./mercure" 45 hours ago Up 6 minutes 443/tcp, 0.0.0.0:1337->80/tcp apiplatformdemo_mercure_1
Edit 2 (2019-04-14)
It is worth mentioning that although the API component at http://localhost:8080 works, the HTTPS variant at https://localhost:8443 does not. (Connection refused if I try to telnet it.)
Now it turned out it escaped my notice earlier that there is a message in the JS console saying there was a failed connection to https://localhost:8443. (It says about CORS, but I think the real reason is 8443 simply refuses connection). So although I entered the HTTP variant of Admin at http://localhost:81 it tried to access the API via HTTPS. What could be the reason HTTPS doesn't work?
Edit 3 (2019-04-15)
After looking into the logs of docker compose, I see it is relevant the Varnish container failed. h2-proxy depends on it and it is h2-proxy that governs the 8443 port.
cache-proxy_1 | Error:
cache-proxy_1 | Message from VCC-compiler:
cache-proxy_1 | Expected return action name.
cache-proxy_1 | ('/usr/local/etc/varnish/default.vcl' Line 67 Pos 13)
cache-proxy_1 | return (miss);
cache-proxy_1 | ------------####--
cache-proxy_1 |
cache-proxy_1 | Running VCC-compiler failed, exited with 2
cache-proxy_1 | VCL compilation failed
apiplatform242_cache-proxy_1 exited with code 2
h2-proxy_1 | 2019/04/15 08:09:17 [emerg] 1#1: host not found in upstream "cache-proxy" in /etc/nginx/conf.d/default.conf:58
h2-proxy_1 | nginx: [emerg] host not found in upstream "cache-proxy" in /etc/nginx/conf.d/default.conf:58
apiplatform242_h2-proxy_1 exited with code 1
I have solved this error by getting API Platform by cloning the current master and not download the tar.tgz release version (2.4.2)
git clone https://github.com/api-platform/api-platform.git
docker-compose build
docker-compose up -d
Works like a charm !
I have setup a node.js application and it's running on port 4999, but when I browse to the url www.website.com:4999 I get a This site can’t be reached error in Chrome and Secure Connection Failed in Firefox
The is the code in SSH used to start the node app
[~/public_html/customer_portal]# gulp serv:prod
[13:48:50] Using gulpfile ~/public_html/customer_portal/gulpfile.js
[13:48:50] Starting 'ConcatScripts'...
[13:48:50] Starting 'ConcatCss'...
[13:48:50] Starting 'CopyAssets'...
[13:48:50] Finished 'ConcatCss' after 553 ms
[13:48:50] Starting 'UglyCss'...
[13:48:50] Finished 'CopyAssets' after 855 ms
[13:48:50] Finished 'UglyCss' after 322 ms
[13:48:50] Finished 'ConcatScripts' after 925 ms
[13:48:50] Starting 'UglyScripts'...
[13:49:08] Finished 'UglyScripts' after 18 s
[13:49:08] Starting 'Inject:PROD'...
[13:49:08] gulp-inject 1 files into index.build.ejs.
[13:49:08] gulp-inject 1 files into index.build.ejs.
[13:49:08] Finished 'Inject:PROD' after 218 ms
[13:49:08] Starting 'build:prod'...
[13:49:08] Finished 'build:prod' after 61 μs
[13:49:08] Starting 'serv:prod'...
[13:49:08] Finished 'serv:prod' after 48 ms
livereload[tiny-lr] listening on 35729 ...
Mon, 25 Jul 2016 03:49:09 GMT express-session deprecated undefined saveUninitialized option; provide saveUninitialized option at app.js:58:13
XXX service has been started at port: 4999 !!!
Just compiling the solution we derived from the comment in OP's post.
So OP had tested his nodeJS application locally and now he is wanting to expose it to the world wide web. While OP did not post content of his gulpFile but I am guessing that he is trying to use a development server spin up by gulp to serve his web page. Not impossible, however certainly not recommended.
A better replacement would be to use a real web server like nginx.
See:
https://nginx.org/en/docs/beginners_guide.html
Back to the original problem. The real reason why OP is getting hit by the error This site can’t be reached was probably because his server did not have the required port forwarded, in this case port 4999. To workaround this temporarily would be to update the Gulp file to host the application on port 80 instead.
However I am still dubious about the error message because I would have thought OP should see something like connection refused. Anyway, this is not important.
To sum up, OP should consider fixing his problem by;
install a real web server on his machine
place the application onto the installed web server
I have two problems running freeswitch from systemd :
EDIT 2 - I have moved the slow start up question to here (Freeswitch pauses on check_ip at boot on centos 7.1) as although they may be related it's probably good as a standalone.
EDIT - I have noticed something else. Look at these next lines captured from the terminal output when running it from there. The gap is 4 minutes but it has been around 10 minutes before. I noticed it because I was trying to find out why port 8021 was taking several minutes to accept the fs_cli connection. Why does this happen? Never happened to me before and I've installed loads of FS boxes. This does the same thing on both 1.7 & todays 1.6.
2015-10-23 12:57:35.280984 [DEBUG] switch_scheduler.c:249 Added task 1 heartbeat (core) to run at 1445601455
2015-10-23 12:57:35.281046 [DEBUG] switch_scheduler.c:249 Added task 2 check_ip (core) to run at 1445601455
2015-10-23 13:01:31.100892 [NOTICE] switch_core.c:1386 Created ip list rfc6598.auto default (deny)
I sometimes get double processes started. Here is my status line after such an occurrence :
# systemctl status freeswitch -l
freeswitch.service - freeswitch
Loaded: loaded (/etc/systemd/system/multi-user.target.wants/freeswitch.service)
Active: activating (start) since Fri 2015-10-23 01:31:53 BST; 18s ago
Main PID: 2571 (code=exited, status=0/SUCCESS); : 2742 (freeswitch)
CGroup: /system.slice/freeswitch.service
├─usr/bin/freeswitch -ncwait -core -db /dev/shm -log /usr/local/freeswitch/log -conf /usr/local/freeswitch/conf -run /usr/local/freeswitch/run
└─usr/bin/freeswitch -ncwait -core -db /dev/shm -log /usr/local/freeswitch/log -conf /usr/local/freeswitch/conf -run /usr/local/freeswitch/run
Oct 23 01:31:53 fswitch-1 systemd[1]: Starting freeswitch...
Oct 23 01:31:53 fswitch-1 freeswitch[2742]: 2743 Backgrounding.
and there are two processes running.
The PID file is sometimes not written fast enough for the systemd process to pick it up, but by the time I see this (no matter how fast I run the command) it's always there by the time I do :
Oct 23 02:00:26 arribacom-sbc-1 systemd[1]: PID file
/usr/local/freeswitch/run/freeswitch.pid not readable (yet?) after
start.
Now, in (2) everything seems to work ok, and I can shut down the freeswitch process using
systemctl stop freeswitch
without any issues, but in (1) it just doesn't seem to do anything.
I'm wondering if the two are related, and that freeswitch is reporting back to systemd that the program is running before it actually is. Then systemd is either starting up another process or (sometimes) not.
Can anyone offer any pointers? I have tried to mail the freeswitch users list but despite being registered I simply cannot get any emails to appear on the list (but that's another problem).
* Update *
If I remove the -ncwait it seems to improve the double process starting but I still get the can't read PID warning, so I'm still sure there's an issue present, possibly around timing(?).
I'm on Centos 7.1, & my freeswitch version is
FreeSWITCH Version 1.7.0+git~20151021T165609Z~9fee9bc613~64bit (git
9fee9bc 2015-10-21 16:56:09Z 64bit)
and here's my freeswitch.service file (some things have been commented out until I understand what they are doing and any side effects they may have) :
[Unit]
Description=freeswitch
After=syslog.target network.target
#
[Service]
Type=forking
PIDFile=/usr/local/freeswitch/run/freeswitch.pid
PermissionsStartOnly=true
ExecStart=/usr/bin/freeswitch -nc -core -db /dev/shm -log /usr/local/freeswitch/log -conf /u
ExecReload=/usr/bin/kill -HUP $MAINPID
#ExecStop=/usr/bin/freeswitch -stop
TimeoutSec=120s
#
WorkingDirectory=/usr/bin
User=freeswitch
Group=freeswitch
LimitCORE=infinity
LimitNOFILE=999999
LimitNPROC=60000
LimitSTACK=245760
LimitRTPRIO=infinity
LimitRTTIME=7000000
#IOSchedulingClass=realtime
#IOSchedulingPriority=2
#CPUSchedulingPolicy=rr
#CPUSchedulingPriority=89
#UMask=0007
#
[Install]
WantedBy=multi-user.target
In the current master branch, take the two files from debian/ directory:
freeswitch-systemd.freeswitch.service -- should go as /lib/systemd/system/freeswitch.service
freeswitch-systemd.freeswitch.tmpfile -- should go as /usr/lib/tmpfiles.d/freeswitch.conf
You probably need to adapt the paths, or build FreeSWITCH to use standard Debian paths.