GridGain node start up failure - java.net.ConnectException during node start up - gridgain

We have a compute grid prototype (GG 6.5.5) that works fine on a local machine (Win7) but when deployed on Windows Server 2008 R2 SP2 even a simple node start up fails.
The behavior on the server:
During the node start up a java socket exception (see below) is thrown several times.
After the attempts to communicate stop (the exceptions as well obviously) I suppose, nothing happens for 5-10 minutes.
After the these 5-10 minutes in some cases the node somehow does come up, joins the grid and capable to receive a task. We couldn't establish the pattern of this behavior.
In the beginning we have suspected that the issue might be caused by blocked or used port so we have modified the ports that are used in the config file but it didn't help to resolve the issue.
In the console output we get a notification from GG that it wasn't fully tested on "Windows Server 2008 R2 SP2", does it mean that GridGain is not compatible with this OS?
In the future grid will include linux machines as well, is there a list of supported and incompatible linux versions as well as other OS?
It is important to mention that the server has no internet access, since on the GG start up it attempts to checks if a new version is available, might that be the cause of the issue? No firewall software is installed.
Is is possible to disable this new version check (possibly some other checks) in order to speed up the node start up process?
I hope there is a solution, many thanks in advance!
The exception:
2015-01-08 17:17:10,078 ERROR [main]: Exception on direct send: Connection refused: connect
java.net.ConnectException: Connection refused: connect
at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
at java.net.DualStackPlainSocketImpl.socketConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
at java.net.PlainSocketImpl.connect(Unknown Source)
at java.net.SocksSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at org.gridgain.grid.spi.discovery.tcp.GridTcpDiscoverySpi.openSocket(GridTcpDiscoverySpi.java:2098)
at org.gridgain.grid.spi.discovery.tcp.GridTcpDiscoverySpi.sendMessageDirectly(GridTcpDiscoverySpi.jav
at org.gridgain.grid.spi.discovery.tcp.GridTcpDiscoverySpi.sendJoinRequestMessage(GridTcpDiscoverySpi.
at org.gridgain.grid.spi.discovery.tcp.GridTcpDiscoverySpi.joinTopology(GridTcpDiscoverySpi.java:1599)
at org.gridgain.grid.spi.discovery.tcp.GridTcpDiscoverySpi.spiStart0(GridTcpDiscoverySpi.java:1084)
at org.gridgain.grid.spi.discovery.tcp.GridTcpDiscoverySpi.spiStart(GridTcpDiscoverySpi.java:982)
at org.gridgain.grid.kernal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:220)
at org.gridgain.grid.kernal.managers.discovery.GridDiscoveryManager.start(GridDiscoveryManager.java:38
at org.gridgain.grid.kernal.GridKernal.startManager(GridKernal.java:1559)
at org.gridgain.grid.kernal.GridKernal.start(GridKernal.java:756)
at org.gridgain.grid.kernal.GridGainEx$GridNamedInstance.start0(GridGainEx.java:1949)
at org.gridgain.grid.kernal.GridGainEx$GridNamedInstance.start(GridGainEx.java:1289)
at org.gridgain.grid.kernal.GridGainEx.start0(GridGainEx.java:832)
at org.gridgain.grid.kernal.GridGainEx.start(GridGainEx.java:759)
at org.gridgain.grid.kernal.GridGainEx.start(GridGainEx.java:677)
at org.gridgain.grid.kernal.GridGainEx.start(GridGainEx.java:524)
at org.gridgain.grid.kernal.GridGainEx.start(GridGainEx.java:494)
at org.gridgain.grid.GridGain.start(GridGain.java:314)
at org.gridgain.grid.startup.cmdline.GridCommandLineStartup.main(GridCommandLineStartup.java:293)

I think the issue you are getting has nothing to do with operating system, be that Windows or Linux. Most likely you have a firewall enabled some place, either locally on your operating system or remotely, and this firewall is blocking traffic in one direction.
Try disabling all software firewalls, and see if the behavior improves. If it does, you then can try re-enabling the firewall and fixing its settings.

Alex_V,
Can you please provide your configuration file?
Please provide the full log of starting node - ggstart.bat -v ...
or add -DGRIDGAIN_QUIET=false to JVM properties.
From the stack trace you provided I see that exception happens on start. Are you able to start nodes on win 2008? How many hosts are there? Are they in 1 network or routing is configured correctly?
When you see node frozen, can you please take a threaddump and post it here.
Log message that GridGain is not fully tested with Windows Server does not mean incompatibility. Moreover, I would expect it to work. It is just not tested as thorough as other win systems now.
Single topology may include win, mac and linux machines with no restrictions or performance impact. Almost all popular linux distributives are supported.
You can skip version check by adding -DGRIDGAIN_UPDATE_NOTIFIER=false to the JVM properties, but I dont think it may cause any issues.

Related

Unexpected Disconnection with Code 1006 on Windows Server Hosted on Azure

My application does client authorization over WebSocket connection using ws#7 but after several minutes suddenly it gets disconnected with the error code 1006.
Interesting thing is it's working on AWS Windows Server instances but not on Azure instances or VMWare VMs. I assume there is some kind of configuration related to WebSockets should be handled before installing Node-based application but the main question is what I have to configure in order to move forward.
1006 error usually happens when there is a timeout. In the library you are using, the ws timeout is 30 seconds: https://github.com/websockets/ws/blob/4f293a8726092c75539287dd07358afaf151a2e5/lib/websocket.js
Check whether you are using a gateway or something in between the VM with a timeout less or equal than the ping interval that ws automatically does from the client.
You can usually can see this automatically generated ping messages in Firefox with the F12 tools in the network tab, these do not show up in Chrome nor in Edge but they happen as well:
I had similar problem with my Windows machine tryng to connect a server using Visual Studio Code. I have reset the routes, and reboot the machine, that solve the issue.
To reset use:
route -f

TortoiseSVN Error: Could not send request body: an existing connection was forcibly closed by the remote host

Let me preface this by saying I have basically 0 knowledge of web development. That being said, I'll still try to provide you with as much information as I possibly can. Our client is using IIS7 on a Windows Server 2008 R2 machine. The TortoiseSVN error they're getting is this:
Error: Could not send request body: an existing connection was forcibly closed by the remote host.
Using the powers of Google, it seems that there's two possible things that could be occurring here. As it is a 4GB file, I've seen people mention that it could be a configuration issue in that the timeout could be a little short, that I might need to enable a setting somewhere to allow committing of larger files or that it could be a network issue. It might be useful to note that they can commit smaller files.
I've all ready tried disabling the firewall, as well as the antivirus, on the server and having them retry, but that didn't work. They are trying to upload from a desktop to the server and they are on the same network through a gigabit switch. I'm sure I'm missing useful information for you guys but I'm a total noob to web dev, their set up, and actually understanding what they're trying to do. If you need any more information from me I'll be glad to provide it.
The problem could be the too strict timeout options configured in Apache2's reqtimeout module. I simply disabled it
a2dismod reqtimeout
/etc/init.d/apache2 restart
Chocolate to: https://serverfault.com/questions/297562/svn-https-problem-could-not-read-status-line-connection-was-closed-by-ser

QSslSocket timeouts in Ubuntu Server, but not in Desktop

We have problem with our Qt based production server for our business application. When total SSL connections increases with time, some clients does not manage to connect at all.
QSslSocket::waitForEncrypted() starts to fail with no QSslError, regardless of that timeout where set. There are more then ~100 active connections when this problem starts to kick in.
So there are ~170 connections, twice of threads, and "lsof" mentions a little more then 1000 opened files (we had to increase file "ulimit" for that..).
It does not look like it's clients problem, since IPs that are failing and reconnecting changes with time (some "leaps in" with success, but then other don't).
As mentioned, this happens in Ubuntu Server (Zentyal 10.04 and "vanilla" 9.10), but does NOT in Ubuntu Desktop 9.10.
Everything runs inside VMWare ESX 4.1, systems there tested with same resources attached. System loads stays below 1.0. Daemon runs with root permissions.
It looks like it's something with "server"/"desktop" kernel or other configuration differences, but I couldn't tell what exactly could make SSL connection not to handshake... in "server editions"...
We are using Qt 4.5.3 compiled by ourselves.
EDIT: after all it's the same on any Linux I tried. It feels like it's some kind socket limit per process, witch is about 1016 - other_opened_files. I'll try to create new question about that.
EDIT 2: It's select and FD_SETSIZE limit problem...
Problem is with fact that Qt uses select() which is limited with FD_SETSIZE macro for maximum selected sockets/files. I had to change FD_SETSIZE value inside /usr/include/bits/typesizes.h before compiling libQtNetwork and libQtCore.

Watir problem: ECONNABORTED

I happily use Watir (actually, FireWatir) on 3 computers.
Only on one of them I get frequently this issue:
C:/Program Files/Ruby192/lib/ruby/gems/1.9.1/gems/firewatir-1.6.7/lib/firewatir/jssh_socket.rb:63:in `recv': An established connection was aborte d by the software in your host machine. - recvfrom(2) (Errno::ECONNABORTED)
This happens at random moments during a script.
What is the cause, and what can I do to solve it?
I received this error when running an exceptionally long test using Mechanize as the browser. I believe it points to a memory error/overflow, or network overflow, but I have not been able to confirm it.

Is activemq reliable?

We have put ActiveMQ on a fresh server. Configured it to use 'kahadb' (the preferred as we read) and set it to allow the file to expand to 2gb.
Then when we put load on the queue (+- 500/sec), within a few minutes activemq crashes.
When ActiveMQ tries to restart, it can't because the db is corrupt:
2010-11-29 13:00:50,359 | ERROR | Failed to start ActiveMQ JMS Message Broker. Reason:
java.io.EOFException | org.apache.activemq.broker.BrokerService | WrapperSimpleAppMain
java.io.EOFException
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:383)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361)
at org.apache.kahadb.page.PageFile.readPage(PageFile.java:792)
at org.apache.kahadb.page.Transaction.load(Transaction.java:411)
Only by deleting the DB and letting it fix itself using the journal is it up again, only to crash again after a few minutes.
Anyone else having these reliability issues?
ActivemQ (5.4.1) is installed on Win2003, with Java64 bit (1.6.0__22)
The load is being done by 4 webservers running PHP using Stomp.
This is a known issue for 5.4.1. It's fixed and available in 5.4.2 release that should go out any day now. You can test release candidate from here: https://repository.apache.org/content/repositories/orgapacheactivemq-023/org/apache/activemq/apache-activemq/5.4.2/
I also noticed this issue (bad performance and a lot of crashing on high traffic from several machines). It's indeed fixed in the latest release, but I would suggest to downgrade to 5.3.2 on production systems.

Resources