Is activemq reliable? - windows-server-2003

We have put ActiveMQ on a fresh server. Configured it to use 'kahadb' (the preferred as we read) and set it to allow the file to expand to 2gb.
Then when we put load on the queue (+- 500/sec), within a few minutes activemq crashes.
When ActiveMQ tries to restart, it can't because the db is corrupt:
2010-11-29 13:00:50,359 | ERROR | Failed to start ActiveMQ JMS Message Broker. Reason:
java.io.EOFException | org.apache.activemq.broker.BrokerService | WrapperSimpleAppMain
java.io.EOFException
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:383)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361)
at org.apache.kahadb.page.PageFile.readPage(PageFile.java:792)
at org.apache.kahadb.page.Transaction.load(Transaction.java:411)
Only by deleting the DB and letting it fix itself using the journal is it up again, only to crash again after a few minutes.
Anyone else having these reliability issues?
ActivemQ (5.4.1) is installed on Win2003, with Java64 bit (1.6.0__22)
The load is being done by 4 webservers running PHP using Stomp.

This is a known issue for 5.4.1. It's fixed and available in 5.4.2 release that should go out any day now. You can test release candidate from here: https://repository.apache.org/content/repositories/orgapacheactivemq-023/org/apache/activemq/apache-activemq/5.4.2/

I also noticed this issue (bad performance and a lot of crashing on high traffic from several machines). It's indeed fixed in the latest release, but I would suggest to downgrade to 5.3.2 on production systems.

Related

Inconsistency Errors in kombu using celery and redis with the key '_kombu.binding.reply.celery.pidbox'

I have two Django sites (archive and test-archive) on one machine. Each has its own virtual environment and different celery queues and daemons, using Python 3.6.9 on Ubuntu 18.04, Django 3.0.2, Redis v 4.0.9, celery v 4.3, and Kombu v4.6.3. This server has 16 GB of RAM, and under load there is at least 10GB free and swap is minimal.
I keep getting this error in my logs:
kombu.exceptions.InconsistencyError:
Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.
Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.
I tried
downgrading Kombu to 4.5 for both sites per some stackoverflow posts
and setting maxmemory=2GB and maxmemory-policy=allkeys-lru in redis.conf per celery docs (https://docs.celeryproject.org/en/stable/getting-started/backends-and-brokers/redis.html#broker-redis); originally the settings were the defaults of unlimited memory and noeviction and these errors were present for both versions of kombu
I still get those errors when one site is under load (i.e. doing something like uploading a set of images and processing them) and the other site is idle.
What is a little strange is that on some test runs using test-archive, test-archive will not have any errors, while archive will show those errors, even though the archive site is not doing anything. On other identical test runs using test-archive, test-archive will generate the errors and archive will not.
I know this is a reported bug in kombu/celery, so I am wondering if anyone has a work around that works more often than not for this configuration. What versions of celery, kombu, redis, etc. seem to work more often than not? I am happy to share my config files or log files, but there are so many I thought it would be best to start this discussion with the problem statement and my setup and see what else is needed.
Thanks!

"Error: read ECONNRESET" on Node-RED when writing to InfluxDB

I have just started with Node-RED and InfluxDB, and I would like to apologise if this is a very silly question.
There was a network disconnection on my server earlier - after reconnecting the server back to the network, the error Error: read ECONNRESET is frequently showing whenever receiving an MQTT signal and trying to write it into influxdb.
A little bit of the background on my work - I am working on an Industrial IoT project, where each machines will send in signals via MQTT to Node-RED, get processed in Node-RED and log into influxDB. The code has been running without issue before the network disconnection, and I have seen other posts stating that restarting Node-RED would solve the problem - but I cannot afford to restart it unless schedule a time with the factory - till then, more data will be loss.
"Error: read ECONNRESET"
This error is happening at many different influxdb nodes - not a single specific incident. Is there anyway to resolve this without having to restart Node-RED?
Thank you
Given that it's not storing any data at the moment, I would say take the hit and restart Node-RED as soon as possible.
The other option is if you are on a recent Node-RED release is to just restart the flow. You can do this from the bottom of the drop down menu on the Deploy button. This will leave Node-RED running and just stop all the nodes and restart them. This will be quicker than a full restart.
I assume you are using the node-red-contrib-influxdb node. It looks to be using the Influx npm node under the covers. I can't see anything obvious in the doc about configuring it to reconnect in case of a failure with the database. I suggest you set up a test system and then try and reproduce this by restarting the DB, if you can then you can open an issue with the node-red-contrib-influxdb on github and see if they can work out how to get it to reconnect after a failure.
There was a power outage one day and have restarted the whole system. Now the database is working fine. It worked, and I didn't know why. Hope this would help.

GridGain node start up failure - java.net.ConnectException during node start up

We have a compute grid prototype (GG 6.5.5) that works fine on a local machine (Win7) but when deployed on Windows Server 2008 R2 SP2 even a simple node start up fails.
The behavior on the server:
During the node start up a java socket exception (see below) is thrown several times.
After the attempts to communicate stop (the exceptions as well obviously) I suppose, nothing happens for 5-10 minutes.
After the these 5-10 minutes in some cases the node somehow does come up, joins the grid and capable to receive a task. We couldn't establish the pattern of this behavior.
In the beginning we have suspected that the issue might be caused by blocked or used port so we have modified the ports that are used in the config file but it didn't help to resolve the issue.
In the console output we get a notification from GG that it wasn't fully tested on "Windows Server 2008 R2 SP2", does it mean that GridGain is not compatible with this OS?
In the future grid will include linux machines as well, is there a list of supported and incompatible linux versions as well as other OS?
It is important to mention that the server has no internet access, since on the GG start up it attempts to checks if a new version is available, might that be the cause of the issue? No firewall software is installed.
Is is possible to disable this new version check (possibly some other checks) in order to speed up the node start up process?
I hope there is a solution, many thanks in advance!
The exception:
2015-01-08 17:17:10,078 ERROR [main]: Exception on direct send: Connection refused: connect
java.net.ConnectException: Connection refused: connect
at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
at java.net.DualStackPlainSocketImpl.socketConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
at java.net.PlainSocketImpl.connect(Unknown Source)
at java.net.SocksSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at org.gridgain.grid.spi.discovery.tcp.GridTcpDiscoverySpi.openSocket(GridTcpDiscoverySpi.java:2098)
at org.gridgain.grid.spi.discovery.tcp.GridTcpDiscoverySpi.sendMessageDirectly(GridTcpDiscoverySpi.jav
at org.gridgain.grid.spi.discovery.tcp.GridTcpDiscoverySpi.sendJoinRequestMessage(GridTcpDiscoverySpi.
at org.gridgain.grid.spi.discovery.tcp.GridTcpDiscoverySpi.joinTopology(GridTcpDiscoverySpi.java:1599)
at org.gridgain.grid.spi.discovery.tcp.GridTcpDiscoverySpi.spiStart0(GridTcpDiscoverySpi.java:1084)
at org.gridgain.grid.spi.discovery.tcp.GridTcpDiscoverySpi.spiStart(GridTcpDiscoverySpi.java:982)
at org.gridgain.grid.kernal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:220)
at org.gridgain.grid.kernal.managers.discovery.GridDiscoveryManager.start(GridDiscoveryManager.java:38
at org.gridgain.grid.kernal.GridKernal.startManager(GridKernal.java:1559)
at org.gridgain.grid.kernal.GridKernal.start(GridKernal.java:756)
at org.gridgain.grid.kernal.GridGainEx$GridNamedInstance.start0(GridGainEx.java:1949)
at org.gridgain.grid.kernal.GridGainEx$GridNamedInstance.start(GridGainEx.java:1289)
at org.gridgain.grid.kernal.GridGainEx.start0(GridGainEx.java:832)
at org.gridgain.grid.kernal.GridGainEx.start(GridGainEx.java:759)
at org.gridgain.grid.kernal.GridGainEx.start(GridGainEx.java:677)
at org.gridgain.grid.kernal.GridGainEx.start(GridGainEx.java:524)
at org.gridgain.grid.kernal.GridGainEx.start(GridGainEx.java:494)
at org.gridgain.grid.GridGain.start(GridGain.java:314)
at org.gridgain.grid.startup.cmdline.GridCommandLineStartup.main(GridCommandLineStartup.java:293)
I think the issue you are getting has nothing to do with operating system, be that Windows or Linux. Most likely you have a firewall enabled some place, either locally on your operating system or remotely, and this firewall is blocking traffic in one direction.
Try disabling all software firewalls, and see if the behavior improves. If it does, you then can try re-enabling the firewall and fixing its settings.
Alex_V,
Can you please provide your configuration file?
Please provide the full log of starting node - ggstart.bat -v ...
or add -DGRIDGAIN_QUIET=false to JVM properties.
From the stack trace you provided I see that exception happens on start. Are you able to start nodes on win 2008? How many hosts are there? Are they in 1 network or routing is configured correctly?
When you see node frozen, can you please take a threaddump and post it here.
Log message that GridGain is not fully tested with Windows Server does not mean incompatibility. Moreover, I would expect it to work. It is just not tested as thorough as other win systems now.
Single topology may include win, mac and linux machines with no restrictions or performance impact. Almost all popular linux distributives are supported.
You can skip version check by adding -DGRIDGAIN_UPDATE_NOTIFIER=false to the JVM properties, but I dont think it may cause any issues.

Collectd server not writing down received client data

I have pretty strange problem with Collectd. I'm not new to Collectd, was using it for a long time on CentOS based boxes, but now we have Ubuntu TLS 12.04 boxes, and I have really strange issue.
So, using version 5.2 on Ubuntu 12.04 TLS. Two boxes residing on Rackspace (maybe important, but I'm not sure). Network plugin configured using two local IPs, without any firewall in between and without any security (just to try to set simple client server scenario).
On both servers collectd writes in configured folders as it should write, but on server machine it doesn't write data received from client.
Troubleshooted with tcpdump, and I can clearly see UDP traffic and collectd data, including hostname and plugin names from my client machine, received on server, but they are not flushed to appropriate folder (configured by collectd) ever. Also running everything as root user, to avoid troubleshooting permissions.
Anyone has any idea or similar experience with this? Or maybe some idea what could I do for troubleshooting this beside trying to crawl internet (I think I clicked on every sensible link Google gave me in last two days) and checking network layer (which looks fine)?
And just small note: exactly the same happened with official 4.10.2 version from Ubuntu's repo. After trying to troubleshoot it for hours moved to upgrade to version five.
I'd suggest trying out the quite generic troubleshooting procedure based on the csv and logfile plugins, as described in this answer. As everything seems to be fine locally, follow this procedure on the server, activating only the network plugin (in addition to logfile, csv and possibly rrdtool).
So after no way of fixing this, I upgraded my Ubuntu to 12.04.2 LTS (3.2.0-24-virtual) and this just started working fine, without any intervention.

What does glibc detected ...httpd: double free or corruption mean?

I have asked this question on serverfault.com which is as suggested a more appropriate place for it - https://serverfault.com/questions/169829/what-does-glibc-detected-httpd-double-free-or-corruption-mean
I have an EC2 server running that I use to process image uploads. i have a flash swf that handles uploading to the server from my local disk - while uploading about 130 images (a total of about 650MB) I got the following error in my server log file after about the 45th image.
* glibc detected * /usr/sbin/httpd: double free or corruption (!prev): 0x85a6b990 ***
What does this error mean?
The server has stopped responding so I will restart it. Where should I begin to find the cause of this problem?
thanks
some info -
Apache/2.2.9 (Unix) DAV/2 PHP/5.2.6 mod_ssl/2.2.9 OpenSSL/0.9.8b configured
Fedora 8
This means your web server has crashed. Barring bad hardware, this is a bug inside either the Apache server or a module you have loaded (such as mod_php).
Try upgrading to a newer version, if one is available. If that fails, open a bug report with the Apache maintainers.

Resources