I am using transport client for ES. I have a search request for 1 million records.
I am getting "No node available" exception after getting 8k records itself.
It is giving org.elasticsearch.transport.ReceiveTimeoutTransportException also sometimes.
What can be done to overcome this issue?
You have to set the timeout of ES to a higher value.
See related question here: timeout-es
Related
So I am getting an error in Azure Data Factory that I haven't been able to find any information about. I am running a data flow and eventually (after an hour or so) get this error
{"StatusCode":"DFExecutorUserError","Message":"Job failed due to
reason: The service has encountered an error processing your request.
Please try again. Error code 1204.","Details":"The service has
encountered an error processing your request. Please try again. Error
code 1204."}
Troubleshooting I have already done :
I have successfully ran the data flow using the sample option. Did this with 1 million rows.
I am processing 3 years of data and I have successfully processed all the data by filter the data by year and running the data flow once for each year.
So I think I have shown the data isn't the problem, because I have processed all of it by breaking it down into 3 runs.
I haven't found a pattern in the time the pipeline runs for before the error occurs that would indicate I am hitting any timeout value.
The source and sink for this data flow are both an Azure SQL Server database.
Does anyone have any thoughts? Any suggestions for getting a more verbose error out of data factory (I already have the pipeline set with verbose logging).
We are glad to hear that you has found the cause:
"I opened a Microsoft support ticket and they are saying it is a
database transient caused failure."
I think the error will be resolved automatically. I post this as answer and this can be beneficial to other community members. Thank you.
Update:
The most important thing is that you have resolved it by increase the vCorces in the end.
"The only thing they gave me was their BS article on handling
transient errors. Maybe I’m just old but a database that cannot
maintain connections to it is not very useful. What I’ve done to
workaround this is increase my vCores. This sql database was a
serverless one. While performance didn’t look bad my guess is the
database must be doing some sort of resize in the background to
handle the hour long data builds I need it to do. I had already tried
setting the min/max vCores to be the same. The connection errors
disappeared when I increased the vCores count to 6."
Looking in CloudWatch logs, I see these errors repeated continuously, approximately twice a second:
2019-03-10 00:56:12 2b1be9674700[AWS_OSCAR_ERROR]:Failed to read persisted replica status info. Error code = 11
2019-03-10 00:56:12 2b1be9674700[AWS_OSCAR_ERROR]:Failed to update replication status table (stage 2)
Does anyone know what these are? I've tried Googling the exact and partial errors and was amazed to see that there's nothing out there (well Google, anyway, has never seen them!)
I am concerned because --even if they are nothing to worry about-- they are generating millions of log entries and thus costing me lots in AWS fees.
Thanks in advance.
This looks like an aurora internal error message (that explains why google didn't have much to offer when you searched for it). I think the right way ahead is to contact AWS Support.
I'm pushing bunch of data in Cosmos DB using MongoDB NodeJS drivers. I'm running into the following error
{"Errors":["Request rate is large"]}
I realize this error occurs when the RU/s have exceeded the set threshold. I have partitioned my collection and set 2000 RU/s. I also have set an exponential backoff delay if that error occurs. It was working fine earlier i.e. it retried after a while if it ran into that error but succeeded after waiting. But currently it's consistently throwing the 'Request rate is large' even though I have a mechanism to try again after some delay. Any idea what might be going on and how can I resolve the issue?
Along with that is there a way to extract the raw response headers sent from cosmos DB using MongoDB NodeJS driver as I would like to extract x-ms-retry-after-ms and x-ms-request-charge headers.
We use vnodes on our cluster.
I noticed that when the token space of a node changes (automatically on vnodes, during a repair or a cleanup after adding new nodes), the datastax nodejs driver gets a lot of "Operation timed out - received only X responses" for a few minutes.
I tried using ONE and LOCAL_QUORUM consistencies.
I suppose this is due to the coordinator not hitting the right node just after the move. This seems to be a logical behavior (data was moved) but we really want to address this particular issue.
What do you guys suggest we should do to avoid this ? Having a custom retry policy ? Caching ? Changing the consistency ?
Example of behavior
when we see this:
4/7/2016, 10:43am Info Host 172.31.34.155 moved from '8185241953623605265' to '-1108852503760494577'
We see a spike of those:
{
"message":"Operation timed out - received only 0 responses.",
"info":"Represents an error message from the server",
"code":4608,
"consistencies":1,
"received":0,
"blockFor":1,
"isDataPresent":0,
"coordinator":"172.31.34.155:9042",
"query":"SELECT foo FROM foo_bar LIMIT 10"
}
I suppose this is due to the coordinator not hitting the right node just after the move. This seems to be a logical behavior (data was moved) but we really want to address this particular issue.
In fact, when adding new node, there will be token range movement but Cassandra can still serve read requests using the old token ranges until the scale out has finished completely. So the behavior you're facing is very suspicious.
If you can reproduce this error, please activate query tracing to narrow down the issue.
The error can also be related to a node under heavy load and not replying fast enough
I have a 3 node Cassandra cluster setup (replication set to 2) with Solr installed, each node having RHEL, 32 GB Ram, 1 TB HDD and DSE 4.8.3. There are lots of writes happening on my nodes and also my web application reads from my nodes.
I have observed that all the nodes go down after every 3-4 days. I have to do a restart of every node and then they function quite well till the next 3-4 days and again the same problem repeats. I checked the server logs but they do not show any error even when the server goes down. I am unable to figure out why is this happening.
In my application, sometimes when I connect to the nodes through the C# Cassandra driver, I get the following error
Cassandra.NoHostAvailableException: None of the hosts tried for query are available (tried: 'node-ip':9042) at Cassandra.Tasks.TaskHelper.WaitToComplete(Task task, Int32 timeout) at Cassandra.Tasks.TaskHelper.WaitToComplete[T](Task``1 task, Int32 timeout) at Cassandra.ControlConnection.Init() at Cassandra.Cluster.Init()`
But when I check the OpsCenter, none of the nodes are down. All nodes status show perfectly fine. Could this be a problem with the driver? Earlier I was using Cassandra C# driver version 2.5.0 installed from nuget, but now I updated even that to version 3.0.3 still this errors persists.
Any help on this would be appreciated. Thanks in advance.
If you haven't done so already, you may want to look at setting your logging levels to default by running: nodetool -h 192.168.XXX.XXX setlogginglevel org.apache.cassandra DEBUG on all your nodes
Your first issue is most likely an OutOfMemory Exception.
For your second issue, the problem is most likely that you have really long GC pauses. Tailing /var/log/cassandra/debug.log or /var/log/cassandra/system.log may give you a hint but typically doesn't reveal the problem unless you are meticulously looking at the timestamps. The best way to troubleshoot this is to ensure you have GC logging enabled in your jvm.options config and then tail your gc logs taking note of the pause times:
grep 'Total time for which application threads were stopped:' /var/log/cassandra/gc.log.1 | less
The Unexpected exception during request; channel = [....] java.io.IOException: Error while read (....): Connection reset by peer error is typically inter-node timeouts. i.e. The coordinator times out waiting for a response from another node and sends a TCP RST packet to close the connection.