When calling a WCF channel from multiple threads some threads might get stuck for a long time - multithreading

I have encountered a weird problem in one of my projects. I am creating one WCF channel and trying to consume it from multiple threads. The service I am targeting is shut down so I except to get an exception after the "Open timeout" (30 seconds in my case) at most. But what I have seen is that the first two calls to the channel are finished (with exception) really quickly. all the other calls are finished after 20 minutes (My receive timeout).
I am using the same channel because I don't want to wait for the channel to open for each request (Can take a few seconds in case of security and high latency). I have read that a channel is thread safe so I didn't think it should be a problem.
I am using dot net 4
Code sample:
EndpointAddress address = new EndpointAddress("net.tcp://localhost:9000/SomeService");
var netTcpBinding = new NetTcpBinding();
var channelFactory = new ChannelFactory<IService>(netTcpBinding, address);
IService channel = channelFactory.CreateChannel();
Parallel.For(0, 10, new ParallelOptions{MaxDegreeOfParallelism = 10}, i =>
{
try
{
channel.SomeOperation();
}
catch
{
}
});
I have tried to Close/Abort/Dispose the channel in the catch block but it didn't help.
Does anyone have any idea why this happens and how to fix it?

A Channel only has one connection, so even if it is thread-safe, you won't get the asynchronous benefits of using Parallel. Create a channel per loop and ensure that you close the channel after each request or you'll exhaust the connection pool on your machine from undisposed connections retained by the Channel.

Didn't find a standard solution but what I did find is that when I use async calls the problem doesn't happen (tested it several time with a 100 iterations loop.
Parallel.For(0, 10, new ParallelOptions{MaxDegreeOfParallelism = 10}, i =>
{
try
{
var result = channel.BeginSomeOperation();
channel.EndSomeOperation(result);
}
catch
{
}
});

Try this instead.
var tasks = from i in Enumerable.Range(0, 10)
select TaskEx.FromAsync(channel.BeginSomeOperation, channel.EndSomeOperation, null);
var results = from t in TaskEx.WhenAll(tasks)
select t.Result;
PS TaskEx is in the Async targeting pack.

Related

Do not process next job until previous job is completed (BullJS/Redis)?

Basically, each of the clients ---that have a clientId associated with them--- can push messages and it is important that a second message from the same client isn't processed until the first one is finished processing (Even though the client can send multiple messages in a row, and they are ordered, and multiple clients sending messages should ideally not interfere with each other). And, importantly, a job shouldn't be processed twice.
I thought that using Redis I might be able to fix this issue, I started with some quick prototyping using the bull library, but I am clearly not doing it well, I was hoping someone would know how to proceed.
This is what I tried so far:
Create jobs and add them to the same queue name for one process, using the clientId as the job name.
Consume jobs while waiting large random amounts of random time on 2 separate process.
I tried adding the default locking provided by the library that I am using (bull) but it locks on the jobId, which is unique for each job, not on the clientId .
What I would want to happen:
One of the consumers can't take the job from the same clientId until the previous one is finished processing it.
They should be able to, however, get items from different clientIds in parallel without problem (asynchronously). (I haven't gotten this far, I am right now simply dealing with only one clientId)
What I get:
Both consumers consume as many items as they can from the queue without waiting for the previous item for the clientId to be completed.
Is Redis even the right tool for this job?
Example code
// ./setup.ts
import Queue from 'bull';
import * as uuid from 'uuid';
// Check that when a message is taken from a place, no other message is taken
// TO do that test, have two processes that process messages and one that sets messages, and make the job take a long time
// queue for each room https://stackoverflow.com/questions/54178462/how-does-redis-pubsub-subscribe-mechanism-works/54243792#54243792
// https://groups.google.com/forum/#!topic/redis-db/R09u__3Jzfk
// Make a job not be called stalled, waiting enough time https://github.com/OptimalBits/bull/issues/210#issuecomment-190818353
export async function sleep(ms: number): Promise<void> {
return new Promise((resolve) => {
setTimeout(resolve, ms);
});
}
export interface JobData {
id: string;
v: number;
}
export const queue = new Queue<JobData>('messages', 'redis://127.0.0.1:6379');
queue.on('error', (err) => {
console.error('Uncaught error on queue.', err);
process.exit(1);
});
export function clientId(): string {
return uuid.v4();
}
export function randomWait(minms: number, maxms: number): Promise<void> {
const ms = Math.random() * (maxms - minms) + minms;
return sleep(ms);
}
// Make a job not be called stalled, waiting enough time https://github.com/OptimalBits/bull/issues/210#issuecomment-190818353
// eslint-disable-next-line #typescript-eslint/ban-ts-comment
//#ts-ignore
queue.LOCK_RENEW_TIME = 5 * 60 * 1000;
// ./create.ts
import { queue, randomWait } from './setup';
const MIN_WAIT = 300;
const MAX_WAIT = 1500;
async function createJobs(n = 10): Promise<void> {
await randomWait(MIN_WAIT, MAX_WAIT);
// always same Id
const clientId = Math.random() > 1 ? 'zero' : 'one';
for (let index = 0; index < n; index++) {
await randomWait(MIN_WAIT, MAX_WAIT);
const job = { id: clientId, v: index };
await queue.add(clientId, job).catch(console.error);
console.log('Added job', job);
}
}
export async function create(nIds = 10, nItems = 10): Promise<void> {
const jobs = [];
await randomWait(MIN_WAIT, MAX_WAIT);
for (let index = 0; index < nIds; index++) {
await randomWait(MIN_WAIT, MAX_WAIT);
jobs.push(createJobs(nItems));
await randomWait(MIN_WAIT, MAX_WAIT);
}
await randomWait(MIN_WAIT, MAX_WAIT);
await Promise.all(jobs)
process.exit();
}
(function mainCreate(): void {
create().catch((err) => {
console.error(err);
process.exit(1);
});
})();
// ./consume.ts
import { queue, randomWait, clientId } from './setup';
function startProcessor(minWait = 5000, maxWait = 10000): void {
queue
.process('*', 100, async (job) => {
console.log('LOCKING: ', job.lockKey());
await job.takeLock();
const name = job.name;
const processingId = clientId().split('-', 1)[0];
try {
console.log('START: ', processingId, '\tjobName:', name);
await randomWait(minWait, maxWait);
const data = job.data;
console.log('PROCESSING: ', processingId, '\tjobName:', name, '\tdata:', data);
await randomWait(minWait, maxWait);
console.log('PROCESSED: ', processingId, '\tjobName:', name, '\tdata:', data);
await randomWait(minWait, maxWait);
console.log('FINISHED: ', processingId, '\tjobName:', name, '\tdata:', data);
} catch (err) {
console.error(err);
} finally {
await job.releaseLock();
}
})
.catch(console.error); // Catches initialization
}
startProcessor();
This is run using 3 different processes, which you might call like this (Although I use different tabs for a clearer view of what is happening)
npx ts-node consume.ts &
npx ts-node consume.ts &
npx ts-node create.ts &
I'm not familir with node.js. But for Redis, I would try this,
Let's say you have client_1, client_2, they are all publisher of events.
You have three machines, consumer_1,consumer_2, consumer_3.
Establish a list of tasks in redis, eg, JOB_LIST.
Clients put(LPUSH) jobs into this JOB_LIST, in a specific form, like "CLIENT_1:[jobcontent]", "CLIENT_2:[jobcontent]"
Each consumer takes out jobs blockingly (RPOP command of Redis) and process them.
For example, consumer_1 takes out a job, content is CLIENT_1:[jobcontent]. It parses the content and recognize it's from CLIENT_1. Then it wants to check if some other consumer is processing CLIENT_1 already, if not, it will lock the key to indicate that it's processing CLIENT_1.
It goes on to set a key of "CLIENT_1_PROCESSING" , with content as "consumer_1", using the Redis SETNX command (set if the key not exists), with an appropriate timeout. For example, the task norally takes one minute to finish, you set a timeout of the key of five minutes, just in case consumer_1 crashes and holds on the lock indefinitely.
If the SETNX returns 0, it means it fails to acquire the lock of CLIENT_1 (someone is already processing a job of client_1). Then it returns the job (a value of "CLIENT_1:[jobcontent]")to the left side of JOB_LIST, by using Redis LPUSH command.Then it might wait a bit (sleep a few seconds), and RPOP another task from the right side of the LIST. If this time SETNX returns 1, consumer_1 acquires the lock. It goes on to process job, after it finishes, it deletes the key of "CLIENT_1_PROCESSING", releasing the lock. Then it goes on to RPOP another job, and so on.
Some things to consider:
The JOB_LIST is not fair,eg, earlier jobs might be processed later
The locking part is a bit rudimentary, but will suffice.
----------update--------------
I've figured another way to keep tasks in order.
For each client(producer), build a list. Like "client_1_list", push jobs into the left side of the list.
Save all the client names in a list "client_names_list", with values "client_1", "client_2", etc.
For each consumer(processor), iterate the "client_names_list", for example, consumer_1 get a "client_1", check if the key of client_1 is locked(some one is processing a task of client_1 already), if not, right pop a value(job) from client_1_list and lock client_1. If client_1 is locked, (probably sleep one second) and iterate to the next client, "client_2", for example, and check the keys and so on.
This way, each client(task producer)'s task is processed by their order of entering.
EDIT: I found the problem regarding BullJS is starting jobs in parallel on one processor: We are using named jobs and where defining many named process functions on one queue/processor. The default concurrency factor for a queue/processor is 1. So the queue should not process any jobs in parallel.
The problem with our mentioned setup is if you define many (named) process-handlers on one queue the concurrency is added up with each process-handler function: So if you define three named process-handlers you get a concurrency factor of 3 for given queue for all the defined named jobs.
So just define one named job per queue for queues where parallel processing should not happen and all jobs should run sequentially one after the other.
That could be important e.g. when pushing a high number of jobs onto the queue and the processing involves API calls that would give errors if handled in parallel.
The following text is my first approach of answering the op's question and describes just a workaround to the problem. So better just go with my edit :) and configure your queues the right way.
I found an easy solution to operators question.
In fact BullJS is processing many jobs in parallel on one worker instance:
Let's say you have one worker instance up and running and push 10 jobs onto the queue than possibly that worker starts all processes in parallel.
My research on BullJS-queues gave that this is not intended behavior: One worker (also called processor by BullJS) should only start a new job from the queue when its in idle state so not processing a former job.
Nevertheless BullJS keeps starting jobs in parallel on one worker.
In our implementation that lead to big problems during API calls that most likely are caused by t00 many API calls at a time. Tests gave that when only starting one worker the API calls finished just fine and gave status 200.
So how to just process one job after the other once the previous is finished if BullJS does not do that for us (just what the op asked)?
We first experimented with delays and other BullJS options but thats kind of workaround and not the exact solution to the problem we are looking for. At least we did not get it working to stop BullJS from processing more than one job at a time.
So we did it ourself and started one job after the other.
The solution was rather simple for our use case after looking into BullJS API reference (BullJS API Ref).
We just used a for-loop to start the jobs one after another. The trick was to use BullJS's
job.finished
method to get a Promise.resolve once the job is finished. By using await inside the for-loop the next job gets just started immediately after the job.finished Promise is awaited (resolved). Thats the nice thing with for-loops: Await works in it!
Here a small code example on how to achieve the intended behavior:
for (let i = 0; i < theValues.length; i++) {
jobCounter++
const job = await this.processingQueue.add(
'update-values',
{
value: theValues[i],
},
{
// delay: i * 90000,
// lifo: true,
}
)
this.jobs[job.id] = {
jobType: 'socket',
jobSocketId: BackgroundJobTasks.UPDATE_VALUES,
data: {
value: theValues[i],
},
jobCount: theValues.length,
jobNumber: jobCounter,
cumulatedJobId
}
await job.finished()
.then((val) => {
console.log('job finished:: ', val)
})
}
The important part is really
await job.finished()
inside the for loop. leasingValues.length jobs get started all just one after the other as intended.
That way horizontally scaling jobs across more than one worker is not possible anymore. Nevertheless this workaround is okay for us at the moment.
I will get in contact with optimalbits - the maker of BullJS to clear things out.

worker thread won't respond after first message?

I'm making a server script and, to make it easier for both hosts and clients to do what they want, I made a customizable server script that runs using nw.js(with a visual interface). Said script was made using web workers since nw.js was having problems with support to worker threads.
Now that NW.js fixed their problems with worker threads, I've been trying to move all the things that were inside the web workers to worker threads, but there's a problem: When the main thread receives the answer from the second thread, the later stops responding to any subsequent message.
For example, running the following code with either NW.js or Node.js itself will return "pong" only once
const { Worker } = require('worker_threads');
const worker = new Worker('const { parentPort } = require("worker_threads");parentPort.once("message",message => parentPort.postMessage({ pong: message })); ', { eval: true });
worker.on('message', message => console.log(message));
worker.postMessage('ping');
worker.postMessage('ping');
How do I configure the worker so it will keep responding to whatever message it receives after the first one?
Because you use EventEmitter.once() method. According to the documentation this method does the next:
Adds a one-time listener function for the event named eventName. The
next time eventName is triggered, this listener is removed and then
invoked.
If you need your worker to process more than one event then use EventEmitter.on()
const worker = new Worker('const { parentPort } = require("worker_threads");' +
'parentPort.on("message",message => parentPort.postMessage({ pong: message }));',
{ eval: true });

Nodejs Cluster Architecture reading from single REDIS instance

I'm using Nodejs cluster module to have multiple workers running.
I created a basic Architecture where there will be a single MASTER process which is basically an express server handling multiple requests and the main task of MASTER would be writing incoming data from requests into a REDIS instance. Other workers(numOfCPUs - 1) will be non-master i.e. they won't be handling any request as they are just the consumers. I have two features namely ABC and DEF. I distributed the non-master workers evenly across features via assigning them type.
For eg: on a 8-core machine:
1 will be MASTER instance handling request via express server
Remaining (8 - 1 = 7) will be distributed evenly. 4 to feature:ABD and 3 to fetaure:DEF.
non-master workers are basically consumers i.e. they read from REDIS in which only MASTER worker can write data.
Here's the code for the same:
if (cluster.isMaster) {
// Fork workers.
for (let i = 0; i < numCPUs - 1; i++) {
ClusteringUtil.forkNewClusterWithAutoTypeBalancing();
}
cluster.on('exit', function(worker) {
console.log(`Worker ${worker.process.pid}::type(${worker.type}) died`);
ClusteringUtil.removeWorkerFromList(worker.type);
ClusteringUtil.forkNewClusterWithAutoTypeBalancing();
});
// Start consuming on server-start
ABCConsumer.start();
DEFConsumer.start();
console.log(`Master running with process-id: ${process.pid}`);
} else {
console.log('CLUSTER type', cluster.worker.process.env.type, 'running on', process.pid);
if (
cluster.worker.process.env &&
cluster.worker.process.env.type &&
cluster.worker.process.env.type === ServerTypeEnum.EXPRESS
) {
// worker for handling requests
app.use(express.json());
...
}
{
Everything works fine except consumers reading from REDIS.
Since there are multiple consumers of a particular feature, each one reads the same message and start processing individually, which is what I don't want. If there are 4 consumers, 1 is marked as busy and can not consumer until free, 3 are available. Once the message for that particular feature is written in REDIS by MASTER, the problem is all 3 available consumers of that feature start consuming. This means that the for a single message, the job is done based on number of available consumers.
const stringifedData = JSON.stringify(req.body);
const key = uuidv1();
const asyncHsetRes = await asyncHset(type, key, stringifedData);
if (asyncHsetRes) {
await asyncRpush(FeatureKeyEnum.REDIS.ABC_MESSAGE_QUEUE, key);
res.send({ status: 'success', message: 'Added to processing queue' });
} else {
res.send({ error: 'failure', message: 'Something went wrong in adding to queue' });
}
Consumer simply accepts messages and stop when it is busy
module.exports.startHeartbeat = startHeartbeat = async function(config = {}) {
if (!config || !config.type || !config.listKey) {
return;
}
heartbeatIntervalObj[config.type] = setInterval(async () => {
await asyncLindex(config.listKey, -1).then(async res => {
if (res) {
await getFreeWorkerAndDoJob(res, config);
stopHeartbeat(config);
}
});
}, HEARTBEAT_INTERVAL);
};
Ideally, a message should be read by only one consumer of that particular feature. After consuming, it is marked as busy so it won't consume further until free(I have handled this). Next message could only be processed by only one consumer out of other available consumers.
Please help me in tacking this problem. Again, I want one message to be read by only one free consumer and rest free consumers should wait for new message.
Thanks
I'm not sure I fully get your Redis consumers architecture, but I feel like it contradicts with the use case of Redis itself. What you're trying to achieve is essentially a queue based messaging with an ability to commit a message once its done.
Redis has its own pub/sub feature, but it is built on fire and forget principle. It doesn't distinguish between consumers - it just sends the data to all of them, assuming that its their logic to handle the incoming data.
I recommend to you use Queue Servers like RabbitMQ. You can achieve your goal with some features that AMQP 0-9-1 supports: message acknowledgment, consumer's prefetch count and so on. You can set up your cluster with very agile configs like ok, I want to have X consumers, and each can handle 1 unique (!) message at a time and they will receive new ones only after they let the server (rabbitmq) know that they successfully finished message processing. This is highly configurable and robust.
However, if you want to go serverless with some fully managed service so that you don't provision like virtual machines or anything else to run a message queue server of your choice, you can use AWS SQS. It has pretty much similar API and features list.
Hope it helps!

How to do Async in Azure WebJob function

I have an async method that gets api data from a server. When I run this code on my local machine, in a console app, it performs at high speed, pushing through a few hundred http calls in the async function per minute. When I put the same code to be triggered from an Azure WebJob queue message however, it seems to operate synchronously and my numbers crawl - I'm sure I am missing something simple in my approach - any assistance appreciated.
(1) .. WebJob function that listens for a message on queue and kicks off the api get process on message received:
public class Functions
{
// This function will get triggered/executed when a new message is written
// on an Azure Queue called queue.
public static async Task ProcessQueueMessage ([QueueTrigger("myqueue")] string message, TextWriter log)
{
var getAPIData = new GetData();
getAPIData.DoIt(message).Wait();
log.WriteLine("*** done: " + message);
}
}
(2) the class that outside azure works in async mode at speed...
class GetData
{
// wrapper that is called by the message function trigger
public async Task DoIt(string MessageFile)
{
await CallAPI(MessageFile);
}
public async Task<string> CallAPI(string MessageFile)
{
/// create a list of sample APIs to call...
var apiCallList = new List<string>();
apiCallList.Add("localhost/?q=1");
apiCallList.Add("localhost/?q=2");
apiCallList.Add("localhost/?q=3");
apiCallList.Add("localhost/?q=4");
apiCallList.Add("localhost/?q=5");
// setup httpclient
HttpClient client =
new HttpClient() { MaxResponseContentBufferSize = 10000000 };
var timeout = new TimeSpan(0, 5, 0); // 5 min timeout
client.Timeout = timeout;
// create a list of http api get Task...
IEnumerable<Task<string>> allResults = apiCallList.Select(str => ProcessURLPageAsync(str, client));
// wait for them all to complete, then move on...
await Task.WhenAll(allResults);
return allResults.ToString();
}
async Task<string> ProcessURLPageAsync(string APIAddressString, HttpClient client)
{
string page = "";
HttpResponseMessage resX;
try
{
// set the address to call
Uri URL = new Uri(APIAddressString);
// execute the call
resX = await client.GetAsync(URL);
page = await resX.Content.ReadAsStringAsync();
string rslt = page;
// do something with the api response data
}
catch (Exception ex)
{
// log error
}
return page;
}
}
First because your triggered function is async, you should use await rather than .Wait(). Wait will block the current thread.
public static async Task ProcessQueueMessage([QueueTrigger("myqueue")] string message, TextWriter log)
{
var getAPIData = new GetData();
await getAPIData.DoIt(message);
log.WriteLine("*** done: " + message);
}
Anyway you'll be able to find usefull information from the documentation
Parallel execution
If you have multiple functions listening on different queues, the SDK will call them in parallel when messages are received simultaneously.
The same is true when multiple messages are received for a single queue. By default, the SDK gets a batch of 16 queue messages at a time and executes the function that processes them in parallel. The batch size is configurable. When the number being processed gets down to half of the batch size, the SDK gets another batch and starts processing those messages. Therefore the maximum number of concurrent messages being processed per function is one and a half times the batch size. This limit applies separately to each function that has a QueueTrigger attribute.
Here is a sample code to configure the batch size:
var config = new JobHostConfiguration();
config.Queues.BatchSize = 50;
var host = new JobHost(config);
host.RunAndBlock();
However, it is not always a good option to have too many threads running at the same time and could lead to bad performance.
Another option is to scale out your webjob:
Multiple instances
if your web app runs on multiple instances, a continuous WebJob runs on each machine, and each machine will wait for triggers and attempt to run functions. The WebJobs SDK queue trigger automatically prevents a function from processing a queue message multiple times; functions do not have to be written to be idempotent. However, if you want to ensure that only one instance of a function runs even when there are multiple instances of the host web app, you can use the Singleton attribute.
Have a read of this Webjobs SDK documentation - the behaviour you should expect is that your process will run and process one message at a time, but will scale up if more instances are created (of your app service). If you had multiple queues, they will trigger in parallel.
In order to improve the performance, see the configurations settings section in the link I sent you, which refers to the number of messages that can be triggered in a batch.
If you want to process multiple messages in parallel though, and don't want to rely on instance scaling, then you need to use threading instead (async isn't about multi-threaded parallelism, but making more efficient use of the thread you're using). So your queue trigger function should read the message from the queue, the create a thread and "fire and forget" that thread, and then return from the trigger function. This will mark the message as processed, and allow the next message on the queue to be processed, even though in theory you're still processing the earlier one. Note you will need to include your own logic for error handling and ensuring that the data wont get lost if your thread throws an exception or can't process the message (eg. put it on a poison queue).
The other option is to not use the [queuetrigger] attribute, and use the Azure storage queues sdk API functions directly to connect and process the messages per your requirements.

Amazon SQS with aws-sdk receiveMessage Stall

I'm using the aws-sdk node module with the (as far as I can tell) approved way to poll for messages.
Which basically sums up to:
sqs.receiveMessage({
QueueUrl: queueUrl,
MaxNumberOfMessages: 10,
WaitTimeSeconds: 20
}, function(err, data) {
if (err) {
logger.fatal('Error on Message Recieve');
logger.fatal(err);
} else {
// all good
if (undefined === data.Messages) {
logger.info('No Messages Object');
} else if (data.Messages.length > 0) {
logger.info('Messages Count: ' + data.Messages.length);
var delete_batch = new Array();
for (var x=0;x<data.Messages.length;x++) {
// process
receiveMessage(data.Messages[x]);
// flag to delete
var pck = new Array();
pck['Id'] = data.Messages[x].MessageId;
pck['ReceiptHandle'] = data.Messages[x].ReceiptHandle;
delete_batch.push(pck);
}
if (delete_batch.length > 0) {
logger.info('Calling Delete');
sqs.deleteMessageBatch({
Entries: delete_batch,
QueueUrl: queueUrl
}, function(err, data) {
if (err) {
logger.fatal('Failed to delete messages');
logger.fatal(err);
} else {
logger.debug('Deleted recieved ok');
}
});
}
} else {
logger.info('No Messages Count');
}
}
});
receiveMessage is my "do stuff with collected messages if I have enough collected messages" function
Occasionally, my script is stalling because I don't get a response for Amazon at all, say for example there are no messages in the queue to consume and instead of hitting the WaitTimeSeconds and sending a "no messages object", the callback isn't called.
(I'm writing this up to Amazon Weirdness)
What I'm asking is whats the best way to detect and deal with this, as I have some code in place to stop concurrent calls to receiveMessage.
The suggested answer here: Nodejs sqs queue processor also has code that prevents concurrent message request queries (granted it's only fetching one message a time)
I do have the whole thing wrapped in
var running = false;
runMonitorJob = setInterval(function() {
if (running) {
} else {
running = true;
// call SQS.receive
}
}, 500);
(With a running = false after the delete loop (not in it's callback))
My solution would be
watchdogTimeout = setTimeout(function() {
running = false;
}, 30000);
But surely this would leave a pile of floating sqs.receive's lurking about and thus much memory over time?
(This job runs all the time, and I left it running on Friday, it stalled Saturday morning and hung till I manually restarted the job this morning)
Edit: I have seen cases where it hangs for ~5 minutes and then suddenly gets messages BUT with a wait time of 20 seconds it should throw a "no messages" after 20 seconds. So a WatchDog of ~10 minutes might be more practical (depending on the rest of ones business logic)
Edit: Yes Long Polling is already configured Queue Side.
Edit: This is under (latest) v2.3.9 of aws-sdk and NodeJS v4.4.4
I've been chasing this (or a similar) issue for a few days now and here's what I've noticed:
The receiveMessage call does eventually return although only after 120 seconds
Concurrent calls to receiveMessage are serialised by the AWS.SDK library so making multiple calls in parallel have no effect.
The receiveMessage callback does not error - in fact after the 120 seconds have passed, it may contain messages.
What can be done about this? This sort of thing can happen for a number of reasons and some/many of these things can't necessarily be fixed. The answer is to run multiple services each calling receiveMessage and processing the messages as they come - SQS supports this. At any time, one of these services may hit this 120 second lag but the other services should be able to continue on as normal.
My particular problem is that I have some critical singleton services that can't afford 120 seconds of down time. For this I will look into either 1) use HTTP instead of SQS to push messages into my service or 2) spawn slave processes around each of the singletons to fetch the messages from SQS and push them into the service.
I also ran into this issue, but not when calling receiveMessage but sendMessage. I also saw hangups of exactly 120 seconds. I also saw it with a few other services, like Firehose.
That lead me to this line in the AWS SDK:
SQS Constructor
httpOptions:
timeout [Integer] — Sets the socket to timeout after timeout milliseconds of inactivity on the socket. Defaults to two minutes (120000).
to implement a fix, I override the timeout for my SQS client that performs the sendMessage to timeout after 10 seconds, and another with 25 seconds for receiving (where I long poll for 20 seconds):
var sendClient = new AWS.SQS({httpOptions:{timeout:10*1000}});
var receiveClient = new AWS.SQS({httpOptions:{timeout:25*1000}});
I've had this out in production for a week now and I've noticed that all of my SQS stalling issues have been eliminated.

Resources