Syncing clocks on multiple Azure VMs - azure

I have a requirement to write a load test measuring message transmission latencies. In order to simulate a large number of simultaneous uses without running into thread contention problem on one box, I'm spinning up multiple servers in Azure.
When I got my first results back, I was a little shocked to see that the results indicated the message was received before it was sent. I immediately realized that, while I had an implicit assumption that all the VMs would have their clocks synced to within milliseconds, that was clearly not the case.
I've spent several hours googling ways to resolve this, and I'm not getting anywhere. One thought was to have each VM query the time on a central server using NetRemoteTOD() using a technique similar to this NetRemoteTOD, and then establish a per-machine correction factor to be added to the time measured from the local machine's clock. However when I tried to run that method, I got a error 2184, "The service has not been started" I have verified that both the RPC service and the Windows Time service are running on the both the client and target machines, and I have not been successful in finding any information indicating what other service needs to be running (or even if the error really means what it seems to mean). (I also get the same error when running between my development desktop and a server on our corporate network. However, I can run it successfully to a PDC on the corporate network - but I can't find a PDC on Azure, since neither machine is part of a domain.)
So, does any one have either any information on what service needs to be started to get NetRemoteTOD (or the windows NET TIME command, which relies on NetRemoteTOD under the covers) working. Alternatively, does anyone have a suggestion for some other technique to get a consistent time reference across multiple VMs in Azure? (Note, I don't necessarily need their clocks synced, I just need a way to establish a consistent correction factor to reference the times to a common source. Note also, I need sub-second accuracy - probably about 100 msec will do.) Basically, I just need a windows function or shell command that will get me the time to sub-second accuracy on a given remote server.
Thanks in advance.
PS. Azure servers are running Server 2008 R2 SP1

Related

Azure DevOps Self hosted agent error connectivity issues

We are using Azure DevOps Self hosted agents to build and release our application. Often we are seeing
below error and recovering automatically. Does anyone know what is this error ,how to tackle this and where to exactly check logs about the error ?
We stopped hearing from agent <agent name>. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink?Linkid=846610
This seems to be a known issue with both self-hosted and Microsoft-hosted agents that many people have been reporting.
Quoting the reply from #zachariahcox from the Azure Pipelines Product Group:
To provide some context, the azure pipelines agent is composed of two
processes: agent.listener and agent.worker (one of these per
step in the job). The listener is responsible for reporting that
workers are still making progress. If the agent.listener is unable
to communicate with the server for 10 minutes (we attempt to
communicate every minute), we assume something has Gone Wrong and
abandon the job.
So, if you're running a private machine, anything that can interfere
with the listener's ability to communicate with our server is going to
be a problem.
Among the issues i've seen are anti-virus programs identifying it as a
threat, local proxies acting up in various ways, the physical machine
running out of memory or disk space (quite common), the machine
rebooting unexpectedly, someone ctrl+c'ing the whole listener process,
the work payload being run at a way higher priority than the listener
(thus "starving" the listener out), unit tests shutting down network
adapters (quite common), having too many agents at normal priority on
the same machine so they starve each other out, etc.
If you think you're seeing an issue that cannot be explained by any of
the above (and nothing jumps out at you from the _diag logs folder),
please file an issue at
https://azure.microsoft.com/en-us/support/devops/
If everything seems to be perfectly alright with your agent and none of the steps mentioned in the Pipeline troubleshooting guide help, please report it on Developer Community where the Azure DevOps Team and DevOps community are actively answering questions.

Azure VM outbound HTTP is unreliable

I have setup and Azure VM and installed a monitoring service that reaches out to various endpoints to verify a 200 response. The service is set to cycle through about 8 URL endpoints every 5 minutes or so.
We have run this service from multiple other servers outside of Azure, including virtual machines that are cheap, low end offerings.
While this machine is running on the lowest A0, it isn't doing anything else other than to run this service and call out to the various endpoints.
We are getting intermittent periods where one of the calls out of the list will fail for different periods that span 10-40 minutes at random periods several times a day.
The site or sites that fail are totally random and there is no down time from other monitor locations. We are sure that the connection problem is between Azure and the endpoints outside of Azure. There is no problem from anywhere outside of Azure.
I'm trying to figure out what could be causing this issue. It concerns me because we will be adding more services to Azure soon that use outside HTTP calls for credit card authorization and other API's.
Is this a known issue where outbound calls just don't function reliably at periods, or am I missing something in the setup or security settings?
Obviously, if the call makes it out and the response doesn't make it back, that is even worse as credit card charges would end up being pushed and the application would not register the proper response.
Anyone with some experience or insight would be greatly appreciated.
Thanks!
I find that very disturbing and hard to believe since, among a lot of other stuff, I run a service like that too... In my case I reach out to several (today, about 70) external addresses on both IPV4 and IPV6. I don't run A0s, and most of my machines are A3. I'll start a A0 to test it... if anything turns out <terminator>i'll be back</terminator> to report...
I know that there are several limitations regarding network traffic but i don't think you can reach them the way you're reporting...
My suggestion is to report that problem directly to MS via support ticket... most likely the problem is on the other side...

I'm not sure how to correctly configure my server setup

This is kind of a multi-tiered question in which my end goal is to establish the best way to setup my server which will be hosting a website as well as a service (using Socket.io) for an iOS (and eventually an Android) app. Both the app service and the website are going to be written in node.js as I need high concurrency and scaling for the app server and I figured whilst I'm at it may as well do the website in node because it wouldn't be that much different in terms of performance than something different like Apache (from my understanding).
Also the website has a lower priority than the app service, the app service should receive significantly higher traffic than the website (but in the long run this may change). Money isn't my greatest priority here, but it is a limiting factor, I feel that having a service that has 99.9% uptime (as 100% uptime appears to be virtually impossible in the long run) is more important than saving money at the compromise of having more down time.
Firstly I understand that having one node process per cpu core is the best way to fully utilise a multi-core cpu. I now understand after researching that running more than one per core is inefficient due to the fact that the cpu has to do context switching between the multiple processes. How come then whenever I see code posted on how to use the in-built cluster module in node.js, the master worker creates a number of workers equal to the number of cores because that would mean you would have 9 processes on an 8 core machine (1 master process and 8 worker processes)? Is this because the master process usually is there just to restart worker processes if they crash or end and therefore does so little it doesnt matter that it shares a cpu core with another node process?
If this is the case then, I am planning to have the workers handle providing the app service and have the master worker handle the workers but also host a webpage which would provide statistical information on the server's state and all other relevant information (like number of clients connected, worker restart count, error logs etc). Is this a bad idea? Would it be better to have this webpage running on a separate worker and just leave the master worker to handle the workers?
So overall I wanted to have the following elements; a service to handle the request from the app (my main point of traffic), a website (fairly simple, a couple of pages and a registration form), an SQL database to store user information, a webpage (probably locally hosted on the server machine) which only I can access that hosts information about the server (users connected, worker restarts, server logs, other useful information etc) and apparently nginx would be a good idea where I'm handling multiple node processes accepting connection from the app. After doing research I've also found that it would probably be best to host on a VPS initially. I was thinking at first when the amount of traffic the app service would be receiving will most likely be fairly low, I could run all of those elements on one VPS. Or would it be best to have them running on seperate VPS's except for the website and the server status webpage which I could run on the same one? I guess this way if there is a hardware failure and something goes down, not everything does and I could run 2 instances of the app service on 2 different VPS's so if one goes down the other one is still functioning. Would this just be overkill? I doubt for a while I would need multiple app service instances to support the traffic load but it would help reduce the apparent down time for users.
Maybe this all depends on what I value more and have the time to do? A more complex server setup that costs more and maybe a little unnecessary but guarantees a consistent and reliable service, or a cheaper and simpler setup that may succumb to downtime due to coding errors and server hardware issues.
Also it's worth noting I've never had any real experience with production level servers so in some ways I've jumped in the deep end a little with this. I feel like I've come a long way in the past half a year and feel like I'm getting a fairly good grasp on what I need to do, I could just do with some advice from someone with experience that has an idea with what roadblocks I may come across along the way and whether I'm causing myself unnecessary problems with this kind of setup.
Any advice is greatly appreciated, thanks for taking the time to read my question.

JMeter never fails

I'm trying to stress test a server with JMeter. I followed the manual and successfully created the tests (Test are running ok and response is correct).
However even if I keep increasing the number of threads it never fails, but I keep reading that there must be limitations? So what am I doing wrong?
My CPU is running on +/-5% when I'm not running JMeter. Running 3000 threads I see the number of threads increase by 3000 and CPU usage goes to +/-15%. Also JMeter never complains something went wrong.
My JMeter configuration is:
Number of threads: 3000
Ramp-Up Period: 30
LoopCount: Forever (Let it run for over an hour and still nothing goes wrong)
The bottleneck now is my internet connection which simply can't handle this load and maxes out at 2.1Mbps. Is this causing the problem? It is increasing my latency from 10ms per thread to over 5000ms per thread, but threads are still running.
Assuming you have confirmed that you definitely aren't getting back any errors (e.g. using a results table listener, or logging/displaying only errors using a results graph listener) and your internet connection is running at capacity then yes, it does sound like your internet connection is the bottleneck. It doesn't sound like your server is being stressed at all.
If you can easily make use of other machines (e.g. servers in the same location as the server you are testing), you could try using JMeter remote (distributed) testing to sidestep the limitations of your internet connection. See http://jmeter.apache.org/usermanual/remote-test.html for details.
Alternatively, if it's easy (e.g. if you're using VM's in a cloud and can easily spin one up with your software on), you could try using the least-powerful server you can instead and stress testing that to see if you can make it struggle even with your internet connection (just as a sanity check).
If this doesn't help, more details on your server (hardware specifications, web server software and thread pool settings, language) and the site/pages you are testing (mostly static or dynamic? large requests/responses?) would be useful. I've certainly managed to make lower-powered machines (e.g. EC2 m1.small) struggle using JMeter over a 2Mbps connection, but it depends on the site you're testing.

Massive test against azure getting connection refused or service unavailable

We have a cloud service that gets requests from users, passes the data (two params) to table entities and puts them into cloudtables (using BatchTableOperations to InsertOrReplace rows). The method is that simple, trying to keep it light and fast (partition key and parttionkey/rowkey pairs issues are controlled).
We need the Cloud Service to cope with about 10k to 15k "concurrent" requests. We first used queues to get users data and a Worker Role to process queue messages and put them into SQL. Although no error rose and no data was lost, processing was too slow for our needs. Now we are trying cloud tables to see if we can process data faster. With smaller amounts of requests, process is fast, but as we get more requests, errors occur and data is lost.
I've set up a few virtual machines for testing in the same virtual network that the cloud service is on, to prevent firewall to stop requests. A jMeter test with 1000 threads and 5 loops, gets 0% error. Same test from 2 virtual machines is ok too. Adding a third machine causes first errors (0.14% requests get Service unavailable 503 errors). Massive tests from 10 machines, 1000 threads and 2 loops gets massive 503 and/or connection refused errors. We have tried scaling cloud service up to 10 instances but that makes little difference on results.
I'm a bit stuck with this issue, and don't know if I'm focussing the problem with the right tools. Any suggestion will be highly welcome.
The issue may be related to throttling at the storage level. Please look at the scalability targets specified by Windows Azure Storage team here: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/11/04/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx. You may want to try doing the load test keeping these scalability targets into consideration.

Resources