Linux vs Win runtime timings - linux

I have an application which was ported from Windows to Linux. Now the same code compiles on VS C++ and g++, but there is a difference in performance when it's running on Win and when it's running on Linux. The scope of this application is caching. It's a node between a server and a client, and it's caching client requests and server response in a list, so that any other client which makes requests that was already processed by the server, this node will response instead of forwarding it to server.
When this node runs on Windows, the client gets all it needs in about 7 seconds. But when same node is running on Linux (Ubuntu 9.04), the client starts up in 35 seconds. Every test is from scratch. I'm trying to understand why is this timing difference. A weird scenario is when the node is running on Linux but in a Virtual Machine, hosted by Win. In this case, load time is around 7 seconds, just like it was running Win natively. So, my impression is that there is a problem with networking.
This node is using UDP protocol for sending and receiving network data, and it's using boost::asio as implementation. I tried to change all supported socket flags, changed buffer size, but nothing.
Does someone know why is this happening, or any network settings related with UDP that might influence the performance?
Thanks.

If you suspect a network problem take a network capture (Wireshark is great for this kind of problem) and look at the traffic.
Find out where the time is being spent, either based on the network capture or based on the output of a profiler.
Once you know that you're half way to a solution.

These timing differences can depend on many factors, but the first one coming to mind is that you are using a modern Windows version. XP already had features to keep recently used applications in memory, but in Vista this was much better optimized. For each application you load, a special load file is created that is equal to how it looks in memory. Next time you load your application, it should go a lot faster.
I don't know about Linux, but it is very well possible that it needs to load your app completely each time. You can test the difference in performance between the two systems much better if you compare performance when running. Leave your application open (if it is possible with your design) and compare again.
These differences in how the system optimizes memory are backed up by your scenario using the VM approach.
Basically, if you rule out other running applications and if you run your application in high priority mode, the performance should be close to equal, but it depends on whether you use operating system specific code, how you access the file system, how you you use the UDP protocol etc etc.

Related

NodeJS Monitoring Website (Worker Threads?/Multi Process?)

I am doing small project of application that will monitor some servers.
It will base on telnet port check, ping, and also it will use libraries to connect directly to databases (MSSQL, Oracle, MySQL) to check their status.
I wonder what will be the best effective solution for this idea, currently with around 30 servers it works quite smooth, around 2.5sec to check status for all of them (running async). However I am worried that in the future with more servers it might get worse. Hence thinking about using some alternative like Worker Threads maybe? or some multi processing? Any ideas? Everything is happening in internal network so I do not expect huge latency.
Thank you in advance.
Have you ever tried the PM2 cluster mode:
https://pm2.keymetrics.io/docs/usage/cluster-mode/
The telnet stuff is TCP, which Node.js does very well using OS-level networking events. The connections to databases can vary. In the case of Oracle, you'll likely be using the node-oracledb. Those are SQL*Net connections that rely on the OCI libs and Node.js' thread pool. The thread pool defaults to four threads, but you can grow it up to 128 per Node.js process. See this doc for info:
https://oracle.github.io/node-oracledb/doc/api.html#-143-connections-threads-and-parallelism
Having said all that, other than increasing the size of the thread pool, I wouldn't recommend you make any changes. Why fight fires before they're burning? No need to over-engineer things. You're getting acceptable performance given the current number of servers you have.
How many servers do you plan to add in, say, 5 years? What's the difference in timing if you run the status checks for half of the servers vs all of them? Perhaps you could use that kind of data to make an educated guess as to where things would go.
As you add new ones, keep track of the total time to check the status. Is it slipping? If so, look into where the time is being spent and write the solution that will help.

For a node web server, is it better to have more vCPUs or RAM

I am running a node app on a Digital Ocean cloud server, and the app merely services API requests. All client-side assets are served by a CDN, and the DB is accessed remotely, rather than stored on the server instance itself.
I have the choice of a greater number of vCPUs or RAM. I have no idea what that means in any way, so any feedback is a great help.
A single node.js server will run your Javascript on only one CPU so it doesn't help your Javascript run any faster to have more CPUs unless you cluster your app and run multiple node.js processes sharing the load of your app or unless there are other processes on the same server that are being used by your server.
Having more RAM (memory) will only improve things if you actually need more RAM. That depends entirely upon what the memory usage profile is of your app and how much RAM you already have available. Probably, you would already know if you were running out of RAM because you either get drastic slow-down when the OS starts page swapping or your process crashes when out of memory.
So, in order to know which would benefit you more, you really need more data on how your existing app is performing (whether it is ever bog down with CPU intensive operations and how much RAM it uses compared to how much you have available). It is quite possible that neither will actually matter to you - it totally depends upon the usage profile or your server process.
If you have no more data than this and have to make a choice, choose the vCPUs because there are some circumstances where it might help you (and gives you the option to go to clustering in the future if needed) whereas adding more RAM when you aren't even using what you already have won't help you at all.

JMeter never fails

I'm trying to stress test a server with JMeter. I followed the manual and successfully created the tests (Test are running ok and response is correct).
However even if I keep increasing the number of threads it never fails, but I keep reading that there must be limitations? So what am I doing wrong?
My CPU is running on +/-5% when I'm not running JMeter. Running 3000 threads I see the number of threads increase by 3000 and CPU usage goes to +/-15%. Also JMeter never complains something went wrong.
My JMeter configuration is:
Number of threads: 3000
Ramp-Up Period: 30
LoopCount: Forever (Let it run for over an hour and still nothing goes wrong)
The bottleneck now is my internet connection which simply can't handle this load and maxes out at 2.1Mbps. Is this causing the problem? It is increasing my latency from 10ms per thread to over 5000ms per thread, but threads are still running.
Assuming you have confirmed that you definitely aren't getting back any errors (e.g. using a results table listener, or logging/displaying only errors using a results graph listener) and your internet connection is running at capacity then yes, it does sound like your internet connection is the bottleneck. It doesn't sound like your server is being stressed at all.
If you can easily make use of other machines (e.g. servers in the same location as the server you are testing), you could try using JMeter remote (distributed) testing to sidestep the limitations of your internet connection. See http://jmeter.apache.org/usermanual/remote-test.html for details.
Alternatively, if it's easy (e.g. if you're using VM's in a cloud and can easily spin one up with your software on), you could try using the least-powerful server you can instead and stress testing that to see if you can make it struggle even with your internet connection (just as a sanity check).
If this doesn't help, more details on your server (hardware specifications, web server software and thread pool settings, language) and the site/pages you are testing (mostly static or dynamic? large requests/responses?) would be useful. I've certainly managed to make lower-powered machines (e.g. EC2 m1.small) struggle using JMeter over a 2Mbps connection, but it depends on the site you're testing.

JBoss: 32 vs 64 Bit Performance differences?

I know that it is a pretty vague question but I was hoping to get some ideas about where to look as it is a little puzzling to me.
I have a web app that computes some value and returns it to the client (EJB remote calls). When I call my localhost from a main() test looping 10 times, it comes back within about 100 milliseconds. When I call the DEV machine following the same process, it is sometimes fast and sometimes really slow, like 4 seconds, which is a huge difference.
The weird thing is that my localhost is a 32 bit 1GB Jboss config but my DEV machine is a 64 bit 6GB Jboss config so if anything, I would expect my localhost to hang... not the the DEV machine.
Where would you suggest starting the troubleshooting process?
If I understood right, both calls are made from same computer? If that is the case, network between is much more likely source for response time differences than 32 vs. 64 bit.
If that is not the case, then monitor dev and check what is the difference in context (other applications etc.) between "fast" and "4 seconds" cases. Anyway, most likely difference in response times have nothing to with difference between 32 bit / 64 bit.
Some time ago I worked on application which was deployed on JBoss on two servers with exactly the same hardware configuration. The first server had CentOS and the second FreeBSD. Exactly the same hardware, the same network, similar load. From what I observed, application responses when it was running on FreeBSD was about 1.5 - 2 times faster. On the first sight, it was strange for me, but after week of tests differences in response times was confirmed.
Since that time I do not consider hardware configuration as so important as I thought before ;)
We resolved the issue after finding out that the install on the linux machine actually had two different instances of JBoss running on the VM therefore resulting in unpredictable behavior. The resources that were consumed were enormous, which did not make any sense based on the app that was deployed...

CPU usage of Oracle installed Database machine

I am using oracle 11g and i have an application which is coded in Spring framework. Once i configure the database on Sun fire 4170 installed with Linux the machine's CPU utilization is around 80-100% and, however, when i shift the same database to Sun M3000 server installed with Unix OS (supposedly more powerful machine) the application performance goes down and CPU utilization remains 90-100%. I can't figure out if its the application which is making the such utilization or its the database design.
It is added that the database is not relational; things are handled by the application.
Well you certainly can find some interesting opinions on the intertubes.
Oracle does not have a true server
architecture (others have it).
Rather than performing classic server
tasks, such as multi-threading,
caching of data pages, parallel
processing (split a query across many
devices) etc. within itself, it uses
the o/s to do all that. That means for
each user process (PL/SQL connection)
there is one unix process; 1000 users
means 1000 unix processes, all
competing for the same resources.
You might note that Oracle has had
a connection pooling architecture (multi-threaded server) since version 7 (1992).
a cache for data pages (known helpfully as the buffer cache) since forever
parallel query (splitting a query across many processes) since version 7.1 (1993)
splitting queries across multiple servers since OPS (version 6) or across distributed databases (version 5)
It's also noteworthy that even if all that was said was correct rather than incorrect it doesn't actually help you in determining root cause.
Especially noteworthy, because it uses
file system files (not raw
partitions), and the "caching" is
outside, it relies heavily on (and is
very sensitive to) the file system
cache that you have set up. likewise,
Oracle needs a massive amount of
memory for these processes.
Oracle certainly can use raw partitions again dating back to the last millenium, moreover if you wish to cache within the database - using the buffer cache that PerformanceDBA has forgotten about - and bypass the filesystem cache this feature is available on all current filesystems. Oracle also supplies it's own combined filesystem/volume manager in ASM which you can use if you wish.
Oracle is also rather well instrumented (and if you have access to dtrace so is solaris) and can certainly tell you what sessions, processes etc are using the CPU, what the time the application spends in the database is consumed by (down to individual block read times if you care) and so is very susceptible to profiling. I'd recommend that you check out Thinking Clearly about Performance available at http://www.method-r.com/downloads/cat_view/38-papers-and-articles and written by one of the top Oracle Performance experts in the world. If you have access to the Oracle Diagnostics pack then checking out first of all ADDM reports and secondly AWR reports would be profitable.
Trying to avoid a flame war here.
I should probably have separated out the "how to find out" part of my response more clearly from my responses to the comments about server architecture from PerformanceDBA. I share Stephanie's suspicions about the spring framework, but without properly scoped measurement evidence there is no point in blaming any particular attribute of the environment, that would be just particular bias. Fortunately the instrumentation built into the oracle kernel allows you to trace and then profile the slow sessions to determine exactly where the issue lies. So I would do the following:
1) enable tracing for a representative session (you can use the dbms_monitor package for that).
2) also gather an execution plan for the statement(s) involved with the gather_plan_statistics hint.
3) profile the trace file by time using an appropriate profile (tkprof,orasrp,method-r profiler)
Investigate the problem statements in contribution to response time order.
If you can't carry out the above, then you can use ADDM and/or AWR if licenced as I originally suggested or statspack if not licensed for the diagnostics pack. ADDM naturally concentrates on time consumers, I suggest if you are forced down the statspack route you do the same.
The M3000 is certainly a more powerful machine, but it is more suitable for true servers. The X4170 with hyper-threads is more suited for file servers.
I'm not so certain about that. Have any data to support that claim?
An M3000 has one SPARC64 VII processor with 4 cores (tech specs) while a X4170 has 1 or 2 Intel 5500 "Nehalem-EP" processors each with 4 cores (tech specs). I know that I would expect much more from even a single processor Nehalem-EP system, than the M3000. Obviously data will vary slightly with the workload, but I know where I'd put my money.

Resources