mpm configuration for httpd - linux

I run a website with 5 httpd servers(Centos 7) in EC2, type is m3.2xlarge.
The servers are configured with load balancer.
Gradually the server memory keeps going higher in all the instances.
For example:
Memory usage in few seconds after restarting the httpd service:
[centos#ip-10-0-1-77 ~]$ while sleep 1; do free -m; done
total used free shared buff/cache available
Mem: 29741 2700 26732 36 307 26728
Swap: 0 0 0
total used free shared buff/cache available
Mem: 29741 2781 26651 36 307 26647
Swap: 0 0 0
total used free shared buff/cache available
Mem: 29741 2820 26613 36 307 26609
Swap: 0 0 0
[centos#ip-10-0-1-77 ~]$
.
.
.
This is what i see after an hour:
[centos#ip-10-0-1-77 ~]$ free -m
total used free shared buff/cache available
Mem: 29741 29092 363 41 284 346
Swap: 0 0 0
Like above it goes and consumes all the memory(30GB) within an hour.
To avoid this I started using worker mpm configuration.
The following configuration is what I have added at the bottom of /etc/httpd/httpd.conf.
<IfModule mpm_worker_module>
MaxRequestWorkers 2500
MaxSpareThreads 250
MinSpareThreads 75
ServerLimit 100
StartServers 3
ThreadsPerChild 25
</IfModule>
Can Someone help and suggest me the right configuration to utilize the RAM memory properly in all the instances?

A standard Apache process takes up about 12 MB of RAM. If you have 30 GB reserved for Apache you will never reach that with a serverlimit of 100 (=100*12MB=1200MB=1,2GB). So I assume that Apache isn't taking up all that memory.
Is there an application that is involved or a DB? Those can take up larger amounts of RAM.
For your servertuning.conf (or httpd.conf since you put it there):
<IfModule mpm_worker_module>
#max amount of requests one worker handles before it's forced to close, 2,5k seems almost a little low
MaxRequestsperChild 2500
#maximum number of worker threads which are kept spare
#250 seems quite high, but really depends on the traffic you are experiencing, we normally don't use more than 50-75
MaxSpareThreads 250
#minimum number of worker threads which are kept spare
#really high, only useful if you often experience higher bursts of traffic
#otherwise you should be fine with 15-30, maybe 50 if you experience higher fluctuation -> bigger bursts of requests
MinSpareThreads 75
#upper limit on the configurable number of threads per child process
#you have to increase this if you want more than 64 ThreadsPerChild
ThreadLimit 64
#maximum number of simultaneous client connections
#that is really low! --> increase that, definitely! We run on 1000, so about 12GB max
MaxClients 100
#initial number of server processes to start
#3 is really low, if you expected your server to be flodded with requests the second you start it
#maybe turn it up a little to around 20 or even 50 if you receive lots of traffic right after a restart
StartServers 3
#number of worker threads created by each child proces
#25 threads per worker is not tooo much, but at some point the administration of xx threads gets more expensive than creating new ones
#would suggest to leave it at 25 or turn it up to around 40
ThreadsPerChild 25
</IfModule>
Notice that I changed ServerLimit to MaxClients and MaxRequestWorkers to MaxRequestsPerChild, because as far as I know those are the terms used in mpm-worker.
Additionally you can change following variables:
#KeepAlive: Whether or not to allow persistent connections (more than
#one request per connection). Set to "Off" to deactivate.
#if it's on, leave it there
KeepAlive On
#MaxKeepAliveRequests: The maximum number of requests to allow
#during a persistent connection. Set to 0 to allow an unlimited amount.
#We recommend you leave this number high, for maximum performance.
#default=100, but you can turn that up if your sites contain a lot of item (img, css, ...)
#we are using about 20*<average object-count per site> = 600
MaxKeepAliveRequests 600
#KeepAliveTimeout: Number of seconds to wait for the next request from the
#same client on the same connection.
#would recommend to decrease that, otherwise you could become a victim of slow-dos attacks
#default is 15, we are running just fine on 5
KeepAliveTimeout 5
To further prevent slow-dos or piling up of open sessions, you can use mod_reqtimeout:
<IfModule mod_reqtimeout.c>
# allow 10s timeout for the headers and allow 1s more until 20s upon receipt of 1000 bytes.
# almost the same with the body, except that it is tricky to
# limit the request timeout within the body at all - it may take time to generate the body.
# below are the default values
#RequestReadTimeout header=10-20,MinRate=1000 body=20,MinRate=1000
# and this is how I'd set them with today's internet speed
# deduct the according numbers from explanation above...
RequestReadTimeout header=2-5,MinRate=100000 body=5-10,MinRate=1000000
</IfModule>
If that's not enough to help your RAM-issues (if they are really caused by Apache), use the tools of your server's OS accordingly to find out what is taking up all the RAM --> TOOLS

Related

Configuring Snap for performance

I'm just playing with the Snap framework and wanted to see how it performs against other frameworks (under completely artificial circumstances).
What I have found is that my Snap application tops out at about 1500 requests/second (the app is simply snap init; snap build; ./dist/app/app, ie. no code changes to the default app created by snap):
$ ab -n 20000 -c 500 http://127.0.0.1:8000/
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)
Completed 2000 requests
Completed 4000 requests
Completed 6000 requests
Completed 8000 requests
Completed 10000 requests
Completed 12000 requests
Completed 14000 requests
Completed 16000 requests
Completed 18000 requests
Completed 20000 requests
Finished 20000 requests
Server Software: Snap/0.9.5.1
Server Hostname: 127.0.0.1
Server Port: 8000
Document Path: /
Document Length: 721 bytes
Concurrency Level: 500
Time taken for tests: 12.845 seconds
Complete requests: 20000
Failed requests: 0
Total transferred: 17140000 bytes
HTML transferred: 14420000 bytes
Requests per second: 1557.00 [#/sec] (mean)
Time per request: 321.131 [ms] (mean)
Time per request: 0.642 [ms] (mean, across all concurrent requests)
Transfer rate: 1303.07 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 44 287.6 0 3010
Processing: 6 274 153.6 317 1802
Waiting: 5 274 153.6 317 1802
Total: 20 318 346.2 317 3511
Percentage of the requests served within a certain time (ms)
50% 317
66% 325
75% 334
80% 341
90% 352
95% 372
98% 1252
99% 2770
100% 3511 (longest request)
I then fired up a Grails application, and it seems like Tomcat (once the JVM warms up) can take a bit more load:
$ ab -n 20000 -c 500 http://127.0.0.1:8080/test-0.1/book
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)
Completed 2000 requests
Completed 4000 requests
Completed 6000 requests
Completed 8000 requests
Completed 10000 requests
Completed 12000 requests
Completed 14000 requests
Completed 16000 requests
Completed 18000 requests
Completed 20000 requests
Finished 20000 requests
Server Software: Apache-Coyote/1.1
Server Hostname: 127.0.0.1
Server Port: 8080
Document Path: /test-0.1/book
Document Length: 722 bytes
Concurrency Level: 500
Time taken for tests: 4.366 seconds
Complete requests: 20000
Failed requests: 0
Total transferred: 18700000 bytes
HTML transferred: 14440000 bytes
Requests per second: 4581.15 [#/sec] (mean)
Time per request: 109.143 [ms] (mean)
Time per request: 0.218 [ms] (mean, across all concurrent requests)
Transfer rate: 4182.99 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 67 347.4 0 3010
Processing: 1 30 31.4 21 374
Waiting: 0 26 24.4 20 346
Total: 1 97 352.5 21 3325
Percentage of the requests served within a certain time (ms)
50% 21
66% 28
75% 35
80% 42
90% 84
95% 230
98% 1043
99% 1258
100% 3325 (longest request)
I'm guessing that a part of this could be the fact that Tomcat seems to reserve a lot of RAM and can keep/cache some methods. During this experiment Tomcat was using in excess of 700mb or RAM while Snap barely approached 70mb.
Questions I have:
Am I comparing apples and oranges here?
What steps would one take to optimise Snap for throughput/speed?
Further experiments:
Then, as suggested by mightybyte, I started experimenting with +RTS -A4M -N4 options. The app was able to serve just over 2000 requests per second (about 25% increase).
I also removed the nested templating and served a document (same size as before) from the top level tpl file. This increased the performance to just over 7000 requests a second. The memory usage went up to about 700MB.
I'm by no means an expert on the subject so I can only really answer your first question, and yes you are comparing apples and oranges (and also bananas without realizing it).
First off, it looks like you are attempting to benchmark different things, so naturally, your results will be inconsistent. One of these is the sample Snap application and the other is just "a Grails application". What exactly are each of these things doing? Are you serving pages? Handling requests? The difference in applications will explain the differences in performance.
Secondly, the difference in RAM usage also shows the difference in what these applications are doing. Haskell web frameworks are very good at handling large instances without much RAM where other frameworks, like Tomcat as you saw, will be limited in their performance with limited RAM. Try limiting both applications to 100mb and see what happens to your performance difference.
If you want to compare the different frameworks, you really need to run a standard application to do that. Snap did this with a Pong benchmark. The results of an old test (from 2011 and Snap 0.3) can be seen here. This paragraph is extremely relevant to your situation:
If you’re comparing this with our previous results you will notice that we left out Grails. We discovered that our previous results for Grails may have been too low because the JVM had not been given time to warm up. The problem is that after the JVM warms up for some reason httperf isn’t able to get any samples from which to generate a replies/sec measurement, so it outputs 0.0 replies/sec. There are also 1000 connreset errors, so we decided the Grails numbers were not reliable enough to use.
As a comparison, the Yesod blog has a Pong benchmark from around the same time that shows similar results. You can find that here. They also link to their benchmark code if you would like to try to run a more similar benchmark, it is available on Github.
The answer by jkeuhlen makes good observations relevant to your first question. As to your second question, there are definitely things you can play with to tune performance. If you look at Snap's old raw result data, you can see that we were running the application with +RTS -A4M -N4. The -N4 option tells the GHC runtime to use 4 threads. (Note that you have to build the application with -threaded to do this.) The -A4M option sets the size of the garbage collector's allocation area. Our experiments showed that these two seemed to have the biggest impact on performance. But that was done a long time ago and GHC has changed a lot since then, so you probably want to play around with them and find what works best for you. This page has in-depth information about other command line options available to control GHC's runtime if you wish to do more experimentation.
A little work was done last year on updating the benchmarks. If you're interested in that, look around the different branches in the snap-benchmarks repository. It would be great to get more help on a new set of benchmarks.

How to scale ejabberd Server machine on CentOS to handle 200 K connections?

I am working on a considerably good ejabberd instance with 40 core CPU machine and 160 GB RAM.
The issue is I am unable to scale up to 200 K parallel connections.
The sysctl config is as follows:
net.ipv4.tcp_window_scaling = 1
net.core.rmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 16384 16777216
#http://linux-ip.net/html/ether-arp.html#ether-arp-flux
net.ipv4.conf.all.arp_filter = 1
kernel.exec-shield=1
kernel.randomize_va_space=1
net.ipv4.conf.all.rp_filter=1
net.ipv4.conf.all.accept_source_route=0
net.ipv4.icmp_echo_ignore_broadcasts=1
net.ipv4.conf.all.log_martians = 1
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.default.secure_redirects = 0
net.ipv4.ip_local_port_range = 12000 65535
fs.nr_open = 20000500
fs.file-max = 1000000
net.ipv4.tcp_max_syn_backlog = 10240
net.ipv4.tcp_max_tw_buckets = 400000
net.ipv4.tcp_max_orphans = 60000
net.ipv4.tcp_synack_retries = 3
net.core.somaxconn = 10000
The /etc/security/limits.conf file entries is as follows:
* soft core 900000
* hard rss 900000
* soft nofile 900000
* hard nofile 900000
* soft nproc 900000
* hard nproc 900000
The machine starts to lose connections when the server reaches around 112 K.
Things that happen around 112 K
The CPU usage goes up to 200 ~ 300 % (but it is the usual spike)
Background - When all things are normal the CPU usage shoots up to 80 % as seen below (only two CPUs are doing actual work)
I am unable to work on the machine. I am using top and ss command to see what is going on the server. The machine just stops responding at this point and the connections begin to drop.
What is a saving grace is that the connections don't drop abruptly, but drop at the rate they are connected.
I am using TSUNG to generate the load. There are 4 load generator boxes hitting 4 different ips mapped to only one machine internally.
Any suggestions, opinions are very welcome.
As the first call you would need to establish what's the bottleneck in your case:
CPU
Memory
System limits (open sockets, open files)
Application architecture
If possible add a resource-tracking application to your node, e.g. recon. It will allow you to check the length of process queues, memory fragmentation, etc. In our production system the amount of memory consumed by Erlang VM was different when reported by the system than when reported by the Erlang VM itself due to Transparent Huge Pages (the system was virtualized). There may be other issues that may not be obvious when inspecting the node using system tools.
So I would propose:
Determine processes with the longest queue sizes - they will be responsible for slowing down the system because Erlang VM needs to scan the whole inbox of a process when it receives a message
Determine processes with the biggest amount of allocated memory
Determine how much memory Erlang itself thinks is allocated
Also, it would be good if you added parameters used to start the Erlang VM.
Addition
Forgot to mention that it may be worth looking at the tuning WhatsApp did to their Erlang nodes to handle hundreds of thousands of simultaneous connections:
The WhatsApp Architecture Facebook Bought For $19 Billion

Linux(Ubuntu) load average higher than total-true-utilization?

I have a dell pd2950(2x4core) server running Ubuntu server 12.04LTS. And there's a VLC encoder instance running. Recently I updated the script(VLM) for VLC to increase quality and this means I'm increasing the CPU utilization too. So I started to tune the script to avoid exceeding maximum utilization. I use top to monitor the CPU utilization. I found that the load average is higher than 100%(I have 8-cores totally so 8.00 is 100%) but there's still 20-35% is idle, like:
top - 21:41:19 up 2 days, 17:15, 1 user, load average: 9.20, 9.65, 8.80
Tasks: 148 total, 1 running, 147 sleeping, 0 stopped, 0 zombie
Cpu(s): 32.8%us, 0.7%sy, 29.7%ni, 36.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1982680k total, 1735672k used, 247008k free, 126284k buffers
Swap: 0k total, 0k used, 0k free, 774228k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9715 wilson RT 0 2572m 649m 13m S 499 33.5 13914:44 vlc
11663 wilson 20 0 17344 1328 964 R 2 0.1 0:02.00 top
1 root 20 0 24332 2264 1332 S 0 0.1 0:01.06 init
2 root 20 0 0 0 0 S 0 0.0 0:00.09 kthreadd
3 root 20 0 0 0 0 S 0 0.0 0:27.05 ksoftirqd/0
4 root 20 0 0 0 0 S 0 0.0 0:00.00 kworker/0:0
5 root 0 -20 0 0 0 S 0 0.0 0:00.00 kworker/0:0H
To confirm my CPU(s) don't have Hyper-Thread, I tried:
wilson#server:/$ nproc
8
And to reduce the sampling deviation cause by refresh time, I also tried:
wilson#server:/$ top -d 0.1
I looked at the number %id for a long time, it haven't been lower than 14.
I also tried:
wilson#server:/$ uptime
21:57:20 up 2 days, 17:31, 1 user, load average: 9.03, 9.12, 9.35
The 1m load average often reach 14-15. So I'm wondering what's wrong with my system? Has anyone ever have this problem?
More information:
I'm using VLC with x264 codec to encode a live HTTP stream(application/octet-stream). It use ffmpeg(libavc) to decode and output as Apple HLS(.ts segment). I found this problem after I added arguments for x264:
level=41,ref=5,b-adapt=2,direct=auto,me=umh,subq=8,rc-lookahead=60,analyse=all
This almost equal to preset=slower. And as you can see, my VLC in running in real-time. The parameter is:
wilson#server:/$ chrt -p -f 99 vlc-wrapper
There does not appear to be anything wrong with your system. What is wrong seems to be your understanding of CPU accounting. In particular, load average has nearly nothing at all to do with CPU usage. Load average is based on the number of processes that are ready to run (not waiting on I/O, network, keyboard input, etc...), if there is an available CPU for them to be scheduled on. While it's true that, given an 8 core system, if all 8 cores are 100% busy with a single CPU-bound thread each, your load average should be around 8.00, it is entirely possible to have a load average of 200.0 with near-0% CPU utilization. All that would indicate is you have 200 processes that are ready to run, but as soon as they get scheduled, they do almost nothing before they go back to waiting for input of some sort.
Your top output shows that vlc seems to be using roughly the equivalent of 5 of your cores, but it doesn't indicate whether you have 5 cores at 100% each, or if all 8 cores are at 62.5% each. All of the other processes listed by top also contribute to your load average, as well as CPU usage. In particular, top running with a short delay like your example of 0.1 seconds, will probably increase your load average by almost 1 itself, even though, overall, it's not using a lot of CPU time.
Read this:
Understanding load average vs. cpu usage
If the load average is at 7, with 4 hyper-threaded processors, shouldn't that means that the CPU is working to about 7/8 capacity?
No it just means that you have 7 running processes in the job queue on average.
But I think that we can't use load average as a reference number to determine system is overload or not. So that I wonder if there's a kernel-level cpu utitlization statistical tools or not?(why kernel level because reduce performance loss)

What do the mod_pagespeed statistics mean?

Here's a dump of the stats provided my mod_pagespeed from one of my sites.
resource_url_domain_rejections: 6105
rewrite_cached_output_missed_deadline: 4801
rewrite_cached_output_hits: 116004
rewrite_cached_output_misses: 934
resource_404_count: 0
slurp_404_count: 0
total_page_load_ms: 0
page_load_count: 0
resource_fetches_cached: 0
resource_fetch_construct_successes: 45
resource_fetch_construct_failures: 0
num_flushes: 947
total_fetch_count: 0
total_rewrite_count: 0
cache_time_us: 572878
cache_hits: 872
cache_misses: 1345
cache_expirations: 242
cache_inserts: 1795
cache_extensions: 50799
not_cacheable: 0
css_file_count_reduction: 0
css_elements: 0
domain_rewrites: 0
google_analytics_page_load_count: 0
google_analytics_rewritten_count: 0
image_inline: 7567
image_rewrite_saved_bytes: 208854
image_rewrites: 34128
image_ongoing_rewrites: 0
image_webp_rewrites: 0
image_rewrites_dropped_due_to_load: 0
image_file_count_reduction: 0
javascript_blocks_minified: 12438
javascript_bytes_saved: 1173778
javascript_minification_failures: 0
javascript_total_blocks: 12439
js_file_count_reduction: 0
converted_meta_tags: 902
url_trims: 54765
url_trim_saved_bytes: 1651244
css_filter_files_minified: 0
css_filter_minified_bytes_saved: 0
css_filter_parse_failures: 2
css_image_rewrites: 0
css_image_cache_extends: 0
css_image_no_rewrite: 0
css_imports_to_links: 0
serf_fetch_request_count: 1412
serf_fetch_bytes_count: 12809245
serf_fetch_time_duration_ms: 28706
serf_fetch_cancel_count: 0
serf_fetch_active_count: 0
serf_fetch_timeout_count: 0
serf_fetch_failure_count: 0
Can someone please explain what all of the stats mean?
There's a lot of stats here. I'm going to just describe a few of them, because this will get long. We should probably add detailed doc. I can follow-up with more answers later if these are useful.
resource_url_domain_rejections: 6105: this means that since your server restarted, mod_pagespeed has found 6105 resources that it's not going rewrite a resource because their domains are not authorized for rewriting with the ModPagespeedDomain directive. This is common & occurs anytime time someone refreshes a page with a twitter, facebook, or google+ widget.
rewrite_cached_output_missed_deadline: 4801: when a resources (e.g. a jpeg image) is optimized, it happens in a background thread, and the result is cached so that subsequent page views referencing the same refresh are fast. To avoid slowing down the first view, however, we use a 10 millisecond timer to avoid slowing down the time-to-first byte. This stat counts how many times that deadline was exceeded, in which case the resource is left unchanged for that view, but the optimization continues in the background & so the cache is written.
rewrite_cached_output_hits: 116004: counts the number of times we served an optimized resource from the cache, thus avoiding the need to re-optimize it.
rewrite_cached_output_misses: 934: counts the number of times we looked up a resource in our cache and it wasn't there, forcing us to rewrite it. Note that we would also rewrite a resource that was in the cache, but whose origin cache expiration-time had expired. E.g. if your images had cache-control:max-age=600 then we would re-fetch them every 10 minutes to see if they've changed. If they have changed we must re-optimize them.
num_flushes: 947: this is the number of times the Apache resource-generator for the HTML (e.g. mod_php or Wordpress) called the Apache function ap_flush(), which causes partial HTML to be flushed all the way through to the user's browser. This is interesting for mod_pagespeed because it can limit the amount of optimization we can do (e.g. we can't combine CSS files whose elements are separated by a Flush).
cache_time_us: 572878 - the total amount of time, in microseconds, spent waiting for mod_pagespeed's HTTP Cache (file + memory) to respond to a lookup request, since the server was started.
I think that's enough for now. Are there specific other statistics you'd like to learn more about?
Most of these were created for us to monitor the health of mod_pagespeed as it's running, and to help diagnose users' issues. I have to admit we haven't used it much for that purpose, but we use them during development.

How do I stress test a web form file upload?

I need to test a web form that takes a file upload.
The filesize in each upload will be about 10 MB.
I want to test if the server can handle over 100 simultaneous uploads, and still remain
responsive for the rest of the site.
Repeated form submissions from our office will be limited by our local DSL line.
The server is offsite with higher bandwidth.
Answers based on experience would be great, but any suggestions are welcome.
Use the ab (ApacheBench) command-line tool that is bundled with Apache
(I have just discovered this great little tool). Unlike cURL or wget,
ApacheBench was designed for performing stress tests on web servers (any type of web server!).
It generates plenty statistics too. The following command will send a
HTTP POST request including the file test.jpg to http://localhost/
100 times, with up to 4 concurrent requests.
ab -n 100 -c 4 -p test.jpg http://localhost/
It produces output like this:
Server Software:
Server Hostname: localhost
Server Port: 80
Document Path: /
Document Length: 0 bytes
Concurrency Level: 4
Time taken for tests: 0.78125 seconds
Complete requests: 100
Failed requests: 0
Write errors: 0
Non-2xx responses: 100
Total transferred: 2600 bytes
HTML transferred: 0 bytes
Requests per second: 1280.00 [#/sec] (mean)
Time per request: 3.125 [ms] (mean)
Time per request: 0.781 [ms] (mean, across all concurrent requests)
Transfer rate: 25.60 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 2.6 0 15
Processing: 0 2 5.5 0 15
Waiting: 0 1 4.8 0 15
Total: 0 2 6.0 0 15
Percentage of the requests served within a certain time (ms)
50% 0
66% 0
75% 0
80% 0
90% 15
95% 15
98% 15
99% 15
100% 15 (longest request)
Automate Selenium RC using your favorite language. Start 100 Threads of Selenium,each typing a path of the file in the input and clicking submit.
You could generate 100 sequentially named files to make looping over them easyily, or just use the same file over and over again
I would perhaps guide you towards using cURL and submitting just random stuff (like, read 10MB out of /dev/urandom and encode it into base32), through a POST-request and manually fabricate the body to be a file upload (it's not rocket science).
Fork that script 100 times, perhaps over a few servers. Just make sure that sysadmins don't think you are doing a DDoS, or something :)
Unfortunately, this answer remains a bit vague, but hopefully it helps you by nudging you in the right track.
Continued as per Liam's comment:
If the server receiving the uploads is not in the same LAN as the clients connecting to it, it would be better to get as remote nodes as possible for stress testing, if only to simulate behavior as authentic as possible. But if you don't have access to computers outside the local LAN, the local LAN is always better than nothing.
Stress testing from inside the same hardware would be not a good idea, as you would do double load on the server: Figuring out the random data, packing it, sending it through the TCP/IP stack (although probably not over Ethernet), and only then can the server do its magic. If the sending part is outsourced, you get double (taken with an arbitrary sized grain of salt) performance by the receiving end.

Resources