ec2 linux performance in bash scripts of using $PATH vs /FULL/NAME vs ALIAS - linux

I am considering the speed/memory cost of scaling a large number of small bash scripts. The permutations of scripts calling each other will be quite large, <=5000 scripts.
Is there a benchmark or way of profiling best performance for using a particular method to call each script, say
searching $PATH (could be many deep directories needing to be added to $PATH
using MYFILE=/full/directory/path/file.sh in a number of config files
Using a large number of alias commands to store the more heavily used scripts
The target system will be a scaling cluster of m8.xl type ec2 instances that are general compute optimized with standard ssd i/o.
The reason for using this method is the development of the scripts will be produced by several teams and the only common document for reference is a basic db with function=myfile -p param, return=x, etc. so if a developer needs a function or script they would call the script name and one of the methods above would point to the actual script.
UPDATE
To clarify, the speed of launching the script is not that important if it is only marginally faster to use a method. The concern is the weight of script names being housed in memory having a detrimental effect on the overall performance.
Several hundred or thousand scripts may be spawned in a burst over a few seconds, so the peak load is a concern factor.
Thx
Art

Related

Why are some Bash commands both built-in and external?

Some commands are internal built-in Bash commands while others are external (other programs). I see why certain commands need to be built-in. Some of the reasons are:
If a command needs to change the internal state of the shell process.
If a command performs a very basic operation in the shell.
If a command is called often and needs to be made fast. An external command is executed by loading an external program and hence is slower.
But why are some commands both built-in and external, for example echo and test? I understand echo is used a lot and thus is built-in (Reason 3). But then why also have it as an external command and have a binary for it in /bin/echo? The built-in version of echo will always take precedence over the external version and thus, the external version is hardly ever used. So, why then have an external version of it at all?
It's exactly your point 3. When a command does very little (echo is a good example), spawning a new process dominates the run time behavior. With growing disks and bandwidth and code bases you always reach a spot when you have so much data and so many files (our code base at work has 100k files!!) that one less spawn per file makes a difference of minutes.
That's also why the typical built-in is a drop-in replacement which takes (perhaps a superset of) the same arguments as the binary.
You also ask why the old binary is still retained even though Bash has it as a built-in — the answer is that a lot of programs rely on the existence of that /bin/echo. It's actually standardized.
Bash is only one of many user interfaces and offline command interpreters. They all have different sets of built-ins. Some shells are purposefully small and rely a lot on what you could call "legacy" binaries. One example is ash and its successor, Dash. Dash is now the default /bin/sh in Ubuntu and Debian due to its speed, and is popular for embedded systems due to its small size. (But even Dash has builtins for echo, test and dozens of other commands, and provides a command history for interactive use.)

Identify Resource Usage of a Process: CPU, Memory and I/O

I need to request some AWS resources and in order to do so, I need to identify what are my requirements for:
Number of CPU cores (maybe GPUs as well) (performs parallel processing)
Amount of Memory required
I/O & network read/write time (optional / good to know)
How can I profile my script so that I know if I am using the requested resources to the fullest?
How can I profile the whole system? Something along the following lines:
i. Request large number of compute resources (CPUs and RAM) on AWS
ii. Start the system profiler
iii. Run my program and wait for it to finish
iv. Stop the system profiler and identify the peak #CPUs and RAM used
Context: Unix / Linux
You could use time(1), but there are two variants of it.
the time builtin in most shells is usually not enough for your needs, so...
you need the /usr/bin/time program from the time package
Then you'll run /usr/bin/time -v yourscript
and inside your script you could use the times builtin (see this)
You also should consider using perf(1) and oprofile(1)
At last, you might eventually code something or use something querying the kernel thru /proc/ (see proc(5)). Utilities like xosview are using that.

How to monitor a process in Linux CPU, Memory and time

How can I benchmark a process in Linux? I need something like "top" and "time" put together for a particular process name (it is a multiprocess program so many PIDs will be given)?
Moreover I would like to have a plot over time of memory and cpu usage for these processes and not just final numbers.
Any ideas?
I typically throw together a simple script for this type of work.
Take a look at the kernel documentation for the proc filesystem (Google 'linux proc.txt').
The first line of /proc/stat (Section 1.8 in proc.txt) will give you cumulative cpu usage stats (i.e. user, nice, system, idle, ...). For each process, the file /proc/$PID/stat (Table 1-4 in proc.txt) will provide you with both process-specific cpu usage stats and memory usage stats (see rss).
If you google a bit you'll find plenty of detailed info on these files, and pointers to libraries / apps / code snippets that can help you obtain / derive the values you need. With that in mind, I'll focus on the high-level strategy.
For CPU stats, use your favorite scripting language to create an executable that takes a set of process ids for monitoring. At a fixed interval (ex: 1 second) poll / calculate the cumulative totals for each process and the system as a whole. During each poll interval, write all results on a single line to stdout.
For memory stats, write a similar script, but simply log the per-process memory usage. Memory is a bit easier as we directly obtain the instantaneous values.
Run these script for the duration of your test, passing the set of processes ids that you'd like to monitor and redirecting its output to a log file.
./logcpu $(pidof foo) $(pidof bar) > cpustats
./logmem $(pidof foo) $(pidof bar) > memstats
Import the contents of these files into a spreadsheet (for certain applications this is as easy as copy / paste). For CPU, you are after instantaneous values but have cumulative values, so you'll need to do some minor spreadsheet work to derive these values (it's just the delta 't(x + 1) - t(x)'). Of course you could have your cpu logger write the delta, but you'll be spending a bit more time up front on the script.
Finally, use your spreadsheet to generate a nice plot.
Following are the tools for monitoring a linux system
System commands like top, free -m, vmstat, iostat, iotop, sar, netstat, etc. Nothing comes near these linux utility when you are debugging a problem. These command give you a clear picture that is going inside your server
SeaLion: Agent executes all the commands mentioned in #1 (also user defined) and outputs of these commands can be accessed in a beautiful web interface. This tool comes handy when you are debugging across hundreds of servers as installation is clear simple. And its FREE
Nagios: It is the mother of all monitoring/alerting tools. It is very much customization but very much difficult to setup for beginners. There are sets of tools called nagios plugins that covers pretty much all important Linux metrics
Munin
Server Density: A cloudbased paid service that collects important Linux metrics and gives users ability to write own plugins.
New Relic: Another well know hosted monitoring service.
Zabbix

Can you load a tree structure in memory with Linux shell?

I want to create an application with a Linux shell script like this — but can it be done?
This application will create a tree containing data. The tree should be loaded in the memory. The tree (loaded in memory) could be readable from any other external Linux script.
Is it possible to do it with a Linux shell?
If yes, how can you do it?
And are there any simple examples for that?
There are a large number of misconceptions on display in the question.
Each process normally has its own memory; there's no trivial way to load 'the tree' into one process's memory and make it available to all other processes. You might devise a system of related programs that know about a shared memory segment (somehow — there's a problem right there) that contains the tree, but that's about it. They'd be special programs, not general shell scripts. That doesn't meet your 'any other external Linux script' requirement.
What you're seeking is simply not available in the Linux shell infrastructure. That answers your first question; the other two are moot given the answer to the first.
There is a related discussion here. They use shared memory device /dev/shm and, ostensibly, it works for multiple users. At least, it's worth a try:
http://www.linuxquestions.org/questions/linux-newbie-8/bash-is-it-possible-to-write-to-memory-rather-than-a-file-671891/
Edit: just tried it with two users on Ubuntu - looks like a normal directory and REALLY WORKS with the right chmod.
See also:
http://www.cyberciti.biz/tips/what-is-devshm-and-its-practical-usage.html
I don't think there is a way to do this as if you want to keep all the requirements of:
Building this as a shell script
In-memory
Usable across terminals / from external scripts
You would have to give up at least one requirement:
Give up shell script req - Build this in C to run as a Linux process. I only understand this up to the point to say that it would be non-trivial
Give up in-memory req - You can serialize the tree and keep the data in a temp file. This works as long as the file is small and performance bottleneck isn't around access to the tree. The good news is you can use the data across terminals / from external scripts
Give up usability from external scripts req - You can technically build a script and run it by sourcing it to add many (read: a mess of) variables representing the tree into your current shell session.
None of these alternatives are great, but if you had to go with one, number 2 is probably the least problematic.

Profiling very long running tasks

How can you profile a very long running script that is spawning lots of other processes?
We have a job that takes a long time to run - 11 or more hours, sometimes more than 17 - so it runs on a Amazon EC2 instance.
(It is doing cufflinks DNA alignment and stuff.)
The job is executing lots of processes, scripts and utilities and such.
How can we profile it and determine which component parts of the job take the longest time?
A simple CPU utilisation per process per second would probably suffice. How can we obtain it?
There are many solutions to your question :
munin is a great monitoring tool that can scan almost everything in your system and make nice graph about it :). It is very easy to install and use it.
atop could be a simple solution, it can scan cpu, memory, disk regulary and you can store all those informations into files (the -W option), then you'll have to anaylze those file to detect the bottleneck.
sar, that can scan more than everything on your system, but a little more hard to interpret (you'll have to make the graph yourself with RRDtool for example)

Resources