Passing value passing to multiple commands in xargs - aws-cli

I'm trying to execute multiple commands within xargs. The issue I'm seeing here is the piped value '%' is being passed only to the 1st sub-command inside xargs, but not to the 2nd one. Validated the same by interchanging the commands position, and still always the 2nd command never gets the required value for '%'
Command-1
aws ec2 describe-instances --query 'Reservations[].Instances[?(LaunchTime>=`2015-01-01` && LaunchTime<=`2015-02-28`)][].{id: InstanceId, launched: LaunchTime}' | jq --raw-output '.[] | .id' | xargs -n 1 -I % sh -c 'aws cloudwatch get-metric-statistics --metric-name NetworkPacketsIn --start-time 2018-01-01T00:00:00Z --end-time 2018-02-28T23:59:59Z --period 2592000 --namespace AWS/EC2 --statistics Maximum --dimensions Name=InstanceId,Value=%; echo instance: %;'
Output:
{
"Label": "NetworkPacketsIn",
"Datapoints": []
}
instance: %
{
"Label": "NetworkPacketsIn",
"Datapoints": []
}
instance: %
Command-2
aws ec2 describe-instances --query 'Reservations[].Instances[?(LaunchTime>=`2015-01-01` && LaunchTime<=`2015-02-28`)][].{id: InstanceId, launched: LaunchTime}' | jq --raw-output '.[] | .id' | xargs -n 1 -I % sh -c 'echo instance: %; aws cloudwatch get-metric-statistics --metric-name NetworkPacketsIn --start-time 2018-01-01T00:00:00Z --end-time 2018-02-28T23:59:59Z --period 86400 --namespace AWS/EC2 --statistics Maximum --dimensions Name=InstanceId,Value=%;'
Output
instance: i-3e4fab33
{
"Label": "NetworkPacketsIn",
"Datapoints": []
}
instance: i-c2abbac8
{
"Label": "NetworkPacketsIn",
"Datapoints": []
}

TL;DR on a Mac xargs arguments can not grow beyond 255 bytes after replacement is done.
Shortening your argument and leaving the semicolon off the last command fixed the error:
aws ec2 describe-instances --query 'Reservations[].Instances[?(LaunchTime>=`2015-01-01` && LaunchTime<=`2015-02-28`)][].{id: InstanceId, launched: LaunchTime}' | jq --raw-output '.[] | .id' | xargs -I % sh -c 'aws cloudwatch get-metric-statistics --metric-name NetworkPacketsIn --start-time 2018-01-01T00:00:00Z --end-time 2018-02-28T23:59:59Z --period 86400 --namespace AWS/EC2 --statistics Maximum --dimensions Name=InstanceId,Value=%;echo id=%'
Here's the longer answer along with some tests to prove it.
From the xargs man page:
-I replstr
Execute utility for each input line, replacing one or more occurrences of replstr in up to replacements (or
5 if no -R flag is specified) arguments to utility with the entire line of input. The resulting arguments,
after replacement is done, will not be allowed to grow beyond 255 bytes; this is implemented by concatenat-
ing as much of the argument containing replstr as possible, to the constructed arguments to utility, up to
255 bytes. The 255 byte limit does not apply to arguments to utility which do not contain replstr, and fur-
thermore, no replacement will be done on utility itself. Implies -x.
The argument the OP is passing to xargs
'aws cloudwatch get-metric-statistics --metric-name NetworkPacketsIn --start-time 2018-01-01T00:00:00Z --end-time 2018-02-28T23:59:59Z --period 2592000 --namespace AWS/EC2 --statistics Maximum --dimensions Name=InstanceId,Value=%; echo instance: %;'
is 250 bytes. When the % is replaced by the AMI ID it grows past the 255 byte limit and blows up.
If you want to test this yourself try the following, the argument has 254 bytes:
echo blah |xargs -I % sh -c 'export blah=%; echo $blah; echo $blah; echo $blah;\
echo $blah; echo $blah; echo $blah;echo $blah; echo $blah; echo $blah;echo $blah;\
echo $blah; echo $blah;echo $blah; echo $blah; echo $blah;echo $blah; echo $blah;\
echo $blah;echo $blah;echo $blah;'
This will pass the word blah to every echo statement correctly.
blah blah blah blah blah blah blah blah blah blah blah blah blah blah
blah blah blah blah blah blah
Add one more echo $blah; to the end, taking the byte total to 265 bytes and it blows up:
% % % % % % % % % % % % % % % % % % % %
To make a long post even longer, I passed the instance id to the describe-instances command with a --instance-ids switch and it worked as expected because the argument expansion was below the 255 limit.
aws ec2 describe-instances --query 'Reservations[].Instances[?(LaunchTime>=`2015-01-01` && LaunchTime<=`2015-02-28`)][].{id: InstanceId, launched: LaunchTime}' | jq --raw-output '.[] | .id' | xargs -I % sh -c 'echo instance: %; aws ec2 describe-instances --instance-ids=%; '

Related

How to find gc content of a fasta file using bash script?

I am learning bioinformatics.
I want to find GC content from a fasta file using Bash script.
GC content is basically (number of (g + c)) / (number of (a + t + g + c)).
I am trying to use wc command. But I was not able to get an answer.
Edit 17th Feb 2023.
After going through documentation and videos, I came up with a solution.
filename=$# # collecting all the filenames as parameters
for f in $filename # Looping over files
do
echo " $f is being processed..."
gc=( $( grep -v ">" < "$f" | grep -io 'g\|c'< "$f" | wc -l)) # Reading lines that dont start with < using -v. grep -io matches to either g or c and outputs each match on single line. wc -l counts the number of lines or indirectly the number of g and c. This is stored in a variable.
total=( $( grep -v ">" < "$f" | tr -d '\s\r' | wc -c)) # Spaces, tabs, new line are removed from the file using tr. Then the number of characters are counted by wc -c
percent=( $( echo "scale=2;100*$gc/$total" |bc -l)) # bc -l is used to get the answer in float format. scale=2 mentions the number of decimal points.
echo " The GC content of $f is: "$percent"%"
echo
done
Do not reinvent the wheel. For common bioinformatics tasks, use open-source tools that are specifically designed for these tasks, are well-tested, widely used, and handle edge cases. For example, use EMBOSS infoseq utility. EMBOSS can be easily installed, for example using conda.
Example:
Install EMBOSS package (do once):
conda create --name emboss emboss --channel iuc
Activate the conda environment and use EMBOSS infoseq, here to priitn the sequence name, length and percent GC:
source activate emboss
cat your_sequence_file_name.fasta | infoseq -auto -only -name -length -pgc stdin
source deactivate
This prints into STDOUT something like this:
Name Length %GC
seq_foo 119 60.50
seq_bar 104 39.42
seq_baz 191 46.60
...
This should work:
#!/usr/bin/env sh
# Adapted from https://www.biostars.org/p/17680
# Fail on error
set -o errexit
# Disable undefined variable reference
set -o nounset
# ================
# CONFIGURATION
# ================
# Fasta file path
FASTA_FILE="file.fasta"
# Number of digits after decimal point
N_DIGITS=3
# ================
# LOGGER
# ================
# Fatal log message
fatal() {
printf '[FATAL] %s\n' "$#" >&2
exit 1
}
# Info log message
info() {
printf '[INFO ] %s\n' "$#"
}
# ================
# MAIN
# ================
{
# Check command 'bc' exist
command -v bc > /dev/null 2>&1 || fatal "Command 'bc' not found"
# Check file exist
[ -f "$FASTA_FILE" ] || fatal "File '$FASTA_FILE' not found"
# Count number of sequences
_n_sequences=$(grep --count '^>' "$FASTA_FILE")
info "Analyzing $_n_sequences sequences"
[ "$_n_sequences" -ne 0 ] || fatal "No sequences found"
# Remove sequence wrapping
_fasta_file_content=$(
sed 's/\(^>.*$\)/#\1#/' "$FASTA_FILE" \
| tr --delete "\r\n" \
| sed 's/$/#/' \
| tr "#" "\n" \
| sed '/^$/d'
)
# Vars
_sequence=
_a_count_total=0
_c_count_total=0
_g_count_total=0
_t_count_total=0
# Read line by line
while IFS= read -r _line; do
# Check if header
if printf '%s\n' "$_line" | grep --quiet '^>'; then
# Save sequence and continue
_sequence=${_line#?}
continue
fi
# Count
_a_count=$(printf '%s\n' "$_line" | tr --delete --complement 'A' | wc --bytes)
_c_count=$(printf '%s\n' "$_line" | tr --delete --complement 'C' | wc --bytes)
_g_count=$(printf '%s\n' "$_line" | tr --delete --complement 'G' | wc --bytes)
_t_count=$(printf '%s\n' "$_line" | tr --delete --complement 'T' | wc --bytes)
# Add current count to total
_a_count_total=$((_a_count_total + _a_count))
_c_count_total=$((_c_count_total + _c_count))
_g_count_total=$((_g_count_total + _g_count))
_t_count_total=$((_t_count_total + _t_count))
# Calculate GC content
_gc=$(
printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
"$N_DIGITS" "$_a_count" "$_c_count" "$_g_count" "$_t_count" \
| bc
)
# Add 0 before decimal point
_gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"
info "Sequence '$_sequence' GC content: $_gc"
done << EOF
$_fasta_file_content
EOF
# Total data
info "Adenine total count: $_a_count_total"
info "Cytosine total count: $_c_count_total"
info "Guanine total count: $_g_count_total"
info "Thymine total count: $_t_count_total"
# Calculate total GC content
_gc=$(
printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
"$N_DIGITS" "$_a_count_total" "$_c_count_total" "$_g_count_total" "$_t_count_total" \
| bc
)
# Add 0 before decimal point
_gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"
info "GC content: $_gc"
}
The "Count number of sequences" and "Remove sequence wrapping" codes are adapted from https://www.biostars.org/p/17680
The script uses only basic commands except for bc to do the precision calculation (See bc installation).
You can configure the script by modifying the variables in the CONFIGURATION section.
Because you haven't indicated which one you want, the GC content is calculated for both each sequence and the overall. Therefore, get rid of anything that isn't necessary :)
Despite my lack of bioinformatics background, the script successfully parses and analyzes a fasta file.

why does echo -n "100" | wc -c output 3?

I just happened to be playing around with a few linux commands and i found that echo -n "100" | wc -c outputs 3. i knew that 100 could be stored in a single byte as 1100100 so i could not understand why this happened. I guess that it is because of some teminal encoding, is it ? i also found out that if i touch test.txt and echo -n "100" | test.txt and then execute wc ./test.txt -ci get the same output here also my guess is to blame file encoding, am i right ?
100 is three characters long, hence wc giving you 3. If you left out the -n to echo it'd show 4, because echo would be printing out a newline too in that case.
When you echo -n 100, you are showing a string with 3 characters.
When you want to show a character with ascii value 100, use
echo -n "d"
# Check
echo -n "d" | xdd -b
I found value "d" with man ascii. When you don't want to use the man page, use
printf "\\$(printf "%o" 100)"
# Check
printf "\\$(printf "%o" 100)" | xxd -b
# wc returns 1 here
printf "\\$(printf "%o" 100)" | wc -c
It's fine)
$ wc --help
...
-c, --bytes print the byte counts
-m, --chars print the character counts
...
$ man echo
...
-n do not output the trailing newline
...
$ echo -n 'abc' | wc -c
3
$ echo -n 'абс' | wc -c # russian symbols
6

Create Automatic EBS snapshot

I was looking to setup automatic EBS snapshot at a particular time interval (let say once every week), for this I did Google and found that this task can be done using shell script and I found the same at this link.
Here is whole script, I am using :
#!/bin/bash
# Volume list file will have volume-id:Volume-name format
VOLUMES_LIST = /var/log/volumes-list
SNAPSHOT_INFO = /var/log/snapshot_info
DATE = `date +%Y-%m-%d`
REGION = "ap-south-1a"
# Snapshots Retention Period for each volume snapshot
RETENTION=6
SNAP_CREATION = /var/log/snap_creation
SNAP_DELETION = /var/log/snap_deletion
EMAIL_LIST = shishupal.shakya#itsmysun.com
echo "List of Snapshots Creation Status" > $SNAP_CREATION
echo "List of Snapshots Deletion Status" > $SNAP_DELETION
# Check whether the volumes list file is available or not?
if [ -f $VOLUMES_LIST ]; then
# Creating Snapshot for each volume using for loop
for VOL_INFO in `cat $VOLUMES_LIST`
do
# Getting the Volume ID and Volume Name into the Separate Variables.
VOL_ID = `echo $VOL_INFO | awk -F":" '{print $1}'`
VOL_NAME = `echo $VOL_INFO | awk -F":" '{print $2}'`
# Creating the Snapshot of the Volumes with Proper Description.
DESCRIPTION = "${VOL_NAME}_${DATE}"
/usr/local/bin/aws ec2 create-snapshot --volume-id $VOL_ID --description "$DESCRIPTION" --region $REGION &>> $SNAP_CREATION
done
else
echo "Volumes list file is not available : $VOLUMES_LIST Exiting." | mail -s "Snapshots Creation Status" $EMAIL_LIST
exit 1
fi
echo >> $SNAP_CREATION
echo >> $SNAP_CREATION
# Deleting the Snapshots which are 10 days old.
for VOL_INFO in `cat $VOLUMES_LIST`
do
# Getting the Volume ID and Volume Name into the Separate Variables.
VOL_ID = `echo $VOL_INFO | awk -F":" '{print $1}'`
VOL_NAME = `echo $VOL_INFO | awk -F":" '{print $2}'`
# Getting the Snapshot details of each volume.
/usr/local/bin/aws ec2 describe-snapshots --query Snapshots[*].[SnapshotId,VolumeId,Description,StartTime] --output text --filters "Name=status,Values=completed" "Name=volume-id,Values=$VOL_ID" | grep -v "CreateImage" > $SNAPSHOT_INFO
# Snapshots Retention Period Checking and if it crosses delete them.
while read SNAP_INFO
do
SNAP_ID=`echo $SNAP_INFO | awk '{print $1}'`
echo $SNAP_ID
SNAP_DATE=`echo $SNAP_INFO | awk '{print $4}' | awk -F"T" '{print $1}'`
echo $SNAP_DATE
# Getting the no.of days difference between a snapshot and present day.
RETENTION_DIFF = `echo $(($(($(date -d "$DATE" "+%s") - $(date -d "$SNAP_DATE" "+%s"))) / 86400))`
echo $RETENTION_DIFF
# Deleting the Snapshots which are older than the Retention Period
if [ $RETENTION -lt $RETENTION_DIFF ];
then
/usr/local/bin/aws ec2 delete-snapshot --snapshot-id $SNAP_ID --region $REGION --output text> /tmp/snap_del
echo DELETING $SNAP_INFO >> $SNAP_DELETION
fi
done < $SNAPSHOT_INFO
done
echo >> $SNAP_DELETION
# Merging the Snap Creation and Deletion Data
cat $SNAP_CREATION $SNAP_DELETION > /var/log/mail_report
# Sending the mail Update
cat /var/log/mail_report | mail -s "Volume Snapshots Status" $EMAIL_LIST
But when I ran it over terminal, it is showing me following errors.
Since I am new in this type of work so I am little uncomfortable in resolving this.
Please suggest the fix, I am on this since last few days.
There should not be spaces around the equals (=) signs.
FOO = 1
-bash: FOO: command not found
The correct syntax is:
FOO=1
Go through the script and remove all the spaces in the statements that assign values to variables.
But there is another error "expecting do" -- this makes me think the script is not being run with the correct shell. Instead of 'sh ec2.sh', try running it with bash explicitly: bash ec2.sh
It will be more easy to do it through AWS console.
In cloudwatch you create a new event rule. The event source is "scheduled". For the target you pick up "EC2 Createsnapshot API call". Enter the volume ID (you can find it in the ec2 instances console). Let AWS create a new role for this specific resource.
That's it !

How to get pipe string length?

This is a code that shows my all user names.
-q user | grep -A 0 -B 2 -e uid:\ 5'[0-9][0-9]' | grep ^name | cut -d " " -f2-
For example, the output is like...
usernameone
hello
whoami
Then, I hope that I want to check a length of all user names.
Like this output...
11 //usernameone
5 //hello
6 //whoami
How can I get a length of pipeline code?
Given some command cmd that produces the list of users, you can do this pretty easily with xargs:
$ cat x
usernameone
hello
whoami
$ cat x | xargs -L 1 sh -c 'printf "%s //%s\n" "$(echo -n "$1" | wc -c)" "$1"' '{}'
11 //usernameone
5 //hello
6 //whoami
To get a piped command might not be possible, so here's a one liner that uses a split and a while loop to accomplish this:
-q user | grep -A 0 -B 2 -e uid:\ 5'[0-9][0-9]' | grep ^name | cut -d " " -f2-|tr " " "\n"|while read user; do echo $(echo $user|wc -c) '//'$user;done|tr "\n" " ";echo
This should give you an output in the desired format. I used user as a file hence the cat
i=0;for token in $(cat user); do echo -n "${#token} //$token";echo;i=$((i+1));done;echo;

producing a bash command with awk giving runaway string constant - awk

I'm using awk to parse /etc/hosts and produce a command which will format MapR for me. It's being done in a bash utility in Chef:
egrep '^[0-9]' /etc/hosts | grep -v 127.0.0.1 \
| awk 'NR==1{ips=$1}
NR>1{ips=ips ", " $1}
$2=="namenode"{nn=$1}
END{ printf "/opt/mapr/server/configure.sh -C %s -Z %s -N mycluster --create-user -D /dev/xvdb\n", ips, nn}' \
| bash
sleep 60
The command above should execute the following command:
/opt/mapr/server/configure.sh -C 10.32.237.251 -Z 10.32.237.251 -N mycluster --create-user -D /dev/xvdb
However, looking into my chef output I see:
==> namenode: Executing awk utility
==> namenode: awk: line 1: runaway string constant "/opt/mapr/ ...
The command never got executed in the MapR node... However when i execute it directly on the terminal it works nicely in the way it's supposed to be. What am I doing wrong?
I'm updating the question to show the complete bash script that executes that utility:
DISK_CONFIG=/home/ubuntu/disk_config
if [ -f $DISK_CONFIG ];
then
echo "File already exists"
else
echo "Executing awk utility\n"
touch $DISK_CONFIG
egrep '^[0-9]' /etc/hosts | grep -v 127.0.0.1 \
| awk 'NR==1{ips=$1}
NR>1{ips=ips ", " $1}
$2=="namenode"{nn=$1}
END{ printf "/opt/mapr/server/configure.sh -C %s -Z %s -N mycluster --create-user -D /dev/xvdb\n", ips, nn}' \
| bash
sleep 60
fi
Assuming you're using HEREDOC syntax in your bash resource:
bash "whatever" do
code <<-EOH
DISK_CONFIG=/tmp/disk_config
if [ -f $DISK_CONFIG ];
then
echo "File already exists"
else
echo "Executing awk utility\n"
touch $DISK_CONFIG
egrep '^[0-9]' /etc/hosts | grep -v 127.0.0.1 \
| awk 'NR==1{ips=$1}
NR>1{ips=ips ", " $1}
$2=="namenode"{nn=$1}
END{ printf "/opt/mapr/server/configure.sh -C %s -Z %s -N mycluster --create-user -D /dev/xvdb\n", ips, nn}' \
| bash
fi
EOH
end
this one leads to your error:
Executing awk utility
awk: line 4: runaway string constant "/opt/mapr/ ...
This is due to the \n in your comand (the one into the awk command is likely to be problematic too)
This resource should do (warning I did replace the DISK_CONFIG path for my tests):
bash "whatever" do
code <<-EOH
DISK_CONFIG=/tmp/disk_config
if [ -f $DISK_CONFIG ];
then
echo "File already exists"
else
echo "Executing awk utility"
touch $DISK_CONFIG
egrep '^[0-9]' /etc/hosts | grep -v 127.0.0.1 \
| awk 'NR==1{ips=$1}
NR>1{ips=ips ", " $1}
$2=="namenode"{nn=$1}
END{ printf "/opt/mapr/server/configure.sh -C %s -Z %s -N mycluster --create-user -D /dev/xvdb", ips, nn}' \
| bash
sleep 60
fi
EOH
end
The reason is that Chef already interpret the \n in the code and so awk see a string never ending (runaway).
As you pipe to bash you can omit the \n as the pipe will end the line.

Resources