Create a file with the sample, gene, and line count - linux - linux

I am trying to create a file called depths that has the name of the sample, the gene, and then the number of times that gene is in the sample. The below code is what I have currently, but the output just has the file names. Ex. file name=ERR034597.MTCYB.sam
I want the file to have ERR034597 MTCYB 327, for example.
for i in genes/${i}.sam
filename=$(basename $i)
n_rows=$(cat $i | wc -l)
echo $filename $n_rows > depths

Here
for i in genes/${i}.sam
you're accessing the variable i before it has been assigned yet. This shouldn't work. What you probably want to do is
for i in genes/*.sam
filename=$(basename "$i")
n_rows=$(wc -l "$i")
echo "$filename" $n_rows > depths
And just another note. It's good practice to avoid unnecessary calls to cat and always quote the variables holding filenames.

If I understand what you are attempting, then you need a few more steps to isolate the first part of the filename, (e.g. ERR034597) and the gene (e.g. MTCYB) before writing the information to depths. You also need to consider whether you are replacing the contents of depths on each iteration (e.g. using >) or Appending to depths with >>.
Since your tag is [Linux], all we can presume is you have a POSIX shell and not an advanced shell like bash. To remove the .sam extension from filename and then separate into the first part and the gene before obtaining the line count, you can do something similar to the following:
#!/bin/sh
:> depths # truncate depths (optional - if required)
for i in genes/*.sam; do # loop over all .sam files
filename="$(basename "$i")" # remove path from name
filename="${filename%.sam}" # trim .sam extension from name
gene="${filename##*.}" # trim to last '.' save as gene
filename="${filename%.$gene}" # remove gene from end of name
n_rows=$(wc -l < "$i") # get number of lines in file
echo "$filename $gene $n_rows" >> depths # append vales to depths
done
Which would result in depths containing lines similar to:
ERR034597 MTCYB 92
(where the test file contained 92 lines)
Look things over and let me know if you have further questions.

Related

How to add sequential numbers say 1,2,3 etc. to each file name and also for each line of the file content in a directory?

I want to add sequential number for each file and its contents in a directory. The sequential number should be prefixed with the filename and for each line of its contents should have the same number prefixed. In this manner, the sequential numbers should be generated for all the files(for names and its contents) in the sub-folders of the directory.
I have tried using maxdepth, rename, print function as a part. but it throws error saying that "-maxdepth" - not a valid option.
I have already a part of code(to print the names and contents of text files in a directory) and this logic should be appended with it.
#!bin/bash
cd home/TESTING
for file in home/TESTING;
do
find home/TESTING/ -type f -name *.txt -exec basename {} ';' -exec cat {} \;
done
P.s - print, rename, maxdepth are not working
If the name of the first file is File1.txt and its contents is mentioned as "Louis" then the output for the filename should be 1File1.txt and the content should be as "1Louis".The same should be replaced with 2 for second file. In this manner, it has to traverse through all the subfolders in the directory and print accordingly. I have already a part of code and this logic should be appended with it.
There should be fail safe if you execute cd in a script. You can execute command in wrong directory if you don't.
In your attempt, the output would be the same even without the for cycle, as for file in home/TESTING only pass home/TESTING as argument to for so it only run once. In case of
for file in home/TESTING/* this would happen else how.
I used find without --maxdepth, so it will look into all subdirectory as well for *.txt files. If you want only the current directory $(find /home/TESTING/* -type f -name "*.txt") could be replaced to $(ls *.txt) as long you do not have directory that end to .txt there will be no problem.
#!/bin/bash
# try cd to directory, do things upon success.
if cd /home/TESTING ;then
# set sequence number
let "x = 1"
# pass every file to for that find matching, sub directories will be also as there is no maxdeapth.
for file in $(find /home/TESTING/* -type f -name "*.txt") ; do
# print sequence number, and base file name, processed by variable substitution.
# basename can be used as well but this is bash built in.
echo "${x}${file##*/}"
# print file content, and put sequence number before each line with stream editor.
sed 's#^#'"${x}"'#g' ${file}
# increase sequence number with one.
let "x++"
done
# unset sequence number
unset 'x'
else
# print error on stderr
echo 'cd to /home/TESTING directory is failed' >&2
fi
Variable Substitution:
There is more i only picked this 4 for now as they similar.
${var#pattern} - Use value of var after removing text that match pattern from the left
${var##pattern} - Same as above but remove the longest matching piece instead the shortest
${var%pattern} - Use value of var after removing text that match pattern from the right
${var%%pattern} - Same as above but remove the longest matching piece instead the shortest
So ${file##*/} will take the variable of $file and drop every caracter * before the last ## slash /. The $file variable value not get modified by this, so it still contain the path and filename.
sed 's#^#'"${x}"'#g' ${file} sed is a stream editor, there is whole books about its usage, for this particular one. It usually placed into single quote, so 's#^#1#g' will add 1 the beginning of every line in a file.s is substitution, ^ is the beginning of the file, 1 is a text, g is global if you not put there the g only first mach will be affected.
# is separator it can be else as well, like / for example. I brake single quote to let variable be used and reopened the single quote.
If you like to replace a text, .txt to .php, you can use sed 's#\.txt#\.php#g' file , . have special meaning, it can replace any singe character, so it need to be escaped \, to use it as a text. else not only file.txt will be matched but file1txt as well.
It can be piped , you not need to specify file name in that case, else you have to provide at least one filename in our case it was the ${file} variable that contain the filename. As i mentioned variable substitution is not modify variable value so its still contain the filename with path.

Linux copy file and rename to substring of filename

My files got this structure:
mynewfile-runtime-tested-1102-19.4-alpha.zip
mysdk-sdk-tested-1102-19.4-alpha.zip
sources-tested-1102-19.4-alpha.zip
I looking for a way how to dynamically detect and drop the suffix of tested-1102-19.4-alpha and to copy the files with new names so it will look like:
mynewfile-runtime.zip
mysdk-sdk.zip
sources.zip
The suffix should be detected dynamically by delimiters ('-'), my other chunk of file have suffix like nottested-404-11.2.34-beta and the other one is final-01-1-release. The only thing remain constant is the delimiter of '-'
for file in *.zip; do
mv "$file" "${file%-*-*-*-*.zip}.zip"
done
This is fully portable POSIX shell, without forks to sed or other programs.
The ${file%pattern} bit says to remove the shortest matching string.
You can also remove the longest match with %%, or from the left with # and ##, respectively.
To only move the files that match the pattern you can do this:
#!/bin/sh
suffix='-*-*-*-*.zip'
for file in *$suffix
do
trimmed=${file%$suffix}
echo mv "$file" "$trimmed".zip
done
Remove the echo when you are confident with the result.

Is there an option to "ls" that limits filename characters?

syntax question. if I have a number of subdirectories within a target dir, and I want to output the names of the subs to a text file I can easily run:
ls > filelist.txt
on the target. But say all of my subs are named with a 7 character prefix like:
JR-5426_mydir
JR-5487_mydir2
JR-5517_mydir3
...
and I just want the prefixes. Is there an option to "ls" that will only output n characters per line?
Don't use ls in any programmatic context; it should be used strictly for presentation to humans -- ParsingLs gives details on why.
On bash 4.0 or later, the below will provide a deduplicated list of filename prefixes:
declare -A prefixes_seen=( ) # create an associative array -- aka "hash" or "map"
for file in *; do # iterate over all non-hidden directory entries
prefixes_seen[${file:0:2}]=1 # add the first two chars of each as a key in the map
done
printf '%s\n' "${!prefixes_seen[#]}" # print all keys in the map separated by newlines
That said, if instead of wanting a 2-character prefix you want everything before the first -, you can write something cleaner:
declare -A prefixes_seen=( )
for file in *-*; do
prefixes_seen[${file%%-*}]=1 # "${file%%-*}" cuts off "$file" at the first dash
done
printf '%s\n' "${!prefixes_seen[#]}"
...and if you don't care about deduplication:
for file in *-*; do
printf '%s\n' "${file%%-*}"
done
...or, sticking with the two-character rule:
for file in *; do
printf '%s\n' "${file:0:2}"
done
That said -- if you're trying to Do It Right, you shouldn't be using newlines to separate lists of filename characters either, because newlines are valid inside filenames on POSIX filesystems. Think about a file named f$'\n'oobar -- that is, with a literal newline in the second character; code written carelessly would see f as one prefix and oo as a second one, from this single name. Iterating over associative-array prefixes, as done for the deduplicating answers, is safer in this case, because it doesn't rely on any delimiter character.
To demonstrate the difference -- if instead of writing
printf '%s\n' "${!prefixes_seen[#]}"
you wrote
printf '%q\n' "${!prefixes_seen[#]}"
it would emit the prefix of the hypothetical file f$'\n'oobar as
$'f\n'
instead of
f
...with an extra newline below it.
If you want to pass lists of filenames (or, as here, filename prefixes) between programs, the safe way to do it is to NUL-delimit the elements -- as NULs are the single character which can't possibly exist in a valid UNIX path. (A filename also can't contain /, but a path obviously can).
A NUL-delimited list can be written like so:
printf '%s\0' "${!prefixes_seen[#]}"
...and read back into an identical data structure on the receiving end (should the receiving code be written in bash) like so:
declare -A prefixes_seen=( )
while IFS= read -r -d '' prefix; do
prefixes_seen[$prefix]=1
done
No, you use the cut command:
ls | cut -c1-7

How do I insert the results of several commands on a file as part of my sed stream?

I use DJing software on linux (xwax) which uses a 'scanning' script (visible here) that compiles all the music files available to the software and outputs a string which contains a path to the filename and then the title of the mp3. For example, if it scans path-to-mp3/Artist - Test.mp3, it will spit out a string like so:
path-to-mp3/Artist - Test.mp3[tab]Artist - Test
I have tagged all my mp3s with BPM information via the id3v2 tool and have a commandline method for extracting that information as follows:
id3v2 -l name-of-mp3.mp3 | grep TBPM | cut -D: -f2
That spits out JUST the numerical BPM to me. What I'd like to do is prepend the BPM number from the above command as part of the xwax scanning script, but I'm not sure how to insert that command in the midst of the script. What I'd want it to generate is:
path-to-mp3/Artist - Test.mp3[tab][bpm]Artist - Test
Any ideas?
It's not clear to me where in that script you want to insert the BPM number, but the idea is this:
To embed the output of one command into the arguments of another, you can use the "command substitution" notation `...` or $(...). For example, this:
rm $(echo abcd)
runs the command echo abcd and substitutes its output (abcd) into the overall command; so that's equivalent to just rm abcd. It will remove the file named abcd.
The above doesn't work inside single-quotes. If you want, you can just put it outside quotes, as I did in the above example; but it's generally safer to put it inside double-quotes (so as to prevent some unwanted postprocessing). Either of these:
rm "$(echo abcd)"
rm "a$(echo bc)d"
will remove the file named abcd.
In your case, you need to embed the command substitution into the middle of an argument that's mostly single-quoted. You can do that by simply putting the single-quoted strings and double-quoted strings right next to each other with no space in between, so that Bash will combine them into a single argument. (This also works with unquoted strings.) For example, either of these:
rm a"$(echo bc)"d
rm 'a'"$(echo bc)"'d'
will remove the file named abcd.
Edited to add: O.K., I think I understand what you're trying to do. You have a command that either (1) outputs out all the files in a specified directory (and any subdirectories and so on), one per line, or (2) outputs the contents of a file, where the contents of that file is a list of files, one per line. So in either case, it's outputting a list of files, one per line. And you're piping that list into this command:
sed -n '
{
# /[<num>[.]] <artist> - <title>.ext
s:/\([0-9]\+.\? \+\)\?\([^/]*\) \+- \+\([^/]*\)\.[A-Z0-9]*$:\0\t\2\t\3:pi
t
# /<artist> - <album>[/(Disc|Side) <name>]/[<ABnum>[.]] <title>.ext
s:/\([^/]*\) \+- \+\([^/]*\)\(/\(disc\|side\) [0-9A-Z][^/]*\)\?/\([A-H]\?[A0-9]\?[0-9].\? \+\)\?\([^/]*\)\.[A-Z0-9]*$:\0\t\1\t\6:pi
t
# /[<ABnum>[.]] <name>.ext
s:/\([A-H]\?[A0-9]\?[0-9].\? \+\)\?\([^/]*\)\.[A-Z0-9]*$:\0\t\t\2:pi
}
'
which runs a sed script over that list. What you want is for all of the replacement-strings to change from \0\t... to \0\tBPM\t..., where BPM is the BPM number computed from your command. Right? And you need to compute that BPM number separately for each file, so instead of relying on seds implicit line-by-line looping, you need to handle the looping yourself, and process one line at a time. Right?
So, you should change the above command to this:
while read -r LINE ; do # loop over the lines, saving each one as "$LINE"
BPM=$(id3v2 -l "$LINE" | grep TBPM | cut -D: -f2) # save BPM as "$BPM"
sed -n '
{
# /[<num>[.]] <artist> - <title>.ext
s:/\([0-9]\+.\? \+\)\?\([^/]*\) \+- \+\([^/]*\)\.[A-Z0-9]*$:\0\t'"$BPM"'\t\2\t\3:pi
t
# /<artist> - <album>[/(Disc|Side) <name>]/[<ABnum>[.]] <title>.ext
s:/\([^/]*\) \+- \+\([^/]*\)\(/\(disc\|side\) [0-9A-Z][^/]*\)\?/\([A-H]\?[A0-9]\?[0-9].\? \+\)\?\([^/]*\)\.[A-Z0-9]*$:\0\t'"$BPM"'\t\1\t\6:pi
t
# /[<ABnum>[.]] <name>.ext
s:/\([A-H]\?[A0-9]\?[0-9].\? \+\)\?\([^/]*\)\.[A-Z0-9]*$:\0\t'"$BPM"'\t\t\2:pi
}
' <<<"$LINE" # take $LINE as input, rather than reading more lines
done
(where the only change to the sed script itself was to insert '"$BPM"'\t in a few places to switch from single-quoting to double-quoting, then insert the BPM, then switch back to single-quoting and add a tab).

Extracting sub-strings in Unix

I'm using cygwin on Windows 7. I want to loop through a folder consisting of about 10,000 files and perform a signal processing tool's operation on each file. The problem is that the files names have some excess characters that are not compatible with the operation. Hence, I need to extract just a certain part of the file names.
For example if the file name is abc123456_justlike.txt.rna I need to use abc123456_justlike.txt. How should I write a loop to go through each file and perform the operation on the shortened file names?
I tried the cut - b1-10 command but that doesn't let my tool perform the necessary operation. I'd appreciate help with this problem
Try some shell scripting, using the ${NAME%TAIL} parameter substitution: the contents of variable NAME are expanded, but any suffix material which matches the TAIL glob pattern is chopped off.
$ NAME=abc12345.txt.rna
$ echo ${NAME%.rna} #
# process all files in the directory, taking off their .rna suffix
$ for x in *; do signal_processing_tool ${x%.rna} ; done
If there are variations among the file names, you can classify them with a case:
for x in * ; do
case $x in
*.rna )
# do something with .rna files
;;
*.txt )
# do something else with .txt files
;;
* )
# default catch-all-else case
;;
esac
done
Try sed:
echo a.b.c | sed 's/\.[^.]*$//'
The s command in sed performs a search-and-replace operation, in this case it replaces the regular expression \.[^.]*$ (meaning: a dot, followed by any number of non-dots, at the end of the string) with the empty string.
If you are not yet familiar with regular expressions, this is a good point to learn them. I find manipulating string using regular expressions much more straightforward than using tools like cut (or their equivalents).
If you are trying to extract the list of filenames from a directory use the below command.
ls -ltr | awk -F " " '{print $9}' | cut -c1-10

Resources