Get numeric value from file name - linux

I am a new guy of Linux. I have a question:
I have a bunch of files in a directory, like:
abc-188_1.out
abc-188_2.out
abc-188_3.out
how can a get the number 188 from those names?

Assuming (since you are on linux and are working with files), that you will use a shell / bash-script... (If you use something different (say, python, ...), the solution will, of course, be a different one.)
... this will work
for file in `ls *`; do out=`echo "${file//[!0-9]/ }"|xargs|cut -d' ' -f1`; echo $out; done
Explanation
The basic problem is to extract a number from a string in bash script (search stackoverflow for this, you will find dozens of different solutions).
This is done in the command above as (the string from which numbers are to be extracted being saved in the variable file):
${file//[!0-9]/ }
or, without spaces
${file//[!0-9]/}
It is complicated here by two things:
Do this recursively on the contents of a directory. This is done here with a bash for loop (note that the variable file takes as value the name of each of the files on the current working directory, one after another)
for file in ls *; do (commands you want done for every file in the CWD, seperated by ";"); done
There are multiple numbers in the filenames, you just want the first one.
Therefore, we leave the spaces in, and pipe the result (that being only numbers and spaces from the current file name) into two other commands, xargs (removes leading and trailing whitespace) and cut -d' ' -f1` (returns only the part of the string before the first remaining space, i.e. the first number in our filename),
We save the resulting string in a variable "out" and print it with echo $out,
out=echo "${file//[!0-9]/ }"|xargs|cut -d' ' -f1; echo $out
Note that the number is still in a string data type. You can transform it to integer if you want by using double brackets preceeded by $ out_int=$((out))

Related

Grep for specific numbers within a text file and output per number text file

I have a text file chunk_names.txt that looks like this:
chr1_12334_64321
chr1_134435_77474
chr10_463252_74754
chr10_54265_423435
chr13_5464565_547644567
This is an example but all chromosomes are represented (1...22, X and Y). All entries follow the same formatchr{1..22, X or Y}_*string of numbers*__*string of numbers*.
I would like to split these into per chromosome files e.g. all of the chunks starting chr10 to be put into a file called chr10.txt:
In Linux I have tried :
for i in {1..22}
do
grep chr$i chunk_names.txt > chr$i.txt
done
However, the chr1.txt output file now contains all the chromosome chunks with 1 in them (1,10,11,12, etc).
How would I modify this script to separate out the chromosomes?
I also haven't tackled how to include chromosome X or Y within the same script and am currently running that separately
Things I have tried :
grep -o gives me just "chr$i" as an output
grep 'chr$i' gives me blank files
grep "chr$i" has the initial problem
Many thanks for your time.
Your 'for' loop will mean parsing your file N times (where N is the number of chromosomes/contigs in your list). Here's an agnostic approach using awk that will parse the file just once:
awk -F '_' '{ print > $1 ".txt" }' chunk_names.txt
If you include the _ following the number you can distinguish between chr1_ and e.g. chr10_. To include X and Y, simply include these in the loop
for i in {1..22} X Y
do
grep "chr${i}_" chunk_names.txt > chr$i.txt
done
To search at the beginning of the line only you can add a leading ^ to the pattern
grep "^chr${i}_" chunk_names.txt > chr$i.txt
Explanation about your attempts:
grep chr$i searches for the pattern anywhere in the line. The shell replaces $i with the value of the variable i, so you get chr1, chr2 etc.
If you enclose the pattern in double quotes as grep "chr$i" the shell will not do any file name globbing or splitting of the string, but still expand variables. In your case it is the same as without quotes.
If you use single quotes, the shell takes the literal string as is, so you always search for a line that contains chr$i (instead of chr1 etc.) which does not occur in your file.
Explanation about quotes:
The quotes in my proposed solution are not necessary in your case, but it is a good habit to quote everything. If your pattern would contain spaces or characters that are special to the shell, the quoting will make a difference.
Example:
If your file would contain a chr1* instead of the chr1_, the pattern chr${i}* would be replaced by the list of matching files.
When you already created your output files chr1.txt etc., try these commands
$ i=1; echo chr$i*
chr10.txt chr11.txt chr12.txt chr13.txt chr14.txt chr15.txt chr16.txt chr17.txt chr18.txt chr19.txt chr1.txt
$ i=1; echo "chr$i*"
chr1*
In the first case, the grepcommand
grep chr${i}* chunk_names.txt
would be expanded as
grep chr10.txt chr11.txt chr12.txt chr13.txt chr14.txt chr15.txt chr16.txt chr17.txt chr18.txt chr19.txt chr1.txt chunk_names.txt
which would search for the pattern chr10.txt in files chr11.txt ... chr1.txt and chunk_names.txt.

Is there an option to "ls" that limits filename characters?

syntax question. if I have a number of subdirectories within a target dir, and I want to output the names of the subs to a text file I can easily run:
ls > filelist.txt
on the target. But say all of my subs are named with a 7 character prefix like:
JR-5426_mydir
JR-5487_mydir2
JR-5517_mydir3
...
and I just want the prefixes. Is there an option to "ls" that will only output n characters per line?
Don't use ls in any programmatic context; it should be used strictly for presentation to humans -- ParsingLs gives details on why.
On bash 4.0 or later, the below will provide a deduplicated list of filename prefixes:
declare -A prefixes_seen=( ) # create an associative array -- aka "hash" or "map"
for file in *; do # iterate over all non-hidden directory entries
prefixes_seen[${file:0:2}]=1 # add the first two chars of each as a key in the map
done
printf '%s\n' "${!prefixes_seen[#]}" # print all keys in the map separated by newlines
That said, if instead of wanting a 2-character prefix you want everything before the first -, you can write something cleaner:
declare -A prefixes_seen=( )
for file in *-*; do
prefixes_seen[${file%%-*}]=1 # "${file%%-*}" cuts off "$file" at the first dash
done
printf '%s\n' "${!prefixes_seen[#]}"
...and if you don't care about deduplication:
for file in *-*; do
printf '%s\n' "${file%%-*}"
done
...or, sticking with the two-character rule:
for file in *; do
printf '%s\n' "${file:0:2}"
done
That said -- if you're trying to Do It Right, you shouldn't be using newlines to separate lists of filename characters either, because newlines are valid inside filenames on POSIX filesystems. Think about a file named f$'\n'oobar -- that is, with a literal newline in the second character; code written carelessly would see f as one prefix and oo as a second one, from this single name. Iterating over associative-array prefixes, as done for the deduplicating answers, is safer in this case, because it doesn't rely on any delimiter character.
To demonstrate the difference -- if instead of writing
printf '%s\n' "${!prefixes_seen[#]}"
you wrote
printf '%q\n' "${!prefixes_seen[#]}"
it would emit the prefix of the hypothetical file f$'\n'oobar as
$'f\n'
instead of
f
...with an extra newline below it.
If you want to pass lists of filenames (or, as here, filename prefixes) between programs, the safe way to do it is to NUL-delimit the elements -- as NULs are the single character which can't possibly exist in a valid UNIX path. (A filename also can't contain /, but a path obviously can).
A NUL-delimited list can be written like so:
printf '%s\0' "${!prefixes_seen[#]}"
...and read back into an identical data structure on the receiving end (should the receiving code be written in bash) like so:
declare -A prefixes_seen=( )
while IFS= read -r -d '' prefix; do
prefixes_seen[$prefix]=1
done
No, you use the cut command:
ls | cut -c1-7

How do I insert the results of several commands on a file as part of my sed stream?

I use DJing software on linux (xwax) which uses a 'scanning' script (visible here) that compiles all the music files available to the software and outputs a string which contains a path to the filename and then the title of the mp3. For example, if it scans path-to-mp3/Artist - Test.mp3, it will spit out a string like so:
path-to-mp3/Artist - Test.mp3[tab]Artist - Test
I have tagged all my mp3s with BPM information via the id3v2 tool and have a commandline method for extracting that information as follows:
id3v2 -l name-of-mp3.mp3 | grep TBPM | cut -D: -f2
That spits out JUST the numerical BPM to me. What I'd like to do is prepend the BPM number from the above command as part of the xwax scanning script, but I'm not sure how to insert that command in the midst of the script. What I'd want it to generate is:
path-to-mp3/Artist - Test.mp3[tab][bpm]Artist - Test
Any ideas?
It's not clear to me where in that script you want to insert the BPM number, but the idea is this:
To embed the output of one command into the arguments of another, you can use the "command substitution" notation `...` or $(...). For example, this:
rm $(echo abcd)
runs the command echo abcd and substitutes its output (abcd) into the overall command; so that's equivalent to just rm abcd. It will remove the file named abcd.
The above doesn't work inside single-quotes. If you want, you can just put it outside quotes, as I did in the above example; but it's generally safer to put it inside double-quotes (so as to prevent some unwanted postprocessing). Either of these:
rm "$(echo abcd)"
rm "a$(echo bc)d"
will remove the file named abcd.
In your case, you need to embed the command substitution into the middle of an argument that's mostly single-quoted. You can do that by simply putting the single-quoted strings and double-quoted strings right next to each other with no space in between, so that Bash will combine them into a single argument. (This also works with unquoted strings.) For example, either of these:
rm a"$(echo bc)"d
rm 'a'"$(echo bc)"'d'
will remove the file named abcd.
Edited to add: O.K., I think I understand what you're trying to do. You have a command that either (1) outputs out all the files in a specified directory (and any subdirectories and so on), one per line, or (2) outputs the contents of a file, where the contents of that file is a list of files, one per line. So in either case, it's outputting a list of files, one per line. And you're piping that list into this command:
sed -n '
{
# /[<num>[.]] <artist> - <title>.ext
s:/\([0-9]\+.\? \+\)\?\([^/]*\) \+- \+\([^/]*\)\.[A-Z0-9]*$:\0\t\2\t\3:pi
t
# /<artist> - <album>[/(Disc|Side) <name>]/[<ABnum>[.]] <title>.ext
s:/\([^/]*\) \+- \+\([^/]*\)\(/\(disc\|side\) [0-9A-Z][^/]*\)\?/\([A-H]\?[A0-9]\?[0-9].\? \+\)\?\([^/]*\)\.[A-Z0-9]*$:\0\t\1\t\6:pi
t
# /[<ABnum>[.]] <name>.ext
s:/\([A-H]\?[A0-9]\?[0-9].\? \+\)\?\([^/]*\)\.[A-Z0-9]*$:\0\t\t\2:pi
}
'
which runs a sed script over that list. What you want is for all of the replacement-strings to change from \0\t... to \0\tBPM\t..., where BPM is the BPM number computed from your command. Right? And you need to compute that BPM number separately for each file, so instead of relying on seds implicit line-by-line looping, you need to handle the looping yourself, and process one line at a time. Right?
So, you should change the above command to this:
while read -r LINE ; do # loop over the lines, saving each one as "$LINE"
BPM=$(id3v2 -l "$LINE" | grep TBPM | cut -D: -f2) # save BPM as "$BPM"
sed -n '
{
# /[<num>[.]] <artist> - <title>.ext
s:/\([0-9]\+.\? \+\)\?\([^/]*\) \+- \+\([^/]*\)\.[A-Z0-9]*$:\0\t'"$BPM"'\t\2\t\3:pi
t
# /<artist> - <album>[/(Disc|Side) <name>]/[<ABnum>[.]] <title>.ext
s:/\([^/]*\) \+- \+\([^/]*\)\(/\(disc\|side\) [0-9A-Z][^/]*\)\?/\([A-H]\?[A0-9]\?[0-9].\? \+\)\?\([^/]*\)\.[A-Z0-9]*$:\0\t'"$BPM"'\t\1\t\6:pi
t
# /[<ABnum>[.]] <name>.ext
s:/\([A-H]\?[A0-9]\?[0-9].\? \+\)\?\([^/]*\)\.[A-Z0-9]*$:\0\t'"$BPM"'\t\t\2:pi
}
' <<<"$LINE" # take $LINE as input, rather than reading more lines
done
(where the only change to the sed script itself was to insert '"$BPM"'\t in a few places to switch from single-quoting to double-quoting, then insert the BPM, then switch back to single-quoting and add a tab).

Extracting sub-strings in Unix

I'm using cygwin on Windows 7. I want to loop through a folder consisting of about 10,000 files and perform a signal processing tool's operation on each file. The problem is that the files names have some excess characters that are not compatible with the operation. Hence, I need to extract just a certain part of the file names.
For example if the file name is abc123456_justlike.txt.rna I need to use abc123456_justlike.txt. How should I write a loop to go through each file and perform the operation on the shortened file names?
I tried the cut - b1-10 command but that doesn't let my tool perform the necessary operation. I'd appreciate help with this problem
Try some shell scripting, using the ${NAME%TAIL} parameter substitution: the contents of variable NAME are expanded, but any suffix material which matches the TAIL glob pattern is chopped off.
$ NAME=abc12345.txt.rna
$ echo ${NAME%.rna} #
# process all files in the directory, taking off their .rna suffix
$ for x in *; do signal_processing_tool ${x%.rna} ; done
If there are variations among the file names, you can classify them with a case:
for x in * ; do
case $x in
*.rna )
# do something with .rna files
;;
*.txt )
# do something else with .txt files
;;
* )
# default catch-all-else case
;;
esac
done
Try sed:
echo a.b.c | sed 's/\.[^.]*$//'
The s command in sed performs a search-and-replace operation, in this case it replaces the regular expression \.[^.]*$ (meaning: a dot, followed by any number of non-dots, at the end of the string) with the empty string.
If you are not yet familiar with regular expressions, this is a good point to learn them. I find manipulating string using regular expressions much more straightforward than using tools like cut (or their equivalents).
If you are trying to extract the list of filenames from a directory use the below command.
ls -ltr | awk -F " " '{print $9}' | cut -c1-10

Perl line runs 30 times quicker with single quotes than with double quotes

We have a task to change some strings in binary files to lowercase (from mixed/upper/whatever). The relevant strings are references to the other files (it's in connection with an upgrade where we are also moving from Windows to linux as a server environment, so the case suddenly matters). We have written a script which uses a perl loop to do this. We have a directory containing around 300 files (total size of the directory is around 150M) so it's some data but not huge amounts.
The following perl code takes about 6 minutes to do the job:
for file_ref in `ls -1F $forms6_convert_dir/ | grep -v "/" | sed 's/\(.*\)\..*/\1/'`
do
(( updated++ ))
write_line "Converting case of string: $file_ref "
perl -i -pe "s{(?i)$file_ref}{$file_ref}g" $forms6_convert_dir/*
done
while the following perl code takes over 3 hours!
for file_ref in `ls -1F $forms6_convert_dir/ | grep -v "/" | sed 's/\(.*\)\..*/\1/'`
do
(( updated++ ))
write_line "Converting case of string: $file_ref "
perl -i -pe 's{(?i)$file_ref}{$file_ref}g' $forms6_convert_dir/*
done
Can anyone explain why? Is it that the $file_ref is getting left as the string $file_ref instead of substituted with the value in the single quotes version? in which case, what is it replacing in this version? What we want is to replace all occurances of any filename with itself but in lowercase. If we run strings on the files before and after and search for the filenames then both appeared to have made the same changes. However, if we run diff on the files produced by the two loops (diff firstloop/file1 secondloop/file1) then it reports that they differ.
This is running from within a bash script on linux.
The shell doesn't do variable substitution for single quoted strings. So, the second one is a different program.
As the other answers said, the shell doesn't substitute variables inside single quotes, so the second version is executing the literal Perl statement s{(?i)$file_ref}{$file_ref}g for every line in every file.
As you said in a comment, if $ is the end-of-line metacharacter, $file_ref could never match anything. $ matches before the newline at end-of-line, so the next character would have to be a newline. Therefore, Perl doesn't interpret $ as the metacharacter; it interprets it as the beginning of a variable interpolation.
In Perl, the variable $file_ref is undef, which is treated as the empty string when interpolated. So you're really executing s{(?i)}{}g, which says to replace the empty string with the empty string, and do that for all occurrences in a case-insensitive manner. Well, there's an empty string between every pair of characters, plus one at the beginning and end of each line. Perl is finding each one and replacing it with the empty string. This is a no-op, but it's an expensive one, hence the 3-hour run time.
You must be mistaken about both versions making the same changes. As I just explained, the single-quoted version is just an expensive no-op; it doesn't make any changes at all to the file contents (it just makes a fresh copy of each file). The files you ran it on must have already been converted to lower case.
With double quotes you are using the shell variable, with single quotes Perl is trying to use a variable of that name.
You might wish to consider writing the whole lot in either Perl or Bash to speed things up. Both languages can read files and do pattern matching. In Perl you can change to lower-case using the lc built-in function, and in Bash 4 you can use ${file,,}.

Resources