How to extract substring from a text file in bash? - string

I have lots of strings in a text file, like this:
"/home/mossen/Desktop/jeff's project/Results/FCCY.png"
"/tmp/accept/FLWS14UU.png"
"/home/tten/Desktop/.wordi/STSMLC.png"
I want to get only the file names from the string as I read the text file line by line, using a bash shell script. The file name will always end in .png and will always have the "/" in front of it. I can get each string into a var, but what is the best way to extract the filenames (FCCY.png, FLWS14UU.png, etc.) into vars? I can't count on the user having Perl, Python, etc, just the standard Unix utils such as awk and sed.
Thanks,
mossen

You want basename:
$ basename /tmp/accept/FLWS14UU.png
FLWS14UU.png

basename works on one file/string at a time. If you have many strings you will be iterating the file and calling external command many times.
use awk
$ awk -F'[/"]' '{print $(NF-1)}' file
FCCY.png
FLWS14UU.png
STSMLC.png
or use the shell
while read -r line
do
line=${line##*/}
echo "${line%\"}"
done <"file"

newlist=$(for file in ${list} ;do basename ${file}; done)

$ var="/home/mossen/Desktop/jeff's project/Results/FCCY.png"
$ file="${var##*/}"

Using basename iteratively has a huge performance hit. It's small and unnoticeable when you're doing it on a file or two but adds up over hundreds of them. Let me do some timing tests for you to exemplify why using basneame (or any system util callout) is bad when an internal feature can do the job -- Dennis and ghostdog74 gave you the more experienced BASH answers.
Sample input files.txt (list of my pics with full path): 3749 entries
external.sh
while read -r line
do
line=`basename "${line}"`
echo "${line%\"}"
done < "files.txt"
internal.sh
while read -r line
do
line=${line##*/}
echo "${line%\"}"
done < "files.txt"
Timed results, redirecting output to /dev/null to get rid of any video lag:
$ time sh external.sh 1>/dev/null
real 0m4.135s
user 0m1.142s
sys 0m2.308s
$ time sh internal.sh 1>/dev/null
real 0m0.413s
user 0m0.357s
sys 0m0.021s
The output of both is identical:
$ sh external.sh | sort > result1.txt
$ sh internal.sh | sort > result2.txt
$ diff -uN result1.txt result2.txt
So as you can see from the timing tests you really want to avoid any external calls to system utilities when you can write the same feature in some creative BASH code/lingo to get the job done, especially when it's going to be called a whole lot of times over and over.

Related

How to specify more inputs as a single input in Linux command-line?

I searched online but I didn't find anything that could answer my question.
I'm using a java tool in Ubuntu Linux, calling it with bash command; this tool has two paths for two different input files:
java -Xmx8G -jar picard.jar FastqToSam \
FASTQ=6484_snippet_1.fastq \ #first read file of pair
FASTQ2=6484_snippet_2.fastq \ #second read file of pair
[...]
What I'd like to do is for example, instead of specify the path of a single FASTQ, specify the path of two different files.
So instead of having cat file1 file2 > File and using File as input of FASTQ, I'd like that this operation would be executed on the fly and create the File on the fly, without saving it on the file system (that would be what happens with the command cat file1 file2 > File).
I hope that I've been clear in explaining my question, in case just ask me and I'll try to explain better.
Most well-written shell commands which accept a file name argument also usually accept a list of file name arguments. Like cat file or cat file1 file2 etc.
If the program you are trying to use doesn't support this, and cannot easily be fixed, perhaps your OS or shell makes /dev/stdin available as a pseudo-file.
cat file1 file2 | java -mumble -crash -burn FASTQ=/dev/stdin
Some shells also have process substitutions, which (typically) look to the calling program like a single file containing whatever the process substitution produces on standard output.
java -mumble -crash -burn FASTQ=<(cat file1 file2) FASTQ2=<(cat file3 file4)
If neither of these work, a simple shell script which uses temporary files and deletes them when it's done is a tried and true solution.
#!/bin/sh
: ${4?Need four file name arguments, will process them pairwise}
t=$(mktemp -d -t fastqtwoness.XXXXXXX) || exit
trap 'rm -rf $t' EXIT HUP INT TERM # remove in case of failure or when done
cat "$1" "$2" >$t/1.fastq
cat "$3" "$4" >$t/2.fastq
exec java -mumble -crash -burn FASTQ=$t/1.fastq FASTQ2=$t/2.fastq

Copy a txt file twice to a different file using bash

I am trying to cat a file.txt and loop it twice through the whole content and copy it to a new file file_new.txt. The bash command I am using is as follows:
for i in {1..3}; do cat file.txt > file_new.txt; done
The above command is just giving me the same file contents as file.txt. Hence file_new.txt is also of the same size (1 GB).
Basically, if file.txt is a 1GB file, then I want file_new.txt to be a 2GB file, double the contents of file.txt. Please, can someone help here? Thank you.
Simply apply the redirection to the for loop as a whole:
for i in {1..3}; do cat file.txt; done > file_new.txt
The advantage of this over using >> (aside from not having to open and close the file multiple times) is that you needn't ensure that a preexisting output file is truncated first.
Note that the generalization of this approach is to use a group command ({ ...; ...; }) to apply redirections to multiple commands; e.g.:
$ { echo hi; echo there; } > out.txt; cat out.txt
hi
there
Given that whole files are being output, the cost of invoking cat for each repetition will probably not matter that much, but here's a robust way to invoke cat only once:[1]
# Create an array of repetitions of filename 'file' as needed.
files=(); for ((i=0; i<3; ++i)); do files[i]='file'; done
# Pass all repetitions *at once* as arguments to `cat`.
cat "${files[#]}" > file_new.txt
[1] Note that, hypothetically, you could run into your platform's command-line length limit, as reported by getconf ARG_MAX - given that on Linux that limit is 2,097,152 bytes (2MB) that's not likely, though.
You could use the append operator, >>, instead of >. Then adjust your loop count as needed to get the output size desired.
You should adjust your code so it is as follows:
for i in {1..3}; do cat file.txt >> file_new.txt; done
The >> operator appends data to a file rather than writing over it (>)
if file.txt is a 1GB file,
cat file.txt > file_new.txt
cat file.txt >> file_new.txt
The > operator will create file_new.txt(1GB),
The >> operator will append file_new.txt(2GB).
for i in {1..3}; do cat file.txt >> file_new.txt; done
This command will make file_new.txt(3GB),because for i in {1..3} will run three times.
As others have mentioned, you can use >> to append. But, you could also just invoke cat once and have it read the file 3 times. For instance:
n=3; cat $( yes file.txt | sed ${n}q ) > file_new.txt
Note that this solution exhibits a common anti-pattern and fails to properly quote the arguments, which will cause issues if the filename contains whitespace. See mklement's solution for a more robust solution.

referencing stdout in a command that has been piped into

I want to make a simple dmenu command that reads a file of commands and names. Then takes the names and displays them using dmenu then takes dmenu's output and runs the associated command using the file again.
I got to the point where dmenu displays the names, but I don't really know where to go from there. Learning bash is a really daunting task to me and I don't really know where to start with this seemingly simple script/command.
here is the file:
Pushbullet
google-chrome-stable --app=https://www.pushbullet.com
Steam
steam
Chrome
google-chrome-stable
Libre Office
libreoffice
Transmission
transmission-qt
Audio Control Panel
sudo pavucontrol & bluberry
and here is what I have so far for my command:
awk 'NR % 2 != 0' /home/rocco/programlist | dmenu | ??(grep -l "stdout" /home/rocco/programlist....)
It was my thinking that I could somehow pipe into grep or awk with the name of the application then get the line number then add one and pipe that into sh.
Thanks
I have no experience with dmenu but if I understand how it works correctly, this should do what you want. Wrapping a command in $(…) returns the output as a variable, which we can pass on to another command.
#!/bin/bash
plist="/home/rocco/programlist"
# pipe every second line to dmenu
selected=$(awk 'NR % 2 != 0' "$plist" | dmenu)
# search for the selected item, get the command after it
cmd=$(grep -A1 "$selected" "$plist" | tail -n 1)
# run the command
$cmd
Worth mentioning a mistake in your question. dmenu sends to stdout, or standard output, but the next program in line would be reading stdin, or standard input. In any case, grep can't take patterns on standard input, which is why I've saved to a variable instead of trying to pipe it somewhere.
Assuming you have programlist.txt in the working directory you can use:
awk 'NR%2 !=0' programlist.txt |dmenu |awk '{system("grep --no-group-separator -A 1 '"'"'"$0"'"'"' programlist.txt");}' |awk '{if(NR==2){system($0);}}'
Note the quoting of the $0 in the first awk envocation. This is necessary to get names with spaces in them like "Libre Office"

Improve performance of Bash loop that removes windows line endings

Editor's note: This question was always about loop performance, but the original title led some answerers - and voters - to believe it was about how to remove Windows line endings.
The below bash loop below just remove the windows line endings and converts them to unix and appears to be running, but it is slow. The input files are small (4 files ranging from 167 bytes - 1 kb), and are all the same structure (list of names) and the only thing that varies is the length (ie. some files are 10 names others are 50). Is it supposed to take over 15 minutes to complete this task using a xeon processor? Thank you :)
for f in /home/cmccabe/Desktop/files/*.txt ; do
bname=`basename $f`
pref=${bname%%.txt}
sed 's/\r//' $f - $f > /home/cmccabe/Desktop/files/${pref}_unix.txt
done
Input .txt files
AP3B1
BRCA2
BRIP1
CBL
CTC1
EDIT
This is not a duplicate as I was more asking for why my bash loop that uses sed to remove windows line endings was running so slow. I did not mean to imply how to remove them, was asking for ideas that might speed up the loop and I got many. Thank you :). I hope this helps.
Use the utilities dos2unix and unix2dos to convert between unix and windows style line endings.
Your 'sed' command looks wrong. I believe the trailing $f - $f should simply be $f. Running your script as written hangs for a very long time on my system, but making this change causes it to complete almost instantly.
Of course, the best answer is to use dos2unix, which was designed to handle this exact thing:
cd /home/cmccabe/Desktop/files
for f in *.txt ; do
pref=$(basename -s '.txt' "$f")
dos2unix -q -n "$f" "${pref}_unix.txt"
done
This always works for me:
perl -pe 's/\r\n/\n/' inputfile.txt > outputfile.txt
you can use dos2unix as stated before or use this small sed:
sed 's/\r//' file
The key to performance in Bash is to avoid loops in general, and in particular those that call one or more external utilities in each iteration.
Here is a solution that uses a single GNU awk command:
awk -v RS='\r\n' '
BEGINFILE { outFile=gensub("\\.txt$", "_unix&", 1, FILENAME) }
{ print > outFile }
' /home/cmccabe/Desktop/files/*.txt
-v RS='\r\n' sets CRLF as the input record separator, and by virtue of leaving ORS, the output record separator at its default, \n, simply printing each input line will terminate it with \n.
the BEGINFILE block is executed every time processing of a new input file starts; in it, gensub() is used to insert _unix before the .txt suffix of the input file at hand to form the output filename.
{print > outFile} simply prints the \n-terminated lines to the output file at hand.
Note that use of a multi-char. RS value, the BEGINFILE block, and the gensub() function are GNU extensions to the POSIX standard.
Switching from the OP's sed solution to a GNU awk-based one was necessary in order to provide a single-command solution that is both simpler and faster.
Alternatively, here's a solution that relies on dos2unix for conversion of Window line-endings (for instance, you can install dos2unix with sudo apt-get install dos2unix on Debian-based systems); except for requiring dos2unix, it should work on most platforms (no GNU utilities required):
It uses a loop only to construct the array of filename arguments to pass to dos2unix - this should be fast, given that no call to basename is involved; Bash-native parameter expansion is used instead.
then uses a single invocation of dos2unix to process all files.
# cd to the target folder, so that the operations below do not need to handle
# path components.
cd '/home/cmccabe/Desktop/files'
# Collect all *.txt filenames in an array.
inFiles=( *.txt )
# Derive output filenames from it, using Bash parameter expansion:
# '%.txt' matches '.txt' at the end of each array element, and replaces it
# with '_unix.txt', effectively inserting '_unix' before the suffix.
outFiles=( "${inFiles[#]/%.txt/_unix.txt}" )
# Create an interleaved array of *input-output filename pairs* to be passed
# to dos2unix later.
# To inspect the resulting array, run `printf '%s\n' "${fileArgs[#]}"`
# You'll see pairs like these:
# file1.txt
# file1_unix.txt
# ...
fileArgs=(); i=0
for inFile in "${inFiles[#]}"; do
fileArgs+=( "$inFile" "${outFiles[i++]}" )
done
# Now, use a *single* invocation of dos2unix, passing all input-output
# filename pairs at once.
dos2unix -q -n "${fileArgs[#]}"

Alternative to ls in shell-script compatible with nohup

I have a shell-script which lists all the file names in a directory and store them in a new file.
The problem is that when I execute this script with the nohup command, it lists the first name four times instead of listing the correct names.
Commenting the problem with other programmers they think that the problem may be the ls command.
Part of my code is the following:
for i in $( ls -1 ./Datasets/); do
awk '{print $1}' ./genes.txt | head -$num_lineas | tail -1 >> ./aux
let num_lineas=$num_lineas-1
done
Do you know an alternative to ls that works well with nohup?
Thanks.
Don't use ls to feed the loop, use:
for i in ./Datasets/*; do
or if subdirectories are of interest
for i in ./Datasets/*/*; do
Lastly, and more correctly, use find if you need the entire tree below Datasets:
find ./Datasets -type f | while IFS= read -r file; do
(do stuff with $file)
done
Others frown, but there is nothing wrong with also using find as:
for file in $(find ./Datasets -type f); do
(do stuff with $file)
done
Just choose the syntax that most closely meets your needs.
First of all, don't parse ls! A simple glob will suffice. Secondly, your awk | head | tail chain can be simplified by only printing the first column of the line that you're interested in using awk. Thirdly, you can redirect the output of your loop to a file, rather than using >>.
Incorporating all of those changes into your script:
for i in Datasets/*; do
awk -v n="$(( num_lineas-- ))" 'NR==n{print $1}' genes.txt
done > aux
Every time the loop goes round, the value of $num_lineas will decrease by 1.
In terms of your problem with nohup, I would recommend looking into using something like screen, which is known to be a better solution for maintaining a session between logins.

Resources