concatenate two strings and one variable using bash - linux

I need to generate filename from three parts, two strings, and one variable.
for f in `cat files.csv`; do echo fastq/$f\_1.fastq.gze; done
files.csv has the following lines:
Sample_11
Sample_12
I need to generate the following:
fastq/Sample_11_1.fastq.gze
fastq/Sample_12_1.fastq.gze
My problem is that I got the below files:
_1.fastq.gze_11
_1.fastq.gze_12
the string after the variable deletes the string before it.
I appreciate any help
Regards

By the way your idiom: for f in cat files.csv should be avoid. Refer: Dangerous Backticks
while read f
do
echo "fastq/${f}/_1.fastq.gze"
done < files.csv

You can make it a one-liner with xargs and printf.
xargs printf 'fastq/%s_1.fastq.gze\n' <files.csv
The function of printf is to apply the first argument (the format string) to each argument in turn.
xargs says to run this command on as many files as it can fit onto the command line (splitting it up into multiple invocations if the input file is too large to fit all the arguments onto a single command line, subject to the ARG_MAX constant in your kernel).

Your best bet, generally, is to wrap the variable name in braces. So, in this case:
echo fastq/${f}_1.fastq.gz
See this answer for some details about the general concept, as well.
Edit: An additional thought looking at the now-provided output makes me think that this isn't a coding problem at all, but rather a conflict between line-endings and the terminal/console program.
Specifically, if the CSV file ends its lines with just a carriage return (ASCII/Unicode 13), the end of Sample_11 might "rewind" the line to the start and overwrite.
In that case, based loosely on this article, I'd recommend replacing cat (if you understandably don't want to re-architect the actual script with something like while) with something that will strip the carriage returns, such as:
for f in $(tr -cd '\011\012\040-\176' < temp.csv)
do
echo fastq/${f}_1.fastq.gze
done
As the cited article explains, Octal 11 is a tab, 12 a line feed, and 40-176 are typeable characters (Unicode will require more thinking). If there aren't any line feeds in the file, for some reason, you probably want to replace that with tr '\015' '\012', which will convert the carriage returns to line feeds.
Of course, at that point, better is to find whatever produces the file and ask them to put reasonable line-endings into their file...

Related

How to echo/print actual file contents on a unix system

I would like to see the actual file contents without it being formatted to print. For example, to show:
\n0.032,170\n0.034,290
Instead of:
0.032,170
0.34,290
Is there a command to echo the file's actual data in bash? I've tried using head, cat, more, etc. but all those seem to echo the "print-formatted" text. For example:
$ cat example.csv
0.032,170
0.34,290
How can I print the actual characters within the file?
This reads as if you miss understand what the "actual characters in the file" are. You will not find the characters \ and n in that file. But only a line feed, which is a specific character. So the utilities like cat do actually output exactly the characters in the file.
Putting it the other way around: if you really had those two characters literally in the file, then a utility like cat would actually output them. I just checked that, just to be sure.
You can easily check that yourself if you open the file using a hexeditor. There you will see the character 0A (decimal 10) which is a line feed character. You will not see the pair of the two characters \ and n somewhere in that file.
Many programming languages and also shell environments use escape sequences like \n in string definitions and identify those as control characters which would not be typable otherwise. So maybe that is where your impression comes from that your files should contain those two characters.
To display newlines as \n, you might try:
awk 1 ORS='\\n' input-file
This is not the "actual characters in the file", as \n is merely a conventional method of displaying a newline, but this does seem to be what you want.

Why does a part of this variable get replaced when combining it with a string?

I have the following Bash script which loops through the lines of a file:
INFO_FILE=playlist-info-test.txt
line_count=$(wc -l $INFO_FILE | awk '{print $1}')
for ((i=1; i<=$line_count; i++))
do
current_line=$(sed "${i}q;d" $INFO_FILE)
CURRENT_PLAYLIST_ORIG="$current_line"
input_file="$CURRENT_PLAYLIST_ORIG.mp3"
echo $input_file
done
This is a sample of the playlist-info-test.txt file:
Playlist 1
Playlist2
Playlist 3
The output of the script should be as follows:
Playlist 1.mp3
Playlist2.mp3
Playlist 3.mp3
However, I am getting the following output:
.mp3list 1
.mp3list2
.mp3list 3
I have spent a few hours on this and can't understand why the ".mp3" part is being moved to the front of the string. I initially thought it was because of the space in the lines of the input file, but removing the space doesn't make a difference. I also tried using a while loop with read line and the input file redirected into it, but that does not make any difference either.
I copied the playlist-info-test.txt contents and the script, and get the output you expected. Most likely there are non-printable characters in your playlist-info-test.txt or script which are messing up the processing. Check the binary contents of both files using for example xxd -g 1 and look for non-newline (0a) non-printing characters.
Did the file come from Windows? DOS and Windows end their lines with carriage return (hex 0d, sometimes represented as \r) followed by linefeed (hex 0a, sometimes represented as \n). Unix just uses linefeed, and so tends to treat the carriage return as part of the content of the line. In your case, it winds up at the end of the current_line variable, so input_file winds up something like "Playlist 1\r.mp3". When you print this to the terminal, the carriage return makes it go back to the beginning of the line (that's what carriage return means), so it prints as:
Playlist 1
.mp3
...with the ".mp3" printed over the "Play" part, rather than on the next line like I have it above.
Solution: either fix the file (there's a fairly standard dos2unix program that does precisely this), or change your script to strip carriage returns as it reads the file. Actually, I'd recommend a rewrite anyway, since your current use of sed to pick out lines is rather weird and inefficient. In a shell script, the standard way to read through a file line-by-line is to use a loop like while read -r current_line; do [commands here]; done <"$INFO_FILE". There's a possible problem that if any commands inside the loop read from standard input, they'll wind up inhaling part of that file; you can fix that by passing the file over unit 3 rather than standard input. With that fix and a trick to trim carriage returns, here's what it looks like:
INFO_FILE=playlist-info-test.txt
while IFS=$' \t\n\r' read -r current_line <&3; do
CURRENT_PLAYLIST_ORIG="$current_line"
input_file="$CURRENT_PLAYLIST_ORIG.mp3"
echo "$input_file"
done 3<"$INFO_FILE"
(The carriage return trim is done by read -- it always auto-trims leading and trailing whitespace, and setting IFS to $' \t\n\r' tells it to treat spaces, tabs, linefeeds, and carriage returns as whitespace. And since that assignment is a prefix to the read command, it applies only to that one command and you don't have to set IFS back to normal afterward.)
A couple of other recommendations while I'm here: double-quote all variable references (as I did with echo "$input_file" above), and avoid all-caps variable names (there are a bunch with special meanings, and if you accidentally use one of those it can have weird effects). Oh, and try passing your scripts to shellcheck.net -- it's good at spotting common mistakes.

expr bash for sed a line in log does not work

my goal is to sed the 100th line and convert it to a string, then separate the data of the sentence to word
#!/bin/bash
fid=log.txt;
sentence=`expr sed -n '100p' ${fid}`;
for word in $sentence
do
echo $word
done
but apparently this has failed.
expr: syntax error
would somebody please let me know what have I done wrong? previously for number it worked.
The expr does not seem to serve a useful purpose here, and if it did, a sed command would certainly not be a valid or useful thing to pass to it, under most circumstances. You should probably just take it out.
However, the following loop is also problematic. Unquoted variables in shell script are very frequently an error. In this case, you can't quote the thing you pass to the for loop (that would cause the loop to only run once, with the loop variable set to the quoted string) but you also cannot prevent the shell from performing wildcard expansion on the unquoted string. So if the string happened to contain *, the shell will expand that to a list of files in the current directory, for example.
Fortunately, this can all be done in an only slightly more complicated sed script.
sed '100!d;s/[ \t]\+/\n/g;q' "$fid"
That is, if the line number is not 100, delete this line and start over with the next line. Otherwise, we are at line 100; replace runs of whitespace with newlines, (print) and quit.
(The backslash escape codes \t and \n are not universally portable; and \+ for repetition is also an optional extension. I believe there are also sed variants which dislike semicolon as a command separator. Consult your sed manual page, experiment, and if everything else fails, maybe switch to Awk or Perl. Just in case, here is a version which works even on Mac OSX:
sed '100!d
s/[ ][ ]*/\
/g;q' log.txt
The stuff inside the square brackets are a space and a literal tab; in Bash, with default keybindings, type ctrl-V, tab to produce a literal tab.)
Incidentally, this also gets rid of the variable capture antipattern. There are good reasons to capture output to a variable, but if it can be avoided, you often end up with a simpler, more robust and efficient, as well as more idiomatic and elegant script. (I see no reason to put the log file name in a variable, either, in this isolated case; but in a larger script, it might make sense.)
I don't think you need expr command in this case.
expr is used to do calculations. Something like:
expr 1 + 1
Just this one is fine:
sentence=`sed -n '100p' ${fid}`;
#!/bin/bash
fid=log.txt;
sentence=$(sed -n '100p' ${fid});
for word in $sentence
do
echo $word
done
put a dollar sign and parenthesis solve the problem

find string and replace

Hi I have a file like this
L_00001_mRNA_interferase_MazF
ATGGATTATCCAAAACAAAAGGATATTGTCTGGATTGATTTTGACCCTTCTAAAGGCAAA
GAGATAAGAAAGCGGAGACCTGCGTTAGTAGTTAGTAAAGATGAATTTAATGAACGTACA
GGTTTCTGTTTAGTTTGCCCCATCACATCTACTAAAAGGAACTTTGCAACGTATATTGAA
ATAACAGACCCACAGAAAGTAGAAGGGGACGTAGTTACCCATCAATTGCGAGCGGTTGAT
TACACCACAAGAAATATCGAAAAAATTGAACAATGTGATATGTTGACGTGGATTGATGTA
GTAGAAGTAATCGGAATGTTTATTTAA
L_00002_hypothetical_protein
ATGGAAACGGTAGTTAGAAAGATAGGGAATTCAGTAGGAACTATTTTTCCGAAAAGTATT
TCACCACAAGTTGGAGAAAAGTTCACTATTCTTAAAGTTGGGGAAGCGTATATATTGAAA
CCTAAGAGAGAAGATATTTTTAAAAATGCTGAAGATTGGGTAGGGTTTAGAGAAGCTTTG
ACTAATGAAGATAAAGAATGGGACGAGATGAAACTTGAGGGAGGAGAACGCTAG
L_00003_hypothetical_protein
ATGACAACGTTTGGAGAAATTCATAGCAATGCAGAAGGTTATAAAAACGATTTTAATGAG
TTGAATAAATTAGTATTACGTGTAGCTGAAGAAAAAGCAAAAGGAGAGCCATTAGTAACG
TGGTTTCGGTTGCGGAATCGTAGGATTGCACAAGTATTAGACCCAATGAAAGAAGAAGTA
GAAAGTAAATCAAAGTACGAAAAAAGAAGAGTAGCAGCAATTAGTAAAAGCTTTTTTCTA
CTTAAAAAAGCTTTTAACTTTATTGAAGCAGAACAATTTGAAAAAGCAGAAAAATTAATT
I would like to substitute the header of each sequence with a string.
I have a conversion file like
L_00001_mRNA_interferase_MazF galM,GALM,aldose1-epimerase[EC:5.1.3.3]
L_00002_hypothetical_protein E3.2.1.85,lacG,6-phospho-beta-galactosidase[EC:3.2.1.85]
L_00003_hypothetical_protein PTS-Lac-EIIB,lacE,PTSsystem,lactose-specificIIBcomponent[EC:2.7.1.69]
Your question is unclear as to what platform you're on (Windows, Linux, Mac, ...), what languages you're constrained to, and the exact details of your input files.
On the assumption that you're on Linux, or otherwise have sed and awk available and a command shell, it could be as simple as (where $ indicates a Bourne-like shell prompt):
$ awk '{print "s/^" $1 "/" $2 "/"}' conversions.txt > conversions.sed
$ sed -f conversions.sed sequences.txt > relabeled.txt
This assumes that your first file (with the headings you want changed) is called sequences.txt and your second file (the “conversion file”) is called conversions.txt. It is further assumed that the “conversion file” contains one record per line with exactly two fields — the original and substitute headers — separated by whitespace (i.e. neither the original header nor the new header contain any spaces) and no blank lines.
In this solution, the first (awk) line converts the conversions.txt file into a sed script, conversions.sed; the second (sed) line then runs this script on the sequences.txt file, producing the relabeled.txt file, which may (or may not) be what you're looking for.
Depending on the exact nature of your input files, which isn't clear from your question, this may need a bit of tweaking.
Hope this helps.

How do I grep for entire, possibly wrapped, lines of code?

When searching code for strings, I constantly run into the problem that I get meaningless, context-less results. For example, if a function call is split across 3 lines, and I search for the name of a parameter, I get the parameter on a line by itself and not the name of the function.
For example, in a file containing
...
someFunctionCall ("test",
MY_CONSTANT,
(some *really) - long / expression);
grepping for MY_CONSTANT would return a line that looked like this:
MY_CONSTANT,
Likewise, in a comment block:
/////////////////////////////////////////
// FIXMESOON, do..while is the wrong choice here, because
// it makes the wrong thing happen
/////////////////////////////////////////
Grepping for FIXMESOON gives the very frustrating answer:
// FIXMESOON, do..while is the wrong choice here, because
When there are thousands of hits, single line results are a little meaningless. What I would like to do is have grep be aware of the start and stop points of source code lines, something as simple as having it consider ";" as the line separator would be a good start.
Bonus points if you can make it return the entire comment block if the hit is inside a comment.
I know you can't do this with grep alone. I also am aware of the option to have grep return a certain number of lines of context. Any suggestions on how to accomplish under Linux? FYI my preferred languages are C and Perl.
I'm sure I could write something, but I know that somebody must have already done this.
Thanks!
You can use pcregrep with the -M option (multiline matching; pcregrep is grep with Perl-compatible regular expressions). Something like:
pcregrep -M ";*\R*.*thingtosearchfor*\R*.*;.*"
Here's an example using awk.
$ cat file
blah1
blah2
function1 ("test",
MY_CONSTANT,
(some *really) - long / expression);
function2( one , two )
blah3
blah4
$ awk -vRS=")" '/function1/{gsub(".*function1","function1");print $0RT}' file
function1 ("test",
MY_CONSTANT,
(some *really)
the concept behind: RS is record separator. by setting it to ")", then every record in your file is separated by ")" instead of newline. This make it easy to find your "function1" since you can then "grep" for it. If you don't use awk, the same concept can be applied using "splitting" on ")".
You can write a command line using grep with the options that give you the line number and the filename, then xarg these results into awk to parse these columns and then use a little script from you to display the N lines surrounding that line? :)
If this isn't an academic endeavour you could just use cscope (for C code only though). If you are willing to drop the requirement to search in comments ctags should be enough (and it also supports Perl).
I had a situation in which I had an xml file full of the names of zip files in an xml style format, that is, with carrots bracketing the names of the files, say example.zip<\stuff>
I used awk to change all carrots into newlines then used grep :)

Resources