How to echo/print actual file contents on a unix system - linux

I would like to see the actual file contents without it being formatted to print. For example, to show:
\n0.032,170\n0.034,290
Instead of:
0.032,170
0.34,290
Is there a command to echo the file's actual data in bash? I've tried using head, cat, more, etc. but all those seem to echo the "print-formatted" text. For example:
$ cat example.csv
0.032,170
0.34,290
How can I print the actual characters within the file?

This reads as if you miss understand what the "actual characters in the file" are. You will not find the characters \ and n in that file. But only a line feed, which is a specific character. So the utilities like cat do actually output exactly the characters in the file.
Putting it the other way around: if you really had those two characters literally in the file, then a utility like cat would actually output them. I just checked that, just to be sure.
You can easily check that yourself if you open the file using a hexeditor. There you will see the character 0A (decimal 10) which is a line feed character. You will not see the pair of the two characters \ and n somewhere in that file.
Many programming languages and also shell environments use escape sequences like \n in string definitions and identify those as control characters which would not be typable otherwise. So maybe that is where your impression comes from that your files should contain those two characters.

To display newlines as \n, you might try:
awk 1 ORS='\\n' input-file
This is not the "actual characters in the file", as \n is merely a conventional method of displaying a newline, but this does seem to be what you want.

Related

Why does a part of this variable get replaced when combining it with a string?

I have the following Bash script which loops through the lines of a file:
INFO_FILE=playlist-info-test.txt
line_count=$(wc -l $INFO_FILE | awk '{print $1}')
for ((i=1; i<=$line_count; i++))
do
current_line=$(sed "${i}q;d" $INFO_FILE)
CURRENT_PLAYLIST_ORIG="$current_line"
input_file="$CURRENT_PLAYLIST_ORIG.mp3"
echo $input_file
done
This is a sample of the playlist-info-test.txt file:
Playlist 1
Playlist2
Playlist 3
The output of the script should be as follows:
Playlist 1.mp3
Playlist2.mp3
Playlist 3.mp3
However, I am getting the following output:
.mp3list 1
.mp3list2
.mp3list 3
I have spent a few hours on this and can't understand why the ".mp3" part is being moved to the front of the string. I initially thought it was because of the space in the lines of the input file, but removing the space doesn't make a difference. I also tried using a while loop with read line and the input file redirected into it, but that does not make any difference either.
I copied the playlist-info-test.txt contents and the script, and get the output you expected. Most likely there are non-printable characters in your playlist-info-test.txt or script which are messing up the processing. Check the binary contents of both files using for example xxd -g 1 and look for non-newline (0a) non-printing characters.
Did the file come from Windows? DOS and Windows end their lines with carriage return (hex 0d, sometimes represented as \r) followed by linefeed (hex 0a, sometimes represented as \n). Unix just uses linefeed, and so tends to treat the carriage return as part of the content of the line. In your case, it winds up at the end of the current_line variable, so input_file winds up something like "Playlist 1\r.mp3". When you print this to the terminal, the carriage return makes it go back to the beginning of the line (that's what carriage return means), so it prints as:
Playlist 1
.mp3
...with the ".mp3" printed over the "Play" part, rather than on the next line like I have it above.
Solution: either fix the file (there's a fairly standard dos2unix program that does precisely this), or change your script to strip carriage returns as it reads the file. Actually, I'd recommend a rewrite anyway, since your current use of sed to pick out lines is rather weird and inefficient. In a shell script, the standard way to read through a file line-by-line is to use a loop like while read -r current_line; do [commands here]; done <"$INFO_FILE". There's a possible problem that if any commands inside the loop read from standard input, they'll wind up inhaling part of that file; you can fix that by passing the file over unit 3 rather than standard input. With that fix and a trick to trim carriage returns, here's what it looks like:
INFO_FILE=playlist-info-test.txt
while IFS=$' \t\n\r' read -r current_line <&3; do
CURRENT_PLAYLIST_ORIG="$current_line"
input_file="$CURRENT_PLAYLIST_ORIG.mp3"
echo "$input_file"
done 3<"$INFO_FILE"
(The carriage return trim is done by read -- it always auto-trims leading and trailing whitespace, and setting IFS to $' \t\n\r' tells it to treat spaces, tabs, linefeeds, and carriage returns as whitespace. And since that assignment is a prefix to the read command, it applies only to that one command and you don't have to set IFS back to normal afterward.)
A couple of other recommendations while I'm here: double-quote all variable references (as I did with echo "$input_file" above), and avoid all-caps variable names (there are a bunch with special meanings, and if you accidentally use one of those it can have weird effects). Oh, and try passing your scripts to shellcheck.net -- it's good at spotting common mistakes.

concatenate two strings and one variable using bash

I need to generate filename from three parts, two strings, and one variable.
for f in `cat files.csv`; do echo fastq/$f\_1.fastq.gze; done
files.csv has the following lines:
Sample_11
Sample_12
I need to generate the following:
fastq/Sample_11_1.fastq.gze
fastq/Sample_12_1.fastq.gze
My problem is that I got the below files:
_1.fastq.gze_11
_1.fastq.gze_12
the string after the variable deletes the string before it.
I appreciate any help
Regards
By the way your idiom: for f in cat files.csv should be avoid. Refer: Dangerous Backticks
while read f
do
echo "fastq/${f}/_1.fastq.gze"
done < files.csv
You can make it a one-liner with xargs and printf.
xargs printf 'fastq/%s_1.fastq.gze\n' <files.csv
The function of printf is to apply the first argument (the format string) to each argument in turn.
xargs says to run this command on as many files as it can fit onto the command line (splitting it up into multiple invocations if the input file is too large to fit all the arguments onto a single command line, subject to the ARG_MAX constant in your kernel).
Your best bet, generally, is to wrap the variable name in braces. So, in this case:
echo fastq/${f}_1.fastq.gz
See this answer for some details about the general concept, as well.
Edit: An additional thought looking at the now-provided output makes me think that this isn't a coding problem at all, but rather a conflict between line-endings and the terminal/console program.
Specifically, if the CSV file ends its lines with just a carriage return (ASCII/Unicode 13), the end of Sample_11 might "rewind" the line to the start and overwrite.
In that case, based loosely on this article, I'd recommend replacing cat (if you understandably don't want to re-architect the actual script with something like while) with something that will strip the carriage returns, such as:
for f in $(tr -cd '\011\012\040-\176' < temp.csv)
do
echo fastq/${f}_1.fastq.gze
done
As the cited article explains, Octal 11 is a tab, 12 a line feed, and 40-176 are typeable characters (Unicode will require more thinking). If there aren't any line feeds in the file, for some reason, you probably want to replace that with tr '\015' '\012', which will convert the carriage returns to line feeds.
Of course, at that point, better is to find whatever produces the file and ask them to put reasonable line-endings into their file...

Ignore spaces, tabs and new line in SED

I tried to replace a string in a file that contains tabs and line breaks.
the command in the shell file looked something like this:
FILE="/Somewhere"
STRING_OLD="line 1[ \t\r\n]*line 2"
sed -i 's/'"$STRING_OLD"'/'"$STRING_NEW"'/' $FILE
if I manually remove the line breaks and the tabs and leave only the spaces then I can replace successfully the file. but if I leave the line breaks then SED is unable to locate the $STRING_OLD and unable to replace to the new string
thanks in advance
Kobi
sed reads lines one at a time, and usually lines are also processed one at a time, as they are read. However, sed does have facilities for reading additional lines and operating on the combined result. There are several ways that could be applied to your problem, such as:
FILE="/Somewhere"
STRING_OLD="line 1[ \t\r\n]*line 2"
sed -n "1h;2,\$H;\${g;s/$STRING_OLD/$STRING_NEW/g;p}"
That that does more or less what you describe doing manually: it concatenates all the lines of the file (but keeps newlines), and then performs the substitution on the overall buffer, all at once. That does assume, however, either that the file is short (POSIX does not require it to work if the overall file length exceeds 8192 bytes) or that you are using a sed that does not have buffer-size limitations, such as GNU sed. Since you tagged Linux, I'm supposing that GNU sed can be assumed.
In detail:
the -n option turns off line echoing, because we save everything up and print the modified text in one chunk at the end.
there are multiple sed commands, separated by semicolons, and with literal $ characters escaped (for the shell):
1h: when processing the first line of input, replace the "hold space" with the contents of the pattern space (i.e. the first line, excluding newline)
2,\$H: when processing any line from the second through the last, append a newline to the hold space, then the contents of the pattern space
\${g;s/$STRING_OLD/$STRING_NEW/g;p}: when processing the last line, perform this group of commands: copy the hold space into the pattern space; perform the substitution, globally; print the resulting contents of the pattern space.
That's one of the simpler approaches, but if you need to accommodate seds that are not as capable as GNU's with regard to buffer capacity then there are other ways to go about it. Those start to get ugly, though.

find string and replace

Hi I have a file like this
L_00001_mRNA_interferase_MazF
ATGGATTATCCAAAACAAAAGGATATTGTCTGGATTGATTTTGACCCTTCTAAAGGCAAA
GAGATAAGAAAGCGGAGACCTGCGTTAGTAGTTAGTAAAGATGAATTTAATGAACGTACA
GGTTTCTGTTTAGTTTGCCCCATCACATCTACTAAAAGGAACTTTGCAACGTATATTGAA
ATAACAGACCCACAGAAAGTAGAAGGGGACGTAGTTACCCATCAATTGCGAGCGGTTGAT
TACACCACAAGAAATATCGAAAAAATTGAACAATGTGATATGTTGACGTGGATTGATGTA
GTAGAAGTAATCGGAATGTTTATTTAA
L_00002_hypothetical_protein
ATGGAAACGGTAGTTAGAAAGATAGGGAATTCAGTAGGAACTATTTTTCCGAAAAGTATT
TCACCACAAGTTGGAGAAAAGTTCACTATTCTTAAAGTTGGGGAAGCGTATATATTGAAA
CCTAAGAGAGAAGATATTTTTAAAAATGCTGAAGATTGGGTAGGGTTTAGAGAAGCTTTG
ACTAATGAAGATAAAGAATGGGACGAGATGAAACTTGAGGGAGGAGAACGCTAG
L_00003_hypothetical_protein
ATGACAACGTTTGGAGAAATTCATAGCAATGCAGAAGGTTATAAAAACGATTTTAATGAG
TTGAATAAATTAGTATTACGTGTAGCTGAAGAAAAAGCAAAAGGAGAGCCATTAGTAACG
TGGTTTCGGTTGCGGAATCGTAGGATTGCACAAGTATTAGACCCAATGAAAGAAGAAGTA
GAAAGTAAATCAAAGTACGAAAAAAGAAGAGTAGCAGCAATTAGTAAAAGCTTTTTTCTA
CTTAAAAAAGCTTTTAACTTTATTGAAGCAGAACAATTTGAAAAAGCAGAAAAATTAATT
I would like to substitute the header of each sequence with a string.
I have a conversion file like
L_00001_mRNA_interferase_MazF galM,GALM,aldose1-epimerase[EC:5.1.3.3]
L_00002_hypothetical_protein E3.2.1.85,lacG,6-phospho-beta-galactosidase[EC:3.2.1.85]
L_00003_hypothetical_protein PTS-Lac-EIIB,lacE,PTSsystem,lactose-specificIIBcomponent[EC:2.7.1.69]
Your question is unclear as to what platform you're on (Windows, Linux, Mac, ...), what languages you're constrained to, and the exact details of your input files.
On the assumption that you're on Linux, or otherwise have sed and awk available and a command shell, it could be as simple as (where $ indicates a Bourne-like shell prompt):
$ awk '{print "s/^" $1 "/" $2 "/"}' conversions.txt > conversions.sed
$ sed -f conversions.sed sequences.txt > relabeled.txt
This assumes that your first file (with the headings you want changed) is called sequences.txt and your second file (the “conversion file”) is called conversions.txt. It is further assumed that the “conversion file” contains one record per line with exactly two fields — the original and substitute headers — separated by whitespace (i.e. neither the original header nor the new header contain any spaces) and no blank lines.
In this solution, the first (awk) line converts the conversions.txt file into a sed script, conversions.sed; the second (sed) line then runs this script on the sequences.txt file, producing the relabeled.txt file, which may (or may not) be what you're looking for.
Depending on the exact nature of your input files, which isn't clear from your question, this may need a bit of tweaking.
Hope this helps.

Bash - process backspace control character when redirecting output to file

I have to run a third-party program in background and capture its output to file. I'm doing this simply using the_program > output.txt. However, the coders of said program decided to be flashy and show processed lines in real-time, using \b characters to erase the previous value. So, one of the lines in output.txt ends up like Lines: 1(b)2(b)3(b)4(b)5, (b) being an unprintable character with ASCII code 08. I want that line to end up as Lines: 5.
I'm aware that I can write it as-is and post-process the file using AWK, but I wonder if it's possible to somehow process the control characters in-place, by using some kind of shell option or by piping some commands together, so that line would become Lines: 5 without having to run any additional commands after the program is done?
Edit:
Just a clarification: what I wrote here is a simplified version, actual line count processed by the program is a hundred thousands, so that string ends up quite long.
Thanks for your comments! I ended up piping the output of that program to AWK Script I linked in the question. I get a well-formed file in the end.
the_program | ./awk_crush.sh > output.txt
The only downside is that I get the output only once the program itself is finished, even though the initial output exceeds 5M and should be passed in the lesser chunks. I don't know the exact reason, perhaps AWK script waits for EOF on stdin. Either way, on more modern system I would use
stdbuf -oL the_program | ./awk_crush.sh > output.txt
to process the output line-by-line. I'm stuck on RHEL4 with expired support though, so I'm unable to use neither stdbuf nor unbuffer. I'll leave it as-is, it's fine too.
The contents of awk_crush.sh are based on this answer, except with ^H sequences (which are supposed to be ASCII 08 characters entered via VIM commands) replaced with escape sequence \b:
#!/usr/bin/awk -f
function crushify(data) {
while (data ~ /[^\b]\b/) {
gsub(/[^\b]\b/, "", data)
}
print data
}
crushify($0)
Basically, it replaces character before \b and \b itself with empty string, and repeats it while there are \b in the string - just what I needed. It doesn't care for other escape sequences though, but if it's necessary, there's a more complete SED solution by Thomas Dickey.
Pipe it to col -b, from util-linux:
the_program | col -b
Or, if the input is a file, not a program:
col -b < input > output
Mentioned in Unix & Linux: Evaluate large file with ^H and ^M characters.

Resources