ksh "." operator is doing string replacement instead of concatenation - linux

I was debugging a script in which I found the following weird behavior. The script is simply setting some variables by sourcing another file, then the values of these variables are used to run the main script command.
The first file has the following line:
export PROJECT=ABCD1234
The script then sources this file thought the following line:
. file_path
Later in the script, the script is using the $PROJECT variable in the following statement:
cd $PROJECT.proj #expecting to do string concatenation
The problem here is that $PROJECT.proj doesn't result in "ABCD1234.proj", actually it does string replacement instead of string concatenation, so $PROJECT.proj equals .proj234!!
I suspected that there might be some special hidden characters in the first file that cause this behavior, so I rewrote the file using gvim instead of nedit & it worked.
Does anybody have any idea how this happened??

Anytime you are creating files on Windows and then moving or using them on a Unix/Linux like environment, be sure to convert your files so they work properly on unix/linux.
Use the dos2unix utility for this, i.e.
dos2unix file [file1 file2 file3 .... myFile*]
As many files as will fit on the cmd line.
(I'll be back to flesh this out after I eat ; -)
Disappearing characters like
ABCD1234.proj
but getting some, but not all, like
proj234
Are often the result of the Windows line-ending characters conflicting with Unix/Linux line-ending character. Windows uses ^M^J (\r\n), where as unix/linux uses just ^J (\n).
OR
Ctrl oct hex dec abbrev
^J 012 0a 10 nl
^M 015 od 13 cr
cr = Carriage Return
Think of the old typewriters, it is a two step process.
The lever both moves the platen back to the left margin AND it advances the paper so the next line will be typed on. CR returns the carriage to the left margin, will new-line advances the printing to the next line.
Unix assumes there is an implied CR with an NL, so having a CR confuses things and makes it easy for your system to overwrite data (or maybe it just the display of data, I don't have time to test right now).
IHTH

Related

why Linux tools display the CR character as `^M`? [duplicate]

This question already has answers here:
What does the ^M character mean in Vim?
(15 answers)
Closed 2 years ago.
I'm new to Linux sorry if my question sounds dumb.
We know that Linux and Mac OS X use \n (0xa), which is the ASCII line feed (LF) character. MS Windows and Internet protocols such as HTTP use the sequence \r\n (0xd 0xa). If you create a file foo.txt in Windows and then view it in a Linux text editor, you’ll see an annoying ^M at the end of each line, which is how Linux tools display the CR character.
Bu why Linux tools display the CR character as ^M? as my understanding is, \r (carriage return) is to move the cursor in the beginning of the current line, so the sensible approach to display it is like, when you open the file, you see the cursor is in the beginning of the line(that contains \r), so ^M shouldn't be displayed?
PS: some people post answers that how to remove ^M, but I wnat to know why eventually^M is displayed rather than moving the cursor in the beginning, which is the definition of carriage return.
The ASCII control characters like TAB, CR, NL and others are intended to control the printing position of a teletypewriter-like display device.
A text editor isn't such a device. It is not appropriate for a text editor to treat a CR character literally as meaning "go to the first column"; it would make a confusing gibberish out of the editing experience.
A text editor works by parsing a text file's representation, to create an internal representation which is presented to the user. On Unix-like operating systems, a file is represented by zero or more lines, which are terminated by the ASCII NL character. Any CR characters that occur just look like part of the data, and not part of the line separation.
Not all editors behave the same way. For instance, the Vim editor will detect that a file uses CR-LF line endings, and load it properly using that representation. A flag is set for that buffer which indicates that it's a "DOS" file, so that when you save it, the same representation is reproduced.
That said, there is a feature in the actual Linux kernel for representing control characters like CR using the ^M notation. The TTY line discipline for any given TTY device can be configured to print characters in this notation, but only when echoing back the characters received.
Demo:
$ stty echoctl # turn on notational echo of control characters
$ cat # run some non-interactive program with rudimentary line input
^F^F^F^F^F^F
^C
$
Above, the Ctrl-F that I entered was echoed back as ^F. So, in fact there is a "Linux editor" which uses this notation: the rudimentary line editor of the "canonical input mode" line discipline.

How to echo/print actual file contents on a unix system

I would like to see the actual file contents without it being formatted to print. For example, to show:
\n0.032,170\n0.034,290
Instead of:
0.032,170
0.34,290
Is there a command to echo the file's actual data in bash? I've tried using head, cat, more, etc. but all those seem to echo the "print-formatted" text. For example:
$ cat example.csv
0.032,170
0.34,290
How can I print the actual characters within the file?
This reads as if you miss understand what the "actual characters in the file" are. You will not find the characters \ and n in that file. But only a line feed, which is a specific character. So the utilities like cat do actually output exactly the characters in the file.
Putting it the other way around: if you really had those two characters literally in the file, then a utility like cat would actually output them. I just checked that, just to be sure.
You can easily check that yourself if you open the file using a hexeditor. There you will see the character 0A (decimal 10) which is a line feed character. You will not see the pair of the two characters \ and n somewhere in that file.
Many programming languages and also shell environments use escape sequences like \n in string definitions and identify those as control characters which would not be typable otherwise. So maybe that is where your impression comes from that your files should contain those two characters.
To display newlines as \n, you might try:
awk 1 ORS='\\n' input-file
This is not the "actual characters in the file", as \n is merely a conventional method of displaying a newline, but this does seem to be what you want.

Why does a part of this variable get replaced when combining it with a string?

I have the following Bash script which loops through the lines of a file:
INFO_FILE=playlist-info-test.txt
line_count=$(wc -l $INFO_FILE | awk '{print $1}')
for ((i=1; i<=$line_count; i++))
do
current_line=$(sed "${i}q;d" $INFO_FILE)
CURRENT_PLAYLIST_ORIG="$current_line"
input_file="$CURRENT_PLAYLIST_ORIG.mp3"
echo $input_file
done
This is a sample of the playlist-info-test.txt file:
Playlist 1
Playlist2
Playlist 3
The output of the script should be as follows:
Playlist 1.mp3
Playlist2.mp3
Playlist 3.mp3
However, I am getting the following output:
.mp3list 1
.mp3list2
.mp3list 3
I have spent a few hours on this and can't understand why the ".mp3" part is being moved to the front of the string. I initially thought it was because of the space in the lines of the input file, but removing the space doesn't make a difference. I also tried using a while loop with read line and the input file redirected into it, but that does not make any difference either.
I copied the playlist-info-test.txt contents and the script, and get the output you expected. Most likely there are non-printable characters in your playlist-info-test.txt or script which are messing up the processing. Check the binary contents of both files using for example xxd -g 1 and look for non-newline (0a) non-printing characters.
Did the file come from Windows? DOS and Windows end their lines with carriage return (hex 0d, sometimes represented as \r) followed by linefeed (hex 0a, sometimes represented as \n). Unix just uses linefeed, and so tends to treat the carriage return as part of the content of the line. In your case, it winds up at the end of the current_line variable, so input_file winds up something like "Playlist 1\r.mp3". When you print this to the terminal, the carriage return makes it go back to the beginning of the line (that's what carriage return means), so it prints as:
Playlist 1
.mp3
...with the ".mp3" printed over the "Play" part, rather than on the next line like I have it above.
Solution: either fix the file (there's a fairly standard dos2unix program that does precisely this), or change your script to strip carriage returns as it reads the file. Actually, I'd recommend a rewrite anyway, since your current use of sed to pick out lines is rather weird and inefficient. In a shell script, the standard way to read through a file line-by-line is to use a loop like while read -r current_line; do [commands here]; done <"$INFO_FILE". There's a possible problem that if any commands inside the loop read from standard input, they'll wind up inhaling part of that file; you can fix that by passing the file over unit 3 rather than standard input. With that fix and a trick to trim carriage returns, here's what it looks like:
INFO_FILE=playlist-info-test.txt
while IFS=$' \t\n\r' read -r current_line <&3; do
CURRENT_PLAYLIST_ORIG="$current_line"
input_file="$CURRENT_PLAYLIST_ORIG.mp3"
echo "$input_file"
done 3<"$INFO_FILE"
(The carriage return trim is done by read -- it always auto-trims leading and trailing whitespace, and setting IFS to $' \t\n\r' tells it to treat spaces, tabs, linefeeds, and carriage returns as whitespace. And since that assignment is a prefix to the read command, it applies only to that one command and you don't have to set IFS back to normal afterward.)
A couple of other recommendations while I'm here: double-quote all variable references (as I did with echo "$input_file" above), and avoid all-caps variable names (there are a bunch with special meanings, and if you accidentally use one of those it can have weird effects). Oh, and try passing your scripts to shellcheck.net -- it's good at spotting common mistakes.

Why is vim stripping the carriage return when I copy a line to another file?

I sorted a file a.csv into b.csv.
I noticed that the sizes of the files differed, and after noticing that b.csv was exactly n bytes smaller (where n is the number of lines in a.csv), I immediately suspected that a.csv contained those pesky \r.
The .py script for sorting contained the line line.strip() which removed the carriage returns and then afile.write(line2 + '\n') which wrote newlines but not carriage returns.
Ok. Makes sense.
The strange bit is that when I vim'd a.csv, I didn't see the ^M like I usually do (maybe the reason lies in a configuration file), so I only found out about the \r from opening the file in a hex editor.
The more interesting bit, is that I would take a small subset of a.csv (3y) and paste it to a testfile (p).
Sorting the testfile resulted in a file of the exact same size as the original.
From xxding, I see that there is no \r in the new testfile.
When I yank a line that contains a carriage return and paste it into another file, the pasted line does not contain the carriage return. Why?
I tested this on Windows (Cygwin), and it does appear to copy the \r. But on the Linux machine I'm using, it doesn't.
How come?
Edit:
I tried reproducing the issue on another linux machine, but I couldn't. It appears to be a configuration thing - some file somewhere telling vim to do that.
Vim's model of a loaded file is a sequence of lines, each consisting of a sequence of characters. In this model, newlines aren't themselves characters. So when you're copying lines of text, you're not copying the CRs or LFs. Vim also stores a number of other pieces of information which are used to write the file back out again, principally:
fileformat can be unix, dos or mac. This determines what end-of-line character will be written at the end of each line.
endofline can be on or off. This determines if the last line of the file has an end-of-line character.
bomb can be on or off. This determines if a byte order mark is written at the start of the first line.
fileencoding specifies what character encoding will be used to store the file, such as utf-8.
Normally these are all auto-detected upon loading the file. In particular, fileformat will be auto-detected depending on the settings in fileformats option, which may be configured differently on different platforms. However, sometimes things can go wrong. The most common problem is that a file might have mixed line-endings, and that's when you'll start seeing ^M floating around. In this case, Vim has loaded the file as if it's in unix format - it treated the LFs as the line separators and the CRs as just normal characters. You can see which mode Vim has opened the file in by entering :set fileformat? or just set ff? for short.
Vim detects the newline style (Windows CR-LF vs. Unix LF) when opening the file (according to the 'fileformats' option), and uses the detected 'fileformat' value for all subsequent saves. So, the newline style is a property of the Vim buffer / opened file. When you yank line(s) from one buffer and paste it into another, the newline style isn't kept; instead, the newline style of the target buffer is used, as this makes much more sense.

Bash - process backspace control character when redirecting output to file

I have to run a third-party program in background and capture its output to file. I'm doing this simply using the_program > output.txt. However, the coders of said program decided to be flashy and show processed lines in real-time, using \b characters to erase the previous value. So, one of the lines in output.txt ends up like Lines: 1(b)2(b)3(b)4(b)5, (b) being an unprintable character with ASCII code 08. I want that line to end up as Lines: 5.
I'm aware that I can write it as-is and post-process the file using AWK, but I wonder if it's possible to somehow process the control characters in-place, by using some kind of shell option or by piping some commands together, so that line would become Lines: 5 without having to run any additional commands after the program is done?
Edit:
Just a clarification: what I wrote here is a simplified version, actual line count processed by the program is a hundred thousands, so that string ends up quite long.
Thanks for your comments! I ended up piping the output of that program to AWK Script I linked in the question. I get a well-formed file in the end.
the_program | ./awk_crush.sh > output.txt
The only downside is that I get the output only once the program itself is finished, even though the initial output exceeds 5M and should be passed in the lesser chunks. I don't know the exact reason, perhaps AWK script waits for EOF on stdin. Either way, on more modern system I would use
stdbuf -oL the_program | ./awk_crush.sh > output.txt
to process the output line-by-line. I'm stuck on RHEL4 with expired support though, so I'm unable to use neither stdbuf nor unbuffer. I'll leave it as-is, it's fine too.
The contents of awk_crush.sh are based on this answer, except with ^H sequences (which are supposed to be ASCII 08 characters entered via VIM commands) replaced with escape sequence \b:
#!/usr/bin/awk -f
function crushify(data) {
while (data ~ /[^\b]\b/) {
gsub(/[^\b]\b/, "", data)
}
print data
}
crushify($0)
Basically, it replaces character before \b and \b itself with empty string, and repeats it while there are \b in the string - just what I needed. It doesn't care for other escape sequences though, but if it's necessary, there's a more complete SED solution by Thomas Dickey.
Pipe it to col -b, from util-linux:
the_program | col -b
Or, if the input is a file, not a program:
col -b < input > output
Mentioned in Unix & Linux: Evaluate large file with ^H and ^M characters.

Resources