what are the numbers shown by od -c -N 16 <filename.png>

what are the numbers shown by od -c -N 16 <filename.png> - linux

I'm using linux and when I typed
od -c -N 16 <filename.png>
I got 0000000 211 P N G \r \n 032 \n \0 \0 \0 \r I H D R 0000020.
I thought this command tells me the type of the file but I'm curious about what the number 0000000 and 211 means. Can anybody please help?

od means "octal dump" (analogous to the hexdumper hd). It dumps bytes of a file in octal notation.
211 octal is 2 * 82 + 1 * 81 + 1 = 137, so you have a byte of value 137 there.
The 0000000 at the beginning of the line and the 0000020 at the beginning of the next are positions in the file, also in octal. If you remove -N 16 from the call, you'll see a column of monotonously ascending octal numbers on the left side of the octal dump; their purpose is to make it instantly visible which part of a dump you're currently reading.
The parameter
-N 16
means to read only the first 16 bytes of filename.png, and
-c
is a format option that tells od
to print printable characters as characters themselves rather than the octal code, and
to print unprintable characters that have a backslash escape sequence (such as \r or \n) as that escape sequence rather than an octal number.
It is the reason that not all bytes are dumped in octal.
If you want to know the file type of a file, use the file utility:
file filename.png
Side note: You may be interested in the man command, which shows the manual page of (among other things) command line tools. In this particular case,
man od
could have been enlightening.

Related

Inserting ',' into certain position of a text containing full-width characters

Inserting a "," in a particular position of a text
From question above, I have gotten errors because a text contained some full-width characters.
I deal with some Japanese text data on RHEL server. Question above was a perfect solution for utf-8 text but the UNIX command wont work for Japanese text in SJIS format.
The difference between these two is that utf-8 counts every character as 1 byte and SJIS counts alphabets and numbers as 1 byte and other Japanese characters, such as あ, as 2 bytes. So the sed command only works for utf-8 when inserting ',' in some positions.
My input would be like
aaaああ123あ
And I would like to insert ',' after 3 bytes, 4 bytes and 3 bytes so my desired outcome is
aaa,ああ,123,あ
It is not necessarily sed command if it works on UNIX system.
Is there any way to insert ',' after some bytes of data while counting full-width character as 2 bytes and others as 1 bytes.

あ is 3 bytes in UTF-8
Depending on the locale GNU sed supports unicode. So reset the locale before running sed commands, and it will work on bytes.
And I would like to insert ',' after 3 bytes, 4 bytes and 3 bytes
Just use a backreference to remember the bytes.
LC_ALL=C sed 's/^\(...\)\(....\)\(...\)/\1,\2,\3,/'
or you could specify numbers:
LC_ALL=C sed 's/^\(.\{3\}\)\(.\{4\}\)\(.\{3\}\)/\1,\2,\3,/'
And cleaner with extended regex extension:
LC_ALL=C sed -E 's/^(.{3})(.{4})(.{3})/\1,\2,\3,/'
The following seems to work in my terminal:
$ <<<'aaaああ123あ' iconv -f UTF-8 -t SHIFT-JIS | LC_ALL=C sed 's/^\(.\{3\}\)\(.\{4\}\)\(.\{3\}\)/\1,\2,\3,/' | iconv -f SHIFT-JIS -t UTF-8
aaa,ああ,123,あ

Replace string and remain the file format [duplicate]

The intent of this question is to provide an answer to the daily questions whose answer is "you have DOS line endings" so we can simply close them as duplicates of this one without repeating the same answers ad nauseam.
NOTE: This is NOT a duplicate of any existing question. The intent of this Q&A is not just to provide a "run this tool" answer but also to explain the issue such that we can just point anyone with a related question here and they will find a clear explanation of why they were pointed here as well as the tool to run so solve their problem. I spent hours reading all of the existing Q&A and they are all lacking in the explanation of the issue, alternative tools that can be used to solve it, and/or the pros/cons/caveats of the possible solutions. Also some of them have accepted answers that are just plain dangerous and should never be used.
Now back to the typical question that would result in a referral here:
I have a file containing 1 line:
what isgoingon
and when I print it using this awk script to reverse the order of the fields:
awk '{print $2, $1}' file
instead of seeing the output I expect:
isgoingon what
I get the field that should be at the end of the line appear at the start of the line, overwriting some text at the start of the line:
whatngon
or I get the output split onto 2 lines:
isgoingon
what
What could the problem be and how do I fix it?

The problem is that your input file uses DOS line endings of CRLF instead of UNIX line endings of just LF and you are running a UNIX tool on it so the CR remains part of the data being operated on by the UNIX tool. CR is commonly denoted by \r and can be seen as a control-M (^M) when you run cat -vE on the file while LF is \n and appears as $ with cat -vE.
So your input file wasn't really just:
what isgoingon
it was actually:
what isgoingon\r\n
as you can see with cat -v:
$ cat -vE file
what isgoingon^M$
and od -c:
$ od -c file
0000000 w h a t i s g o i n g o n \r \n
0000020
so when you run a UNIX tool like awk (which treats \n as the line ending) on the file, the \n is consumed by the act of reading the line, but that leaves the 2 fields as:
<what> <isgoingon\r>
Note the \r at the end of the second field. \r means Carriage Return which is literally an instruction to return the cursor to the start of the line so when you do:
print $2, $1
awk will print isgoingon and then will return the cursor to the start of the line before printing what which is why the what appears to overwrite the start of isgoingon.
To fix the problem, do either of these:
dos2unix file
sed 's/\r$//' file
awk '{sub(/\r$/,"")}1' file
perl -pe 's/\r$//' file
Apparently dos2unix is aka frodos in some UNIX variants (e.g. Ubuntu).
Be careful if you decide to use tr -d '\r' as is often suggested as that will delete all \rs in your file, not just those at the end of each line.
Note that GNU awk will let you parse files that have DOS line endings by simply setting RS appropriately:
gawk -v RS='\r\n' '...' file
but other awks will not allow that as POSIX only requires awks to support a single character RS and most other awks will quietly truncate RS='\r\n' to RS='\r'. You may need to add -v BINMODE=3 for gawk to even see the \rs though as the underlying C primitives will strip them on some platforms, e.g. cygwin.
One thing to watch out for is that CSVs created by Windows tools like Excel will use CRLF as the line endings but can have LFs embedded inside a specific field of the CSV, e.g.:
"field1","field2.1
field2.2","field3"
is really:
"field1","field2.1\nfield2.2","field3"\r\n
so if you just convert \r\ns to \ns then you can no longer tell linefeeds within fields from linefeeds as line endings so if you want to do that I recommend converting all of the intra-field linefeeds to something else first, e.g. this would convert all intra-field LFs to tabs and convert all line ending CRLFs to LFs:
gawk -v RS='\r\n' '{gsub(/\n/,"\t")}1' file
Doing similar without GNU awk left as an exercise but with other awks it involves combining lines that do not end in CR as they're read.
Also note that though CR is part of the [[:space:]] POSIX character class, it is not one of the whitespace characters included as separating fields when the default FS of " " is used, whose whitespace characters are only tab, blank, and newline. This can lead to confusing results if your input can have blanks before CRLF:
$ printf 'x y \n'
x y
$ printf 'x y \n' | awk '{print $NF}'
y
$
$ printf 'x y \r\n'
x y
$ printf 'x y \r\n' | awk '{print $NF}'
$
That's because trailing field separator white space is ignored at the beginning/end of a line that has LF line endings, but \r is the final field on a line with CRLF line endings if the character before it was whitespace:
$ printf 'x y \r\n' | awk '{print $NF}' | cat -Ev
^M$

You can use the \R shorthand character class in PCRE for files with unknown line endings. There are even more line ending to consider with Unicode or other platforms. The \R form is a recommended character class from the Unicode consortium to represent all forms of a generic newline.
So if you have an 'extra' you can find and remove it with the regex s/\R$/\n/ will normalize any combination of line endings into \n. Alternatively, you can use s/\R/\n/g to capture any notion of 'line ending' and standardize into a \n character.
Given:
$ printf "what\risgoingon\r\n" > file
$ od -c file
0000000 w h a t \r i s g o i n g o n \r \n
0000020
Perl and Ruby and most flavors of PCRE implement \R combined with the end of string assertion $ (end of line in multi-line mode):
$ perl -pe 's/\R$/\n/' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
$ ruby -pe '$_.sub!(/\R$/,"\n")' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
(Note the \r between the two words is correctly left alone)
If you do not have \R you can use the equivalent of (?>\r\n|\v) in PCRE.
With straight POSIX tools, your best bet is likely awk like so:
$ awk '{sub(/\r$/,"")} 1' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
Things that kinda work (but know your limitations):
tr deletes all \r even if used in another context (granted the use of \r is rare, and XML processing requires that \r be deleted, so tr is a great solution):
$ tr -d "\r" < file | od -c
0000000 w h a t i s g o i n g o n \n
0000016
GNU sed works, but not POSIX sed since \r and \x0D are not supported on POSIX.
GNU sed only:
$ sed 's/\x0D//' file | od -c # also sed 's/\r//'
0000000 w h a t \r i s g o i n g o n \n
0000017
The Unicode Regular Expression Guide is probably the best bet of what the definitive treatment of what a "newline" is.

Run dos2unix. While you can manipulate the line endings with code you wrote yourself, there are utilities which exist in the Linux / Unix world which already do this for you.
If on a Fedora system dnf install dos2unix will put the dos2unix tool in place (should it not be installed).
There is a similar dos2unix deb package available for Debian based systems.
From a programming point of view, the conversion is simple. Search all the characters in a file for the sequence \r\n and replace it with \n.
This means there are dozens of ways to convert from DOS to Unix using nearly every tool imaginable. One simple way is to use the command tr where you simply replace \r with nothing!
tr -d '\r' < infile > outfile

In a bash script, what would $'\0' evaluate to and why?

In various bash scripts I have come across the following: $'\0'
An example with some context:
while read -r -d $'\0' line; do
echo "${line}"
done <<< "${some_variable}"
What does $'\0' return as its value? Or, stated slightly differently, what does $'\0' evaluate to and why?
It is possible that this has been answered elsewhere. I did search prior to posting but the limited number of characters or meaningful words in dollar-quote-slash-zero-quote makes it very hard to get results from stackoverflow search or google. So, if there are other duplicate questions, please allow some grace and link them from this question.

In bash, $'\0' is precisely the same as '': an empty string. There is absolutely no point in using the special Bash syntax in this case.
Bash strings are always NUL-terminated, so if you manage to insert a NUL into the middle of a string, it will terminate the string. In this case, the C-escape \0 is converted to a NUL character, which then acts as a string terminator.
The -d option of the read builtin (which defines a line-end character the input) expects a single character in its argument. It does not check if that character is the NUL character, so it will be equally happy using the NUL terminator of '' or the explicit NUL in $'\0' (which is also a NUL terminator, so it is probably no different). The effect, in either case, will be to read NUL-terminated data, as produced (for example) by find's -print0 option.
In the specific case of read -d '' line <<< "$var', it is impossible for $var to have an internal NUL character (for the reasons described above), so line will be set to the entire value of $var with leading and trailing whitespace removed. (As #mklement notes, this will not be apparent in the suggested code snippet, because read will have a non-zero exit status, even though the variable will have been set; read only returns success if the delimiter is actually found, and NUL cannot be part of a here-string.)
Note that there is a big difference between
read -d '' line
and
read -d'' line
The first one is correct. In the second one, the argument word passed to read is just -d, which means that the option will be the next argument (in this case, line). read -d$'\0' line will have identical behaviour; in either case, the space is necessary. (So, again, no need for the C-escape syntax).

To complement rici's helpful answer:
Note that this answer is about bash. ksh and zsh also support $'...' strings, but their behavior differs:
* zsh does create and preserve NUL (null bytes) with $'\0'.
* ksh, by contrast, has the same limitations as bash, and additionally interprets the first NUL in a command substitution's output as the string terminator (cuts off at the first NUL, whereas bash strips such NULs).
$'\0' is an ANSI C-quoted string that technically creates a NUL (0x0 byte), but effectively results in the empty (null) string (same as ''), because any NUL is interpreted as the (C-style) string terminator by Bash in the context of arguments and here-docs/here-strings.
As such, it is somewhat misleading to use $'\0' because it suggests that you can create a NUL this way, when you actually cannot:
You cannot create NULs as part of a command argument or here-doc / here-string, and you cannot store NULs in a variable:
echo $'a\0b' | cat -v # -> 'a' - string terminated after 'a'
cat -v <<<$'a\0b' # -> 'a' - ditto
In the context of command substitutions, by contrast, NULs are stripped:
echo "$(printf 'a\0b')" | cat -v # -> 'ab' - NUL is stripped
However, you can pass NUL bytes via files and pipes.
printf 'a\0b' | cat -v # -> 'a^#b' - NUL is preserved, via stdout and pipe
Note that it is printf that is generating the NUL via its single-quoted argument whose escape sequences printf then interprets and writes to stdout. By contrast, if you used printf $'a\0b', bash would again interpret the NUL as the string terminator up front and pass only 'a' to printf.
If we examine the sample code, whose intent is to read the entire input at once, across lines (I've therefore changed line to content):
while read -r -d $'\0' content; do # same as: `while read -r -d '' ...`
echo "${content}"
done <<< "${some_variable}"
This will never enter the while loop body, because stdin input is provided by a here-string, which, as explained, cannot contain NULs.
Note that read actually does look for NULs with -d $'\0', even though $'\0' is effectively ''. In other words: read by convention interprets the empty (null) string to mean NUL as -d's option-argument, because NUL itself cannot be specified for technical reasons.
In the absence of an actual NUL in the input, read's exit code indicates failure, so the loop is never entered.
However, even in the absence of the delimiter, the value is read, so to make this code work with a here-string or here-doc, it must be modified as follows:
while read -r -d $'\0' content || [[ -n $content ]]; do
echo "${content}"
done <<< "${some_variable}"
However, as #rici notes in a comment, with a single (multi-line) input string, there is no need to use while at all:
read -r -d $'\0' content <<< "${some_variable}"
This reads the entire content of $some_variable, while trimming leading and trailing whitespace (which is what read does with $IFS at its default value, $' \t\n').
#rici also points out that if such trimming weren't desired, a simple content=$some_variable would do.
Contrast this with input that actually contains NULs, in which case while is needed to process each NUL-separated token (but without the || [[ -n $<var> ]] clause); find -print0 outputs filenames separated by a NUL each):
while IFS= read -r -d $'\0' file; do
echo "${file}"
done < <(find . -print0)
Note the use of IFS= read ... to suppress trimming of leading and trailing whitespace, which is undesired in this case, because input filenames must be preserved as-is.

It is technically true that the expansion $'\0' will always become the empty string '' (a.k.a. the null string) to the shell (not in zsh). Or, worded the other way around, a $'\0' will never expand to an ascii NUL (or byte with zero value), (again, not in zsh). It should be noted that it is confusing that both names are quite similar: NUL and null.
However, there is an aditional (quite confusing) twist when we talk about read -d ''.
What read see is the value '' (the null string) as the delimiter.
What read does is split the input from stdin on the character $'\0' (yes an actual 0x00).
Expanded answer.
The question in the tittle is:
In a bash script, what would $'\0' evaluate to and why?
That means that we need to explain what $'\0' is expanded to.
What $'\0' is expanded to is very easy: it expands to the null string '' (in most shells, not in zsh).
But the example of use is:
read -r -d $'\0'
That transform the question to: what delimiter character does $'\0' expand to ?
This holds a very confusing twist. To address that correctly, we need to take a full circle tour of when and how a NUL (a byte with zero value or '0x00') is used in shells.
Stream.
We need some NUL to work with. It is possible to generate NUL bytes from shell:
$ echo -e 'ab\0cd' | od -An -vtx1
61 62 00 63 64 0a ### That works in bash.
$ printf 'ab\0cd' | od -An -vtx1
61 62 00 63 64 ### That works in all shells tested.
Variable.
A variable in shell will not store a NUL.
$ printf -v a 'ab\0cd'; printf '%s' "$a" | od -An -vtx1
61 62
The example is meant to be executed in bash as only bash printf has the -v option.
But the example is clear to show that a string that contains a NUL will be cut at the NUL.
Simple variables will cut the string at the zero byte.
As is reasonable to expect if the string is a C string, which must end on a NUL \0.
As soon as a NUL is found the string must end.
Command substitution.
A NUL will work differently when used in a command substitution.
This code should assign a value to the variable $a and then print it:
$ a=$(printf 'ab\0cd'); printf '%s' "$a" | od -An -vtx1
And it does, but with different results in different shells:
### several shells just ignore (remove)
### a NUL in the value of the expanded command.
/bin/dash : 61 62 63 64
/bin/sh : 61 62 63 64
/bin/b43sh : 61 62 63 64
/bin/bash : 61 62 63 64
/bin/lksh : 61 62 63 64
/bin/mksh : 61 62 63 64
### ksh trims the the value.
/bin/ksh : 61 62
/bin/ksh93 : 61 62
### zsh sets the var to actually contain the NUL value.
/bin/zsh : 61 62 00 63 64
/bin/zsh4 : 61 62 00 63 64
It is of special mention that bash (version 4.4) warns about the fact:
/bin/b44sh : warning: command substitution: ignored null byte in input
61 62 63 64
In command substitution the zero byte is silently ignored by the shell.
It is very important to understand that that does not happen in zsh.
Now that we have all the pieces about NUL. We may look at what read does.
What read do on NUL delimiter.
That brings us back to the command read -d $'\0':
while read -r -d $'\0' line; do
The $'\0' shoud have been expanded to a byte of value 0x00, but the shell cuts it and it actually becomes ''.
That means that both $'\0' and '' are received by read as the same value.
Having said that, it may seem reasonable to write the equivalent construct:
while read -r -d '' line; do
And it is technically correct.
What a delimiter of '' actually does.
There are two sides of this point, one that is the character after the -d option of read, the other one, which is addressed here, is: what character will read use if given a delimiter as -d $'\0'?.
The first side has been answered in detail above.
The second side is very confusing twist as the command read will actually read up to the next byte of value 0x00 (which is what $'\0' represents).
To actually show that that is the case:
#!/bin/bash
# create a test file with some zero bytes.
printf 'ab\0cd\0ef\ngh\n' > tfile
while true ; do
read -r -d '' line; a=$?
echo "exit $a"
if [[ $a == 1 ]]; then
printf 'last %s\n' "$line"
break
else
printf 'normal %s\n' "$line"
fi
done <tfile
when executed, the output will be:
$ ./script.sh
exit 0
normal ab
exit 0
normal cd
exit 1
last ef
gh
The first two exit 0 are successfully reads done up to the next "zero byte", and both contain the correct values of ab and cd. The next read is the last one (as there are no more zero bytes) and contains the value $'ef\ngh' (yes, it also contains a new line).
All this goes to show (and prove) that read -d '' actually reads up to the next "zero byte", which is also known by the ascii name NUL and should have been the result of a $'\0' expansion.
In short: we can safely state that read -d '' reads up to the next 0x00 (NUL).
Conclusion:
We must state that a read -d $'\0' will expand to a delimiter of 0x00.
Using $'\0' is a better way to transmit to the reader this correct meaning.
As a code style thing: I write $'\0' to make my intentions clear.
One, and only one, character used as a delimiter: the byte value of 0x00
(even if in bash it happens to be cut)
Note: Either this commands will print the hex values of the stream.
$ printf 'ab\0cd' | od -An -vtx1
$ printf 'ab\0cd' | xxd -p
$ printf 'ab\0cd' | hexdump -v -e '/1 "%02X "'
61 62 00 63 64

$'\0' expands the contained escape sequence \0 to the actual characters they represent which is \0 or an empty character in shell.
This is BASH syntax. As per man BASH:
Words of the form $'string' are treated specially. The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard. Known backslash escape sequences are also decoded.
Similarly $'\n' expands to a newline and $'\r' will expand to a carriage return.

Special Character: "^#" before EOF

I piped a program's output on the command line into a file and opened it in vim. At the very end of the file is the character: "^#", what does this mean?

CRTL-# (shown by Vim as ^#) is a NUL character, code point zero in the ASCII table.
You can enter it into Vim while in insert mode with CTRL-vCTRL-#, or by using a tool capable of producing a NUL output:
$ printf "\0" >tempfile
and then check it with any hex dump program:
$ od -xcb tempfile
0000000 0000
\0
000
0000001
So, obviously, your program is outputting NUL at the end for some reason.

Convert UTF8 to UTF16 using iconv

When I use iconv to convert from UTF16 to UTF8 then all is fine but vice versa it does not work.
I have these files:
a-16.strings: Little-endian UTF-16 Unicode c program text
a-8.strings: UTF-8 Unicode c program text, with very long lines
The text look OK in editor. When I run this:
iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16.strings
Then I get this result:
b-16.strings: data
a-16.strings: Little-endian UTF-16 Unicode c program text
a-8.strings: UTF-8 Unicode c program text, with very long lines
The file utility does not show expected file format and the text does not look good in editor either. Could it be that iconv does not create proper BOM? I run it on MAC command line.
Why is not the b-16 in proper UTF-16LE format? Is there another way of converting utf8 to utf16?
More elaboration is bellow.
$ iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16le-BAD-fromUTF8.strings
$ iconv -f UTF-8 -t UTF-16 a-8.strings > b-16be.strings
$ iconv -f UTF-16 -t UTF-16LE b-16be.strings > b-16le-BAD-fromUTF16BE.strings
$ file *s
a-16.strings: Little-endian UTF-16 Unicode c program text, with very long lines
a-8.strings: UTF-8 Unicode c program text, with very long lines
b-16be.strings: Big-endian UTF-16 Unicode c program text, with very long lines
b-16le-BAD-fromUTF16BE.strings: data
b-16le-BAD-fromUTF8.strings: data
$ od -c a-16.strings | head
0000000 377 376 / \0 * \0 \0 \f 001 E \0 S \0 K \0
$ od -c a-8.strings | head
0000000 / * * * Č ** E S K Y ( J V O
$ od -c b-16be.strings | head
0000000 376 377 \0 / \0 * \0 * \0 * \0 001 \f \0 E
$ od -c b-16le-BAD-fromUTF16BE.strings | head
0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0
$ od -c b-16le-BAD-fromUTF8.strings | head
0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0
It is clear the BOM is missing whenever I run conversion to UTF-16LE.
Any help on this?

UTF-16LE tells iconv to generate little-endian UTF-16 without a BOM (Byte Order Mark). Apparently it assumes that since you specified LE, the BOM isn't necessary.
UTF-16 tells it to generate UTF-16 text (in the local machine's byte order) with a BOM.
If you're on a little-endian machine, I don't see a way to tell iconv to generate big-endian UTF-16 with a BOM, but I might just be missing something.
I find that the file command doesn't recognize UTF-16 text without a BOM, and your editor might not either. But if you run iconv -f UTF-16LE -t UTF_8 b-16 strings, you should get a valid UTF-8 version of the original file.
Try running od -c on the files to see their actual contents.
UPDATE :
It looks like you're on a big-endian machine (x86 is little-endian), and you're trying to generate a little-endian UTF-16 file with a BOM. Is that correct? As far as I can tell, iconv won't do that directly. But this should work:
( printf "\xff\xfe" ; iconv -f utf-8 -t utf-16le UTF-8-FILE ) > UTF-16-FILE
The behavior of the printf might depend on your locale settings; I have LANG=en_US.UTF-8.
(Can anyone suggest a more elegant solution?)
Another workaround, if you know the endianness of the output produced by -t utf-16:
iconv -f utf-8 -t utf-16 UTF-8-FILE | dd conv=swab 2>/dev/null

I first convert to UTF-16, which will prepend a byte-order mark, if necessary as Keith Thompson mentions. Then since UTF-16 doesn't define endianness, we must use file to determine whether it's UTF-16BE or UTF-16LE. Finally, we can convert to UTF-16LE.
iconv -f utf-8 -t utf-16 UTF-8-FILE > UTF-16-UNKNOWN-ENDIANNESS-FILE
FILE_ENCODING="$( file --brief --mime-encoding UTF-16-UNKNOWN-ENDIANNESS-FILE )"
iconv -f "$FILE_ENCODING" -t UTF-16LE UTF-16-UNKNOWN-ENDIANNESS-FILE > UTF-16-FILE

This may not be an elegant solution but I found a manual way to ensure correct conversion for my problem which I believe is similar to the subject of this thread.
The Problem:
I got a text datafile from a user and I was going to process it on Linux (specifically, Ubuntu) using shell script (tokenization, splitting, etc.). Let's call the file myfile.txt. The first indication that I got that something was amiss was that the tokenization was not working. So I was not surprised when I ran the file command on myfile.txt and got the following
$ file myfile.txt
myfile.txt: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators
If the file was compliant, here is what should have been the conversation:
$ file myfile.txt
myfile.txt: ASCII text, with very long lines
The Solution:
To make the datafile compliant, below are the 3 manual steps that I found to work after some trial and error with other steps.
First convert to Big Endian at the same encoding via vi (or vim). vi myfile.txt. In vi do :set fileencoding=UTF-16BE then write out the file. You may have to force it with :!wq.
vi myfile.txt (which should now be in utf-16BE). In vi do :set fileencoding=ASCII then write out the file. Again, you may have to force the write with !wq.
Run dos2unix converter: d2u myfile.txt. If you now run file myfile.txt you should now see an output or something more familiar and assuring like:
myfile.txt: ASCII text, with very long lines
That's it. That's what worked for me, and I was then able to run my processing bash shell script of myfile.txt. I found that I cannot skip Step 2. That is, in this case I cannot skip directly to Step 3. Hopefully you can find this info useful; hopefully someone can automate it perhaps via sed or the like. Cheers.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string