BASH - Convert textfile containing binary numbers into a binary file - linux

I have a long text file that looks something like this:
00000000
00001110
00010001
00010000
00001110
00000001
00010001
00001110
...and so on...
I'd like to take this data that is represented in ASCII and write it to a binary file. That is, i do NOT want to convert ASCII to binary, but rather take the actual 1s and 0s and put them in a binary file
The purpose of this is so that my EPROM programmer can read the file.
I've heard that ob and hexdump are useful in this case but I never really understood how they worked.
If it's to any help I also have the data in hex form:
00 0E 11 10 0E 01 11 0E
How do I do this using a shell script?

Something like perl -ne 'print pack "B*", $_' input should get you most of the way there.

Related

Is a text file with 16 leading binary bytes a "normal" file format?

I've encountered files with leading byte values as follows when viewed in a hex editor:
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0f
I've seen this in 2 cases:
*.csproj.CopyComplete files. My Windows .NET Xunit projects contain these files. They consist of only 16 bytes with the sequential byte signature as shown above.
Macintosh text files (as generated from Excel file save in "Text (Macintosh) (*.txt)"). In this case the first 16 bytes follow the signature above, followed by the expected document text.
My understanding is that text files may have a leading binary byte signature if their encoding is not UTF-8.
Can anyone provide more information about this byte signature?
Right you are! My bad. The *.CopyComplete files are 0-length, so the hex editor display misled me.

nroff/groff does not properly convert utf-8 encoded file

I am having a utf-8 encoded roff-file that I want to convert to a manpage with
$ nroff -mandoc inittab.5
However, characters in [äöüÄÖÜ], e.g. are not displayed properly as it seems that nroff assumes ISO 8859-1 encoding (I am getting [äöüÃÃÃ] instead. Calling nroff with the -Tutf8 flag does not change the behaviour and the locale environment variables are (I assume properly) set to
LANG=de_DE.utf8
LC_CTYPE="de_DE.utf8"
LC_NUMERIC="de_DE.utf8"
LC_TIME="de_DE.utf8"
LC_COLLATE="de_DE.utf8"
LC_MONETARY="de_DE.utf8"
LC_MESSAGES="de_DE.utf8"
LC_PAPER="de_DE.utf8"
LC_NAME="de_DE.utf8"
LC_ADDRESS="de_DE.utf8"
LC_TELEPHONE="de_DE.utf8"
LC_MEASUREMENT="de_DE.utf8"
LC_IDENTIFICATION="de_DE.utf8"
LC_ALL=
Since nroff is only a wrapper-script and eventually calls groff I checked the call to the latter which is:
$ groff -Tutf8 -mandoc inittab.5
Comparing the byte-encodings of characters in the src file and the output file I am getting the following conversions:
character src file output file
--------- -------- -----------
ä C3 A4 C3 83 C2 A4
ö C3 B6 C3 83 C2 B6
ü C3 BC C3 83 C2 BC
Ä C3 84 C3 83
Ö C3 96 C3 83
Ü C3 9C C3 83
ß C3 9F C3 83
This behaviour seems very weird to me (why am I getting an additional C3 83 and have the original byte-sequence truncated alltogether for big umlauts and ß?)
Why is this and how can I make nroff/groff properly convert my utf-8 encoded file?
EDIT: I am using GNU nroff (groff) version 1.22.2
Unlike other troff implementations (namely Plan 9 and Heirloom troff), groff does not support UTF8 in documents. However, UTF8 output can be achieved using the preconv(1) pre-processor, which converts UTF8 characters in a file to groff native escape sequences.
Take for example this groff_ms(7) document:
.TL
StackOverflow Test Document
.AU
ToasterKing
.PP
I like going to the café down the street
äöüÄÖÜ
Using groff normally, we get:
StackOverflow Test Document
ToasterKing
I like going to the café down the street
äöüÃÃÃ
But when using preconv | groff or groff -k, we get:
StackOverflow Test Document
ToasterKing
I like going to the café down the street
äöüÄÖÜ
Looking at the output of preconv, you can see how it transforms characters into escape sequences:
.lf 1 so.ms
.TL
StackOverflow Test Document
.AU
ToasterKing
.PP
I like going to the caf\[u00E9] down the street
\[u00E4]\[u00F6]\[u00FC]\[u00C4]\[u00D6]\[u00DC]

Put space every two characters in text string

Given the following string:
6e000000b0040000044250534bb4f6fd02d6dc5bc0790c2fde3166a14146009c8684a4624
Which is a representation of a Byte array, every two characters represent a Byte.
I would like to put a space between each Byte using Sublime Text, something like:
6e 00 00 00 b0 04 00 00 04 42 50
Does Sublime Text help me on that issue?
As a bonus I would like to split into lines and add 0x before each Byte.
I found a similar question but that's not related no Sublime Text, Split character string multiple times every two characters.
Go to Find->Replace... and enable regular expressions.
Replace: (.{2})
With: $1SPACE
Where SPACE is a space.
To split it onto separate lines and add 0x before each byte do this:
Find (.{2})
Replace with: 0x\1\n

Remove BOM from string with Perl

I have the following problem: I am reading from a UTF-8 text file (and I am telling Perl that I am doing so by ":encoding(utf-8)").
The file looks like this in a hex viewer:
EF BB BF 43 6F 6E 66 65 72 65 6E 63 65
This translates to "Conference" when printed. I understand the "wide character" which I am being warned about is the BOM. I want to get rid of it (not because of the warning, but because it messes up a string comparison that I undertake later).
So I tried to remove it using the following code, but I fail miserably:
$line =~ s/^\xEF\xBB\xBF//;
Can anyone enlighten me as to how to remove the UTF-8 BOM from a string which I obtained by reading the first line of the UTF-8 file?
Thanks!
EF BB BF is the UTF-8 encoding of the BOM, but you decoded it, so you must look for its decoded form. The BOM is a ZERO WIDTH NO-BREAK SPACE (U+FEFF) used at the start of a file, so any of the following will do:
s/^\x{FEFF}//;
s/^\N{U+FEFF}//;
s/^\N{ZERO WIDTH NO-BREAK SPACE}//;
s/^\N{BOM}//; # Convenient alias
See also: File::Bom.
I understand the "wide character" which I am being warned about is the BOM. I want to get rid of it
You're getting wide character because you forgot to add an :encoding layer on your output file handle. The following adds :encoding(UTF-8) to STDIN, STDOUT, STDERR, and makes it the default for open().
use open ':std', ':encoding(UTF-8)';
To defuse the BOM, you have to know it's not 3 characters, it's 1 in UTF (U+FEFF):
s/^\x{FEFF}//;
If you open the file using File::BOM, it will remove the BOM for you.
use File::BOM;
open_bom(my $fh, $path, ':utf8')
Ideally, your filehandle should be doing this for you automatically. But if you're not in an ideal situation, this worked for me:
use Encode;
my $value = decode('UTF-8', $originalvalue);
$value =~ s/\N{U+FEFF}//;

What is std::wifstream::getline doing to my wchar_t array? It's treated like a byte array after getline returns

I want to read lines of Unicode text (UTF-16 LE, line feed delimited) from a file. I'm using Visual Studio 2012 and targeting a 32-bit console application.
I was not able to find a ReadLine function within WinAPI so I turned to Google. It is clear I am not the first to seek such a function. The most commonly recommended solution involves using std::wifstream.
I wrote code similar to the following:
wchar_t buffer[1024];
std::wifstream input(L"input.txt");
while (input.good())
{
input::getline(buffer, 1024);
// ... do stuff...
}
input.close();
For the sake of explanation, assume that input.txt contains two UTF-16 LE lines which are less than 200 wchar_t chars in length.
Prior to calling getline the first time, Visual Studio correctly identifies that buffer is an array of wchar_t. You can mouse over the variable in the debugger and see that the array is comprised of 16-bit values. However, after the call to getline returns, the debugger now displays buffer as if is a byte array.
After the first call to getline, the contents of buffer are correct (aside from buffer being treated like a byte array). If the first line of input.txt contains the UTF-16 string L"123", this is correctly stored in buffer as (hex) "31 00 32 00 33 00"
My first thought was to reinterpret_cast<wchar_t *>(buffer) which does produce the desired result (buffer is now treated like a wchar_t array) and it contains the values I expect.
However, after the second call to getline, (the second line of input.txt contains the string L"456") buffer contains (hex) "00 34 00 35 00 36 00". Note that this is incorrect (it should be [hex] 34 00 35 00 36 00)
The fact that the byte ordering gets messed up prevents me from using reinterpret_cast as a solution to work around this. More importantly, why is std::wifstream::getline even converting my wchar_t buffer into a char buffer anyways?? I was under the impression that if one wanted to use chars they would use ifstream and if they want to use wchar_t they use wifstream...
I am terrible at making sense of the stl headers, but it almost looks as if wifstream is intentionally converting my wchar_t to a char... why??
I would appreciate any insights and explanations for understanding these problems.
wifstream reads bytes from the file, and converts them to wide chars using codecvt facet installed into the stream's locale. The default facet assumes system-default code page and calls mbstowcs on those bytes.
To treat your file as UTF-16, you need to use codecvt_utf16. Like this:
std::wifstream fin("text.txt", std::ios::binary);
// apply facet
fin.imbue(std::locale(fin.getloc(),
new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>));

Resources