I'm used to using fwrite to write the contents of a buffer to a file in gcc. However, for some reason, fwrite doesn't seem to work correctly in Visual C++ 2005.
What I'm doing now is to convert a text file into binary. The program worked fine for the first 61 lines, but at the 62nd line, it inserted a 0x0d into the output binary file. Basically, it turned the original
12 0a 00
to
12 0d 0a 00
I checked the buffer, the content was correct, i.e.
buffer[18] = 0x12, buffer[19] = 0x0a, buffer[20] = 0x00
And I tried to write this buffer to file by
fwrite(buffer, 1, length, fout)
where length is the correct value of the size of the buffer content.
This happened to me once, and I had to change my code from fwrite to WriteFile for it to work correctly. Is there a reason a 0x0d is inserted into my output? Can I fix this, or do I have to try using WriteFile instead?
The problem is because the file has been opened in text mode, so it is converting each of the newline characters it sees to a newline+carriage return sequence.
When you open your file, specify binary mode by using the b qualifier on the file mode:
FILE *pFile = fopen(szFilename, "wb");
In VC++, if t or b is not given in the file mode, the default translation mode is defined by the global variable _fmode. This may explain the difference between compilers.
You'll need to do the same when reading a binary file too. In text mode, carriage return–linefeed combinations are translated into single linefeeds on input, and linefeed characters are translated to carriage return–linefeed combinations on output.
Related
I would like to use vim with binary files. I run run vim with -b and I have isprint = and display += uhex. I am using the following statusline:
%<%f\ %h%m%r%=%o\ (0x%06O)\ \ %3.b\ <%02B>\ %7P
so I get output containing some useful information like byte offset in the file and the current character in hex etc. But I'm having trouble with random pieced of data interpreted as multibyte characters which prevent me from accessing the inner bytes, combine with surroundings (including vim's decoration) or display as �.
Of course I have tried opening the files with ++enc=latin1. However, my system's encoding is UTF-8, so what vim supposedly does is convert the file from Latin-1 to UTF-8 internally and display that. This has two problems:
The sequence <c3><ac> displays as ì, rather than ì, but the characters count as two bytes each, so it breaks my %o and counts offsets wrong. This is 2 bytes in the file but apparently 4 bytes in vim's buffer.
I don't know why my isprint is ignored. Neither of these characters are between 32 and 126 so they should display in hex.
I found the following workaround: I set encoding to latin1, but termencoding to utf-8. This achieves what I want, but breaks other things like when vim needs to display status messages ("new file", "changed" etc.) in my language, because it wants to use the encoding for them too and they don't fit. I guess I could run vim in LC_ALL=C but it feels I'm resorting to too many dirty tricks already. Is there a better way, i.e., without having to mess around with encoding?
I'm reading a large (10Gb) bzipped file in python3, which is utf-8-encoded JSON. I only want a few of the lines though, that start with a certain set of bytes, so to save having to decode all the lines into unicode, I'm reading the file in 'rb' mode, like this:
with bz2.open(filename, 'rb') as file:
for line in file:
if line.startswith(b'Hello'):
#decode line here, then do stuff
But I suddenly thought, what if one of the unicode characters contains the same byte as a newline character? By doing for line in file will I risk getting truncated lines? Or does the linewise iterator over a binary file still work by magic?
Line-wise iteration will work for UTF-8 encoded data.
Not by magic, but by design:
UTF-8 was created to be backwards-compatible to ASCII.
ASCII only uses the byte values 0 through 127, leaving the upper half of possible values for extensions of any kind.
UTF-8 takes advantage of this, in that any Unicode codepoint outside ASCII is encoded using bytes in the range 128..255.
For example, the letter "Ċ" (capital letter C with dot above) has the Unicode codepoint value U+010A.
In UTF-8, this is encoded with the byte sequence C4 8A, thus without using the byte 0A, which is the ASCII newline.
In contrast, UTF-16 encodes the same character as 0A 01 or 01 0A (depending on the Endianness).
So I guess UTF-16 is not safe to do line-wise iteration over.
It's not that common as file encoding though.
I've asked this before, but the results were not fruitful, i don't know whether i should've bumped it so i just made a new one.
my code for opening the text file and converting it to stringstream:
OpenFileDialog^ failas = gcnew OpenFileDialog();
failas->Filter = "Text Files|*.txt";
if( failas->ShowDialog() != System::Windows::Forms::DialogResult::OK )
{
return;
}
MessageBox::Show( failas->FileName );
String^ str = failas->FileName;
StreamReader ^strm = gcnew StreamReader(str);
String ^ST1=strm->ReadToEnd();
strm->Close();
string st1 = marshal_as<string,String ^>(ST1);
stringstream SS(st1);
if i were to output the SS or st1
instead of outputting something like:
a
a
a
I get
a
a
a
And now the thing is, that if i open the file in notepad, it looks like intended(no spaces between lines) but if i load it anywhere else but there, it still has the spaces.
I was understand this has something to do with the way windows save text files, but I have no idea how to remove the additional \n when I use the command ReadToEnd?
any ideas?
You're reading the input file using .Net methods, converting it to an unmanaged C++ stringstream, and then presumably writing the output file using unmanaged C++ methods.
In C++, many methods will automatically handle Windows vs. Unix line endings: fprintf(outfile, "Some text\n"); will actually write the bytes "Some text\r\n" to disk, if the file was opened in text mode.
You didn't say how you're writing the output file, but I think what's happening is that you're using fopen or similar in text mode. When you read from the input file, it contained CR-LF (\r\n) character sequences, and those character sequences were copied to the String^ ST1. They were still there when you copied the characters to the stringstream.
When you wrote the "\r\n" using fwrite or similar, it converted the \n to \r\n, resulting in \r\r\n sequences. This is not a standard line-ending on any platform, so that's why different editors are displaying it differently. You can confirm this by looking at the output file in a binary editor (rename to *.bin and open in Visual Studio): I expect you'll see bytes 0d 0d 0a at the end of each line.
To fix this, there's a couple things you can do:
You could read the file using unmanaged methods. Since you apparently want to manipulate and write using unmanaged, you can stay in unmanaged-land for the whole operation. Let the unmanaged APIs convert the \r\n on disk to \n in memory, and convert back to \r\n when written back to disk.
You could remove the \r characters from the string after reading in .Net, before writing in unmanaged C++. This would be a simple call to String::Replace.
You could open the file for writing in unmanaged C++ as a binary file, rather than text. This will turn off the line-ending conversion, and output exactly the characters you have in your string. Just be sure to use \r\n line endings if you manipulate the data before writing it.
Unfortunately, I can't post any pictures due to my lack of reputation but it looks like "^#".
For some context, I have a script that goes through a list of names to generate a configuration file. I run an executable with those configuration and if it doesn't run, the script will proceed to the next name and erase the content of the previous configuration. However, if the executable does run, the script will move on to the next name and append onto the exist configuration. The problem is that when the first iteration is erased, it leaves behind a symbol that would conflict with all subsequent iterations. Any idea what this symbol mean? Much appreciated.
It doesn't just look like "^#", it is "^#". The ^ denotes a control character; for example ^X is Control-X. The null character can be entered on most keyboards by typing Control-#.
Look at a table of ASCII codes. The Control key, in many cases, modifies a character by subtracting 64 from its ASCII value; thus Control-G is character (71 - 64) or 7, the ASCII BEL character.
As special cases, the ASCII DEL character, 127, is represented as "^?", and the NUL character can be entered (on most keyboards) by typing Control-Space. (Vim doesn't use "^ " to represent the NUL character because it would be difficult to read.)
It's how vim displays ASCII nul, i.e. a zero byte.
A simple way to find out the numeric value of a character is to pipe the file through a hex tool such as xxd and you will see the ^# character has value 00
You can create an empty file then (in input mode) type Ctrl-V Ctrl-Shift-# to enter the ^# character, then filter it with :%!xxd and you will get:
0000000: 000a ..
This shows there are two characters with values 00 and 0a (which is a newline)
I want to read lines of Unicode text (UTF-16 LE, line feed delimited) from a file. I'm using Visual Studio 2012 and targeting a 32-bit console application.
I was not able to find a ReadLine function within WinAPI so I turned to Google. It is clear I am not the first to seek such a function. The most commonly recommended solution involves using std::wifstream.
I wrote code similar to the following:
wchar_t buffer[1024];
std::wifstream input(L"input.txt");
while (input.good())
{
input::getline(buffer, 1024);
// ... do stuff...
}
input.close();
For the sake of explanation, assume that input.txt contains two UTF-16 LE lines which are less than 200 wchar_t chars in length.
Prior to calling getline the first time, Visual Studio correctly identifies that buffer is an array of wchar_t. You can mouse over the variable in the debugger and see that the array is comprised of 16-bit values. However, after the call to getline returns, the debugger now displays buffer as if is a byte array.
After the first call to getline, the contents of buffer are correct (aside from buffer being treated like a byte array). If the first line of input.txt contains the UTF-16 string L"123", this is correctly stored in buffer as (hex) "31 00 32 00 33 00"
My first thought was to reinterpret_cast<wchar_t *>(buffer) which does produce the desired result (buffer is now treated like a wchar_t array) and it contains the values I expect.
However, after the second call to getline, (the second line of input.txt contains the string L"456") buffer contains (hex) "00 34 00 35 00 36 00". Note that this is incorrect (it should be [hex] 34 00 35 00 36 00)
The fact that the byte ordering gets messed up prevents me from using reinterpret_cast as a solution to work around this. More importantly, why is std::wifstream::getline even converting my wchar_t buffer into a char buffer anyways?? I was under the impression that if one wanted to use chars they would use ifstream and if they want to use wchar_t they use wifstream...
I am terrible at making sense of the stl headers, but it almost looks as if wifstream is intentionally converting my wchar_t to a char... why??
I would appreciate any insights and explanations for understanding these problems.
wifstream reads bytes from the file, and converts them to wide chars using codecvt facet installed into the stream's locale. The default facet assumes system-default code page and calls mbstowcs on those bytes.
To treat your file as UTF-16, you need to use codecvt_utf16. Like this:
std::wifstream fin("text.txt", std::ios::binary);
// apply facet
fin.imbue(std::locale(fin.getloc(),
new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>));