How to find condition that start char in UTF-8 file is read, using FileStream and StreamReader? - c#-4.0

In C# .NET 4.0 (really 4.5.2), my code reads a UTF-8 file.
FileStream fstream = new FileStream(path, FileMode.Open);
BufferedStream stream = new BufferedStream(fstream);
using (StreamReader reader = new StreamReader(stream, new UTF8Encoding())) {
int i;
while((i = reader.Read()) > -1) {
//a guess at a condition that is true I.F.F. reader has read character 1 of the file
if (stream.Position == (0 + sizeof(char)) || stream.Position == (0 + sizeof(int)) ) {
//while loop has reader read through all characters,
//but within this block, the reader has surely read character 1?
char c = (char)i;
}
}
reader.Close();
return 0;
}
I.F.F. we reach the condition that StreamReader reads the start character of the UTF-8 file, then run some function on the first character read.
With a FileStream and StreamReader used in reading a UTF-8 file, how do you know whether the aforementioned condition is met?
I am looking for an answer, please, that uses a property or method that already exists in the C# .NET 4.0 System.IO namespace. I thought use of the Stream.Position (BufferedStream.Position) property is the obvious way to find out where (i.e. at what character) in the file the reader is, but in trying a UTF-8 file that starts with some character in '0' to '9' (48 to 57), the loop with reader.Read() reads that char, and stream.Position = 43 . I don't know why 43 of all integral values is the value of stream.Position after the 1st character is read, or what the 43 means.
update: As the loop iterates and the reader reads more characters, the stream.Position value remains at 43. I don't know the Position property is useful then.

bool first = true;
while((i = reader.Read()) > -1)
{
if (first)
{
first = false;
// Do first character things
}
Note that the concept of first character is complex: what happens if the first glyph is è, that occupies two bytes in the file? The stream position will be at least 2 :-)
In general, you can check what the Position of the StreamReader.BaseStream is, but that Position is nearly useless, because there could be multiple levels of caching, or simply because for reading a single char, the StreamReader could have consumed 1-4 bytes (à is one byte, while some Unicode characters are long 4 bytes)... And then UTF8 files can have a BOM (an initial header long 3 bytes). That too is normally skipped from StreamReader.
Still, if you want, you can subclass the entire StreamReader class, overriding all the Read*, and keeping an internal flag SomethingHasBeenRead. It isn't difficult (everything is virtual in StreamReader)... It is only a little long to do.

Related

How to add pointer char datas (created using malloc) to a char array in C?

In my MPI code in C, i'm receiving a word from each of my slave processes. I want to add all these words to an char array in master side (part of code below). I can print these words but not collect them into a single char array.
(I consider max word length as 10, and number of slave's as slavenumber)
char* word = (char*)malloc(sizeof(char)*10);
char words[slavenumber*10];
for (int p = 0; p<slavenumber; p++){
MPI_Recv(word, 10, MPI_CHAR, p, 0,MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Word: %s\n", word); //it works fine
words[p*10] = *word; //This does not work, i think there is a problem here.
}
printf(words); //This does not work correctly, it gives something like: ��>;&�>W�
Can anybody help me on this?
Let's break it down line by line
// allocate a buffer large enough to hold 10 elements of type `char`
char* word = (char*)malloc(sizeof(char)*10);
// define a variable-length-array large enough to
// hold 10*slavenumber elements of `char`
char words[slavenumber*10];
for (int p = 0; p<slavenumber; p++){
// dereference `word` which is exactly the same as writing
// `word[0]` assigning it to `words[p*10]`
words[p*10] = *word;
// words[p*10+1] to words[p*10+9] are unchanged,
// i.e. uninitialized
}
// printing from an array. For this to work properly all
// accessed elements must be initialized and the buffer
// terminated by a null byte. You have neither
printf(words);
Because you left elements uninitialized and didn't null terminate, you're invoking undefined behavior. Be happy that you didn't get demons crawl out of your nose.
In seriousness though, in C you can copy strings by mere assignment. Your usage case calls for strncpy.
for (int p = 0; p<slavenumber; p++){
strncpy(&words[p*10], word, 10);
}

How can I detect whether a WAV file has a 44 or 46-byte header?

I've discovered it is dangerous to assume that all PCM wav audio files have 44 bytes of header data before the samples begin. Though this is common, many applications (ffmpeg for example), will generate wavs with a 46-byte header and ignoring this fact while processing will result in a corrupt and unreadable file. But how can you detect how long the header actually is?
Obviously there is a way to do this, but I searched and found little discussion about this. A LOT of audio projects out there assume 44 (or conversely, 46) depending on the authors own context.
You should be checking all of the header data to see what the actual sizes are. Broadcast Wave Format files will contain an even larger extension subchunk. WAV and AIFF files from Pro Tools have even more extension chunks that are undocumented as well as data after the audio. If you want to be sure where the sample data begins and ends you need to actually look for the data chunk ('data' for WAV files and 'SSND' for AIFF).
As a review, all WAV subchunks conform to the following format:
Subchunk Descriptor (4 bytes)
Subchunk Size (4 byte integer, little endian)
Subchunk Data (size is Subchunk Size)
This is very easy to process. All you need to do is read the descriptor, if it's not the one you are looking for, read the data size and skip ahead to the next. A simple Java routine to do that would look like this:
//
// Quick note for people who don't know Java well:
// 'in.read(...)' returns -1 when the stream reaches
// the end of the file, so 'if (in.read(...) < 0)'
// is checking for the end of file.
//
public static void printWaveDescriptors(File file)
throws IOException {
try (FileInputStream in = new FileInputStream(file)) {
byte[] bytes = new byte[4];
// Read first 4 bytes.
// (Should be RIFF descriptor.)
if (in.read(bytes) < 0) {
return;
}
printDescriptor(bytes);
// First subchunk will always be at byte 12.
// (There is no other dependable constant.)
in.skip(8);
for (;;) {
// Read each chunk descriptor.
if (in.read(bytes) < 0) {
break;
}
printDescriptor(bytes);
// Read chunk length.
if (in.read(bytes) < 0) {
break;
}
// Skip the length of this chunk.
// Next bytes should be another descriptor or EOF.
int length = (
Byte.toUnsignedInt(bytes[0])
| Byte.toUnsignedInt(bytes[1]) << 8
| Byte.toUnsignedInt(bytes[2]) << 16
| Byte.toUnsignedInt(bytes[3]) << 24
);
in.skip(Integer.toUnsignedLong(length));
}
System.out.println("End of file.");
}
}
private static void printDescriptor(byte[] bytes)
throws IOException {
String desc = new String(bytes, "US-ASCII");
System.out.println("Found '" + desc + "' descriptor.");
}
For example here is a random WAV file I had:
Found 'RIFF' descriptor.
Found 'bext' descriptor.
Found 'fmt ' descriptor.
Found 'minf' descriptor.
Found 'elm1' descriptor.
Found 'data' descriptor.
Found 'regn' descriptor.
Found 'ovwf' descriptor.
Found 'umid' descriptor.
End of file.
Notably, here both 'fmt ' and 'data' legitimately appear in between other chunks because Microsoft's RIFF specification says that subchunks can appear in any order. Even some major audio systems that I know of get this wrong and don't account for that.
So if you want to find a certain chunk, loop through the file checking each descriptor until you find the one you're looking for.
The trick is to look at the "Subchunk1Size", which is a 4-byte integer beginning at byte 16 of the header. In a normal 44-byte wav, this integer will be 16 [10, 0, 0, 0]. If it's a 46-byte header, this integer will be 18 [12, 0, 0, 0] or maybe even higher if there is extra extensible meta data (rare?).
The extra data itself (if present), begins in byte 36.
So a simple C# program to detect the header length would look like this:
static void Main(string[] args)
{
byte[] bytes = new byte[4];
FileStream fileStream = new FileStream(args[0], FileMode.Open, FileAccess.Read);
fileStream.Seek(16, 0);
fileStream.Read(bytes, 0, 4);
fileStream.Close();
int Subchunk1Size = BitConverter.ToInt32(bytes, 0);
if (Subchunk1Size < 16)
Console.WriteLine("This is not a valid wav file");
else
switch (Subchunk1Size)
{
case 16:
Console.WriteLine("44-byte header");
break;
case 18:
Console.WriteLine("46-byte header");
break;
default:
Console.WriteLine("Header contains extra data and is larger than 46 bytes");
break;
}
}
In addition to Radiodef's excellent reply, I'd like to add 3 things that aren't obvious.
The only rule for WAV files is the FMT chunk comes before the DATA chunk. Apart from that, you will find chunks you don't know about at the beginning, before the DATA chunk and after it. You must read the header for each chunk to skip forward to find the next chunk.
The FMT chunk is commonly found in 16 byte and 18 byte variations, but the spec actually allows more than 18 bytes as well.
If the FMT chunk' header size field says greater than 16, Bytes 17 and 18 also specify how many extra bytes there are, so if they are both zero, you end up with an 18 byte FMT chunk identical to the 16 byte one.
It is safe to read in just the first 16 bytes of the FMT chunk and parse those, ignoring any more.
Why does this matter? - not much any more, but Windows XP's Media Player was able to play 16 bit WAV files, but 24 bit WAV files only if the FMT chunk was the Extended (18+ byte) version. There used to be a lot of complaints that "Windows doesn't play my 24 bit WAV files", but if it had an 18 byte FMT chunk, it would... Microsoft fixed that sometime during the early days of Windows 7, so 24 bit with 16 byte FMT files work fine now.
(Newly added) Chunk sizes with odd sizes occur quite often. Mostly seen when a 24 bit mono file is made. It is unclear from the spec, but the chunk size specifies the actual data length (the odd value) and a pad byte (zero) is added after the chunk and before the start of the next chunk. So chunks always start on even boundaries, but the chunk size itself is stored as the actual odd value.

Read a String with spaces till a new line in C

I am in a pickle right now. I'm having trouble taking in an input of example
1994 The Shawshank Redemption
1994 Pulp Fiction
2008 The Dark Knight
1957 12 Angry Men
I first take in the number into an integer, then I need to take in the name of the Movie into a string using a character array, however i have not been able to get this done.
here is the code atm
while(scanf("%d", &myear) != EOF)
{
i = 0;
while(scanf("%[^\n]", &ch))
{
title[i] = ch;
i++;
}
addNode(makeData(title,myear));
}
The title array is arbitrarily large and the function is to add the data as a node to a linked list. right now the output I keep getting for each node is as follows
" hank Redemption"
" ion"
" Knight"
" Men"
Yes, it oddly prints a space in front of the cut-off title. I checked the variables and it adds the space in the data. (I am not printing the year as that is taken in correctly)
How can I fix this?
You are using the wrong type of argument passed to scanf() -- instead of scanning a character, try scanning to the string buffer immediately. %[^\n] scans an entire string up to (but not including) the newline. It does not scan only one character.
(Marginal secondary problem: I don't know from where you people are getting the idea that scanf() returns EOF at end of input, but it doesn't - you'd be better off reading the documentation instead of making incorrect assumptions.)
I hope you see now: scanf() is hard to get right. It's evil. Why not input the whole line at once then parse it using sane functions?
char buf[LINE_MAX];
while (fgets(buf, sizeof buf, stdin) != NULL) {
int year = strtol(buf, NULL, 0);
const char *p = strchr(buf, ' ');
if (p != NULL) {
char name[LINE_MAX];
strcpy(name, p + 1); // safe because strlen(p) <= sizeof(name)
}
}

Binary file I/O

How to read and write to binary files in D language? In C would be:
FILE *fp = fopen("/home/peu/Desktop/bla.bin", "wb");
char x[4] = "RIFF";
fwrite(x, sizeof(char), 4, fp);
I found rawWrite at D docs, but I don't know the usage, nor if does what I think. fread is from C:
T[] rawRead(T)(T[] buffer);
If the file is not opened, throws an exception. Otherwise, calls fread for the file handle and throws on error.
rawRead always read in binary mode on Windows.
rawRead and rawWrite should behave exactly like fread, fwrite, only they are templates to take care of argument sizes and lengths.
e.g.
auto stream = File("filename","r+");
auto outstring = "abcd";
stream.rawWrite(outstring);
stream.rewind();
auto inbytes = new char[4];
stream.rawRead(inbytes);
assert(inbytes[3] == outstring[3]);
rawRead is implemented in terms of fread as
T[] rawRead(T)(T[] buffer)
{
enforce(buffer.length, "rawRead must take a non-empty buffer");
immutable result =
.fread(buffer.ptr, T.sizeof, buffer.length, p.handle);
errnoEnforce(!error);
return result ? buffer[0 .. result] : null;
}
If you just want to read in a big buffer of values (say, ints), you can simply do:
int[] ints = cast(int[]) std.file.read("ints.bin", numInts * int.sizeof);
and
std.file.write("ints.bin", ints);
Of course, if you have more structured data then Scott Wales' answer is more appropriate.

Making a WCHAR null terminated

I've got this
WCHAR fileName[1];
as a returned value from a function (it's a sys 32 function so I am not able to change the returned type). I need to make fileName to be null terminated so I am trying to append '\0' to it, but nothing seems to work.
Once I get a null terminated WCHAR I will need to pass it to another sys 32 function so I need it to stay as WCHAR.
Could anyone give me any suggestion please?
================================================
Thanks a lot for all your help. Looks like my problem has to do with more than missing a null terminated string.
//This works:
WCHAR szPath1[50] = L"\\Invalid2.txt.txt";
dwResult = FbwfCommitFile(szDrive, pPath1); //Successful
//This does not:
std::wstring l_fn(L"\\");
//Because Cache_detail->fileName is \Invalid2.txt.txt and I need two
l_fn.append(Cache_detail->fileName);
l_fn += L""; //To ensure null terminated
fprintf(output, "l_fn.c_str: %ls\n", l_fn.c_str()); //Prints "\\Invalid2.txt.txt"
iCommitErr = FbwfCommitFile(L"C:", (WCHAR*)l_fn.c_str()); //Unsuccessful
//Then when I do a comparison on these two they are unequal.
int iCompareResult = l_fn.compare(pPath1); // returns -1
So I need to figure out how these two ended up to be different.
Thanks a lot!
Since you mentioned fbwffindfirst/fbwffindnext in a comment, you're talking about the file name returned in FbwfCacheDetail. So from the fileNameLength field you know length for the fileName in bytes. The length of fileName in WCHAR's is fileNameLength/sizeof(WCHAR). So the simple answer is that you can set
fileName[fileNameLength/sizeof(WCHAR)+1] = L'\0'
Now this is important you need to make sure that the buffer you send for the cacheDetail parameter into fbwffindfirst/fbwffindnext is sizeof(WCHAR) bytes larger than you need, the above code snippet may run outside the bounds of your array. So for the size parameter of fbwffindfirst/fbwffindnext pass in the buffer size - sizeof(WCHAR).
For example this:
// *** Caution: This example has no error checking, nor has it been compiled ***
ULONG error;
ULONG size;
FbwfCacheDetail *cacheDetail;
// Make an intial call to find how big of a buffer we need
size = 0;
error = FbwfFindFirst(volume, NULL, &size);
if (error == ERROR_MORE_DATA) {
// Allocate more than we need
cacheDetail = (FbwfCacheDetail*)malloc(size + sizeof(WCHAR));
// Don't tell this call about the bytes we allocated for the null
error = FbwfFindFirstFile(volume, cacheDetail, &size);
cacheDetail->fileName[cacheDetail->fileNameLength/sizeof(WCHAR)+1] = L"\0";
// ... Use fileName as a null terminated string ...
// Have to free what we allocate
free(cacheDetail);
}
Of course you'll have to change a good bit to fit in with your code (plus you'll have to call fbwffindnext as well)
If you are interested in why the FbwfCacheDetail struct ends with a WCHAR[1] field, see this blog post. It's a pretty common pattern in the Windows API.
Use L'\0', not '\0'.
As each character of a WCHAR is 16-bit in size, you should perhaps append \0\0 to it, but I'm not sure if this works. By the way, WCHAR fileName[1]; is creating a WCHAR of length 1, perhaps you want something like WCHAR fileName[1024]; instead.
WCHAR fileName[1]; is an array of 1 character, so if null terminated it will contain only the null terminator L'\0'.
Which API function are you calling?
Edited
The fileName member in FbwfCacheDetail is only 1 character which is a common technique used when the length of the array is unknown and the member is the last member in a structure. As you have likely already noticed if your allocated buffer is is only sizeof (FbwfCacheDetail) long then FbwfFindFirst returns ERROR_NOT_ENOUGH_MEMORY.
So if I understand, what you desire to do it output the non NULL terminated filename using fprintf. This can be done as follows
fprintf (outputfile, L"%.*ls", cacheDetail.fileNameLength, cacheDetail.fileName);
This will print only the first fileNameLength characters of fileName.
An alternative approach would be to append a NULL terminator to the end of fileName. First you'll need to ensure that the buffer is long enough which can be done by subtracting sizeof (WCHAR) from the size argument you pass to FbwfFindFirst. So if you allocate a buffer of 1000 bytes, you'll pass 998 to FbwfFindFirst, reserving the last two bytes in the buffer for your own use. Then to add the NULL terminator and output the file name use
cacheDetail.fileName[cacheDetail.fileNameLength] = L'\0';
fprintf (outputfile, L"%ls", cacheDetail.fileName);

Resources