FileStream behaves differently on Linux

FileStream behaves differently on Linux - linux

(DotNet Core 2.1) I am using FileStream to write bytes to a file. Developing on Windows and Targeting Yocto Linux on a ARM 32 Bit processor.
When running on Windows I get the correct resulting file.
Same code running on Linux, produces an incorrect file.
Putting both files on the Linux system and checking them with the "file" command yields different results:
$ file pc.iso # <-- GOOD
pc.iso: data
$ file linux.iso # <-- BAD
linux.iso: DOS/MBR boot sector, code offset 0x58+2, OEM-ID "mkfs.fat",
sectors/cluster 64, reserved sectors 64, Media descriptor 0xf8, sectors/track
63, heads 255, sectors 100000000 (volumes > 32 MB), FAT (32 bit), sectors/FAT
12224, reserved 0x1, serial number 0x53ab8ba2, unlabeled
Sample of the code:
using (var stream = new FileStream(path, FileMode.Open, FileAccess.Write, FileShare.None, 4096, FileOptions.WriteThrough ))
{
stream.Seek(start, SeekOrigin.Begin);
for (int i = 0; i < amount; i++)
{
var bytes = GetBytes(i);
stream.Write(bytes, 0, bytes.Length);
}
}
Why would something like this produce different results on Linux? And/or how can I prevent it from somehow producing a FAT32 image?

Related

determining the optimal buffer size for file read in linux

I am writing a C program which reads from stdin and writes to stdout. But it buffers the data so that a write is performed only after it reads a specific number of bytes(=SIZE)
#include<stdio.h>
#include<stdlib.h>
#define SIZE 100
int main()
{
char buf[SIZE];
int n=0;
//printf("Block size = %d\n", BUFSIZ);
while( ( n = read(0, buf, sizeof(buf)) ) > 0 )
write(1, buf, n);
exit(0);
}
Iam running this program on a Ubuntu 18.04 hosted on Oracle Virtual Box(4GB RAM, 2 cores), and testing the program for different values of buffer size. I have redirected the standard input to come from a file(which contains random numbers created dynamically) and standard output to go to /dev/null. Here is the shell script used to run the test:
#!/bin/bash
# $1 - step size (bytes)
# $2 - start size (bytes)
# $3 - stop size (bytes)
echo "Changing buffer size from $2 to $3 in steps of $1, and measuring time for copying."
buff_size=$2
echo "Test Data" >testData
echo "Step Size:(doubles from previous size) Start Size:$2 Stop Size:$3" >>testData
while [ $buff_size -le $3 ]
do
echo "" >>testData
echo -n "$buff_size," >>testData
gcc -DSIZE=$buff_size copy.c # Compile the program for cat, with new buffer size
dd bs=1000 count=1000000 </dev/urandom >testFile #Create testFile with random data of 1GB
(/usr/bin/time -f "\t%U, \t%S," ./a.out <testFile 1>/dev/null) 2>>testData
buff_size=$(($buff_size * 2))
rm -f a.out
rm -f testFile
done
I am measuring the time taken to execute the program and tabulate it. A test run produces the following data:
Test Data
Step Size:(doubles from previous size) Start Size:1 Stop Size:524288
1, 5.94, 17.81,
2, 5.53, 18.37,
4, 5.35, 18.37,
8, 5.58, 18.78,
16, 5.45, 18.96,
32, 5.96, 19.81,
64, 5.60, 18.64,
128, 5.62, 17.94,
256, 5.37, 18.33,
512, 5.70, 18.45,
1024, 5.43, 17.45,
2048, 5.22, 17.95,
4096, 5.57, 18.14,
8192, 5.88, 17.39,
16384, 5.39, 18.64,
32768, 5.27, 17.78,
65536, 5.22, 17.77,
131072, 5.52, 17.70,
262144, 5.60, 17.40,
524288, 5.96, 17.99,
I dont see any significant variation in user+system time as we use a different block size. But theoretically, as the block size becomes smaller, many number of system calls are generated for the same file size, and it should take more time to execute. I have seen test results in the book 'Advanced Programming in Unix Environment' by Richard Stevens for a similar test, which shows that user+system time reduces significantly if the buffer size used in copy is close to block size.(In my case, block size is 4096 bytes on an ext4 partition)
Why am i not able to reproduce these results? Am i missing some factors in these tests?

You did not disable the line #define SIZE 100 in your source code so the definition via option (-DSIZE=1000) does have influence only above this #define. On my compiler I get a warning for this (<command-line>:0:0: note: this is the location of the previous definition) at compile time.
If you comment out the #define you should be able to fix this error.
Another aspect which comes to mind:
If you create a file on a machine and read it right away after that, it will be in the OS's disk cache (which is large enough to store all of this file), so the actual disk block size won't have much of an influence here.
Stevens's book was written in 1992 when RAM was way more expensive than today, so maybe some information in there is outdated. I also doubt that newer editions of the book have taken things like these out because in general they are still true.

Logical block number to address (Linux filesystem)

int block_count;
struct stat statBuf;
int block;
int fd = open("file.txt", O_RDONLY);
fstat(fd, &statBuf);
block_count = (statBuf.st_size + statBuf.st_blksize - 1) / statBuf.st_blksize;
int i,add;
for(i = 0; i < block_count; i++) {
block = i;
if (ioctl(fd, FIBMAP, &block)) {
perror("FIBMAP ioctl failed");
}
printf("%3d %10d\n", i, block);
add = block;
}
char buffer[255];
int fd2 = open("/dev/sda1", O_RDONLY);
lseek(fd2, add, SEEK_SET);
read(fd2, buffer, 20);
printf("%s%s\n","ss ",buffer);
Output:
0 5038060
1 5038061
2 5038062
3 5038063
4 5038064
5 5038065
ss
I am using the above code to get the logical block number of a file. Lets suppose I want to read the contents of the last block number, How would I do that?
Is there a way to get the address of a block from logical block number?
PS: I am using linux and filesystem is ext4

The above code actually is getting you the physical block number. The FIBMAP ioctl takes as input the logical block number, and returns the physical block number. You can then multiply that by the blocksize (which is 4k for most ext4 file systems, but you can get that by using the BLKBSZGET ioctl) to get the byte offset if you want to read that disk block using lseek and read.
Note that the more modern interface that people tend to use today is the FIEMAP interface. This doesn't require root, and returns the physical byte offset, plus a lot more data. For more information, please see:
https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt
Or you can look at the source code for the filefrag command, which is part of e2fsprogs:
https://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/tree/misc/filefrag.c

How can I detect whether a WAV file has a 44 or 46-byte header?

I've discovered it is dangerous to assume that all PCM wav audio files have 44 bytes of header data before the samples begin. Though this is common, many applications (ffmpeg for example), will generate wavs with a 46-byte header and ignoring this fact while processing will result in a corrupt and unreadable file. But how can you detect how long the header actually is?
Obviously there is a way to do this, but I searched and found little discussion about this. A LOT of audio projects out there assume 44 (or conversely, 46) depending on the authors own context.

You should be checking all of the header data to see what the actual sizes are. Broadcast Wave Format files will contain an even larger extension subchunk. WAV and AIFF files from Pro Tools have even more extension chunks that are undocumented as well as data after the audio. If you want to be sure where the sample data begins and ends you need to actually look for the data chunk ('data' for WAV files and 'SSND' for AIFF).
As a review, all WAV subchunks conform to the following format:
Subchunk Descriptor (4 bytes)
Subchunk Size (4 byte integer, little endian)
Subchunk Data (size is Subchunk Size)
This is very easy to process. All you need to do is read the descriptor, if it's not the one you are looking for, read the data size and skip ahead to the next. A simple Java routine to do that would look like this:
//
// Quick note for people who don't know Java well:
// 'in.read(...)' returns -1 when the stream reaches
// the end of the file, so 'if (in.read(...) < 0)'
// is checking for the end of file.
//
public static void printWaveDescriptors(File file)
throws IOException {
try (FileInputStream in = new FileInputStream(file)) {
byte[] bytes = new byte[4];
// Read first 4 bytes.
// (Should be RIFF descriptor.)
if (in.read(bytes) < 0) {
return;
}
printDescriptor(bytes);
// First subchunk will always be at byte 12.
// (There is no other dependable constant.)
in.skip(8);
for (;;) {
// Read each chunk descriptor.
if (in.read(bytes) < 0) {
break;
}
printDescriptor(bytes);
// Read chunk length.
if (in.read(bytes) < 0) {
break;
}
// Skip the length of this chunk.
// Next bytes should be another descriptor or EOF.
int length = (
Byte.toUnsignedInt(bytes[0])
| Byte.toUnsignedInt(bytes[1]) << 8
| Byte.toUnsignedInt(bytes[2]) << 16
| Byte.toUnsignedInt(bytes[3]) << 24
);
in.skip(Integer.toUnsignedLong(length));
}
System.out.println("End of file.");
}
}
private static void printDescriptor(byte[] bytes)
throws IOException {
String desc = new String(bytes, "US-ASCII");
System.out.println("Found '" + desc + "' descriptor.");
}
For example here is a random WAV file I had:
Found 'RIFF' descriptor.
Found 'bext' descriptor.
Found 'fmt ' descriptor.
Found 'minf' descriptor.
Found 'elm1' descriptor.
Found 'data' descriptor.
Found 'regn' descriptor.
Found 'ovwf' descriptor.
Found 'umid' descriptor.
End of file.
Notably, here both 'fmt ' and 'data' legitimately appear in between other chunks because Microsoft's RIFF specification says that subchunks can appear in any order. Even some major audio systems that I know of get this wrong and don't account for that.
So if you want to find a certain chunk, loop through the file checking each descriptor until you find the one you're looking for.

The trick is to look at the "Subchunk1Size", which is a 4-byte integer beginning at byte 16 of the header. In a normal 44-byte wav, this integer will be 16 [10, 0, 0, 0]. If it's a 46-byte header, this integer will be 18 [12, 0, 0, 0] or maybe even higher if there is extra extensible meta data (rare?).
The extra data itself (if present), begins in byte 36.
So a simple C# program to detect the header length would look like this:
static void Main(string[] args)
{
byte[] bytes = new byte[4];
FileStream fileStream = new FileStream(args[0], FileMode.Open, FileAccess.Read);
fileStream.Seek(16, 0);
fileStream.Read(bytes, 0, 4);
fileStream.Close();
int Subchunk1Size = BitConverter.ToInt32(bytes, 0);
if (Subchunk1Size < 16)
Console.WriteLine("This is not a valid wav file");
else
switch (Subchunk1Size)
{
case 16:
Console.WriteLine("44-byte header");
break;
case 18:
Console.WriteLine("46-byte header");
break;
default:
Console.WriteLine("Header contains extra data and is larger than 46 bytes");
break;
}
}

In addition to Radiodef's excellent reply, I'd like to add 3 things that aren't obvious.
The only rule for WAV files is the FMT chunk comes before the DATA chunk. Apart from that, you will find chunks you don't know about at the beginning, before the DATA chunk and after it. You must read the header for each chunk to skip forward to find the next chunk.
The FMT chunk is commonly found in 16 byte and 18 byte variations, but the spec actually allows more than 18 bytes as well.
If the FMT chunk' header size field says greater than 16, Bytes 17 and 18 also specify how many extra bytes there are, so if they are both zero, you end up with an 18 byte FMT chunk identical to the 16 byte one.
It is safe to read in just the first 16 bytes of the FMT chunk and parse those, ignoring any more.
Why does this matter? - not much any more, but Windows XP's Media Player was able to play 16 bit WAV files, but 24 bit WAV files only if the FMT chunk was the Extended (18+ byte) version. There used to be a lot of complaints that "Windows doesn't play my 24 bit WAV files", but if it had an 18 byte FMT chunk, it would... Microsoft fixed that sometime during the early days of Windows 7, so 24 bit with 16 byte FMT files work fine now.
(Newly added) Chunk sizes with odd sizes occur quite often. Mostly seen when a 24 bit mono file is made. It is unclear from the spec, but the chunk size specifies the actual data length (the odd value) and a pad byte (zero) is added after the chunk and before the start of the next chunk. So chunks always start on even boundaries, but the chunk size itself is stored as the actual odd value.

Efficiently transfer large file (up to 2GB) to CUDA GPU?

I'm working on an a GPU accelerated program that requires the reading of an entire file of variable size. My question, what is the optimal number of bytes to read from a file and transfer to a coprocessor (CUDA device)?
These files could be as large as 2GiB, so creating a buffer of that size doesn't seem like the best idea.

You can cudaMalloc a buffer of the maximum size you can on your device. After this, copy over chunks of your input data of this size from host to device, process it, copy back the results and continue.
// Your input data on host
int hostBufNum = 5600000;
int* hostBuf = ...;
// Assume this is largest device buffer you can allocate
int devBufNum = 1000000;
int* devBuf;
cudaMalloc( &devBuf, sizeof( int ) * devBufNum );
int* hostChunk = hostBuf;
int hostLeft = hostBufNum;
int chunkNum = ( hostLeft < devBufNum ) ? hostLeft : devBufNum;
do
{
cudaMemcpy( devBuf, hostChunk, chunkNum * sizeof( int ) , cudaMemcpyHostToDevice);
doSomethingKernel<<< >>>( devBuf, chunkNum );
hostChunk = hostChunk + chunkNum;
hostLeft = hostBufNum - ( hostChunk - hostBuf );
} while( hostLeft > 0 );

If you can split your function up so you can work on chunks on the card, you should look into using streams (cudaStream_t).
If you schedule loads and kernel executions in several streams, you can have one stream load data while another executes a kernel on the card, thereby hiding some of the transfer time of your data in the execution of a kernel.
You need to declare a buffer of whatever your chunk size is times however many streams you declare (up to 16, for compute capability 1.x as far as I know).

FUSE fseek unexpected behaviour with direct_io

I'm trying to write a FUSE filesystem that presents streamable music as mp3 files. I don't want to start to stream the audio when just the ID3v1.1 tag is read, so I mount the filesystem with direct_io and max_readahead=0.
But when I do this (which is also what libid3tag does), I get reads of 2752 bytes with offset -2880 bytes from the end:
char tmp[255];
FILE* f = fopen("foo.mp3", "r");
fseek(f, -128, SEEK_END);
fread(tmp, 1, 10, f);
Why is this? I expect to get a call to read with an offset exactly 128 bytes from the end with size 10..
The amount of bytes read seems to vary somewhat.

I've had similar issue and filed an issue with s3fs. Checkout issue : http://code.google.com/p/s3fs/issues/detail?can=2&q=&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary&groupby=&sort=&id=241
additionally, checkout line 1611 in the s3fs.cpp:
http://code.google.com/p/s3fs/source/browse/trunk/src/s3fs.cpp?r=316
// error check this
// fseek (pSourceFile , 0 , SEEK_END);

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string