I'm looking through the node Buffer documentation in detail, and I can't get my head around the explanation for buffer.write().
Specifically, I don't get what the behaviour is when a write attempt is performed with string larger than the buffer's capacity. The following passage seems to contradict itself:
If buffer did not contain enough space to fit the entire string, it will write a partial amount of the string. length defaults to buffer.length - offset. The method will not write partial characters.
The first sentence claims it will write what it can, while the last one says it's an all-or-nothing operation.
Am I missing something?
In certain encodings (like UTF-8) a single character can be represented by multiple bytes.
When the documentation says "The method will not write partial characters" I think they mean that if a characters needs 3 bytes but there are only 2 bytes left on the buffer the character won't be written at all (as opposed to only writing the first 2 bytes)
http://en.wikipedia.org/wiki/UTF-8
Related
I have a text file of 3GB size (a FASTA file with DNA sequences). It contains about 50 million lines of differing
length, though the most lines are 70 characters wide. I want to extract a string from this file, given two character indices. The difficult
part is, that newlines shall not be counted as character.
For good speed, I want to use seek() to reach the beginning of the string and start reading, but I need the offset in bytes for that.
My current approach is to write a new file, with all the newlines removed, but that takes another 3GB on disk. I want to find a solution which requires less disk space.
Using a dictionary mapping each character count to a file offset is not practicable either, because there would be one key for every byte, therefore using at least 16bytes*3 billion characters = 48GB.
I think I need a data structure which allows to retrieve the number of newline characters that come before a character of certain index, then I can add their number and the character index to obtain the file offset in bytes.
The SamTools fai index was designed just for this purpose. Which makes a very small compact index file with enough information to quickly seek to any point in the fasta file for any record inside as long as the file is properly formatted
You can create a SamTools index using samtools faidx command.
You can then use other programs in the SamTools package to pull out subsequences or alignments very quickly using the index.
see http://www.htslib.org/doc/samtools.html for usage.
I've heard this so many times, that I have taken it for granted. But thinking back on it, can someone help me realize why string manipulation, say comparison etc, is more expensive than say an integer, or some other primitive?
8bit example:
1 bit can be 1 or 0. With 2 bits you can represent 0, 1, 2, and 3. And so on.
With a byte you have 2^8 possibilities, from 0 to 255.
In a string a single letter is stored in a byte, so "Hello world" is 11 bytes.
If I want to do 100 + 100, 100 is stored in 1 byte of memory, I need only two bytes to sum two numbers. The result will need again 1 byte.
Now let's try with strings, "100" + "100", this is 3 bytes plus 3 bytes and the result, "100100" needs 6 bytes to be stored.
This is over-simplified, but more or less it works in this way.
The int data type in C# was carefully selected to be a good match with processor design. Which can store an int in a cpu register, a storage location that's an easy factor of 3 faster than memory. And a single cpu instruction to compare values of type int. The CMP instruction runs in less than a single cpu cycle, a fraction of a nano-second.
That doesn't work nearly as well for a string, it is a variable length data type and every single char in the string must be compared to test for equality. So it is automatically proportionally slower by the size of the string. Furthermore, string comparison is afflicted by culture dependent comparison rules. The kind that make "ss" and "ß" equal in German and "Aa" and "Å" equal in Danish. Nothing subtle to deal with, taken care of by highly optimized table-driven code inside the CLR. It can't beat CMP.
I've always thought it was because of the immutability of strings. That is, every time you make a change to the string, it requires allocating memory for a whole new string (rather than modifying the original in place).
Probably a woefully naive understanding but perhaps someone else can expound further.
There are several things to consider when looking at the "cost" of manipulating strings.
There is the cost in terms of memory usage, there is the cost in terms of CPU cycles used, and there is a cost associated with the complexity of the code involved.
Integer manipulation (Add, Subtract, Multipy, Divide, Compare) is most often done by the CPU at the hardware level, in few (or even 1) instruction. When the manipulation is done, the answer fits back in the same size chunk of memory.
Strings are stored in blocks of memory, which have to be manipulated a byte or word at a time. Comparing two 100 character long strings may require 100 separate comparison operations.
Any manipulation that makes a string longer will require, either moving the string to a bigger block of memory, or moving other stuff around in memory to allow growing the existing block.
Any manipulation that leaves the string the same, or smaller, could be done in place, if the language allows for it. If not, then again, a new block of memory has to be allocated and contents moved.
From the documentation:
store() should return the number of bytes used from the buffer. If the
entire buffer has been used, just return the count argument.
What does it do with this value? What's the difference if from a buffer of size FOO I read 4 and not 6 bytes?
You must realize that by implementing a sysfs file, you are trying to behave like a file.
Let's see this from the other side first. From the man page of fwrite(3):
RETURN VALUE
fread() and fwrite() return the number of items successfully read or written (i.e., not the number of characters). If an error occurs, or the end-of-file is
reached, the return value is a short item count (or zero).
And even better, from the man page of write(2):
The number of bytes written may be less than count if, for example, there is insufficient space on the underlying physical medium, or the RLIMIT_FSIZE resource
limit is encountered (see setrlimit(2)), or the call was interrupted by a signal handler after having written less than count bytes. (See also pipe(7).)
What this means is that store(), which is implementing the other end of the write(2) function for your particular file should return the number of bytes written (i.e. read by you), in the very least so that write(2) can return that value to the user.
In most cases, if there is no error in the input, you would just want to return count to acknowledge that you have read everything and all is ok.
While defining a dataset to be created, one of the JCL parameters, DCB has a positional sub-parameter RECFM, has possible values of F,FB,V,VB etc.. What're the advantages/disadvantages of RECFM=FB over RECFM=F or RECFM=VB over RECFM=V? And which case prefers to use what RECFM format?
RECFM is short for record format.
F represents fixed length records, unblocked. FB represents fixed length records, blocked. Blocking stores multiple records in a disk block, while the unblocked format stores one record in a disk block. At one time, disk drives were so slow that the unblocked format provided relative speed, while the blocked format provided better disk usage. Today, with modern disk drives, there's no advantage to using the unblocked format.
V represents variable length records, unblocked. VB represents variable length records, blocked. You would use these formats if you have variable length records, rather than fixed length records. You need to add 4 to the maximum record length in the LRECL to account for the record length field.
There's an additional attribute character, A. Used with fixed blocked (FBA) or variable blocked (VBA), this tells the system that the first byte of your record is a printer control character.
I am creating a protocol to have two applications talk over a TCP/IP stream and am figuring out how to design a header for my messages. Using the TCP header as an initial guide, I am wondering if I will need padding. I understand that when we're dealing with a cache, we want to make sure that data being stored fits in a row of cache so that when it is retrieved it is done so efficiently. However, I do not understand how it makes sense to pad a header considering that an application will parse a stream of bytes and store it how it sees fit.
For example: I want to send over a message header consisting of a 3 byte field followed by a 1 byte padding field for 32 bit alignment. Then I will send over the message data.
In this case, the receiver will just take 3 bytes from the stream and throw away the padding byte. And then start reading message data. As I see it, he will not be storing the 3 bytes and the message data the way he wants. The whole point of byte alignment is so that it will be retrieved in an efficient manner. But if the retriever doesn't care about the padding how will it be retrieved efficiently?
Without the padding, the retriever just takes the 3 header bytes from the stream and then takes the data bytes. Since the retriever stores these bytes however he wants, how does it matter whether or not the padding is done?
Maybe I'm missing the point of padding.
It's slightly hard to extract a question from this post, but with what I've said you guys can probably point out my misconceptions.
Please let me know what you guys think.
Thanks,
jbu
If word alignment of the message body is of some use, then by all means, pad the message to avoid other contortions. The padding will be of benefit if most of the message is processed as machine words with decent intensity.
If the message is a stream of bytes, for instance xml, then padding won't do you a whole heck of a lot of good.
As far as actually designing a wire protocol, you should probably consider using a plain text protocol with compression (including the header), which will probably use less bandwidth than any hand-designed binary protocol you could possibly invent.
I do not understand how it makes sense to pad a header considering that an application will parse a stream of bytes and store it how it sees fit.
If I'm a receiver, I might pass a buffer (i.e. an array of bytes) to the protocol driver (i.e. the TCP stack) and say, "give this back to me when there's data in it".
What I (the application) get back, then, is an array of bytes which contains the data. Using C-style tricks like "casting" and so on I can treat portions of this array as if it were words and double-words (not just bytes) ... provided that they're suitably aligned (which is where padding may be required).
Here's an example of a statement which reads a DWORD from an offset in a byte buffer:
DWORD getDword(const byte* buffer)
{
//we want the DWORD which starts at byte-offset 8
buffer += 8;
//dereference as if it were pointing to a DWORD
//(this would fail on some machines if the pointer
//weren't pointing to a DWORD-aligned boundary)
return *((DWORD*)buffer);
}
Here's the corresponding function in Intel assembly; note that it's a single opcode i.e. quite an efficient way to access the data, more efficient that reading and accumulating separate bytes:
mov eax,DWORD PTR [esi+8]
Oner reason to consider padding is if you plan to extend your protocol over time. Some of the padding can be intentionally set aside for future assignment.
Another reason to consider padding is to save a couple of bits on length fields. I.e. always a multiple of 4, or 8 saves 2 or 3 bits off the length field.
One other good reason that TCP has padding (which probably does not apply to you) is it allows dedicated network processing hardware to easily separate the data from the header. As the data always starts on a 32 bit boundary, it's easier to separate the header from the data when the packet gets routed.
If you have a 3 byte header and align it to 4 bytes, then designate the unused byte as 'reserved for future use' and require the bits to be zero (rejecting messages where they are not as malformed). That leaves you some extensibility. Or you might decide to use the byte as a version number - initially zero, and then incrementing it if (when) you make incompatible changes to the protocol. Don't let the value be 'undefined' and "don't care"; you'll never be able to use it if you start out that way.