How many bytes/KB per tweet? - string

I have been reading a lot of SO posts and external blogs. Something that is confusing me is without going in too deep into breakup of UTF8 : Does each character mean 2 bytes or 4 bytes? I have seen BOTH being mentioned. Given the CURRENT twitter limit of 280 characters, is this the right calculation ?
For A-z0-9 We use 1 byte
280 * 1 = 280 bytes

Related

Node.js readUIntBE arbitrary size restriction?

Background
I am reading buffers using the Node.js buffer native API. This API has two functions called readUIntBE and readUIntLE for Big Endian and Little Endian respectively.
https://nodejs.org/api/buffer.html#buffer_buf_readuintbe_offset_bytelength_noassert
Problem
By reading the docs, I stumbled upon the following lines:
byteLength Number of bytes to read. Must satisfy: 0 < byteLength <= 6.
If I understand correctly, this means that I can only read 6 bytes at a time using this function, which makes it useless for my use case, as I need to read a timestamp comprised of 8 bytes.
Questions
Is this a documentation typo?
If not, what is the reason for such an arbitrary limitation?
How do I read 8 bytes in a row ( or how do I read sequences greater than 6 bytes? )
Answer
After asking in the official Node.js repo, I got the following response from one of the members:
No it is not a typo
The byteLength corresponds to e.g. 8bit, 16bit, 24bit, 32bit, 40bit and 48bit. More is not possible since JS numbers are only safe up to Number.MAX_SAFE_INTEGER.
If you want to read 8 bytes, you can read multiple entries by adding the offset.
Source: https://github.com/nodejs/node/issues/20249#issuecomment-383899009

CoAP, how to understand "Options"

Formatting a CoAP packet
[RFC 7252 CoAP][1]
In RFC 7252 Section 3, Figure 7 the third row, bytes 9 ... 16 or maybe more, is the Options field. I am unable to find anything that specifies how long the options field is. I understand that it can change, but unlike the Token field who's length is specified by field TKL, I cannot recognize where the length of the Options is specified.
Yes, I see sections 3.1 and 3.2 but am not able to understand what they are telling me. The document states to reference the previous options. OK, what do you do for the first message where there is no previous packet and no previous option?
When my code needs to send a CoAP message, how do I determine what options can be sent? What values must be loaded into the packet to send, for example, no options?
If you see Figure 8 in sec 3.1 on the RFC, bits 4-7 denote the length of the option value.
0 1 2 3 4 5 6 7
+---------------+---------------+
| Option Delta | Option Length | 1 byte
+---------------+---------------+
Bits 0-3 will tell you which option it is. This nibble only gives you the delta compared to the previous option encoded in this message. For the first option in the message, there is no previous option so the bits 0-3 give you the Option number.
Lets consider an example where you need to encode 2 options Uri-Port with value 7000 and Uri-Path with value /temp in a CoAP message. Options are always encoded in increasing order of the Option numbers. So you first encode Uri-Port which has Option number 7 and then Uri-Path with Option number 11.
Uri-Port
As this is the first option in the message, the Option delta will be same as the Option number so Option delta = 0x7. Port value 7000 will take 2 bytes (0x1B58) so Option length = 0x2. So this Option will be encoded get encoded as 72 1b 58.
Uri-Path
This is not the first Option in this message. Option delta for this option will be this option number - prev option number i.e. 11 - 7 = 4. Encoding temp will take 4 bytes so Option length = 4. So this option would get encoded as 44 74 65 6d 70
Note that this was for a simplified case where the Option number and length are not more than 12 bytes. When either of these is more than 12 bytes, you encode using the extended option delta/length as specified in the RFC.

Get bytes from /dev/urandom within range while keeping fair distribution

I want to generate random numbers in assembly (nasm, linux) and I don't want to use the libc (for didactic reasons), so I'm planning on reading /dev/urandom.
The thing is, I would like them to be in a specific range.
For instance let's say I want a number from 0 to 99.
When I read a byte from /dev/urandom it will come in the range 0x00 to 0xff (255).
One thing I could do is apply a mod 100, which would guarantee the correct range.
But the problem with this approach is that some numbers have more chance to
come out than others.
The number 51 would come out from 3 different results:
51 % 100 = 51
151 % 100 = 51
251 % 100 = 51
The number 99 would come only from 2 different results:
99 % 100 = 99
199 % 100 = 99
(there will be no 299 since the range of a byte ends in 255).
The only solution I came up with involves discarting the random number
when it is in the range 200-255 and reading another one.
Is there a more clever way to read a random byte, and make sure
it is in a certain range while being "fair"?
What if I'm planning to read lots of bytes within a range?
Is there a way to be fair without discarting lots of urandom reads?
I heard about the getrandom(2) linux syscall, but it's not yet in a stable kernel (3.16.3 as of this time). Is there an alternative?

bitshift large strings for encoding QR Codes

As an example, suppose a QR Code data stream contains 55 data words (each one byte in length) and 15 error correction words (again one byte). The data stream begins with a 12 bit header and ends with four 0 bits. So, 12 + 4 bits of header/footer and 15 bytes of error correction, leaves me 53 bytes to hold 53 alphanumeric characters. The 53 bytes of data and 15 bytes of ec are supplied in a string of length 68 (str68). The problem seems simple enough - concatenate 2 bytes of (right-shifted) header data with str68 and then left shift the entire 70 bytes by 4 bits.
This is the first time in many years of programming that I have ever needed to do something like this, I am a c and bit shifting noob, so please be gentle... I have done a little investigation and so far have not been able to figure out how to bitshift 70 bytes of data; any help would be greatly appreciated.
Larger QR codes can hold 2000 bytes of data...
You need to look at this 4 bits at a time.
The first 4 bits you need to worry about are the lower bits of the first byte. Fortunately this is an easy case because they need to end up in the upper bits of the first byte.
The next 4 bits you need to worry about are the upper bits of the second byte. These need to end up as the lower bits of the first byte.
The next 4 bits you need to worry about are the lower bits of the second byte. But fortunately you already know how to do this because you already did it for the first byte.
You continue in this vein until you have dealt with the lower bytes of the 70th byte.

How many bytes of memory is a tweet?

140 characters. How much memory would it take up ?
I'm trying to calculate how many tweets my EC2 Large instance Mongo DB can hold.
Twitter uses UTF-8 encoded messages.
UTF-8 code points can be up to six four octets long, making the maximum message size 140 x 4 = 560 8-bit bytes.
This is, of course, just for the raw messages, excluding storage overhead, indexing and other storage-related padding.
e: Twitter successfully let me post the message:
™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™
Yes, that's 140 trademark symbols, which are three octets each in UTF-8
Back in September, an engineer at Twitter gave a presentation that suggested it's about 200 bytes per tweet.
Of course you still have to account for overhead for your own metadata and the database itself, but 200 bytes/record is probably a good place to start.
Typically it's two bytes per character if you're storing Unicode as UTF-8, so that would mean 280 bytes max per tweet.
Probably 284 bytes in memory ( 4 byte length prefix + length*2). Inside the DB I cannot say but probably 280 if the DB is UTF-8, you could add some bytes of overhead, for metadata etc.
Potentially of interest:
http://mehack.com/map-of-a-twitter-status-object
Anatomy of a Twitter Status Object
Also more about twitter character encoding:
http://dev.twitter.com/pages/counting_characters
It's technically stored as UTF-8, and in reality, the slide deck from a tweeter guy here http://www.slideshare.net/raffikrikorian/twitter-by-the-numbers gives the real stat about it:
140 characters, ~200 bytes

Resources