Can someone please explain how does this code work:
[...Buffer('abc')]
the result is:
[ 97, 98, 99 ]
First of all, consider this piece of code:
console.log([...[1, 2, 3]]); //[1, 2, 3]
The Spread operator will try to take an array and transform it to a list of arguments.
Node.js's buffer object are actually arrays of bytes, a way to represent character and deal with binary data simultaneously. you can read more about it at https://nodejs.org/api/buffer.html.
Now, since 'abc' is actually three ascii characters, each character will only take up one byte, and that byte will correspond to its ascii code.
You can get this by doing: myString.charCodeAt(pos), in your case 'abc'.charCodeAt(0) will return 97.
So,
[...Buffer('abc')]
will actually return an array containing the ascii codes of each character of 'abc'. that is [97, 98, 99]
Since buffers' encoding is by default UTF-8, things will get more exciting when you are dealing with unicode.
console.log([...Buffer('漢字')]); //[230 188 162 229 173 151]
Sorry for potential typos and hope this helps.
Related
I'm using URL safe Base64 encoding to encode my randomly generated byte arrays. But I have a problem on decoding. When I decode two different strings (all but the last chars are identical), it produces the same byte array. For example, for both "dGVzdCBzdHJpbmr" and "dGVzdCBzdHJpbmq" strings the result is same:
Array(116, 101, 115, 116, 32, 115, 116, 114, 105, 110, 106)
For encoding/decoding I use java.util.Base64 in that way:
// encoding...
Base64.getUrlEncoder().withoutPadding().encodeToString(myString.getBytes())
// decoding...
Base64.getUrlDecoder().decode(base64String)
What is the reason of this collision? Is it also possible with chars other than the last one? And how can I fix this and make decoding to return a different byte array for each different string?
The issue you are seeing, is caused by the fact that the number of bytes you have in the "result" (11 bytes) doesn't completely "fill" the last char of the Base64 encoded string.
Remember that Base64 encodes each 8 bit entity into 6 bit chars. The resulting string then needs exactly 11 * 8 / 6 bytes, or 14 2/3 chars. But you can't write partial characters. Only the first 4 bits (or 2/3 of the last char) are significant. The last two bits are not decoded. Thus all of:
dGVzdCBzdHJpbmo
dGVzdCBzdHJpbmp
dGVzdCBzdHJpbmq
dGVzdCBzdHJpbmr
All decode to the same 11 bytes (116, 101, 115, 116, 32, 115, 116, 114, 105, 110, 106).
PS: Without padding, some decoders will try to decode the "last" byte as well, and you'll have a 12 byte result (with different last byte). This is the reason for my comment (asking if withoutPadding() option is a good idea). But your decoder seems to handle this.
May be this is how Base64 encodes and decodes...see if this this helps.
Read the below description for knowing actual working of Base 64.If the array string have difference at the end, the encoded value will be possibly be reflected the same place.
The array you showed is an ASCII representation for "test strinj" (see http://www.unit-conversion.info/texttools/ascii/) and doesn't seem to be a base64 representation of anything.
Seems like you are analysing the wrong 'result' array
i have a little confused
// default charset utf8
val bytes = byteArrayOf(78, 23, 41, 51, -32, 42)
val str = String(bytes)
// there i got array [78, 23, 41, 51, -17, -65, -67, 42]
val weird = str.toByteArray()
i put random value into the bytes property, for some reason. why is it inconsistent???
The issue here is that your bytes aren't a valid UTF-8 sequence.
Any sequence of bytes can be interpreted as valid ISO Latin-1, for example. (There may be issues with bytes having values 0–31, but those generally don't stop the characters being stored and processed.) Similar applies to most other 8-bit character sets.
But the same isn't true of UTF-8. While all sequences of bytes in the range 1–127 are valid UTF-8 (and interpreted the same as they are in ASCII and most 8-bit encodings), bytes in the range 128–255 can only appear in certain well-defined combinations. (This has several very useful properties: it lets you identify UTF-8 with a very high probability; it also avoids issues with synchronisation, searching, sorting, &c.)
In this case, the sequence in the question (which is 4E 17 29 33 E0 2A in unsigned hex) isn't valid UTF-8.
So when you try to convert it to a string using the default encoding (UTF-8), the JVM substitutes the replacement character — value U+FFFD, which looks like this: � — in place of each invalid character.
Then, when you convert that back to UTF-8, you get the UTF-8 encoding of the replacment character, which is EF BF BD. And if you interpret that as signed bytes, you get -17 -65 -67 — as in the question.
So Kotlin/JVM is handling the invalid input as best it can.
So I am trying to XOR two strings together but am unsure if I am doing it correctly when the strings are different length.
The method I am using is as follows.
def xor_two_str(a,b):
xored = []
for i in range(max(len(a), len(b))):
xored_value = ord(a[i%len(a)]) ^ ord(b[i%len(b)])
xored.append(hex(xored_value)[2:])
return ''.join(xored)
I get output like so.
abc XOR abc: 000
abc XOR ab: 002
ab XOR abc: 5a
space XOR space: 0
I know something is wrong and I will eventually want to convert the hex value to ascii so am worried the foundation is wrong. Any help would be greatly appreciated.
Your code looks mostly correct (assuming the goal is to reuse the shorter input by cycling back to the beginning), but your output has a minor problem: It's not fixed width per character, so you could get the same output from two pairs characters with a small (< 16) difference as from a single pair of characters with a large difference.
Assuming you're only working with "bytes-like" strings (all inputs have ordinal values below 256), you'll want to pad your hex output to a fixed width of two, with padding zeroes changing:
xored.append(hex(xored_value)[2:])
to:
xored.append('{:02x}'.format(xored_value))
which saves a temporary string (hex + slice makes the longer string then slices off the prefix, when format strings can directly produce the result without the prefix) and zero-pads to a width of two.
There are other improvements possible for more Pythonic/performant code, but that should be enough to make your code produce usable results.
Side-note: When running your original code, xor_two_str('abc', 'ab') and xor_two_str('ab', 'abc') both produced the same output, 002 (Try it online!), which is what you'd expect (since xor-ing is commutative, and you cycle the shorter input, reversing the arguments to any call should produce the same results). Not sure why you think it produced 5a. My fixed code (Try it online!) just makes the outputs 000000, 000002, 000002, and 00; padded properly, but otherwise unchanged from your results.
As far as other improvements to make, manually converting character by character, and manually cycling the shorter input via remainder-and-indexing is a surprisingly costly part of this code, relative to the actual work performed. You can do a few things to reduce this overhead, including:
Convert from str to bytes once, up-front, in bulk (runs in roughly one seventh the time of the fastest character by character conversion)
Determine up front which string is shortest, and use itertools.cycle to extend it as needed, and zip to directly iterate over paired byte values rather than indexing at all
Together, this gets you:
from itertools import cycle
def xor_two_str(a,b):
# Convert to bytes so we iterate by ordinal, determine which is longer
short, long = sorted((a.encode('latin-1'), b.encode('latin-1')), key=len)
xored = []
for x, y in zip(long, cycle(short)):
xored_value = x ^ y
xored.append('{:02x}'.format(xored_value))
return ''.join(xored)
or to make it even more concise/fast, we just make the bytes object without converting to hex (and just for fun, use map+operator.xor to avoid the need for Python level loops entirely, pushing all the work to the C layer in the CPython reference interpreter), then convert to hex str in bulk with the (new in 3.5) bytes.hex method:
from itertools import cycle
from operator import xor
def xor_two_str(a,b):
short, long = sorted((a.encode('latin-1'), b.encode('latin-1')), key=len)
xored = bytes(map(xor, long, cycle(short)))
return xored.hex()
I would like to read in a JPEG-Header and analyze it.
According to Wikipedia, the header consists of a sequences of markers. Each Marker starts with FF xx, where xx is a specific Marker-ID.
So my idea, was to simply read in the image in binary format, and seek for the corresponding character-combinations in the binary stream. This should enable me to split the header in the corresponding marker-fields.
For instance, this is, what I receive, when I read in the first 20 bytes of an image:
binary_data = open('picture.jpg','rb').read(20)
print(binary_data)
b'\xff\xd8\xff\xe1-\xfcExif\x00\x00MM\x00*\x00\x00\x00\x08'
My questions are now:
1) Why does python not return me nice chunks of 2 bytes (in hex-format).
Somthing like this I would expect:
b'\xff \xd8 \xff \xe1 \x-' ... and so on. Some blocks delimited by '\x' are much longer than 2 bytes.
2) Why are there symbols like -, M, * in the returned string? Those are no characters of a hex representation I expect from a byte string (only: 0-9, a-f, I think).
Both observations hinder me in writing a simple parser.
So ultimately my question summarizes to:
How do I properly read-in and parse a JPEG Header in Python?
You seem overly worried about how your binary data is represented on your console. Don't worry about that.
The default built-in string-based representation that print(..) applies to a bytes object is just "printable ASCII characters as such (except a few exceptions), all others as an escaped hex sequence". The exceptions are semi-special characters such as \, ", and ', which could mess up the string representation. But this alternative representation does not change the values in any way!
>>> a = bytes([1,2,4,92,34,39])
>>> a
b'\x01\x02\x04\\"\''
>>> a[0]
1
See how the entire object is printed 'as if' it's a string, but its individual elements are still perfectly normal bytes?
If you have a byte array and you don't like the appearance of this default, then you can write your own. But – for clarity – this still doesn't have anything to do with parsing a file.
>>> binary_data = open('iaijiedc.jpg','rb').read(20)
>>> binary_data
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x01\x00H\x00H\x00\x00'
>>> ''.join(['%02x%02x ' % (binary_data[2*i],binary_data[2*i+1]) for i in range(len(binary_data)>>1)])
'ffd8 ffe0 0010 4a46 4946 0001 0201 0048 0048 0000 '
Why does python not return me nice chunks of 2 bytes (in hex-format)?
Because you don't ask it to. You are asking for a sequence of bytes, and that's what you get. If you want chunks of two-bytes, transform it after reading.
The code above only prints the data; to create a new list that contains 2-byte words, loop over it and convert each 2 bytes or use unpack (there are actually several ways):
>>> wd = [unpack('>H', binary_data[x:x+2])[0] for x in range(0,len(binary_data),2)]
>>> wd
[65496, 65504, 16, 19014, 18758, 1, 513, 72, 72, 0]
>>> [hex(x) for x in wd]
['0xffd8', '0xffe0', '0x10', '0x4a46', '0x4946', '0x1', '0x201', '0x48', '0x48', '0x0']
I'm using the little-endian specifier < and unsigned short H in unpack, because (I assume) these are the conventional ways to represent JPEG 2-byte codes. Check the documentation if you want to derive from this.
I am looking at a MIME segment of an email, which claims to be encoded as base64. The entire value is
ICA=
The conversion Convert.FromBase64String( "ICA=" ) returns a two-byte array, and both values are 32, which looks to me like two spaces. There is no error.
I have read about base64, but I haven't caught why ICA= becomes spaces.
Use the Base64 alphabet to convert "ICA=" to six-digit binary numbers. "I" is 8, which is "001000" in binary. "C" is 2, which is "000010". "A" is 0, which is "000000". Ignore the placeholder "=". Concatenate the binary numbers, and take bytes (8 bits) out of the result at a time, ignoring leftover zeroes. So, you have "001000000010000000", which becomes "00100000" and "00100000", which are both 32, the decimal representation of a space character.
Base64 encoding alphabet and explanation can be found here: https://www.rfc-editor.org/rfc/rfc4648