I have two bytes in:
b'T'
and
b'\x40' (only bit #6 is set)
In need to perform a check on the first byte to see if bit # 6 is set. For example, on [A-Za-9] it would be set, but on all some characters it would not be set.
if (b'T' & b'\x40') != 0:
print("set");
does not work ...
Byte values, when indexed, give integer values. Use that to your advantage:
value = b'T'
if value[0] & 0x40:
print('set')
You cannot use the & operator on bytes, but it works just fine on integers.
See the documentation on the bytes type:
While bytes literals and representations are based on ASCII text, bytes objects actually behave like immutable sequences of integers, with each value in the sequence restricted such that 0 <= x < 256[.]
…
Since bytes objects are sequences of integers (akin to a tuple), for a bytes object b, b[0] will be an integer[.]
Note that non-zero numbers always test as true in a boolean context, there is no need to explicitly test for != 0 here either.
You are looking for the ord built-in function, which converts single-character strings (byte or unicode) to the corresponding numeric codepoint.
if ord(b'T') & 0x40:
print("set")
Related
I am trying to create a binary message to send over a socket, but I'm having trouble with the way TCL treats all variables as strings. I need to calculate the length of a string and know its value in binary.
set length [string length $message]
set binaryMessagePart [binary format s* { $length 0 }]
However, when I run this I get the error 'expected integer but got "$length"'. How do I get this to work and return the value for the integer 5 and not the char 5?
To calculate the length of a string, use string length. To calculate the length of a string in a particular encoding, convert the string to that encoding and use string length:
set enc "utf-8"; # Or whatever; you need to know this ahead of time for sanity's sake
set encoded [encoding convertto $enc $message]
set length [string length $encoded]
Note that with the encoded length, this will be in bytes whereas the length prior to encoding is in characters. For some messages and some encodings, the difference can be substantial.
To compose a binary message with the length and the body of the message (a fairly common binary format), use binary format like this:
# Assumes the length is big-endian; for little-endian, use i instead of I
set binPart [binary format "Ia*" $length $encoded]
What you were doing wrong was using s* which consumes a list of integers and produces a sequence of little-endian short integer binary values in the output string, and yet were feeding the list that was literally $length 0; and the string $length is not an integer as those don't start with $. We could have instead done [list $length 0] to produce the argument to s* and that would have worked, but that doesn't seem quite right for the context of the question.
In binary format, these are the common formats (there are many more):
a is for string data (mnemonically “ASCII”); this is binary string data, and you need to encode it first.
i and I are for 32-bit numbers (mnemonically “int” like in many programming languages, but especially C). Upper case is big-endian, lower case is little-endian.
s and S are for 16-bit numbers (mnemonically “short”).
c is for 8-bit numbers (mnemonically “char” from C).
w and W are for 64-bit numbers (mnemonically “wide integers”).
f and d are for IEEE binary floating point numbers (mnemonically “float” and “double” respectively, so 4 and 8 bytes).
All can be followed by an optional length, either a number or a *. For the number ones, instead of inserting a single number they insert a list of them (and so consume a list); numbers give fixed lengths, and * does “all the list”. For the string format indicator, a number uses a fixed number of bytes in the message (truncating or padding with zero bytes as necessary) and * does “all the string” (never truncating or padding).
After several years of writing code for my own use, I'm trying to understand what does it really mean.
a = "Foo"
b = ""
c = 5
d = True
a - string variable. "Foo" (with quotes) - string literal, i.e. an entity of the string data type.
b - string variable. "" - empty string.
c - integer variable. 5 - integer literal, i.e. an entity of the integral data type.
d - Boolean variable. True - Boolean value, i.e. an entity of the Boolean data type.
Questions:
Is my understanding is correct?
It seems that 5 is an integer literal, which is an entity of the integral data type. "Integer" and "integral": for what reason we use different words here?
What is the "string" and "integer"?
As I understand from Wikipedia, "string" and "integer" are not the same thing as string/integer literals or data types. In other words, there are 3 pairs or terms:
string literal, integer literal
string data type, integer data type
string, integer
Firstly, a literal value is any value which appears literally in code, e.g "hello" is a string literal, 123 is an integer literal, etc. In contrast for example:
int a = 5;
int b = 2;
int c = a + b;
a and b have literal values assigned to them, but c does not, it has a computed value assigned to it.
With any literal value we describe the literal value with it's data type ( as in the first sentence ), e.g. "string literal" or "integer literal".
Now a data type refers to how the computer, or the software running on the computer, interprets the binary value of some data. For most kinds of data, the interpretation of the bytes is typically defined in a standard. utf-8 for example is one way to interpret the bytes of a string's internal (binary) value. Interestingly, the actual bytes of a string are treated as unsigned, 8-bit integers. In utf-8, the values of those integers are combined in various ways to determine which glyph, or character, should appear on the screen when those values are encountered in the data. utf-8 is a variable-byte-length encoding which can have between 1 and 4 bytes per character ( 8 to 32-bit ).
For numbers, particularly integers, implementations can vary, but most representations use four bytes with the most significant byte first in order, and the first bit of the first byte as the sign, with signed integers, or the first bit is simply the most significant bit for unsigned integers. This is referred to as big-endian ordering of bytes in a multi-byte integer. There is also little-endian encoding, and integers can in principle use any number of bytes, but the most typically implemented are 1, 2, 4 and sometimes 8, where bit-wise you have 8, 16, 32 or 64, respectively. For integer sizes that are not of these sizes, typically requires a custom implementation.
For floating point numbers it gets a bit more tricky. There is a common standard for floating point numbers called IEEE-754 which describes how floats are encoded. Likewise for floats, there are different sizes and variations, but primarily we use 16, 32, 64 and sometimes 24-bit in some mobile device graphics implementations. There are also extended precision floats which use 40 or 80 bits.
What is the difference between string and character class in MATLAB?
a = 'AX'; % This is a character.
b = string(a) % This is a string.
The documentation suggests:
There are two ways to represent text in MATLAB®. You can store text in character arrays. A typical use is to store short pieces of text as character vectors. And starting in Release 2016b, you can also store multiple pieces of text in string arrays. String arrays provide a set of functions for working with text as data.
This is how the two representations differ:
Element access. To represent char vectors of different length, one had to use cell arrays, e.g. ch = {'a', 'ab', 'abc'}. With strings, they can be created in actual arrays: str = [string('a'), string('ab'), string('abc')].
However, to index characters in a string array directly, the curly bracket notation has to be used:
str{3}(2) % == 'b'
Memory use. Chars use exactly two bytes per character. strings have overhead:
a = 'abc'
b = string('abc')
whos a b
returns
Name Size Bytes Class Attributes
a 1x3 6 char
b 1x1 132 string
The best place to start for understanding the difference is the documentation. The key difference, as stated there:
A character array is a sequence of characters, just as a numeric array is a sequence of numbers. A typical use is to store short pieces of text as character vectors, such as c = 'Hello World';.
A string array is a container for pieces of text. String arrays provide a set of functions for working with text as data. To convert text to string arrays, use the string function.
Here are a few more key points about their differences:
They are different classes (i.e. types): char versus string. As such they will have different sets of methods defined for each. Think about what sort of operations you want to do on your text, then choose the one that best supports those.
Since a string is a container class, be mindful of how its size differs from an equivalent character array representation. Using your example:
>> a = 'AX'; % This is a character.
>> b = string(a) % This is a string.
>> whos
Name Size Bytes Class Attributes
a 1x2 4 char
b 1x1 134 string
Notice that the string container lists its size as 1x1 (and takes up more bytes in memory) while the character array is, as its name implies, a 1x2 array of characters.
They can't always be used interchangeably, and you may need to convert between the two for certain operations. For example, string objects can't be used as dynamic field names for structure indexing:
>> s = struct('a', 1);
>> name = string('a');
>> s.(name)
Argument to dynamic structure reference must evaluate to a valid field name.
>> s.(char(name))
ans =
1
Strings do have a bit of overhead, but still increase by 2 bytes per character. After every 8 characters it increases the size of the variable. The red line is y=2x+127.
figure is created using:
v=[];N=100;
for ct = 1:N
s=char(randi([0 255],[1,ct]));
s=string(s);
a=whos('s');v(ct)=a.bytes;
end
figure(1);clf
plot(v)
xlabel('# characters')
ylabel('# bytes')
p=polyfit(1:N,v,1);
hold on
plot([0,N],[127,2*N+127],'r')
hold off
One important practical thing to note is, that strings and chars behave differently when interacting with square brackets. This can be especially confusing when coming from python. consider following example:
>>['asdf' '123']
ans =
'asdf123'
>> ["asdf" "123"]
ans =
1×2 string array
"asdf" "123"
I needed to add some padding to a bytes string. This is what I came up with:
if padding_len > 0:
data += bytes.fromhex('00') * padding_len
Is there a nicer way of representing a null byte in Python 3 than bytes.fromhex('00')?
>>> bytes.fromhex('00') == b'\x00'
True
>>> b'\x00' * 10
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
EDIT:
As hobbs points out in a comment below, it is certainly possible to use b'\0' instead of b'\x00' and as j-f-sebastian points out, it is not confusing in this instance.
But I wouldn't do it, anyway. It certainly works in this context (and it saves you two characters, if that's important). It will even work in the most common other context, where you are building strings for C and putting a null byte at the end.
But in the most general case, it can lead to confusion, because the compiler's interpretation of b'\0' is highly data dependent. In other words, it changes according to what comes after that zero.
>>> b'\0ABC' == b'\00ABC'
True
>>> b'\0ABC' == b'\000ABC'
True
>>> b'\0ABC' == '\0000ABC'
False
If you are debugging late at night when not all your brain cells are functioning, it is highly frustrating to have the length of a string change because you replaced a character in the string. All you have to do to avoid this is to always use two extra characters. It doesn't matter whether you use \x00 (hexadecimal) or \000 octal -- both of those will work properly no matter the value of the following character.
I want to concatenate the first byte of a bytes string to the end of the string:
a = b'\x14\xf6'
a += a[0]
I get an error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to int
When I type bytes(a[0]) I get:
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
And bytes({a[0]}) gives the correct b'\x14'.
Why do I need {} ?
If you want to change your byte sequence, you should use a bytearray. It is mutable and has the .append method:
>>> a = bytearray(b'\x14\xf6')
>>> a.append(a[0])
>>> a
bytearray(b'\x14\xf6\x14')
What happens in your approach: when you do
a += a[0]
you are trying to add an integer to a bytes object. That doesn't make sense, since you are trying to add different types.
If you do
bytes(a[0])
you get a bytes object of length 20, as the documentation describes:
If [the argument] is an integer, the array will have that size and will be initialized with null bytes.
If you use curly braces, you are creating a set, and a different option in the constructor is chosen:
If it is an iterable, it must be an iterable of integers in the range 0 <= x < 256, which are used as the initial contents of the array.
Bytes don't work quite like strings. When you index with a single value (rather than a slice), you get an integer, rather than a length-one bytes instance. In your case, a[0] is 20 (hex 0x14).
A similar issue happens with the bytes constructor. If you pass a single integer in as the argument (rather than an iterable), you get a bytes instance that consists of that many null bytes ("\x00"). This explains why bytes(a[0]) gives you twenty null bytes. The version with the curly brackets works because it creates a set (which is iterable).
To do what you want, I suggest slicing a[0:1] rather than indexing with a single value. This will give you a bytes instance that you can concatenate onto your existing value.
a += a[0:1]
bytes is a sequence type. Its individual elements are integers. You can't do a + a[0] for the same reason you can't do a + a[0] if a is a list. You can only concatenate a sequence with another sequence.
bytes(a[0]) gives you that because a[0] is an integer, and as documented doing bytes(someInteger) gives you a sequence of that many zero bytes (e.g,, bytes(3) gives you 3 zero bytes).
{a[0]} is a set. When you do bytes({a[0]}) you convert the contents of that set into a bytes object. This is not a great way to do it in general, because sets are unordered, so if you try to do it with more than one byte in there you may not get what you expect.
The easiest way to do what you want is a + a[:1]. You could also do a + bytes([a[0]]). There is no shortcut for creating a single-element bytes object; you have to either use a slice or make a length-one sequence of that byte.
Try this
values = [0x49, 0x7A]
concat = (values[0] << 8) + values[1]
print(hex(concat))
you should get 0x497A