I am new to python and doing embedded-related work (most of my programming experience has been using C).
I am reading a four-byte float into a bytearray from a serial port but instead of the normal little-endian order DCBA, it is encoded as CDAB. Or it may be encoded as BADC. (where A is most significant byte and D is LSB). What is the correct way of swapping bytes around in a bytearray?
For example, I have
tmp=bytearray(pack("f",3.14))
I want to be able to arbitrarily arrange the bytes in tmp, and then unpack() it back into a float.
Things like this seem essential when doing anything related to embedded systems, but either I am googling it wrong or no clear answer is out there (yet!).
edit:
sure, I can do this:
from struct import *
def modswap(num):
tmp=bytearray(pack("f",num))
res=bytearray()
res.append(tmp[2])
res.append(tmp[3])
res.append(tmp[0])
res.append(tmp[1])
return unpack('f',res)
def main():
print(modswap(3.14))
but there has to be a better way...
Ideally, I would like to be able to slice and re-concatenate as I please, or even replace a slice at a time if possible.
You can swizzle in one step:
from struct import pack,unpack
def modswap(num):
tmp=bytearray(pack("f",num))
tmp[0],tmp[1],tmp[2],tmp[3] = tmp[2],tmp[3],tmp[0],tmp[1]
return unpack('f',tmp)
You can modify a slice of a byte array:
>>> data = bytearray(b'0123456789')
>>> data[3:7] = data[5],data[6],data[3],data[4]
>>> data
bytearray(b'0125634789')
I came across the same problem and the closest answer is in this thread.
In Python3 the solution is:
b''.join((tmp[2:4],tmp[0:2]))
Related
I've started taking a look at Nim for hobby game modding purposes.
Intro
Yet, I found it difficult to work with Nim compared to C when it comes to machine-specific low-level memory layout and would like to know if Nim actually has better support here.
I need to control byte order and be able to de/serialize arbitrary Plain-Old-Datatype objects to binary custom file formats. I didn't directly find a Nim library which allows flexible storage options like representing enum and pointers with Big-Endian 32-bit. Or maybe I just don't know how to use the feature.
std/marshal : just JSON, i.e. no efficient, flexible nor binary format but cross-compatible
nim-serialization : seems like being made for human readable formats
nesm : flexible cross-compatibility? (It has some options and has a good interface)
flatty : no flexible cross-compatibility, no byte order?
msgpack4nim : no flexible cross-compatibility, byte order?
bingo : ?
Flexible cross-compatibility means, it must be able to de/serialize fields independently of Nim's ABI but with customization options.
Maybe "Kaitai Struct" is more what I look for, a file parser with experimental Nim support.
TL;DR
As a workaround for a serialization library I tried myself at a recursive "member fields reverser" that makes use of std/endians which is almost sufficient.
But I didn't succeed with implementing byte reversal of arbitrarily long objects in Nim. Not practically relevant but I still wonder if Nim has a solution.
I found reverse() and reversed() from std/algorithm but I need a byte array to reverse it and turn it back into the original object type. In C++ there would be reinterprete_cast, in C there is void*-cast, in D there is a void[] cast (D allows defining array slices from pointers) but I couldn't get it working with Nim.
I tried cast[ptr array[value.sizeof, byte]](unsafeAddr value)[] but I can't assign it to a new variable. Maybe there was a different problem.
How to "byte reverse" arbitrary long Plain-Old-Datatype objects?
How to serialize to binary files with byte order, member field size, pointer as file "offset - start offset"? Are there bitfield options in Nim?
It is indeed possible to use algorithm.reverse and the appropriate cast invocation to reverse bytes in-place:
import std/[algorithm,strutils,strformat]
type
LittleEnd{.packed.} = object
a: int8
b: int16
c: int32
BigEnd{.packed.} = object
c: int32
b: int16
a: int8
## just so we can see what's going on:
proc `$`(b: LittleEnd):string = &"(a:0x{b.a.toHex}, b:0x{b.b.toHex}, c:0x{b.c.toHex})"
proc `$`(l:BigEnd):string = &"(c:0x{l.c.toHex}, b:0x{l.b.toHex}, a:0x{l.a.toHex})"
var lit = LittleEnd(a: 0x12, b:0x3456, c: 0x789a_bcde)
echo lit # (a:0x12, b:0x3456, c:0x789ABCDE)
var big:BigEnd
copyMem(big.addr,lit.addr,sizeof(lit))
# here's the reinterpret_cast you were looking for:
cast[var array[sizeof(big),byte]](big.addr).reverse
echo big # (c:0xDEBC9A78, b:0x5634, a:0x12)
for C-style bitfields there is also the {.bitsize.} pragma
but using it causes Nim to lose sizeof information, and of course bitfields wont be reversed within bytes
import std/[algorithm,strutils,strformat]
type
LittleNib{.packed.} = object
a{.bitsize: 4}: int8
b{.bitsize: 12}: int16
c{.bitsize: 20}: int32
d{.bitsize: 28}: int32
BigNib{.packed.} = object
d{.bitsize: 28}: int32
c{.bitsize: 20}: int32
b{.bitsize: 12}: int16
a{.bitsize: 4}: int8
const nibsize = 8
proc `$`(b: LittleNib):string = &"(a:0x{b.a.toHex(1)}, b:0x{b.b.toHex(3)}, c:0x{b.c.toHex(5)}, d:0x{b.d.toHex(7)})"
proc `$`(l:BigNib):string = &"(d:0x{l.d.toHex(7)}, c:0x{l.c.toHex(5)}, b:0x{l.b.toHex(3)}, a:0x{l.a.toHex(1)})"
var lit = LitNib(a: 0x1,b:0x234, c:0x56789, d: 0x0abcdef)
echo lit # (a:0x1, b:0x234, c:0x56789, d:0x0ABCDEF)
var big:BigNib
copyMem(big.addr,lit.addr,nibsize)
cast[var array[nibsize,byte]](big.addr).reverse
echo big # (d:0x5DEBC0A, c:0x8967F, b:0x123, a:0x4)
It's less than optimal to copy the bytes over, then rearrange them with reverse, anyway, so you might just want to copy the bytes over in a loop. Here's a proc that can swap the endianness of any object, (including ones for which sizeof is not known at compiletime):
template asBytes[T](x:var T):ptr UncheckedArray[byte] =
cast[ptr UncheckedArray[byte]](x.addr)
proc swapEndian[T,U](src:var T,dst:var U) =
assert sizeof(src) == sizeof(dst)
let len = sizeof(src)
for i in 0..<len:
dst.asBytes[len - i - 1] = src.asBytes[i]
Bit fields are supported in Nim as a set of enums:
type
MyFlag* {.size: sizeof(cint).} = enum
A
B
C
D
MyFlags = set[MyFlag]
proc toNum(f: MyFlags): int = cast[cint](f)
proc toFlags(v: int): MyFlags = cast[MyFlags](v)
assert toNum({}) == 0
assert toNum({A}) == 1
assert toNum({D}) == 8
assert toNum({A, C}) == 5
assert toFlags(0) == {}
assert toFlags(7) == {A, B, C}
For arbitrary bit operations you have the bitops module, and for endianness conversions you have the endians module. But you already know about the endians module, so it's not clear what problem you are trying to solve with the so called byte reversal. Usually you have an integer, so you first convert the integer to byte endian format, for instance, then save that. And when you read back, convert from byte endian format and you have the int. The endianness procs should be dealing with reversal or not of bytes, so why do you need to do one yourself? In any case, you can follow the source hyperlink of the documentation and see how the endian procs are implemented. This can give you an idea of how to cast values in case you need to do some yourself.
Since you know C maybe the last resort would be to write a few serialization functions and call them from Nim, or directly embed them using the emit pragma. However this looks like the least cross platform and pain free option.
Can't answer anything about generic data structure serialization libraries. I stray away from them because they tend to require hand holding imposing certain limitations on your code and depending on the feature set, a simple refactoring (changing field order in your POD) may destroy the binary compatibility of the generated output without you noticing it until runtime. So you end up spending additional time writing unit tests to verify that the black box you brought in to save you some time behaves as you want (and keeps doing so across refactorings and version upgrades!).
I know that python has a len() function that is used to determine the size of a string, but I was wondering why it's not a method of the string object?
Strings do have a length method: __len__()
The protocol in Python is to implement this method on objects which have a length and use the built-in len() function, which calls it for you, similar to the way you would implement __iter__() and use the built-in iter() function (or have the method called behind the scenes for you) on objects which are iterable.
See Emulating container types for more information.
Here's a good read on the subject of protocols in Python: Python and the Principle of Least Astonishment
Jim's answer to this question may help; I copy it here. Quoting Guido van Rossum:
First of all, I chose len(x) over x.len() for HCI reasons (def __len__() came much later). There are two intertwined reasons actually, both HCI:
(a) For some operations, prefix notation just reads better than postfix — prefix (and infix!) operations have a long tradition in mathematics which likes notations where the visuals help the mathematician thinking about a problem. Compare the easy with which we rewrite a formula like x*(a+b) into x*a + x*b to the clumsiness of doing the same thing using a raw OO notation.
(b) When I read code that says len(x) I know that it is asking for the length of something. This tells me two things: the result is an integer, and the argument is some kind of container. To the contrary, when I read x.len(), I have to already know that x is some kind of container implementing an interface or inheriting from a class that has a standard len(). Witness the confusion we occasionally have when a class that is not implementing a mapping has a get() or keys() method, or something that isn’t a file has a write() method.
Saying the same thing in another way, I see ‘len‘ as a built-in operation. I’d hate to lose that. /…/
Python is a pragmatic programming language, and the reasons for len() being a function and not a method of str, list, dict etc. are pragmatic.
The len() built-in function deals directly with built-in types: the CPython implementation of len() actually returns the value of the ob_size field in the PyVarObject C struct that represents any variable-sized built-in object in memory. This is much faster than calling a method -- no attribute lookup needs to happen. Getting the number of items in a collection is a common operation and must work efficiently for such basic and diverse types as str, list, array.array etc.
However, to promote consistency, when applying len(o) to a user-defined type, Python calls o.__len__() as a fallback. __len__, __abs__ and all the other special methods documented in the Python Data Model make it easy to create objects that behave like the built-ins, enabling the expressive and highly consistent APIs we call "Pythonic".
By implementing special methods your objects can support iteration, overload infix operators, manage contexts in with blocks etc. You can think of the Data Model as a way of using the Python language itself as a framework where the objects you create can be integrated seamlessly.
A second reason, supported by quotes from Guido van Rossum like this one, is that it is easier to read and write len(s) than s.len().
The notation len(s) is consistent with unary operators with prefix notation, like abs(n). len() is used way more often than abs(), and it deserves to be as easy to write.
There may also be a historical reason: in the ABC language which preceded Python (and was very influential in its design), there was a unary operator written as #s which meant len(s).
There is a len method:
>>> a = 'a string of some length'
>>> a.__len__()
23
>>> a.__len__
<method-wrapper '__len__' of str object at 0x02005650>
met% python -c 'import this' | grep 'only one'
There should be one-- and preferably only one --obvious way to do it.
There are some great answers here, and so before I give my own I'd like to highlight a few of the gems (no ruby pun intended) I've read here.
Python is not a pure OOP language -- it's a general purpose, multi-paradigm language that allows the programmer to use the paradigm they are most comfortable with and/or the paradigm that is best suited for their solution.
Python has first-class functions, so len is actually an object. Ruby, on the other hand, doesn't have first class functions. So the len function object has it's own methods that you can inspect by running dir(len).
If you don't like the way this works in your own code, it's trivial for you to re-implement the containers using your preferred method (see example below).
>>> class List(list):
... def len(self):
... return len(self)
...
>>> class Dict(dict):
... def len(self):
... return len(self)
...
>>> class Tuple(tuple):
... def len(self):
... return len(self)
...
>>> class Set(set):
... def len(self):
... return len(self)
...
>>> my_list = List([1,2,3,4,5,6,7,8,9,'A','B','C','D','E','F'])
>>> my_dict = Dict({'key': 'value', 'site': 'stackoverflow'})
>>> my_set = Set({1,2,3,4,5,6,7,8,9,'A','B','C','D','E','F'})
>>> my_tuple = Tuple((1,2,3,4,5,6,7,8,9,'A','B','C','D','E','F'))
>>> my_containers = Tuple((my_list, my_dict, my_set, my_tuple))
>>>
>>> for container in my_containers:
... print container.len()
...
15
2
15
15
Something missing from the rest of the answers here: the len function checks that the __len__ method returns a non-negative int. The fact that len is a function means that classes cannot override this behaviour to avoid the check. As such, len(obj) gives a level of safety that obj.len() cannot.
Example:
>>> class A:
... def __len__(self):
... return 'foo'
...
>>> len(A())
Traceback (most recent call last):
File "<pyshell#8>", line 1, in <module>
len(A())
TypeError: 'str' object cannot be interpreted as an integer
>>> class B:
... def __len__(self):
... return -1
...
>>> len(B())
Traceback (most recent call last):
File "<pyshell#13>", line 1, in <module>
len(B())
ValueError: __len__() should return >= 0
Of course, it is possible to "override" the len function by reassigning it as a global variable, but code which does this is much more obviously suspicious than code which overrides a method in a class.
Closed. This question is opinion-based. It is not currently accepting answers.
Closed 4 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Python has a built in function sum, which is effectively equivalent to:
def sum2(iterable, start=0):
return start + reduce(operator.add, iterable)
for all types of parameters except strings. It works for numbers and lists, for example:
sum([1,2,3], 0) = sum2([1,2,3],0) = 6 #Note: 0 is the default value for start, but I include it for clarity
sum({888:1}, 0) = sum2({888:1},0) = 888
Why were strings specially left out?
sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
sum2(['foo','bar'], '') = 'foobar'
I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.
Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.
Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?
Python tries to discourage you from "summing" strings. You're supposed to join them:
"".join(list_of_strings)
It's a lot faster, and uses much less memory.
A quick benchmark:
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop
Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sum will point this out to newbies.
BTW, this restriction has been in place "forever", i.e., since the sum was added as a built-in function (rev. 32347)
You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..) anyway..
>>> class ZeroObject(object):
... def __add__(self, other):
... return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'
Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup
In the builtin_sum function we have this bit of code:
/* reject string values for 'start' parameter */
if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
PyErr_SetString(PyExc_TypeError,
"sum() can't sum strings [use ''.join(seq) instead]");
Py_DECREF(iter);
return NULL;
}
Py_INCREF(result);
}
So.. that's your answer.
It's explicitly checked in the code and rejected.
From the docs:
The preferred, fast way to concatenate a
sequence of strings is by calling
''.join(sequence).
By making sum refuse to operate on strings, Python has encouraged you to use the correct method.
Short answer: Efficiency.
Long answer: The sum function has to create an object for each partial sum.
Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.
doubles are always the same size, which makes sum's running time O(1)×N = O(N).
int (formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).
For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²).
Thus, summing strings would be much slower than summing numbers.
str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum.
The Reason Why
#dan04 has an excellent explanation for the costs of using sum on large lists of strings.
The missing piece as to why str is not allowed for sum is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples and other O(n**2) data structures. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.
Edit: Moved the parts about immutability to history.
Basically, its a question of preallocation. When you use a statement such as
sum(["a", "b", "c", ..., ])
and expect it to work similar to a reduce statement, the code generated looks something like
v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")
In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that’s maybe not the point here. What’s more important, is that every new string on each line must be allocated to it’s specific size (which. I don’t know it it must allocate in every iteration of the reduce statement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won’t help anymore and Python must allocate again, which is rather expensive.
A dedicated method like join, however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.
I dont know why, but this works!
import operator
def sum_of_strings(list_of_strings):
return reduce(operator.add, list_of_strings)
From what I understand, strings in Nim are basically a mutable sequence of bytes and that they are copied on assignment.
Given that, I assumed that sizeof would tell me (like len) the number of bytes, but instead it always gives 8 on my 64-bit machine, so it seems to be holding a pointer.
Given that, I have the following questions...
What was the motivation behind copy on assignment? Is it because they're mutable?
Is there ever a time when it isn't copied when assigned? (I assume non-var function parameters don't copy. Anything else?)
Are they optimized such that they only actually get copied if/when they're mutated?
Is there any significant difference between a string and a sequence, or can the answers to the above questions be equally applied to all sequences?
Anything else in general worth noting?
Thank you!
The definition of strings actually is in system.nim, just under another name:
type
TGenericSeq {.compilerproc, pure, inheritable.} = object
len, reserved: int
PGenericSeq {.exportc.} = ptr TGenericSeq
UncheckedCharArray {.unchecked.} = array[0..ArrayDummySize, char]
# len and space without counting the terminating zero:
NimStringDesc {.compilerproc, final.} = object of TGenericSeq
data: UncheckedCharArray
NimString = ptr NimStringDesc
So a string is a raw pointer to an object with a len, reserved and data field. The procs for strings are defined in sysstr.nim.
The semantics of string assignments have been chosen to be the same as for all value types (not ref or ptr) in Nim by default, so you can assume that assignments create a copy. When a copy is unneccessary, the compiler can leave it out, but I'm not sure how much that is happening so far. Passing strings into a proc doesn't copy them. There is no optimization that prevents string copies until they are mutated. Sequences behave in the same way.
You can change the default assignment behaviour of strings and seqs by marking them as shallow, then no copy is done on assignment:
var s = "foo"
shallow s
The challenge is like that:
Array Buffers are the backend for Typed Arrays. Whereas Buffer in node
is both the raw bytes as well as the encoding/view, Array Buffers are
only raw bytes and you have to create a Typed Array on top of an Array
Buffer in order to access the data.
When you create a new Typed Array and don't give it an Array Buffer to
be a view on top of it will create it's own new Array Buffer instead.
For this challenge, take the integer from process.argv[2] and write it
as the first element in a single element Uint32Array. Then create a
Uint16Array from the Array Buffer of the Uint32Array and log out to
the console the JSON stringified version of the Uint16Array.
Bonus: try to explain the relevance of the integer from
process.argv[2], or explain why the Uint16Array has the particular
values that it does.
The solution given by the author is like that:
var num = +process.argv[2]
var ui32 = new Uint32Array(1)
ui32[0] = num
var ui16 = new Uint16Array(ui32.buffer)
console.log(JSON.stringify(ui16))
I don't understand what does the plus sign in the first line means? And I can't understand the logic of this block of code either.
Thank you a lot if you can solve my puzzle.
Typed arrays are often used in context of asm.js, a strongly typed subset of JavaScript that is highly optimisable. "strongly typed" and "subset of JavaScript" are contradictory requirements, since JavaScript does not natively distinguish integers and floats, and has no way to declare them. Thus, asm.js adopts the convention that the following no-ops on integers and floats respectively serve as declarations and casts:
n|0 is n for every integer
+n is n for every float
Thus,
var num = +process.argv[2]
would be the equivalent of the following line in C or Java:
float num = process.argv[2]
declaring a floating point variable num.
It is puzzling though, I would have expected the following, given the requirement for integers:
var num = (process.argv[2])|0
Even in normal JavaScript though they can have uses, because they will also convert string representations of integers or floats respectively into numbers.