Why is the size of npy bigger than csv?

Why is the size of npy bigger than csv? - python-3.x

I changed csv to npy file. After the change, size of csv file is 5GB, and npy is 13GB.
I thought a npy file is more efficient than csv.
Am I misunderstanding this? Why is the size of npy bigger than csv?
I just used this code
full = pd.read_csv('data/RGB.csv', header=None).values
np.save('data/RGB.npy', full, allow_pickle=False, fix_imports=False)
and data structure like this:
R, G, B, is_skin
2, 5, 1, 0
10, 52, 242, 1
52, 240, 42, 0
...(row is 420,711,257)

In your case an element is an integer between 0 and 255, inclusive. That means, saved as ASCII it will need at most
3 chars for the number
1 char for ,
1 char for the whitespace
which results in at most 5 bytes (somewhat less on average) per element on the disc.
Pandas reads/interprets this as an int64 array (see full.dtype) as default, which means it needs 8 bytes per element, which leads to a bigger size of the npy-file (most of which are zeros!).
To save an integer between 0 and 255 we need only one byte, thus the size of the npy-file could be reduced by factor 8 without loosing any information - just tell pandas it needs to interpret the data as unsigned 8bit-integers:
full = pd.read_csv(r'e:\data.csv', dtype=np.uint8).values
# or to get rid of pandas-dependency:
# full = np.genfromtxt(r'e:\data.csv', delimiter=',', dtype=np.uint8, skip_header=1)
np.save(r'e:/RGB.npy', full, allow_pickle=False, fix_imports=False)
# an 8 times smaller npy-file
Most of the time npy-format needs less space, however there can be situations when the ASCII format results in smaller files.
For example if data consist mostly of very small numbers with one digit and some few very big numbers, that for them really 8bytes are needed:
in ASCII-format you pay on average 2 bytes per element (there is no need to write whitespace, , alone as delimiter is good enough).
in numpy-format you will pay 8 bytes per element.

Related

How to parse CAN bus data

I have a CAN bus charger which broadcasts battery charge level, but I am unable to parse the incoming data.
I use Node.js to capture the data and I receive (for example) following HEX values from the charger:
09e6000000
f5e5000000
Device datasheet states, that this message is 40 bits long. And it contains
Start bit 0, length 20, factor 0.001 - battery voltage
Start bit 20, length 18, factor 0.001 - charging current
Start bit 38, length 1, factor 1 - fault
Start bit 39, length 1, factor 1 - charging
I understand, that I should convert the HEX values to binary, using online calculator I can convert it as follows
09e6000000 -> 100111100110000000000000000000000000
Now if I extract first 20 values, 10011110011000000000 and again, using online calculator to convert it to demical I get 648704
If I take second HEX value and follow same process I get
f5e5000000 -> 1111010111100101000000000000000000000000 -> 11110101111001010000 -> 1007184
Something is terribly wrong, since these two HEX values should have been around 59.000. What I am doing wrong?

One hot encoding for ages categorical data

When trying to implement encoding for the below categories using one hot encoder, I got a couldn't convert string to float error.
['0-17', '55+', '26-35', '46-50', '51-55', '36-45', '18-25']

I made something real quick that should work. You will see that I had a really nasty looking one-liner for preconditioning your limits; however, it will be much easier if you just convert the limits directly to the proper format.
Essentially, this just iterates through a list of limits and makes comparisons to the limits. If the sample of data is less than the limit, we make that index a 1 and break.
import random
# str_limits = ['0-17', '55+', '26-35', '46-50', '51-55', '36-45', '18-25']
#
# oneline conditioning for the limit string format
# limits = sorted(list(filter(lambda x: not x.endswith("+"), map(lambda v: v.split("-")[-1], str_limits))))
# limits.append('1000')
# do this instead
limits = sorted([17, 35, 50, 55, 45, 25, 1000])
# sample 100 random datapoints between 0 and 65 for testing
samples = [random.choice(list(range(65))) for i in range(100)]
onehot = [] # this is where we will store our one-hot encodings
for sample in samples:
row = [0]*len(limits) # preallocating a list
for i, limit in enumerate(limits):
if sample <= limit:
row[i] = 1
break
# storing that sample's onehot into a onehot list of lists
onehot.append(row)
for i in range(10):
print("{}: {}".format(onehot[i], samples[i]))
I am not sure about the specifics of your implementation, but you are probably forgetting to convert from a string to an integer at some point.

is there a way to get the size of TFRecord file and the size of one Example in it?

Because I want to get the number of Examples in a TFRecord file, the currently method I used is
len([x for x in tf.python_io.tf_record_iterator(tf_record_file)])
but it is slow.
All Examples in my TFRecord file have exactly the same length, so I wonder that if there is a way to get the size (the number of bytes) of whole TFRecord file (xxx.tfrecord) and the size (the number of bytes) of one Example in it? Then I think I can just use
number_of_Examples = (length of TFRecord file) / (length of the first
Example) = (bytes of all Examples in xxx.tfrecord) / (bytes of one Expmale)
to get the number of Examples more quickly.

A TFRecord file is essentially an array of Examples, and it does not include the number of examples as metadata. Thus, one must iterate over it to count the number of examples. Another option is saving the size as metadata on creation time (in some separate file).
Edit:
The approach you propose won't work as long as 2 examples may be of different sizes, which is sometimes the case even if the number of features is identical.
If it is guaranteed that all examples have exactly the same number of bytes you could do the following:
import os
import sys
import tensorflow as tf
def getSize(filename):
st = os.stat(filename)
return st.st_size
file = "..."
example_size = 0
example = tf.train.Example()
for x in tf.python_io.tf_record_iterator(file):
example.ParseFromString(x)
example_size = example.ByteSize()
break
file_size = getSize(file)
n = file_size / (example_size + 16)
print("file size in bytes:{}".format(file_size))
print("example size in bytes:{}".format(example_size))
print("N:{}".format(n))

Lua: Working with Bit32 Library to Change States of I/O's

I am trying to understand exactly how programming in Lua can change the state of I/O's with a Modbus I/O module. I have read the modbus protocol and understand the registers, coils, and how a read/write string should look. But right now, I am trying to grasp how I can manipulate the read/write bit(s) and how functions can perform these actions. I know I may be very vague right now, but hopefully the following functions, along with some questions throughout them, will help me better convey where I am having the disconnect. It has been a very long time since I've first learned about bit/byte manipulation.
local funcCodes = { --[[I understand this part]]
readCoil = 1,
readInput = 2,
readHoldingReg = 3,
readInputReg = 4,
writeCoil = 5,
presetSingleReg = 6,
writeMultipleCoils = 15,
presetMultipleReg = 16
}
local function toTwoByte(value)
return string.char(value / 255, value % 255) --[[why do both of these to the same value??]]
end
local function readInputs(s)
local s = mperia.net.connect(host, port)
s:set_timeout(0.1)
local req = string.char(0,0,0,0,0,6,unitId,2,0,0,0,6)
local req = toTwoByte(0) .. toTwoByte(0) .. toTwoByte(6) ..
string.char(unitId, funcCodes.readInput)..toTwoByte(0) ..toTwoByte(8)
s:write(req)
local res = s:read(10)
s:close()
if res:byte(10) then
local out = {}
for i = 1,8 do
local statusBit = bit32.rshift(res:byte(10), i - 1) --[[What is bit32.rshift actually doing to the string? and the same is true for the next line with bit32.band.
out[#out + 1] = bit32.band(statusBit, 1)
end
for i = 1,5 do
tDT.value["return_low"] = tostring(out[1])
tDT.value["return_high"] = tostring(out[2])
tDT.value["sensor1_on"] = tostring(out[3])
tDT.value["sensor2_on"] = tostring(out[4])
tDT.value["sensor3_on"] = tostring(out[5])
tDT.value["sensor4_on"] = tostring(out[6])
tDT.value["sensor5_on"] = tostring(out[7])
tDT.value[""] = tostring(out[8])
end
end
return tDT
end
If I need to be a more specific with my questions, I'll certainly try. But right now I'm having a hard time connecting the dots with what is actually going on to the bit/byte manipulation here. I've read both books on the bit32 library and sources online, but still don't know what these are really doing. I hope that with these examples, I can get some clarification.
Cheers!

--[[why do both of these to the same value??]]
There are two different values here: value / 255 and value % 255. The "/" operator represents divison, and the "%" operator represents (basically) taking the remainder of division.
Before proceeding, I'm going to point out that 255 here should almost certainly be 256, so let's make that correction before proceeding. The reason for this correction should become clear soon.
Let's look at an example.
value = 1000
print(value / 256) -- 3.90625
print(value % 256) -- 232
Whoops! There was another problem. string.char wants integers (in the range of 0 to 255 -- which has 256 distinct values counting 0), and we may be given it a non-integer. Let's fix that problem:
value = 1000
print(math.floor(value / 256)) -- 3
-- in Lua 5.3, you could also use value // 256 to mean the same thing
print(value % 256) -- 232
What have we done here? Let's look 1000 in binary. Since we are working with two-byte values, and each byte is 8 bits, I'll include 16 bits: 0b0000001111101000. (0b is a prefix that is sometimes used to indicate that the following number should be interpreted as binary.) If we split this into the first 8 bits and the second 8 bits, we get: 0b00000011 and 0b11101000. What are these numbers?
print(tonumber("00000011",2)) -- 3
print(tonumber("11101000",2)) -- 232
So what we have done is split a 2-byte number into two 1-byte numbers. So why does this work? Let's go back to base 10 for a moment. Suppose we have a four-digit number, say 1234, and we want to split it into two two-digit numbers. Well, the quotient 1234 / 100 is 12, and the remainder of that divison is 34. In Lua, that's:
print(math.floor(1234 / 100)) -- 12
print(1234 % 100) -- 34
Hopefully, you can understand what's happening in base 10 pretty well. (More math here is outside the scope of this answer.) Well, what about 256? 256 is 2 to the power of 8. And there are 8 bits in a byte. In binary, 256 is 0b100000000 -- it's a 1 followed by a bunch of zeros. That means it a similar ability to split binary numbers apart as 100 did in base 10.
Another thing to note here is the concept of endianness. Which should come first, the 3 or the 232? It turns out that different computers (and different protocols) have different answers for this question. I don't know what is correct in your case, you'll have to refer to your documentation. The way you are currently set up is called "big endian" because the big part of the number comes first.
--[[What is bit32.rshift actually doing to the string? and the same is true for the next line with bit32.band.]]
Let's look at this whole loop:
local out = {}
for i = 1,8 do
local statusBit = bit32.rshift(res:byte(10), i - 1)
out[#out + 1] = bit32.band(statusBit, 1)
end
And let's pick a concrete number for the sake of example, say, 0b01100111. First let's lookat the band (which is short for "bitwise and"). What does this mean? It means line up the two numbers and see where two 1's occur in the same place.
01100111
band 00000001
-------------
00000001
Notice first that I've put a bunch of 0's in front of the one. Preceeding zeros don't change the value of the number, but I want all 8 bits for both numbers so that I can check each digit (bit) of the first number with each digit of the second number. In each place where there both numbers had a 1 (the top number had a 1 "and" the bottom number had a 1), I put a 1 for the result, otherwise I put 0. That's bitwise and.
When we bitwise and with 0b00000001 as we did here, you should be able to see that we will only get a 1 (0b00000001) or a 0 (0b00000000) as the result. Which we get depends on the last bit of the other number. We have basically separated out the last bit of that number from the rest (which is often called "masking") and stored it in our out array.
Now what about the rshift ("right shift")? To shift right by one, we discard the rightmost digit, and move everything else over one space the the right. (At the left, we usually add a 0 so we still have 8 bits ... as usual, adding a bit in front of a number doesn't change it.)
right shift 01100111
\\\\\\\\
0110011 ... 1 <-- discarded
(Forgive my horrible ASCII art.) So shifting right by 1 changes our 0b01100111 to 0b00110011. (You can also think of this as chopping off the last bit.)
Now what does it mean to shift right be a different number? Well to shift by zero does not change the number. To shift by more than one, we just repeat this operation however many times we are shifting by. (To shift by two, shift by one twice, etc.) (If you prefer to think in terms of chopping, right shift by x is chopping off the last x bits.)
So on the first iteration through the loop, the number will not be shifted, and we will store the rightmost bit.
On the second iteration through the loop, the number will be shifted by 1, and the new rightmost bit will be what was previously the second from the right, so the bitwise and will mask out that bit and we will store it.
On the next iteration, we will shift by 2, so the rightmost bit will be the one that was originally third from the right, so the bitwise and will mask out that bit and store it.
On each iteration, we store the next bit.
Since we are working with a byte, there are only 8 bits, so after 8 iterations through the loop, we will have stored the value of each bit into our table. This is what the table should look like in our example:
out = {1,1,1,0,0,1,1,0}
Notice that the bits are reversed from how we wrote them 0b01100111 because we started looking from the right side of the binary number, but things are added to the table starting on the left.
In your case, it looks like each bit has a distinct meaning. For example, a 1 in the third bit could mean that sensor1 was on and a 0 in the third bit could mean that sensor1 was off. Eight different pieces of information like this were packed together to make it more efficient to transmit them over some channel. The loop separates them again into a form that is easy for you to use.

Base64: What is the worst possible increase in space usage?

If a server received a base64 string and wanted to check it's length before converting,, say it wanted to always permit the final byte array to be 16KB. How big could a 16KB byte array possibly become when converted to a Base64 string (assuming one byte per character)?

Base64 encodes each set of three bytes into four bytes. In addition the output is padded to always be a multiple of four.
This means that the size of the base-64 representation of a string of size n is:
ceil(n / 3) * 4
So, for a 16kB array, the base-64 representation will be ceil(16*1024/3)*4 = 21848 bytes long ~= 21.8kB.
A rough approximation would be that the size of the data is increased to 4/3 of the original.

From Wikipedia
Note that given an input of n bytes,
the output will be (n + 2 - ((n + 2) %
3)) / 3 * 4 bytes long, so that the
number of output bytes per input byte
converges to 4 / 3 or 1.33333 for
large n.
So 16kb * 4 / 3 gives very little over 21.3' kb, or 21848 bytes, to be exact.
Hope this helps

16kb is 131,072 bits. Base64 packs 24-bit buffers into four 6-bit characters apiece, so you would have 5,462 * 4 = 21,848 bytes.

Since the question was about the worst possible increase, I must add that there are usually line breaks at around each 80 characters. This means that if you are saving base64 encoded data into a text file on Windows it will add 2 bytes, on Linux 1 byte for each line.
The increase from the actual encoding has been described above.

This is a future reference for myself. Since the question is on worst case, we should take line breaks into account. While RFC 1421 defines maximum line length to be 64 char, RFC 2045 (MIME) states there'd be 76 char in one line at most.
The latter is what C# library has implemented. So in Windows environment where a line break is 2 chars (\r\n), we get this: Length = Floor(Ceiling(N/3) * 4 * 78 / 76)
Note: Flooring is because during my test with C#, if the last line ends at exactly 76 chars, no line-break follows.
I can prove it by running the following code:
byte[] bytes = new byte[16 * 1024];
Console.WriteLine(Convert.ToBase64String(bytes, Base64FormattingOptions.InsertLineBreaks).Length);
The answer for 16 kBytes encoded to base64 with 76-char lines: 22422 chars
Assume in Linux it'd be Length = Floor(Ceiling(N/3) * 4 * 77 / 76) but I didn't get around to test it on my .NET core yet.

Also it would depend on actual character encoding, i.e. if we encode to UTF-32 string, each base64 character would consume 3 additional bytes (4 byte per char).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string