I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b
using pyspark.
import array
from io import StringIO
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 4106)
def mapper(features):
a = array.array('f')
a.frombytes(features)
return a.tolist()
def byte_mapper(bytes):
return str(bytes)
decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])
When just product_id is selected from the rdd using
decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])
The output for product_id is
["b'1582480311'", "b'\\x00\\x00\\x00\\x00\\x88c-?\\xeb\\xe2'", "b'7#\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\xec/\\x0b?\\x00\\x00\\x00\\x00K\\xea'", "b'\\x00\\x00c\\x7f\\xd9?\\x00\\x00\\x00\\x00'", "b'L\\xa6\\n>\\x00\\x00\\x00\\x00\\xfe\\xd4'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\xe5\\xd0\\xa2='", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'"]
The file is hosted on s3.
The file in each row has first 10 bytes for product_id next 4096 bytes as image_features
I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.
EDIT:
Finally, the problem comes from the recordLength. It's not 4096 + 10 but 4096*4 + 10. Chaging to :
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 16394)
Should work.
Actually you can find this in the provided code from the web site you downloaded the binary file:
for i in range(4096):
feature.append(struct.unpack('f', f.read(4))) # <-- so 4096 * 4
Old answer:
I think the issue comes from your byte_mapper function.
That's not the correct way to convert bytes to string. You should be using decode:
bytes = b'1582480311'
print(str(bytes))
# output: "b'1582480311'"
print(bytes.decode("utf-8"))
# output: '1582480311'
If you're getting the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 4: invalid start byte
That means product_id string contains non-utf8 characters. If you don't know the input encoding, it's difficult to convert into strings.
However, you may want to ignore those characters by adding option ignore to decode function:
bytes.decode("utf-8", "ignore")
I'm trying to allocate a set of image buffers in shared memory using multiprocessing.RawArray. It works fine for smaller numbers of images. However, when I get to a certain number of buffers, I get a OSError indicating that I've run out of memory.
Obvious question, am I actually out of memory? By my count, the buffers I'm trying to allocate should be about 1 GB of memory, and according to the Windows Task Manager, I have about 20 GB free. I don't see how I could actually be out of memory!
Am I hitting some kind of artificial memory consumption limit that I can increase? If not, why is this happening, and how can I get around this?
I'm using Windows 10, Python 3.7, 64 bit architecture, 32 GB RAM total.
Here's a minimal reproducible example:
import multiprocessing as mp
import ctypes
imageDataType = ctypes.c_uint8
imageDataSize = 1024*1280*3 # 3,932,160 bytes
maxBufferSize = 300
buffers = []
for k in range(maxBufferSize):
print("Creating buffer #", k)
buffers.append(mp.RawArray(imageDataType, imageDataSize))
Output:
Creating buffer # 0
Creating buffer # 1
Creating buffer # 2
Creating buffer # 3
Creating buffer # 4
Creating buffer # 5
...etc...
Creating buffer # 278
Creating buffer # 279
Creating buffer # 280
Traceback (most recent call last):
File ".\Cruft\memoryErrorTest.py", line 10, in <module>
buffers.append(mp.RawArray(imageDataType, imageDataSize))
File "C:\Users\Brian Kardon\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\context.py", line 129, in RawArray
return RawArray(typecode_or_type, size_or_initializer)
File "C:\Users\Brian Kardon\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\sharedctypes.py", line 61, in RawArray
obj = _new_value(type_)
File "C:\Users\Brian Kardon\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\sharedctypes.py", line 41, in _new_value
wrapper = heap.BufferWrapper(size)
File "C:\Users\Brian Kardon\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\heap.py", line 263, in __init__
block = BufferWrapper._heap.malloc(size)
File "C:\Users\Brian Kardon\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\heap.py", line 242, in malloc
(arena, start, stop) = self._malloc(size)
File "C:\Users\Brian Kardon\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\heap.py", line 134, in _malloc
arena = Arena(length)
File "C:\Users\Brian Kardon\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\heap.py", line 38, in __init__
buf = mmap.mmap(-1, size, tagname=name)
OSError: [WinError 8] Not enough memory resources are available to process this command
Ok, the folks over at Python bug tracker figured this out for me. For posterity:
I was using 32-bit Python, which is limited to a memory address space of 4 GB, much less than my total available system memory. Apparently enough of that space was taken up by other stuff that the interpreter couldn't find a large enough contiguous block for all my RawArrays.
The error does not occur when using 64-bit Python, so that seems to be the easiest solution.
I thought that setting a fixed number of decimal points to all numbers of an array of Decimals, and the new arrays resulting from operations thereof, could be achieved by simply doing:
from decimal import *
getcontext().prec = 5 # 4 decimal points
v = Decimal(0.005)
print(v)
0.005000000000000000104083408558608425664715468883514404296875
However, I get spurious results that I know are the consequence of the contribution of these extra decimals to the calculations. Therefore, as a workaround, I used the round() function like this:
C_subgrid= [Decimal('33.340'), Decimal('33.345'), Decimal('33.350'), Decimal('33.355'), Decimal('33.360'), Decimal('33.365'), Decimal('33.370'), Decimal('33.375'), Decimal('33.380'), Decimal('33.385'), Decimal('33.390'), Decimal('33.395'), Decimal('33.400'), Decimal('33.405'), Decimal('33.410'), Decimal('33.415'), Decimal('33.420'), Decimal('33.425'), Decimal('33.430'), Decimal('33.435'), Decimal('33.440'), Decimal('33.445'), Decimal('33.450'), Decimal('33.455'), Decimal('33.460'), Decimal('33.465'), Decimal('33.470'), Decimal('33.475'), Decimal('33.480'), Decimal('33.485'), Decimal('33.490'), Decimal('33.495'), Decimal('33.500'), Decimal('33.505'), Decimal('33.510'), Decimal('33.515'), Decimal('33.520'), Decimal('33.525'), Decimal('33.530'), Decimal('33.535'), Decimal('33.540'), Decimal('33.545'), Decimal('33.550'), Decimal('33.555'), Decimal('33.560'), Decimal('33.565'), Decimal('33.570'), Decimal('33.575'), Decimal('33.580'), Decimal('33.585'), Decimal('33.590'), Decimal('33.595'), Decimal('33.600'), Decimal('33.605'), Decimal('33.610'), Decimal('33.615'), Decimal('33.620'), Decimal('33.625'), Decimal('33.630'), Decimal('33.635'), Decimal('33.640'), Decimal('33.645'), Decimal('33.650'), Decimal('33.655'), Decimal('33.660'), Decimal('33.665'), Decimal('33.670'), Decimal('33.675'), Decimal('33.680'), Decimal('33.685'), Decimal('33.690'), Decimal('33.695'), Decimal('33.700'), Decimal('33.705'), Decimal('33.710'), Decimal('33.715'), Decimal('33.720'), Decimal('33.725'), Decimal('33.730'), Decimal('33.735'), Decimal('33.740'), Decimal('33.745'), Decimal('33.750'), Decimal('33.755'), Decimal('33.760'), Decimal('33.765'), Decimal('33.770'), Decimal('33.775'), Decimal('33.780'), Decimal('33.785'), Decimal('33.790'), Decimal('33.795'), Decimal('33.800'), Decimal('33.805'), Decimal('33.810'), Decimal('33.815'), Decimal('33.820'), Decimal('33.825'), Decimal('33.830'), Decimal('33.835'), Decimal('33.840'), Decimal('33.845'), Decimal('33.850'), Decimal('33.855'), Decimal('33.860'), Decimal('33.865'), Decimal('33.870'), Decimal('33.875'), Decimal('33.880'), Decimal('33.885'), Decimal('33.890'), Decimal('33.895'), Decimal('33.900'), Decimal('33.905'), Decimal('33.910'), Decimal('33.915'), Decimal('33.920'), Decimal('33.925'), Decimal('33.930'), Decimal('33.935'), Decimal('33.940'), Decimal('33.945'), Decimal('33.950'), Decimal('33.955'), Decimal('33.960'), Decimal('33.965'), Decimal('33.970'), Decimal('33.975'), Decimal('33.980'), Decimal('33.985'), Decimal('33.990'), Decimal('33.995'), Decimal('34.000'), Decimal('34.005'), Decimal('34.010'), Decimal('34.015'), Decimal('34.020'), Decimal('34.025'), Decimal('34.030'), Decimal('34.035'), Decimal('34.040'), Decimal('34.045'), Decimal('34.050'), Decimal('34.055'), Decimal('34.060'), Decimal('34.065'), Decimal('34.070'), Decimal('34.075'), Decimal('34.080'), Decimal('34.085'), Decimal('34.090'), Decimal('34.095'), Decimal('34.100'), Decimal('34.105'), Decimal('34.110'), Decimal('34.115'), Decimal('34.120'), Decimal('34.125'), Decimal('34.130'), Decimal('34.135'), Decimal('34.140')]
C_subgrid = [round(v, 4) for v in C_subgrid]
I got the values of C_subgrid list by printing it out during execution of my code, and I pasted it here. Not sure where the single quotes come from. This code snipped worked fine in Python2.7, but when I upgraded to Python 3.7 it started raising this error:
File "/home2/thomas/Documents/4D-CHAINS_dev/lib/peak.py", line 301, in <listcomp>
C_subgrid = [round(v, 4) for v in C_subgrid] # convert all values to fixed decimal length floats!
decimal.InvalidOperation: [<class 'decimal.InvalidOperation'>]
Strangely, if I run it within ipython it works fine, only within my code it creates problems. Can anybody think of any possible reason?
I am reading large files in Fortran that contain mixed string/numeric data such as:
114 MIDSIDE 0 0 O0002 436 437 584 438
115 SURFACE M00002 0 0 359 561 560 356
412236 SOLID M00002 O00001 0 86157 82419 82418 79009
Currently, each line is read as a string and then post-processed to identify the proper terms. I was wondering if there is any way to read each line as an integer followed by four strings separated by space, and then some more integers; i.e. similar to '(I10,4(A6,X),4I10)' format, but without any information on the size of each string.
Does not work (charr is empty, iarr(2:5)=0):
INTEGER IARR(5)
CHARACTER*30 CHARR(4)
C open the file with ID=1
READ(1,*)IARR(1),(CHARR(I),I=1,4),(IARR(I),I=2,5)
Works (only for the last line in the data example):
INTEGER IARR(5)
CHARACTER*30 CHARR(4)
C open the file with ID=1
READ(1,'(I10,4(A7,X),4I10)')IARR(1),(CHARR(I),I=1,4),(IARR(I),I=2,5)
The issue is I don't know a-priori what would be the size of each string.
I actually found out the f77rtl flag was used to compile the project, and when I removed the flag, the issue was resolved. So the list-directed input format works just fine.