Is there a string size limit when feeding .fromstring() method as input? - python-3.x

I'm working on multiple well-formed xml files, whose sizes range from 100 MB to 4 GB. My goal is to read them as strings and then import them as ElementTree objects using .fromstring() method (from xml.etree.ElementTree module).
However, as the process goes through and the string size increases, two exceptions occured related to memory restriction :
xml.etree.ElementTree.ParseError: out of memory: line 1, column 0
OverflowError: size does not fit in an int
It looks like .fromstring() method enforces a string size limit to the input, around 1GB... ?
To debug this, I wrote a short script using a for loop:
xmlFiles_list = [path1, path2, ...]
for fp in xmlFiles_list:
xml_fo = open(fp, mode='r', encoding="utf-8")
xml_asStr = xml_fo.read()
xml_fo.close()
print(len(xml_asStr.encode("utf-8")) / 10**9) # display string size in GB
try:
etree = cElementTree.fromstring(xml_asStr)
print(".fromstring() success!\n")
except Exception as e:
print(f"Error :{type(e)} {str(e)}\n")
continue
The ouput is as following :
0.895206753
.fromstring() success!
1.220224531
Error :<class 'xml.etree.ElementTree.ParseError'> out of memory: line 1, column 0
1.328233473
Erreur :<class 'xml.etree.ElementTree.ParseError'> out of memory: line 1, column 0
2.567867904
Error :<class 'OverflowError'> size does not fit in an int
4.080672538
Error :<class 'OverflowError'> size does not fit in an int
I found multiple workarounds to avoid this issue : .parse() method or lxml module for bette performance. I just hope someone could shed some light on this :
Is there a specific string size limit in xml.etree.ET module and .fromstring() method ?
Why do I end up with two different exceptions as the string size increases ? Are they related to the same memory-allocation restriction ?
Python version/system: 3.9 (64 bits)
RAM : 32go
Hope my topic is clear enough, I'm new on stackoverflow

Related

RuntimeError on running ALBERT for obtaining encoding vectors from text

I’m trying to get feature vectors from the encoder model using pre-trained ALBERT v2 weights. i have a nvidia 1650ti gpu (4 GB) , and sufficient RAM(8GB) but for some reason I’m getting Runtime error saying -
RuntimeError: [enforce fail at …\c10\core\CPUAllocator.cpp:75] data.
DefaultCPUAllocator: not enough memory: you tried to allocate
491520000 bytes. Buy new RAM!
I’m really new to pytorch and deep learning in general. Can anyone please tell me what is wrong?
My entire code -
encoded_test_data = tokenized_test_values[‘input_ids’]
encoded_test_masks = tokenized_test_values[‘attention_mask’]
encoded_train_data = torch.from_numpy(encoded_train_data).to(device)
encoded_masks = torch.from_numpy(encoded_masks).to(device)
encoded_test_data = torch.from_numpy(encoded_test_data).to(device)
encoded_test_masks = torch.from_numpy(encoded_test_masks).to(device)
config = EncoderDecoderConfig.from_encoder_decoder_configs(BertConfig(),BertConfig())
EnD_model = EncoderDecoderModel.from_pretrained(‘albert-base-v2’,config=config)
feature_extractor = EnD_model.get_encoder()
feature_vector = feature_extractor.forward(input_ids=encoded_train_data,attention_mask = encoded_masks)
feature_test_vector = feature_extractor.forward(input_ids = encoded_test_data, attention_mask = encoded_test_masks)
Also 491520000 bytes is about 490 MB which should not be a problem.
I tried reducing the number of training examples and also the length of the maximum padded input . The OOM error still exists even though the required space now is 153 MB , which should easily be managable.
I also have maxed out the RAM limit of the heap of pycharm software to 2048 MB. I really dont know what to do now…

How to read binary data in pyspark

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b
using pyspark.
import array
from io import StringIO
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 4106)
def mapper(features):
a = array.array('f')
a.frombytes(features)
return a.tolist()
def byte_mapper(bytes):
return str(bytes)
decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])
When just product_id is selected from the rdd using
decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])
The output for product_id is
["b'1582480311'", "b'\\x00\\x00\\x00\\x00\\x88c-?\\xeb\\xe2'", "b'7#\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\xec/\\x0b?\\x00\\x00\\x00\\x00K\\xea'", "b'\\x00\\x00c\\x7f\\xd9?\\x00\\x00\\x00\\x00'", "b'L\\xa6\\n>\\x00\\x00\\x00\\x00\\xfe\\xd4'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\xe5\\xd0\\xa2='", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'"]
The file is hosted on s3.
The file in each row has first 10 bytes for product_id next 4096 bytes as image_features
I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.
EDIT:
Finally, the problem comes from the recordLength. It's not 4096 + 10 but 4096*4 + 10. Chaging to :
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 16394)
Should work.
Actually you can find this in the provided code from the web site you downloaded the binary file:
for i in range(4096):
feature.append(struct.unpack('f', f.read(4))) # <-- so 4096 * 4
Old answer:
I think the issue comes from your byte_mapper function.
That's not the correct way to convert bytes to string. You should be using decode:
bytes = b'1582480311'
print(str(bytes))
# output: "b'1582480311'"
print(bytes.decode("utf-8"))
# output: '1582480311'
If you're getting the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 4: invalid start byte
That means product_id string contains non-utf8 characters. If you don't know the input encoding, it's difficult to convert into strings.
However, you may want to ignore those characters by adding option ignore to decode function:
bytes.decode("utf-8", "ignore")

Spurious out-of-memory error when allocating shared memory with multiprocessing

I'm trying to allocate a set of image buffers in shared memory using multiprocessing.RawArray. It works fine for smaller numbers of images. However, when I get to a certain number of buffers, I get a OSError indicating that I've run out of memory.
Obvious question, am I actually out of memory? By my count, the buffers I'm trying to allocate should be about 1 GB of memory, and according to the Windows Task Manager, I have about 20 GB free. I don't see how I could actually be out of memory!
Am I hitting some kind of artificial memory consumption limit that I can increase? If not, why is this happening, and how can I get around this?
I'm using Windows 10, Python 3.7, 64 bit architecture, 32 GB RAM total.
Here's a minimal reproducible example:
import multiprocessing as mp
import ctypes
imageDataType = ctypes.c_uint8
imageDataSize = 1024*1280*3 # 3,932,160 bytes
maxBufferSize = 300
buffers = []
for k in range(maxBufferSize):
print("Creating buffer #", k)
buffers.append(mp.RawArray(imageDataType, imageDataSize))
Output:
Creating buffer # 0
Creating buffer # 1
Creating buffer # 2
Creating buffer # 3
Creating buffer # 4
Creating buffer # 5
...etc...
Creating buffer # 278
Creating buffer # 279
Creating buffer # 280
Traceback (most recent call last):
File ".\Cruft\memoryErrorTest.py", line 10, in <module>
buffers.append(mp.RawArray(imageDataType, imageDataSize))
File "C:\Users\Brian Kardon\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\context.py", line 129, in RawArray
return RawArray(typecode_or_type, size_or_initializer)
File "C:\Users\Brian Kardon\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\sharedctypes.py", line 61, in RawArray
obj = _new_value(type_)
File "C:\Users\Brian Kardon\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\sharedctypes.py", line 41, in _new_value
wrapper = heap.BufferWrapper(size)
File "C:\Users\Brian Kardon\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\heap.py", line 263, in __init__
block = BufferWrapper._heap.malloc(size)
File "C:\Users\Brian Kardon\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\heap.py", line 242, in malloc
(arena, start, stop) = self._malloc(size)
File "C:\Users\Brian Kardon\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\heap.py", line 134, in _malloc
arena = Arena(length)
File "C:\Users\Brian Kardon\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\heap.py", line 38, in __init__
buf = mmap.mmap(-1, size, tagname=name)
OSError: [WinError 8] Not enough memory resources are available to process this command
Ok, the folks over at Python bug tracker figured this out for me. For posterity:
I was using 32-bit Python, which is limited to a memory address space of 4 GB, much less than my total available system memory. Apparently enough of that space was taken up by other stuff that the interpreter couldn't find a large enough contiguous block for all my RawArrays.
The error does not occur when using 64-bit Python, so that seems to be the easiest solution.

decimal.InvalidOperation: [<class 'decimal.InvalidOperation'>]

I thought that setting a fixed number of decimal points to all numbers of an array of Decimals, and the new arrays resulting from operations thereof, could be achieved by simply doing:
from decimal import *
getcontext().prec = 5 # 4 decimal points
v = Decimal(0.005)
print(v)
0.005000000000000000104083408558608425664715468883514404296875
However, I get spurious results that I know are the consequence of the contribution of these extra decimals to the calculations. Therefore, as a workaround, I used the round() function like this:
C_subgrid= [Decimal('33.340'), Decimal('33.345'), Decimal('33.350'), Decimal('33.355'), Decimal('33.360'), Decimal('33.365'), Decimal('33.370'), Decimal('33.375'), Decimal('33.380'), Decimal('33.385'), Decimal('33.390'), Decimal('33.395'), Decimal('33.400'), Decimal('33.405'), Decimal('33.410'), Decimal('33.415'), Decimal('33.420'), Decimal('33.425'), Decimal('33.430'), Decimal('33.435'), Decimal('33.440'), Decimal('33.445'), Decimal('33.450'), Decimal('33.455'), Decimal('33.460'), Decimal('33.465'), Decimal('33.470'), Decimal('33.475'), Decimal('33.480'), Decimal('33.485'), Decimal('33.490'), Decimal('33.495'), Decimal('33.500'), Decimal('33.505'), Decimal('33.510'), Decimal('33.515'), Decimal('33.520'), Decimal('33.525'), Decimal('33.530'), Decimal('33.535'), Decimal('33.540'), Decimal('33.545'), Decimal('33.550'), Decimal('33.555'), Decimal('33.560'), Decimal('33.565'), Decimal('33.570'), Decimal('33.575'), Decimal('33.580'), Decimal('33.585'), Decimal('33.590'), Decimal('33.595'), Decimal('33.600'), Decimal('33.605'), Decimal('33.610'), Decimal('33.615'), Decimal('33.620'), Decimal('33.625'), Decimal('33.630'), Decimal('33.635'), Decimal('33.640'), Decimal('33.645'), Decimal('33.650'), Decimal('33.655'), Decimal('33.660'), Decimal('33.665'), Decimal('33.670'), Decimal('33.675'), Decimal('33.680'), Decimal('33.685'), Decimal('33.690'), Decimal('33.695'), Decimal('33.700'), Decimal('33.705'), Decimal('33.710'), Decimal('33.715'), Decimal('33.720'), Decimal('33.725'), Decimal('33.730'), Decimal('33.735'), Decimal('33.740'), Decimal('33.745'), Decimal('33.750'), Decimal('33.755'), Decimal('33.760'), Decimal('33.765'), Decimal('33.770'), Decimal('33.775'), Decimal('33.780'), Decimal('33.785'), Decimal('33.790'), Decimal('33.795'), Decimal('33.800'), Decimal('33.805'), Decimal('33.810'), Decimal('33.815'), Decimal('33.820'), Decimal('33.825'), Decimal('33.830'), Decimal('33.835'), Decimal('33.840'), Decimal('33.845'), Decimal('33.850'), Decimal('33.855'), Decimal('33.860'), Decimal('33.865'), Decimal('33.870'), Decimal('33.875'), Decimal('33.880'), Decimal('33.885'), Decimal('33.890'), Decimal('33.895'), Decimal('33.900'), Decimal('33.905'), Decimal('33.910'), Decimal('33.915'), Decimal('33.920'), Decimal('33.925'), Decimal('33.930'), Decimal('33.935'), Decimal('33.940'), Decimal('33.945'), Decimal('33.950'), Decimal('33.955'), Decimal('33.960'), Decimal('33.965'), Decimal('33.970'), Decimal('33.975'), Decimal('33.980'), Decimal('33.985'), Decimal('33.990'), Decimal('33.995'), Decimal('34.000'), Decimal('34.005'), Decimal('34.010'), Decimal('34.015'), Decimal('34.020'), Decimal('34.025'), Decimal('34.030'), Decimal('34.035'), Decimal('34.040'), Decimal('34.045'), Decimal('34.050'), Decimal('34.055'), Decimal('34.060'), Decimal('34.065'), Decimal('34.070'), Decimal('34.075'), Decimal('34.080'), Decimal('34.085'), Decimal('34.090'), Decimal('34.095'), Decimal('34.100'), Decimal('34.105'), Decimal('34.110'), Decimal('34.115'), Decimal('34.120'), Decimal('34.125'), Decimal('34.130'), Decimal('34.135'), Decimal('34.140')]
C_subgrid = [round(v, 4) for v in C_subgrid]
I got the values of C_subgrid list by printing it out during execution of my code, and I pasted it here. Not sure where the single quotes come from. This code snipped worked fine in Python2.7, but when I upgraded to Python 3.7 it started raising this error:
File "/home2/thomas/Documents/4D-CHAINS_dev/lib/peak.py", line 301, in <listcomp>
C_subgrid = [round(v, 4) for v in C_subgrid] # convert all values to fixed decimal length floats!
decimal.InvalidOperation: [<class 'decimal.InvalidOperation'>]
Strangely, if I run it within ipython it works fine, only within my code it creates problems. Can anybody think of any possible reason?

Reading strings from file separated by space in Fortran

I am reading large files in Fortran that contain mixed string/numeric data such as:
114 MIDSIDE 0 0 O0002 436 437 584 438
115 SURFACE M00002 0 0 359 561 560 356
412236 SOLID M00002 O00001 0 86157 82419 82418 79009
Currently, each line is read as a string and then post-processed to identify the proper terms. I was wondering if there is any way to read each line as an integer followed by four strings separated by space, and then some more integers; i.e. similar to '(I10,4(A6,X),4I10)' format, but without any information on the size of each string.
Does not work (charr is empty, iarr(2:5)=0):
INTEGER IARR(5)
CHARACTER*30 CHARR(4)
C open the file with ID=1
READ(1,*)IARR(1),(CHARR(I),I=1,4),(IARR(I),I=2,5)
Works (only for the last line in the data example):
INTEGER IARR(5)
CHARACTER*30 CHARR(4)
C open the file with ID=1
READ(1,'(I10,4(A7,X),4I10)')IARR(1),(CHARR(I),I=1,4),(IARR(I),I=2,5)
The issue is I don't know a-priori what would be the size of each string.
I actually found out the f77rtl flag was used to compile the project, and when I removed the flag, the issue was resolved. So the list-directed input format works just fine.

Resources