load .npy file from google cloud storage with tensorflow - python-3.x

i'm trying to load .npy files from my google cloud storage to my model i followed this example here Load numpy array in google-cloud-ml job
but i get this error
'utf-8' codec can't decode byte 0x93 in
position 0: invalid start byte
can you help me please ??
here is sample from the code
Here i read the file
with file_io.FileIO(metadata_filename, 'r') as f:
self._metadata = [line.strip().split('|') for line in f]
and here i start processing on it
if self._offset >= len(self._metadata):
self._offset = 0
random.shuffle(self._metadata)
meta = self._metadata[self._offset]
self._offset += 1
text = meta[3]
if self._cmudict and random.random() < _p_cmudict:
text = ' '.join([self._maybe_get_arpabet(word) for word in text.split(' ')])
input_data = np.asarray(text_to_sequence(text, self._cleaner_names), dtype=np.int32)
f = StringIO(file_io.read_file_to_string(
os.path.join('gs://path',meta[0]))
linear_target = tf.Variable(initial_value=np.load(f), name='linear_target')
s = StringIO(file_io.read_file_to_string(
os.path.join('gs://path',meta[1])))
mel_target = tf.Variable(initial_value=np.load(s), name='mel_target')
return (input_data, mel_target, linear_target, len(linear_target))
and this is a sample from the data sample

This is likely because your file doesn't contain utf-8 encoded text.
Its possible, you may need to initialize the file_io.FileIO instance as a binary file using mode = 'rb', or set binary_mode = True in the call to read_file_to_string.
This will cause data that is read to be returned as a sequence of bytes, rather than a string.

Related

Choose encoding when converting to Sqlite database

I am converting Mbox files to Sqlite db. I do not arrive to encode the db file into utf-8.
The Python console displays the following message when converting to db:
Error binding parameter 1 - probably unsupported type.
When I visualize my data on DB Browser for SQlite, special characters don't appear and the � symbol shows up instead.
I first convert .text files to Mbox files with the following function:
def makeMBox(fIn,fOut):
if not os.path.exists(fIn):
return False
if os.path.exists(fOut):
return False
out = open(fOut,"w")
lineNum = 0
# detect encoding
readsource = open(fIn,'rt').__next__
#fInCodec = tokenize.detect_encoding(readsource)[0]
fInCodec = 'UTF-8'
for line in open(fIn,'rt', encoding=fInCodec, errors="replace"):
if line.find("From ") == 0:
if lineNum != 0:
out.write("\n")
lineNum +=1
line = line.replace(" at ", "#")
out.write(line)
out.close()
return True
Then, I convert to sqlite db:
for k in dates:
db = sqlite_utils.Database("Courriels_Sqlite/Echanges_Discussion.db")
mbox = mailbox.mbox("Courriels_MBox/"+k+".mbox")
def to_insert():
for message in mbox.values():
Dionyversite = dict(message.items())
Dionyversite["payload"] = message.get_payload()
yield Dionyversite
try:
db["Dionyversite"].upsert_all(to_insert(), alter = True, pk = "Message-ID")
except sql.InterfaceError as e:
print(e)
Thank you for your help.
I found how to fix it:
def to_insert():
for message in mbox.values():
Dionyversite = dict(message.items())
Dionyversite["payload"] = message.get_payload(decode = True)
yield Dionyversite
``
As you can see, I add `decode = True` inside `get_payload`of the `to_insert`function.

Python 3.73 inserting into bytearray = "object cannot be re-sized"

I'm working with a bytearray from file data. I'm opening it as 'r+b', so can change as binary.
In the Python 3.7 docs, it explains that a RegEx's finditer() can use m.start() and m.end() to identify the start and end of a match.
In the question Insert bytearray into bytearray Python, the answer says an insert can be made to a bytearray by using slicing. But when this is attempted, the following error is given: BufferError: Existing exports of data: object cannot be re-sized.
Here is an example:
pat = re.compile(rb'0.?\d* [nN]') # regex, binary "0[.*] n"
with open(file, mode='r+b') as f: # updateable, binary
d = bytearray(f.read()) # read file data as d [as bytes]
it = pat.finditer(d) # find pattern in data as iterable
for match in it: # for each match,
m = match.group() # bytes of the match string to binary m
...
val = b'0123456789 n'
...
d[match.start():match.end()] = bytearray(val)
In the file, the match is 0 n and I'm attempting to replace it with 0123456789 n so would be inserting 9 bytes. The file can be changed successfully with this code, just not increased in size. What am I doing wrong? Here is output showing all non-increasing-filesize operations working, but it failing on inserting digits:
*** Changing b'0.0032 n' to b'0.0640 n'
len(d): 10435, match.start(): 607, match.end(): 615, len(bytearray(val)): 8
*** Found: "0.0126 n"; set to [0.252] or custom:
*** Changing b'0.0126 n' to b'0.2520 n'
len(d): 10435, match.start(): 758, match.end(): 766, len(bytearray(val)): 8
*** Found: "0 n"; set to [0.1] or custom:
*** Changing b'0 n' to b'0.1 n'
len(d): 10435, match.start(): 806, match.end(): 809, len(bytearray(val)): 5
Traceback (most recent call last):
File "fixV1.py", line 190, in <module>
main(sys.argv)
File "fixV1.py", line 136, in main
nchanges += search(midfile) # perform search, returning count
File "fixV1.py", line 71, in search
d[match.start():match.end()] = bytearray(val)
BufferError: Existing exports of data: object cannot be re-sized
This is a simple case, much like modifying an iterable during iteration:
it = pat.finditer(d) creates a buffer from the bytearray object. This in turn "locks" the bytearray object from being changed in size.
d[match.start():match.end()] = bytearray(val) attempts to modify the size on the "locked" bytearray object.
Just like attempting to change a list's size while iterating over it will fail, an attempt to change a bytearray size while iterating over it's buffer will also fail.
You can give a copy of the object to finditer().
For more information about buffers and how Python works under the hood, see the Python docs.
Also, do keep in mind, you're not actually modifying the file. You'll nee to either write the data back to the file, or use memory mapped files. I suggest the latter if you're looking for efficiency.

How to download a sentinel images from google earth engine using python API in tfrecord

While trying to download sentinel image for a specific location, the tif file is generated by default in drive but its not readable by openCV or PIL.Image().Below is the code for the same. If I use the file format as tfrecord. There are no Images downloaded in the drive.
starting_time = '2018-12-15'
delta = 15
L = -96.98
B = 28.78
R = -97.02
T = 28.74
cordinates = [L,B,R,T]
my_scale = 30
fname = 'sinton_texas_30'
llx = cordinates[0]
lly = cordinates[1]
urx = cordinates[2]
ury = cordinates[3]
geometry = [[llx,lly], [llx,ury], [urx,ury], [urx,lly]]
tstart = datetime.datetime.strptime(starting_time, '%Y-%m-%d') tend =
tstart+datetime.timedelta(days=delta)
collSent = ee.ImageCollection('COPERNICUS/S2').filterDate(str(tstart).split('')[0], str(tend).split(' ')[0]).filter(ee.Filter.lt('CLOUDY_PIXEL_PERCENTAGE', 20)).map(mask2clouds)
medianSent = ee.Image(collSent.reduce(ee.Reducer.median())) cropLand = ee.ImageCollection('USDA/NASS/CDL').filterDate('2017-01-01','2017-12-31').first()
task_config = {
'scale': my_scale,
'region': geometry,
'fileFormat':'TFRecord'
}
f1 = medianSent.select(['B1_median','B2_median','B3_median'])
taskSent = ee.batch.Export.image(f1,fname+"_Sent",task_config)
taskSent.start()
I expect the output to be readable in python so I can covert into numpy. In case of file format 'tfrecord', I expect the file to be downloaded in my drive.
I think you should think about the following things:
File format
If you want to open your file with PIL or OpenCV, and not with TensorFlow, you would rather use GeoTIFF. Try with this format and see if things are improved.
Saving to drive
Normally saving to your Drive is the default behavior. However, you can try to force writing to your drive:
ee.batch.Export.image.toDrive(image=f1, ...)
You can further try to setup a folder, where the images should be sent to:
ee.batch.Export.image.toDrive(image=f1, folder='foo', ...)
In addition, the Export data help page and this tutorial are good starting points for further research.

Python3 decode removes white spaces when should be kept

I'm reading a binary file that has a code on STM32. I placed deliberate 2 const strings in the code, that allows me to read SW version and description from a given file.
When you open a binary file with hex editor or even in python3, you can see correct form. But when run text = data.decode('utf-8', errors='ignore'), it removes a zeros from the file! I don't want this, as I keep EOL characters to properly split and extract string that interest me.
(preview of the end of the data variable)
Svc\x00..\Src\adc.c\x00..\Src\can.c\x00defaultTask\x00Task_CANbus_receive\x00Task_LED_Controller\x00Task_LED1_TX\x00Task_LED2_RX\x00Task_PWM_Controller\x00**SW_VER:GN_1.01\x00\x00\x00\x00\x00\x00MODULE_DESC:generic_module\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00**Task_SuperVisor_Controller\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x06\x07\x08\t\x00\x00\x00\x00\x01\x02\x03\x04..\Src\tim.c\x005!\x00\x08\x11!\x00\x08\x01\x00\x00\x00\xaa\xaa\xaa\xaa\x01\x01\nd\x00\x02\x04\nd\x00\x00\x00\x00\xa2J\x04'
(preview of text, i.e. what I receive after decode)
r g # IDLE TmrQ Tmr Svc ..\Src\adc.c ..\Src\can.c
defaultTask Task_CANbus_receive Task_LED_Controller Task_LED1_TX
Task_LED2_RX Task_PWM_Controller SW_VER:GN_1.01
MODULE_DESC:generic_module
Task_SuperVisor_Controller ..\Src\tim.c 5! !
d d J
with open(path_to_file, "rb") as binary_file:
# Read the whole file at once
data = binary_file.read()
text = data.decode('utf-8', errors='ignore')
# get index of "SW_VER:" sting in the file
sw_ver_index = text.rfind("SW_VER:")
# SW_VER found
if sw_ver_index is not -1:
# retrive the value, e.g. "SW_VER:WB_2.01" will has to start from position 7 and finish at 14
sw_ver_value = text[sw_ver_index + 7:sw_ver_index + 14]
module.append(tuple(('DESC:', sw_ver_value)))
else:
# SW_VER not found
module.append(tuple(('DESC:', 'N/A')))
# get index of "MODULE_DESC::" sting in the file
module_desc_index = text.rfind("MODULE_DESC:")
# MODULE_DESC found
if module_desc_index is not -1:
module_desc_substring = text[module_desc_index + 12:]
module_desc_value = module_desc_substring.split()
module.append(tuple(('DESC:', module_desc_value[0])))
print(module_desc_value[0])
As you can see my white characters are gone, while they should be present

Selective download and extraction of data (CAB)

So I have a specific need to download and extract a cab file but the size of each cab file is huge > 200MB. I wanted to selectively download files from the cab as rest of the data is useless.
Done so much so far :
Request 1% of the file from the server. Get the headers and parse them.
Get the files list, their offsets according to This CAB Link.
Send a GET request to server with the Range header set to the file Offset and the Offset+Size.
I am able to get the response but it is in a way "Unreadable" cause it is compressed (LZX:21 - Acc to 7Zip)
Unable to decompress using zlib. Throws invlid header.
Also I did not quite understand nor could trace the CFFOLDER or CFDATA as shown in the example cause its uncompressed.
totalByteArray =b''
eofiles =0
def GetCabMetaData(stream):
global eofiles
cabMetaData={}
try:
cabMetaData["CabFormat"] = stream[0:4].decode('ANSI')
cabMetaData["CabSize"] = struct.unpack("<L",stream[8:12])[0]
cabMetaData["FilesOffset"] = struct.unpack("<L",stream[16:20])[0]
cabMetaData["NoOfFolders"] = struct.unpack("<H",stream[26:28])[0]
cabMetaData["NoOfFiles"] = struct.unpack("<H",stream[28:30])[0]
# skip 30,32,34,35
cabMetaData["Files"]= {}
cabMetaData["Folders"]= {}
baseOffset = cabMetaData["FilesOffset"]
internalOffset = 0
for i in range(0,cabMetaData["NoOfFiles"]):
fileDetails = {}
fileDetails["Size"] = struct.unpack("<L",stream[baseOffset+internalOffset:][:4])[0]
fileDetails["UnpackedStartOffset"] = struct.unpack("<L",stream[baseOffset+internalOffset+4:][:4])[0]
fileDetails["FolderIndex"] = struct.unpack("<H",stream[baseOffset+internalOffset+8:][:2])[0]
fileDetails["Date"] = struct.unpack("<H",stream[baseOffset+internalOffset+10:][:2])[0]
fileDetails["Time"] = struct.unpack("<H",stream[baseOffset+internalOffset+12:][:2])[0]
fileDetails["Attrib"] = struct.unpack("<H",stream[baseOffset+internalOffset+14:][:2])[0]
fileName =''
for j in range(0,len(stream)):
if(chr(stream[baseOffset+internalOffset+16 +j])!='\x00'):
fileName +=chr(stream[baseOffset+internalOffset+16 +j])
else:
break
internalOffset += 16+j+1
cabMetaData["Files"][fileName] = (fileDetails.copy())
eofiles = baseOffset + internalOffset
except Exception as e:
print(e)
pass
print(cabMetaData["CabSize"])
return cabMetaData
def GetFileSize(url):
resp = requests.head(url)
return int(resp.headers["Content-Length"])
def GetCABHeader(url):
global totalByteArray
size = GetFileSize(url)
newSize ="bytes=0-"+ str(int(0.01*size))
totalByteArray = b''
cabHeader= requests.get(url,headers={"Range":newSize},stream=True)
for chunk in cabHeader.iter_content(chunk_size=1024):
totalByteArray += chunk
def DownloadInfFile(baseUrl,InfFileData,InfFileName):
global totalByteArray,eofiles
if(not os.path.exists("infs")):
os.mkdir("infs")
baseCabName = baseUrl[baseUrl.rfind("/"):]
baseCabName = baseCabName.replace(".","_")
if(not os.path.exists("infs\\" + baseCabName)):
os.mkdir("infs\\"+baseCabName)
fileBytes = b''
newRange = "bytes=" + str(eofiles+InfFileData["UnpackedStartOffset"] ) + "-" + str(eofiles+InfFileData["UnpackedStartOffset"]+InfFileData["Size"] )
data = requests.get(baseUrl,headers={"Range":newRange},stream=True)
with open("infs\\"+baseCabName +"\\" + InfFileName ,"wb") as f:
for chunk in data.iter_content(chunk_size=1024):
fileBytes +=chunk
f.write(fileBytes)
f.flush()
print("Saved File " + InfFileName)
pass
def main(url):
GetCABHeader(url)
cabMetaData = GetCabMetaData(totalByteArray)
for fileName,data in cabMetaData["Files"].items():
if(fileName.endswith(".txt")):
DownloadInfFile(url,data,fileName)
main("http://path-to-some-cabinet.cab")
All the file details are correct. I have verified them.
Any guidance will be appreciated. Am I doing it wrong? Another way perhaps?
P.S : Already Looked into This Post
First, the data in the CAB is raw deflate, not zlib-wrapped deflate. So you need to ask zlib's inflate() to decode raw deflate with a negative windowBits value on initialization.
Second, the CAB format does not exactly use standard deflate, in that the 32K sliding window dictionary carries from one block to the next. You'd need to use inflateSetDictionary() to set the dictionary at the start of each block using the last 32K decompressed from the last block.

Resources