Selective download and extraction of data (CAB)

Selective download and extraction of data (CAB) - python-3.x

So I have a specific need to download and extract a cab file but the size of each cab file is huge > 200MB. I wanted to selectively download files from the cab as rest of the data is useless.
Done so much so far :
Request 1% of the file from the server. Get the headers and parse them.
Get the files list, their offsets according to This CAB Link.
Send a GET request to server with the Range header set to the file Offset and the Offset+Size.
I am able to get the response but it is in a way "Unreadable" cause it is compressed (LZX:21 - Acc to 7Zip)
Unable to decompress using zlib. Throws invlid header.
Also I did not quite understand nor could trace the CFFOLDER or CFDATA as shown in the example cause its uncompressed.
totalByteArray =b''
eofiles =0
def GetCabMetaData(stream):
global eofiles
cabMetaData={}
try:
cabMetaData["CabFormat"] = stream[0:4].decode('ANSI')
cabMetaData["CabSize"] = struct.unpack("<L",stream[8:12])[0]
cabMetaData["FilesOffset"] = struct.unpack("<L",stream[16:20])[0]
cabMetaData["NoOfFolders"] = struct.unpack("<H",stream[26:28])[0]
cabMetaData["NoOfFiles"] = struct.unpack("<H",stream[28:30])[0]
# skip 30,32,34,35
cabMetaData["Files"]= {}
cabMetaData["Folders"]= {}
baseOffset = cabMetaData["FilesOffset"]
internalOffset = 0
for i in range(0,cabMetaData["NoOfFiles"]):
fileDetails = {}
fileDetails["Size"] = struct.unpack("<L",stream[baseOffset+internalOffset:][:4])[0]
fileDetails["UnpackedStartOffset"] = struct.unpack("<L",stream[baseOffset+internalOffset+4:][:4])[0]
fileDetails["FolderIndex"] = struct.unpack("<H",stream[baseOffset+internalOffset+8:][:2])[0]
fileDetails["Date"] = struct.unpack("<H",stream[baseOffset+internalOffset+10:][:2])[0]
fileDetails["Time"] = struct.unpack("<H",stream[baseOffset+internalOffset+12:][:2])[0]
fileDetails["Attrib"] = struct.unpack("<H",stream[baseOffset+internalOffset+14:][:2])[0]
fileName =''
for j in range(0,len(stream)):
if(chr(stream[baseOffset+internalOffset+16 +j])!='\x00'):
fileName +=chr(stream[baseOffset+internalOffset+16 +j])
else:
break
internalOffset += 16+j+1
cabMetaData["Files"][fileName] = (fileDetails.copy())
eofiles = baseOffset + internalOffset
except Exception as e:
print(e)
pass
print(cabMetaData["CabSize"])
return cabMetaData
def GetFileSize(url):
resp = requests.head(url)
return int(resp.headers["Content-Length"])
def GetCABHeader(url):
global totalByteArray
size = GetFileSize(url)
newSize ="bytes=0-"+ str(int(0.01*size))
totalByteArray = b''
cabHeader= requests.get(url,headers={"Range":newSize},stream=True)
for chunk in cabHeader.iter_content(chunk_size=1024):
totalByteArray += chunk
def DownloadInfFile(baseUrl,InfFileData,InfFileName):
global totalByteArray,eofiles
if(not os.path.exists("infs")):
os.mkdir("infs")
baseCabName = baseUrl[baseUrl.rfind("/"):]
baseCabName = baseCabName.replace(".","_")
if(not os.path.exists("infs\\" + baseCabName)):
os.mkdir("infs\\"+baseCabName)
fileBytes = b''
newRange = "bytes=" + str(eofiles+InfFileData["UnpackedStartOffset"] ) + "-" + str(eofiles+InfFileData["UnpackedStartOffset"]+InfFileData["Size"] )
data = requests.get(baseUrl,headers={"Range":newRange},stream=True)
with open("infs\\"+baseCabName +"\\" + InfFileName ,"wb") as f:
for chunk in data.iter_content(chunk_size=1024):
fileBytes +=chunk
f.write(fileBytes)
f.flush()
print("Saved File " + InfFileName)
pass
def main(url):
GetCABHeader(url)
cabMetaData = GetCabMetaData(totalByteArray)
for fileName,data in cabMetaData["Files"].items():
if(fileName.endswith(".txt")):
DownloadInfFile(url,data,fileName)
main("http://path-to-some-cabinet.cab")
All the file details are correct. I have verified them.
Any guidance will be appreciated. Am I doing it wrong? Another way perhaps?
P.S : Already Looked into This Post

First, the data in the CAB is raw deflate, not zlib-wrapped deflate. So you need to ask zlib's inflate() to decode raw deflate with a negative windowBits value on initialization.
Second, the CAB format does not exactly use standard deflate, in that the 32K sliding window dictionary carries from one block to the next. You'd need to use inflateSetDictionary() to set the dictionary at the start of each block using the last 32K decompressed from the last block.

Related

Cognitive Services - How to add multiple Faces from a Live Stream to Azure FaceList ? (Python)

Problem Background:
I have created an Azure FaceList and I am using my webcam to capture live feed and:
sending the stream to Azure Face Detect
getting Face Rectangle returned by Face Detect
using the returned Face rectangle to add Face Detected from Live Video Stream to my FaceList.
(I need to create Face List in order to solve the problem I explained in my other question which is answered by Nicolas, which is what I am following)
Problem Details:
According to Azure FaceList documentation at https://learn.microsoft.com/en-us/rest/api/cognitiveservices/face/facelist/addfacefromstream ,if there are multiple faces in the image, we need to specify the target Face to add to Azure FaceList.
The Problem is, What if we need to add all the detected faces (multiple faces) in Face List? Suppose there are 2 or more faces in a single frame of video, then how can I add those two Faces to Face List?
I have tried adding the face rectangles returned from Azure Face Detect into a Python List and then iterating Over List indexes, so that each face Rectangle can be passed to Azure FaceList one-by-one. But no use.
Still getting the error:
There are more than one faces in the image
My Code:
face_list_id = "newtest-face-list"
vid = cv2.VideoCapture(0)
count = 0
face_ids_live_Detected = [] #This list will store faceIds from detected faces
list_of_face_rectangles = []
face_rect_counter=0
while True:
ret, frame = vid.read()
check,buffer = cv2.imencode('.jpg', frame)
img = cv2.imencode('.jpg', frame)[1].tobytes()
base64_encoded = base64.b64encode(buffer).decode()
print(type(img))
detected_faces = utils.detect_face_stream(endpoint=ENDPOINT, key=KEY, image=img,face_attributes=attributes,recognition_model='recognition_03')
print('Image num {} face detected {}'.format(count, detected_faces))
count += 1
color = (255, 0, 0)
thickness = 2
for face in detected_faces:
detected_face_id = face['faceId']
face_ids_live_Detected.append(detected_face_id)
detected_face_rectangle = face['faceRectangle']
list_of_face_rectangles.append(detected_face_rectangle)
print("detected rectangle =",detected_face_rectangle)
face_rect_for_facelist = list_of_face_rectangles[face_rect_counter]
face_rect_counter +=1
frame = cv2.rectangle(frame, *utils.get_rectangle(face), color, thickness)
cv2.imshow('frame', frame)
for face_id_live in face_ids_live_Detected:
similar_faces = face_client.face.find_similar(face_id=face_id_live, face_list_id=face_list_id)
if not similar_faces:
print('No similar faces found !')
print('Adding Unknown Face to FaceList...')
facelist_result = utils.facelist_add(endpoint=ENDPOINT, key=KEY, face_list_id=face_list_id,data=img,params=face_rect_for_facelist)
persisted_face_id = facelist_result['persistedFaceId']
else:
print('Similar Face Found!')
for similar_face in similar_faces:
face_id_similar = similar_face.face_id
print("Confidence: "+str(similar_face.confidence))
From my utils file, code for function facelist_add is as follows:
def facelist_add(endpoint, key, face_list_id, data=None, json=None, headers=None,params=None, targetFace=None):
# pylint: disable=too-many-arguments
"""Universal interface for request."""
method = 'POST'
url = endpoint + '/face/v1.0/facelists/'+face_list_id+'/persistedfaces'
# Make it possible to call only with short name (without BaseUrl).
if not url.startswith('https://'):
url = BaseUrl.get() + url
params={}
# Setup the headers with default Content-Type and Subscription Key.
headers = headers or {}
if 'Content-Type' not in headers:
headers['Content-Type'] = 'application/octet-stream'
headers['Ocp-Apim-Subscription-Key'] = key
params['detectionModel']='detection_03'
response = requests.request(
method,
url,
params=params,
data=data,
json=json,
headers=headers)
if response.text:
result = response.json()
else:
result = {}
return result

When you have several faces in a picture, you have to provide a 'targetFace' in your call to AddFace:
A face rectangle to specify the target face to be added into the face
list, in the format of "targetFace=left,top,width,height". E.g.
"targetFace=10,10,100,100". If there is more than one face in the
image, targetFace is required to specify which face to add. No
targetFace means there is only one face detected in the entire image.
See API documentation for this method: https://westeurope.dev.cognitive.microsoft.com/docs/services/563879b61984550e40cbbe8d/operations/563879b61984550f30395250

Thanks to everyone who helped and especially Nicolas R. I just found a mistake and corrected it. Now the program is working like Charm.
Actually, Azure 'Face Detect' returns face Rectangle in top,left,width,height sequence which I was feeding directly to the faceList's targetFace.
Now I just swapped the first two values of Face Rectangle and it becomes left,top,width,height which is what the documentation says, and it's working fine now.
Solution:
I have added a new function that takes the faceRectangle dictionary and swap first two values.
list_of_faceRect_valuesonly=[]
def face_rect_values(faceRect_dict):
temp_list=[]
for key,value in faceRect_dict.items():
temp_list.append(value)
temp_list[0], temp_list[1] = temp_list[1], temp_list[0]
list_of_faceRect_valuesonly.append(temp_list)
In order to extract values from list, I did following:
face_rect_counter=0
face_rect_for_facelist = list_of_faceRect_valuesonly[face_rect_counter]
face_rect_counter +=1
Request to facelist_add function:
facelist_result = utils.facelist_add(endpoint=ENDPOINT, key=KEY, face_list_id=face_list_id,targetFace=face_rect_for_facelist,data=img)
I have also changed my facelist_add function a little bit:
def facelist_add(endpoint, key, face_list_id, targetFace=[],data=None ,jsondata=None, headers=None):
# pylint: disable=too-many-arguments
"""Universal interface for request."""
method = 'POST'
url = endpoint + '/face/v1.0/facelists/'+face_list_id+'/persistedfaces'
# Make it possible to call only with short name (without BaseUrl).
if not url.startswith('https://'):
url = BaseUrl.get() + url
params={}
# Setup the headers with default Content-Type and Subscription Key.
headers = headers or {}
if 'Content-Type' not in headers:
headers['Content-Type'] = 'application/octet-stream'
headers['Ocp-Apim-Subscription-Key'] = key
list_of_targetfaces =[]
list_of_targetfaces.append(targetFace)
params={'targetFace':json.dumps(targetFace)}
params = {'targetFace': ','.join(map(str,targetFace))}
print("Printing TargetFaces(facelist_add function) ...",params['targetFace'])
params['detectionModel']='detection_03'
url=url + "?"
response = requests.post(url,params=params,data=data,headers=headers)
print("Request URL: ", response.url)
result = None
# Prevent `response.json()` complains about empty response.
if response.text:
result = response.json()
else:
result = {}
return result

python requests.get() does not refresh the page

I have a piece of Python 3 code that fetches a webpage every 10 seconds which gives back some JSON information:
s = requests.Session()
while True:
r = s.get(currenturl)
data = r.json()
datetime = data['Timestamp']['DateTime']
value = data['PV']
print(str(datetime) + ": " + str(value) + "W")
time.sleep(10)
The output of this code is:
2020-10-13T13:26:53: 888W
2020-10-13T13:26:53: 888W
2020-10-13T13:26:53: 888W
2020-10-13T13:26:53: 888W
As you can see, the DateTime does not change with every iteration. When I refresh the page manually in my browser it does get updated every time.
I have tried adding
Cache-Control max-age=0
to the headers of my request but that does not resolve the issue.
Even when explicitely setting everything to None after loop, the same issue remains:
while True:
r = s.get(currenturl, headers={'Cache-Control': 'no-cache'})
data = r.json()
datetime = data['Timestamp']['DateTime']
value = data['PV']
print(str(datetime) + ": " + str(value) + "W")
time.sleep(10)
counter += 1
r = None
data = None
datetime = None
value = None
How can I "force" a refresh of the page with requests.get()?

It turns out this particular website doesn't continuously refresh on its own, unless the request comes from its parent url.
r = s.get(currenturl, headers={'Referer' : 'https://originalurl.com/example'})
I had to include the original parent URL as referer. Now it works as expected:
2020-10-13T15:32:27: 889W
2020-10-13T15:32:37: 889W
2020-10-13T15:32:47: 884W
2020-10-13T15:32:57: 884W
2020-10-13T15:33:07: 894W

RETR downloading zip File from ftp not writing

I am trying to donwload a huge zip file (~9Go zipped and ~130GO unzipped) from an FTP with python using the ftplib library but unfortunately when using the retrbinary method, it does create the file in my local diretory but it is not writing into the file. After a while the code runs, I get an timeout error. It used to work fine before, but when I tried to go deeper in the use of sockets by using this code it does not work anymore. Indeed, as the files I am trying to download are huge I want to have more control with the connection to prevent timeout error while downloading the files. I am not very familar with sockets so I may have misused it. I have been searching online but did not find any problems like this. (I tried with smaller files too for test but still have the same issues)
Here are the function that I tried but both have problems (method 1 is not writing to file, method 2 donwloads file but I can't unzip it)
import time
import socket
import ftplib
import threading
# To complete
filename = ''
local_folder = ''
ftp_folder = ''
host = ''
user = ''
mp = ''
# timeout error in method 1
def downloadFile_method_1(filename, local_folder, ftp_folder, host, user, mp):
try:
ftp = ftplib.FTP(host, user, mp, timeout=1600)
ftp.set_debuglevel(2)
except ftplib.error_perm as error:
print(error)
with open(local_folder + '/' + filename, "wb") as f:
ftp.retrbinary("RETR" + ftp_folder + '/' + filename, f.write)
# method 2 works to download zip file, but header error when unziping it
def downloadFile_method_2(filename, local_folder, ftp_folder, host, user, mp):
try:
ftp = ftplib.FTP(host, user, mp, timeout=1600)
ftp.set_debuglevel(2)
sock = ftp.transfercmd('RETR ' + ftp_folder + '/' + filename)
except ftplib.error_perm as error:
print(error)
def background():
f = open(local_folder + '/' + filename, 'wb')
while True:
block = sock.recv(1024*1024)
if not block:
break
f.write(block)
sock.close()
t = threading.Thread(target=background)
t.start()
while t.is_alive():
t.join(60)
ftp.voidcmd('NOOP')
def unzip_file(filename, local_folder):
local_filename = local_folder + '/' + filename
with ZipFile(local_filename, 'r') as zipObj:
zipObj.extractall(local_folder)
And the error I get for method 1:
ftplib.error_temp: 421 Timeout - try typing a little faster next time
And the error I get when I try to unzip after using method 2:
zipfile.BadZipFile: Bad magic number for file header
Alos, regarding this code If anyone could explain what this does concerning socketopt too would be helpful:
ftp.sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
ftp.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 75)
ftp.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 60)
Thanks for your help.

Azure Put Blob API returns with a non-matching size of file in canonicalized Header

I am trying to upload a blob (pdf) file from my laptop to a container in Azure storage account. I found it to be working but with one glitch.
I am calculating the file size using:
f_info = os.stat(file_path)
file_size = (f_info.st_size) # returns - 19337
Then I insert this value in below canonicalized header:
ch = "PUT\n\n\n"+str(file_size)+"\n\napplication/pdf\n\n\n\n\n\n\nx-ms-blob-type:BlockBlob" + "\nx-ms-date:" + date + "\nx-ms-version:" + version + "\n"
and send the PUT request to PUT Blob API, however, it returns an error saying, "Authentication failed because the server used below below string to calculate the signature"
\'PUT\n\n\n19497\n\napplication/pdf\n\n\n\n\n\n\nx-ms-blob-type:BlockBlob\nx-ms-date:[date]\nx-ms-version:[API version]
Looking at this string it obvious that authentication failed because file size which azure calculated returns a different value! I don't understand how its calculating this value of file size?!?!
FYI: If I replace 19337 with 19497 in canonicalized string and re run. It works!
Any suggestion on where I am making mistakes?
Below is the code:
storage_AccountName = '<storage account name>'
storage_ContainerName = "<container_name>"
storageKey='<key>'
fd = "C:\\<path>\\<to>\\<file_to_upload>.pdf"
URI = 'https://' + storageAccountName + '.blob.core.windows.net/<storage_ContainerName >/<blob_file_name.pdf>
version = '2017-07-29'
date = datetime.datetime.utcnow().strftime("%a, %d %b %Y %H:%M:%S GMT")
if os.path.isfile(fd):
file_info = os.stat(fd)
file_size = (file_info.st_size)
ch = "PUT\n\n\n"+str(file_size)+"\n\napplication/pdf\n\n\n\n\n\n\nx-ms-blob-type:BlockBlob" + "\nx-ms-date:" + date + "\nx-ms-version:" + version + "\n"
cr = "/<storage_AccountName>/<storage_Containername>/<blob_file_name.pdf>"
canonicalizedString = ch + cr
storage_account_key = base64.b64decode(storageKey)
byte_canonicalizedString=canonicalizedString.encode('utf-8')
signature = base64.b64encode(hmac.new(key=storage_account_key, msg=byte_canonicalizedString, digestmod=hashlib.sha256).digest())
header = {
'x-ms-blob-type': "BlockBlob",
'x-ms-date': date,
'x-ms-version': version,
'Authorization': 'SharedKey ' + storageAccountName + ':' + signature.decode('utf-8'),
#'Content-Length': str(19497), # works
'Content-Length': str(file_size), # doesn't work
'Content-Type': "application/pdf"}
files = {'file': open(fd, 'rb')}
result = requests.put(url = URI, headers = header, files = files)
print (result.content)

As mentioned in the comments, the reason you're getting the content length mismatched header is because instead of uploading the file, you're uploading an object which contains file contents and that is causing the content length to increase.
Please change the following line of codes:
files = {'file': open(fd, 'rb')}
result = requests.put(url = URI, headers = header, files = files)
to something like:
data = open(fd, 'rb') as stream
result = requests.put(url = URI, headers = header, data = data)
And now you're only uploading the file contents.

load .npy file from google cloud storage with tensorflow

i'm trying to load .npy files from my google cloud storage to my model i followed this example here Load numpy array in google-cloud-ml job
but i get this error
'utf-8' codec can't decode byte 0x93 in
position 0: invalid start byte
can you help me please ??
here is sample from the code
Here i read the file
with file_io.FileIO(metadata_filename, 'r') as f:
self._metadata = [line.strip().split('|') for line in f]
and here i start processing on it
if self._offset >= len(self._metadata):
self._offset = 0
random.shuffle(self._metadata)
meta = self._metadata[self._offset]
self._offset += 1
text = meta[3]
if self._cmudict and random.random() < _p_cmudict:
text = ' '.join([self._maybe_get_arpabet(word) for word in text.split(' ')])
input_data = np.asarray(text_to_sequence(text, self._cleaner_names), dtype=np.int32)
f = StringIO(file_io.read_file_to_string(
os.path.join('gs://path',meta[0]))
linear_target = tf.Variable(initial_value=np.load(f), name='linear_target')
s = StringIO(file_io.read_file_to_string(
os.path.join('gs://path',meta[1])))
mel_target = tf.Variable(initial_value=np.load(s), name='mel_target')
return (input_data, mel_target, linear_target, len(linear_target))
and this is a sample from the data sample

This is likely because your file doesn't contain utf-8 encoded text.
Its possible, you may need to initialize the file_io.FileIO instance as a binary file using mode = 'rb', or set binary_mode = True in the call to read_file_to_string.
This will cause data that is read to be returned as a sequence of bytes, rather than a string.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Selective download and extraction of data (CAB) - python-3.x

Related

Cognitive Services - How to add multiple Faces from a Live Stream to Azure FaceList ? (Python)

python requests.get() does not refresh the page

RETR downloading zip File from ftp not writing

Azure Put Blob API returns with a non-matching size of file in canonicalized Header

load .npy file from google cloud storage with tensorflow

Categories

Resources