How to list files inside tar in AWS S3 without downloading it? - python-3.x

While looking around for ideas I found https://stackoverflow.com/a/54222447/264822 for zip files which I think is a very clever solution. But it relies on zip files having a Central Directory - tar files don't.
I thought I could follow the same general principle and expose the S3 file to tarfile through the fileobj parameter:
import boto3
import io
import tarfile
class S3File(io.BytesIO):
def __init__(self, bucket_name, key_name, s3client):
super().__init__()
self.bucket_name = bucket_name
self.key_name = key_name
self.s3client = s3client
self.offset = 0
def close(self):
return
def read(self, size):
print('read: offset = {}, size = {}'.format(self.offset, size))
start = self.offset
end = self.offset + size - 1
try:
s3_object = self.s3client.get_object(Bucket=self.bucket_name, Key=self.key_name, Range="bytes=%d-%d" % (start, end))
except:
return bytearray()
self.offset = self.offset + size
result = s3_object['Body'].read()
return result
def seek(self, offset, whence=0):
if whence == 0:
print('seek: offset {} -> {}'.format(self.offset, offset))
self.offset = offset
def tell(self):
return self.offset
s3file = S3File(bucket_name, file_name, s3client)
tarf = tarfile.open(fileobj=s3file)
names = tarf.getnames()
for name in names:
print(name)
This works fine except the output looks like:
read: offset = 0, size = 2
read: offset = 2, size = 8
read: offset = 10, size = 8192
read: offset = 8202, size = 1235
read: offset = 9437, size = 1563
read: offset = 11000, size = 3286
read: offset = 14286, size = 519
read: offset = 14805, size = 625
read: offset = 15430, size = 1128
read: offset = 16558, size = 519
read: offset = 17077, size = 573
read: offset = 17650, size = 620
(continued)
tarfile is just reading the whole file anyway so I haven't gained anything. Is there anyway of making tarfile only read the parts of the file it needs? The only alternative I can think of is re-implementing the tar file parsing so it:
Reads the 512 bytes header and writes this into a BytesIO buffer.
Gets the size of the file following and writes zeroes into the BytesIO buffer.
Skips over the file to the next header.
But this seems overly complicated.

My mistake. I'm actually dealing with tar.gz files but I assumed that zip and tar.gz are similar. They're not - tar is an archive file which is then compressed as gzip, so to read the tar you have to decompress it first. My idea of pulling bits out of the tar file won't work.
What does work is:
s3_object = s3client.get_object(Bucket=bucket_name, Key=file_name)
wholefile = s3_object['Body'].read()
fileobj = io.BytesIO(wholefile)
tarf = tarfile.open(fileobj=fileobj)
names = tarf.getnames()
for name in names:
print(name)
I suspect the original code will work for a tar file but I don't have any to try it on.

I just tested your original code on a tar file and it works quite well.
Here is my sample output (truncated). I made some minor changes to display the total downloaded bytes and the seek step size in kB (published at this gist). This is for a 1 GB tar file containing 321 files (average size per file is 3 MB):
read: offset = 0, size = 2, total download = 2
seek: offset 2 -> 0 (diff = -1 kB)
read: offset = 0, size = 8192, total download = 8194
seek: offset 8192 -> 0 (diff = -9 kB)
read: offset = 0, size = 8192, total download = 16386
seek: offset 8192 -> 0 (diff = -9 kB)
read: offset = 0, size = 512, total download = 16898
<TarInfo 'yt.txt' at 0x7fbbed639ef0>
seek: offset 512 -> 7167 (diff = 6 kB)
read: offset = 7167, size = 1, total download = 16899
read: offset = 7168, size = 512, total download = 17411
<TarInfo 'yt_cache/youtube-sigfuncs' at 0x7fbbed639e20>
read: offset = 7680, size = 512, total download = 17923
...
<TarInfo 'yt_vids/whistle_dolphins-SZTC_zT9ijg.m4a' at 0x7fbbecc697a0>
seek: offset 1004473856 -> 1005401599 (diff = 927 kB)
read: offset = 1005401599, size = 1, total download = 211778
read: offset = 1005401600, size = 512, total download = 212290
None
322
So this downloads 212 kB for a 1GB tar file in order to get a list of 321 filenames in about 2 minutes on colab and 1.5 minutes on ec2 in the same region as the bucket.
In comparison, it takes 17 seconds to download the full file on colab and 1 second to list the files in it with tar -tf file.tar. So if I'm optimizing on execution time, I'd rather just download the full file and work on it locally. Otherwise, there might be some optimization that could be done in your original code? IDK.
OTOH, fetching a single file is more efficient than the above 2 minutes if it's at the beginning of the tar, but as slow as getting all file names if it's at the end. But I couldn't do that with the getmember() function because it seems that it internally calls getmembers() which has to go through the full file. Instead, I rolled out my own while loop to find the file and abort the search once found:
bucket_name, file_name = "bucket", "file.tar"
import boto3
s3client = boto3.client("s3")
s3file = S3File(bucket_name, file_name, s3client)
import tarfile
with tarfile.open(mode="r", fileobj=s3file) as tarf:
tarinfo = 1 # dummy
while tarinfo is not None:
tarinfo = tarf.next()
if tarinfo.name == name_search:
break
I think a future direction for this would be to have the tarinfo.open(...) cache the offsets of each file so that a subsequent tarinfo.open(...) doesn't go through the full file again. Once that's done, a first pass through the tar file will allow downloading individual files from the tar in s3 without going through the full file again and again for reach file.
Side note, couldn't you have just run gunzip on the tar.gz to get the tar to test on?

Related

Byte Compression from file to image

I have been trying to compress files based off the byte content in them, then store this compression as an image(png) and reverse the process to get the original file back.
I have tested all the compression methods i can and none seem to work too well for the test files I am using.
The big test file I am using is a PNG of 7,187Kb (7,358,681 bytes) and the best compression I can get out of it is using blocs which results in a 7,191Kb (7,362,976 bytes)
This is by using binary compression only. When i use PIL to open the image and extract the pixels i can shrink it by almost 70% down to 2,676Kb (2,740,053 bytes)
code:
from PIL import Image
import io, math, os
import zlib, bz2, pylmza, lmza, blosc ##all tested
def __file_to_bytes(self, fname):
img_byte_arr = io.BytesIO()
with open(fname, 'rb+') as fp:
bts = fp.read()
return bts
def __image_to_bytes(self, fname):
img_byte_arr = io.BytesIO()
with Image.open(fname) as fp:
w, h = fp.size ##used this to re-create the image after
bts = fp.tobytes()
return bts
def compresser(self, algo, fname, outname):
bts = __file_to_bytes(fname) # file to bytes = better compatibility and no compression
#bts = __image_to_bytes(fname) #png to bytes = better compression
compressed_bytes = algo.compress(bts)
compressed_image_output = self.__to_image(outname, bts, fname)
def __bytes_to_rgb(self, bts):
padding_len = 0 ##adds black pixels to make the resulting data a perfect square
p = []
for b in bts:
p.append(int(b))
img = list(self.divide_chunks(p, 3))
while len(img[-1])%3 != 0:
img[-1].append(0)
padding_len += 1
for i in range(len(img)):
img[i] = tuple(img[i])
size = math.ceil(math.sqrt(len(img)))
for i in range(size**2-len(img)):
img.append((0, 0, 0))
padding_len += 3
p = (255, math.floor(padding_len/255), padding_len%255)
img[-1] = p ##encodes the last pixel to be the padding lenght as (255*y+z)
return img, size
def __rgb_to_image(self, arr, fname, size, ofname):
metadata = PngInfo()
metadata.add_text("filename", ofname)
output = Image.new('RGB', (size, size))
output.putdata(arr)
output.save(fname, pnginfo=metadata)
return output
def __to_image(self, output_fname, bts, ofname):
img, size = self.__bytes_to_rgb(bts)
ouput_img = self.__rgb_to_image(img, output_fname, size, ofname)
return ouput_img
Although I am using a PNG image for the input file this can be easily changed to any type of file so the compression needs to be lossless, however there is no time criticality for this so it doesn't matter how slow the method is.

why SFTP parallel upload of a file is giving me recursive appended data?

I am trying to split a file into chunks and use threads to write to a file in the SFTP server.
The file name: hi.txt
The data:
hi....
file upload check.
The code to upload:
threads_count = 4
data_ap = {}
size = os.path.getsize(local_path)
part_size = int(size / threads_count)
lock = threading.Lock()
def open_ssh():
# ssh connection codes
return ssh
def upload_part(num, offset, part_size, remote_path_part):
#print(f"Running thread {num}")
try:
ssh = open_ssh()
sftp = ssh.open_sftp()
with open(local_path, "rb") as fl:
fl.seek(offset)
with lock:
with sftp.open(remote_path_part, "ab") as fr:
fr.set_pipelined(True)
size = 0
while size < part_size:
s = 32768
if size + s > part_size:
s = part_size - size
data = fl.read(s)
data_ap[num] = data
print(data)
# print({offset : data})
fr.write(str(data_ap))
#fr.write(data)
size += len(data)
if len(data) == 0:
break
except (paramiko.ssh_exception.SSHException) as x:
print(f"Thread {num} failed: {x}")
#print(f"Thread {num} done")
#ssh.close()
#start_time = time.time()
print("Starting")
offset = 0
threads = []
part_filenames = []
for num in range(threads_count):
if num == threads_count - 1:
part_size = size - offset
#remote_path_part = f"{remote_path}.{num}"
args = (num, offset, part_size, remote_path)
#print(f"Starting thread {num} offset {offset} size {part_size} " + f"part name {remote_path}")
thread = threading.Thread(target=upload_part, args=args)
threads.append(thread)
part_filenames.append(remote_path)
thread.start()
#print(f"Started thread {num}")
offset += part_size
for num in range(len(threads)):
#print(f"Waiting for thread {num}")
threads[num].join()
print("All thread done")
Now I have two problems.
First:
I am not getting the data sorted correctly since the data is divided across chunks I am getting different order
in the uploaded file
The uploaded data:
upload check. code to uploadhi ... the
Second:
To solve the above issue, I thought of using a dictionary where the key is the thread, and the value is the data, and during the download, I will reconstruct the file by order of the key. But I am getting the data recursively added like this.
The uploaded data:
{0: b'file uploa'}{0: b'file uploa', 1: b'd check.\nc'}{0: b'file uploa', 1: b'd check.\nc', 3: b' if appends'}{0: b'file uploa', 1: b'd check.\nc', 3: b' if appends', 2: b'heck again'}
How to fix this?
It would be preferable to fix the First part wherein I do not have to use any dictionary to rearrange the data when uploaded.
Reference for the upload code

File Size not showing accurate in os.stat(y).st_size

I am trying to get the filename and its sizes. code is running fine but the sizes coming up are off by kbs. For example a 3KB file is showing up as 2699 Byte.
created = []
fname = []
def list_files_and_sizes(*argv):
path, sizes = argv
size = []
dirs=os.listdir(path)
for i in dirs:
fname.append(os.path.join(path, str(i)))
for y in fname:
file_stats = os.stat(y).st_size
if file_stats >= sizes:
print(y, file_stats)

Downloaded GZ files is showing 0 byte

I am using OCI Python SDK and when i am trying to download an object (from an OCI bucket) which is GZ format it is getting downloaded but the file size is zero byte. Attaching the code
Any help is much appriciable.
import os
import oci
import io
import sys
reporting_namespace = 'xygabcdef'
prefix_file = "abc/xyz"
# Update these values
destination_path = 'downloaded_reports'
# Make a directory to receive reports
if not os.path.exists(destination_path):
os.mkdir(destination_path)
# Get the list of reports
config = oci.config.from_file(oci.config.DEFAULT_LOCATION, oci.config.DEFAULT_PROFILE)
reporting_bucket = sys.argv[1]
object_storage = oci.object_storage.ObjectStorageClient(config)
report_bucket_objects = object_storage.list_objects(reporting_namespace, reporting_bucket, prefix=prefix_file)
#def download_audit():
for o in report_bucket_objects.data.objects:
print('Found file ' + o.name)
object_details = object_storage.get_object(reporting_namespace, reporting_bucket, o.name)
print (object_details)
filename = o.name.rsplit('/', 1)[-1]
with open(destination_path + '/' + filename, 'wb') as f:
for chunk in object_details.data.raw.stream(1024 * 1024, decode_content=False):
f.write(chunk)
Please see the example here. Does this work for you? Namely:
get_obj = object_storage.get_object(namespace, bucket_name, example_file_object_name)
with open('example_file_retrieved', 'wb') as f:
for chunk in get_obj.data.raw.stream(1024 * 1024, decode_content=False):
f.write(chunk)
In your example destintation_path seems to be undefined, and seems to have a typo (destintation -> destination). Could this be the problem?
Lastly, what does object_details report the file size / content-length as? It could be that the file size of the object in Object Storage is itself 0 bytes.
the .content from the .data of get_object should give you the file data (binary or text/josn/...), so here is a modified version of your code:
import os
import sys
import oci
reporting_namespace = 'xygabcdef'
prefix_file = "abc/xyz"
# Update these values
destination_path = 'downloaded_reports'
# Get the list of reports
config = oci.config.from_file(oci.config.DEFAULT_LOCATION, oci.config.DEFAULT_PROFILE)
reporting_bucket = sys.argv[1]
object_storage = oci.object_storage.ObjectStorageClient(config)
objects = object_storage.list_objects(reporting_namespace, reporting_bucket, prefix=prefix_file).data
# def download_audit():
for obj in objects:
print('Found file ' + obj.name)
object_response = object_storage.get_object(reporting_namespace, reporting_bucket, obj.name).data
print(object_response)
file_path = os.path.join(destination_path, obj.name)
# Make sure parent dirs up to the file level are created
os.makedirs(os.path.dirname(file_path), exist_ok=True)
with open(file_path, 'wb') as file:
file.write(object_response.content)

Split a zip archive into multiple chunks

I'm trying to create a zip archive of a possibly huge folder.
For this purpose I'm using the python zipfile module, but as far as I can see there is no option to split the created archive into multiple chunks with a max size.
The zipped archive is supposed to be sent via Telegram, which has a size limitation of 1.5 GB per file. Thereby I need to split the resulting zip archive.
I would really like to not use a subprocess and shell commands for creating this archive.
My current code looks like this:
def create_zip(archive_name, directory):
"""Create a zip file from given dir path."""
with ZipFile(archive_name, "w", ZIP_LZMA) as target_zip_file:
for root, _, files in os.walk(directory):
for file_to_zip in files:
absolute_path = os.path.join(root, file_to_zip)
zip_file_name = absolute_path[len(directory) + len(os.sep):]
target_zip_file.write(absolute_path, zip_file_name)
return target_zip_file
Thanks in Advance
Here is what i use to send file to telegram channel by telegram bot.
The file size limit is 50MB in upload by telegram bot
The file size limit is 1500MB in upload by telegram client but you may add some text or other info so 1495 is more safe
#! /usr/bin/python3
# -*- coding:utf-8 -*-
# apt-get install p7zip-full
import subprocess
import os
import math
import logzero
logger = logzero.logger
MAX_SPLIT_SIZE = 1495
def file_split_7z(file_path, split_size=MAX_SPLIT_SIZE):
file_path_7z_list = []
# if origin file is 7z file rename it
origin_file_path = ""
if os.path.splitext(file_path)[1] == ".7z":
origin_file_path = file_path
file_path = os.path.splitext(origin_file_path)[0] + ".7zo"
os.rename(origin_file_path, file_path)
# do 7z compress
fz = os.path.getsize(file_path) / 1024 / 1024
pa = math.ceil(fz / split_size)
head, ext = os.path.splitext(os.path.abspath(file_path))
archive_head = "".join((head, ext.replace(".", "_"))) + ".7z"
for i in range(pa):
check_file_name = "{}.{:03d}".format(archive_head, i + 1)
if os.path.isfile(check_file_name):
logger.debug("remove exists file | {}".format(check_file_name))
os.remove(check_file_name)
cmd_7z = ["7z", "a", "-v{}m".format(split_size), "-y", "-mx0", archive_head, file_path]
proc = subprocess.Popen(cmd_7z, shell=False, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = proc.communicate()
if b"Everything is Ok" not in out:
logger.error("7z output | {}".format(out.decode("utf-8")))
logger.error("7z error | {}".format(err.decode("utf-8")))
return file_path_7z_list
for i in range(pa):
file_path_7z_list.append("{}.{:03d}".format(archive_head, i + 1))
# if origin file is 7z file rename it back
if origin_file_path:
os.rename(file_path, origin_file_path)
return file_path_7z_list
def do_file_split(file_path, split_size=MAX_SPLIT_SIZE):
"""caculate split size
example max split size is 1495 file size is 2000
than the split part num should be int(2000 / 1495 + 0.5) = 2
so the split size should be 1000 + 1000 but not 1495 + 505
with the file size increase the upload risk would be increase too
"""
file_size = os.path.getsize(file_path) / 2 ** 20
split_part = math.ceil(file_size / split_size)
new_split_size = math.ceil(file_size / split_part)
logger.info("file size | {} | split num | {} | split size | {}".format(file_size, split_part, new_split_size))
file_path_7z_list = file_split_7z(file_path, split_size=new_split_size)
return file_path_7z_list
In case you don't find a better, native way with zipfile, you could still write the file splitting algorithm yourself. Something like this:
outfile = archive_name
packet_size = int(1.5 * 1024**3) # bytes
with open(outfile, "rb") as output:
filecount = 0
while True:
data = output.read(packet_size)
print(len(data))
if not data:
break # we're done
with open("{}{:03}".format(outfile, filecount), "wb") as packet:
packet.write(data)
filecount += 1
And similar to put it back together on the receiver's side.

Resources