Split a zip archive into multiple chunks - python-3.x

I'm trying to create a zip archive of a possibly huge folder.
For this purpose I'm using the python zipfile module, but as far as I can see there is no option to split the created archive into multiple chunks with a max size.
The zipped archive is supposed to be sent via Telegram, which has a size limitation of 1.5 GB per file. Thereby I need to split the resulting zip archive.
I would really like to not use a subprocess and shell commands for creating this archive.
My current code looks like this:
def create_zip(archive_name, directory):
"""Create a zip file from given dir path."""
with ZipFile(archive_name, "w", ZIP_LZMA) as target_zip_file:
for root, _, files in os.walk(directory):
for file_to_zip in files:
absolute_path = os.path.join(root, file_to_zip)
zip_file_name = absolute_path[len(directory) + len(os.sep):]
target_zip_file.write(absolute_path, zip_file_name)
return target_zip_file
Thanks in Advance

Here is what i use to send file to telegram channel by telegram bot.
The file size limit is 50MB in upload by telegram bot
The file size limit is 1500MB in upload by telegram client but you may add some text or other info so 1495 is more safe
#! /usr/bin/python3
# -*- coding:utf-8 -*-
# apt-get install p7zip-full
import subprocess
import os
import math
import logzero
logger = logzero.logger
MAX_SPLIT_SIZE = 1495
def file_split_7z(file_path, split_size=MAX_SPLIT_SIZE):
file_path_7z_list = []
# if origin file is 7z file rename it
origin_file_path = ""
if os.path.splitext(file_path)[1] == ".7z":
origin_file_path = file_path
file_path = os.path.splitext(origin_file_path)[0] + ".7zo"
os.rename(origin_file_path, file_path)
# do 7z compress
fz = os.path.getsize(file_path) / 1024 / 1024
pa = math.ceil(fz / split_size)
head, ext = os.path.splitext(os.path.abspath(file_path))
archive_head = "".join((head, ext.replace(".", "_"))) + ".7z"
for i in range(pa):
check_file_name = "{}.{:03d}".format(archive_head, i + 1)
if os.path.isfile(check_file_name):
logger.debug("remove exists file | {}".format(check_file_name))
os.remove(check_file_name)
cmd_7z = ["7z", "a", "-v{}m".format(split_size), "-y", "-mx0", archive_head, file_path]
proc = subprocess.Popen(cmd_7z, shell=False, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = proc.communicate()
if b"Everything is Ok" not in out:
logger.error("7z output | {}".format(out.decode("utf-8")))
logger.error("7z error | {}".format(err.decode("utf-8")))
return file_path_7z_list
for i in range(pa):
file_path_7z_list.append("{}.{:03d}".format(archive_head, i + 1))
# if origin file is 7z file rename it back
if origin_file_path:
os.rename(file_path, origin_file_path)
return file_path_7z_list
def do_file_split(file_path, split_size=MAX_SPLIT_SIZE):
"""caculate split size
example max split size is 1495 file size is 2000
than the split part num should be int(2000 / 1495 + 0.5) = 2
so the split size should be 1000 + 1000 but not 1495 + 505
with the file size increase the upload risk would be increase too
"""
file_size = os.path.getsize(file_path) / 2 ** 20
split_part = math.ceil(file_size / split_size)
new_split_size = math.ceil(file_size / split_part)
logger.info("file size | {} | split num | {} | split size | {}".format(file_size, split_part, new_split_size))
file_path_7z_list = file_split_7z(file_path, split_size=new_split_size)
return file_path_7z_list

In case you don't find a better, native way with zipfile, you could still write the file splitting algorithm yourself. Something like this:
outfile = archive_name
packet_size = int(1.5 * 1024**3) # bytes
with open(outfile, "rb") as output:
filecount = 0
while True:
data = output.read(packet_size)
print(len(data))
if not data:
break # we're done
with open("{}{:03}".format(outfile, filecount), "wb") as packet:
packet.write(data)
filecount += 1
And similar to put it back together on the receiver's side.

Related

Unable to properly increment variable and convert to .wav to .mp3

I am trying to create a new file recording every time this program runs and also convert those .wav files to .mp3. When I run this, it only creates a output.wav and output0.mp3 file and then when I run it again, no further files are created. Also the output0.mp3 that was converted is 0KB and cannot be played.
I do not get an error but it seems its not grabbing the output.wav properly that was originally created. I am running Python 3.7.
import os
import sounddevice as sd
from scipy.io.wavfile import write
from pydub import AudioSegment #for converting WAV to MP3
fs = 44100 # Sample rate
seconds = 3 # Duration of recording
myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=2)
sd.wait() # Wait until recording is finished
write('output.wav', fs, myrecording ) # Save as WAV file
#Increments file name by 1 so it writes a new file every time it runs
i = 0
while os.path.exists("output%s.wav" % i):
i += 1
# files for converting WAV to Mp3
src = ("output%s.wav" % i)
dst = ("output%s.mp3" % i)
# convert wav to mp3
sound = AudioSegment.from_mp3(src)
sound.export(dst, format="wav")
writefile = open("output%s.mp3" % i, "w")
EDIT:
Updated while loop to:
#Increments file name by 1 so it writes a new file every time it runs
i = 0
while os.path.exists("output%s.wav" % i):
# files for converting WAV to Mp3
src = ("output%s.wav" % i)
dst = ("output%s.mp3" % i)
# convert wav to mp3
sound = AudioSegment.from_mp3(src)
sound.export(dst, format="wav")
write("output%s.mp3" % i, "w")
i += 1
"create a new file recording every time this program runs " - To what I understand you just need to check for existing files and get a counter to reach +1 then the last file. Once you get that just create/convert file based on that.
I am not familiar with working of sound module, but in general below should be the code structure.
## This will create new recording file called output.wav
myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=2)
sd.wait() # Wait until recording is finished
write('output.wav', fs, myrecording ) # Save as WAV file
# Get the counter to reach the last file that has been created.
# For e.g. if last file generated was output5.wav then below loop will run 5 times
# and should change the value of i = 6.
i = 0
while os.path.exists("output%s.wav" % i):
i += 1
# Code for creating new file using value of 'i'
# Below code is outside of while loop and will run only once,
# as 1 file needs to be created per run of the program.
src = ("output.wav")
dst = ("output%s.mp3" % i)
# convert wav to mp3
sound = AudioSegment.from_mp3(src)
sound.export(dst, format="wav")
# Not sure if this is needed.
# Check working of sound module and what does sound.export do
writefile = open("output%s.mp3" % i, "w")
SOLUTION: Updated my while loop and changed the conversion method
i = 0
while not os.path.exists("output.wav"):
i += 1
fs = 44100 # Sample rate
seconds = 3 # Duration of recording
myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=2)
sd.wait() # Wait until recording is finished
write('output{0}.wav'.format(i), fs, myrecording ) # Save as WAV file
print("recording has finished")
datadir = str(Path(r"FilePathtoFolderWhereAudioFileIs"))
filetopen = datadir + "/" + 'output{0}.wav'.format(i)
sound = pydub.AudioSegment.from_wav(r"FilePathtoFolderWhereAudioFileIs""" + "\\output{0}.wav".format(i))
sound.export(r"FilePathtoFolderWhereAudioFileIs""" + "\\output{0}.mp3".format(i), format="mp3")
print("Converted wav to mp3")
time.sleep(3)

Downloaded GZ files is showing 0 byte

I am using OCI Python SDK and when i am trying to download an object (from an OCI bucket) which is GZ format it is getting downloaded but the file size is zero byte. Attaching the code
Any help is much appriciable.
import os
import oci
import io
import sys
reporting_namespace = 'xygabcdef'
prefix_file = "abc/xyz"
# Update these values
destination_path = 'downloaded_reports'
# Make a directory to receive reports
if not os.path.exists(destination_path):
os.mkdir(destination_path)
# Get the list of reports
config = oci.config.from_file(oci.config.DEFAULT_LOCATION, oci.config.DEFAULT_PROFILE)
reporting_bucket = sys.argv[1]
object_storage = oci.object_storage.ObjectStorageClient(config)
report_bucket_objects = object_storage.list_objects(reporting_namespace, reporting_bucket, prefix=prefix_file)
#def download_audit():
for o in report_bucket_objects.data.objects:
print('Found file ' + o.name)
object_details = object_storage.get_object(reporting_namespace, reporting_bucket, o.name)
print (object_details)
filename = o.name.rsplit('/', 1)[-1]
with open(destination_path + '/' + filename, 'wb') as f:
for chunk in object_details.data.raw.stream(1024 * 1024, decode_content=False):
f.write(chunk)
Please see the example here. Does this work for you? Namely:
get_obj = object_storage.get_object(namespace, bucket_name, example_file_object_name)
with open('example_file_retrieved', 'wb') as f:
for chunk in get_obj.data.raw.stream(1024 * 1024, decode_content=False):
f.write(chunk)
In your example destintation_path seems to be undefined, and seems to have a typo (destintation -> destination). Could this be the problem?
Lastly, what does object_details report the file size / content-length as? It could be that the file size of the object in Object Storage is itself 0 bytes.
the .content from the .data of get_object should give you the file data (binary or text/josn/...), so here is a modified version of your code:
import os
import sys
import oci
reporting_namespace = 'xygabcdef'
prefix_file = "abc/xyz"
# Update these values
destination_path = 'downloaded_reports'
# Get the list of reports
config = oci.config.from_file(oci.config.DEFAULT_LOCATION, oci.config.DEFAULT_PROFILE)
reporting_bucket = sys.argv[1]
object_storage = oci.object_storage.ObjectStorageClient(config)
objects = object_storage.list_objects(reporting_namespace, reporting_bucket, prefix=prefix_file).data
# def download_audit():
for obj in objects:
print('Found file ' + obj.name)
object_response = object_storage.get_object(reporting_namespace, reporting_bucket, obj.name).data
print(object_response)
file_path = os.path.join(destination_path, obj.name)
# Make sure parent dirs up to the file level are created
os.makedirs(os.path.dirname(file_path), exist_ok=True)
with open(file_path, 'wb') as file:
file.write(object_response.content)

I want to split zip file with python and then join splited files together, I found this code but I can not join splited files

Thank you to #Jeronimo
Split a zip archive into multiple chunks
outfile = archive_name
packet_size = int(1.5 * 1024**3) # bytes
with open(outfile, "rb") as output:
filecount = 0
while True:
data = output.read(packet_size)
print(len(data))
if not data:
break # we're done
with open("{}{:03}".format(outfile, filecount), "wb") as packet:
packet.write(data)
filecount += 1
after splitting them I can not join them together
Fortunately, I solve this problem my self
outfile = "archive_name"
packet_size = int(1024*1024*100) # bytes
filenumbers=9 #number of files you want to join
for i in range(filenumbers):
with open("{}.zip{:03}".format(outfile, i), "rb") as packet:
col=packet.read(packet_size)
with open("{}02.zip".format(outfile), "ab+") as mainpackage:
mainpackage.write(col)

How to list files inside tar in AWS S3 without downloading it?

While looking around for ideas I found https://stackoverflow.com/a/54222447/264822 for zip files which I think is a very clever solution. But it relies on zip files having a Central Directory - tar files don't.
I thought I could follow the same general principle and expose the S3 file to tarfile through the fileobj parameter:
import boto3
import io
import tarfile
class S3File(io.BytesIO):
def __init__(self, bucket_name, key_name, s3client):
super().__init__()
self.bucket_name = bucket_name
self.key_name = key_name
self.s3client = s3client
self.offset = 0
def close(self):
return
def read(self, size):
print('read: offset = {}, size = {}'.format(self.offset, size))
start = self.offset
end = self.offset + size - 1
try:
s3_object = self.s3client.get_object(Bucket=self.bucket_name, Key=self.key_name, Range="bytes=%d-%d" % (start, end))
except:
return bytearray()
self.offset = self.offset + size
result = s3_object['Body'].read()
return result
def seek(self, offset, whence=0):
if whence == 0:
print('seek: offset {} -> {}'.format(self.offset, offset))
self.offset = offset
def tell(self):
return self.offset
s3file = S3File(bucket_name, file_name, s3client)
tarf = tarfile.open(fileobj=s3file)
names = tarf.getnames()
for name in names:
print(name)
This works fine except the output looks like:
read: offset = 0, size = 2
read: offset = 2, size = 8
read: offset = 10, size = 8192
read: offset = 8202, size = 1235
read: offset = 9437, size = 1563
read: offset = 11000, size = 3286
read: offset = 14286, size = 519
read: offset = 14805, size = 625
read: offset = 15430, size = 1128
read: offset = 16558, size = 519
read: offset = 17077, size = 573
read: offset = 17650, size = 620
(continued)
tarfile is just reading the whole file anyway so I haven't gained anything. Is there anyway of making tarfile only read the parts of the file it needs? The only alternative I can think of is re-implementing the tar file parsing so it:
Reads the 512 bytes header and writes this into a BytesIO buffer.
Gets the size of the file following and writes zeroes into the BytesIO buffer.
Skips over the file to the next header.
But this seems overly complicated.
My mistake. I'm actually dealing with tar.gz files but I assumed that zip and tar.gz are similar. They're not - tar is an archive file which is then compressed as gzip, so to read the tar you have to decompress it first. My idea of pulling bits out of the tar file won't work.
What does work is:
s3_object = s3client.get_object(Bucket=bucket_name, Key=file_name)
wholefile = s3_object['Body'].read()
fileobj = io.BytesIO(wholefile)
tarf = tarfile.open(fileobj=fileobj)
names = tarf.getnames()
for name in names:
print(name)
I suspect the original code will work for a tar file but I don't have any to try it on.
I just tested your original code on a tar file and it works quite well.
Here is my sample output (truncated). I made some minor changes to display the total downloaded bytes and the seek step size in kB (published at this gist). This is for a 1 GB tar file containing 321 files (average size per file is 3 MB):
read: offset = 0, size = 2, total download = 2
seek: offset 2 -> 0 (diff = -1 kB)
read: offset = 0, size = 8192, total download = 8194
seek: offset 8192 -> 0 (diff = -9 kB)
read: offset = 0, size = 8192, total download = 16386
seek: offset 8192 -> 0 (diff = -9 kB)
read: offset = 0, size = 512, total download = 16898
<TarInfo 'yt.txt' at 0x7fbbed639ef0>
seek: offset 512 -> 7167 (diff = 6 kB)
read: offset = 7167, size = 1, total download = 16899
read: offset = 7168, size = 512, total download = 17411
<TarInfo 'yt_cache/youtube-sigfuncs' at 0x7fbbed639e20>
read: offset = 7680, size = 512, total download = 17923
...
<TarInfo 'yt_vids/whistle_dolphins-SZTC_zT9ijg.m4a' at 0x7fbbecc697a0>
seek: offset 1004473856 -> 1005401599 (diff = 927 kB)
read: offset = 1005401599, size = 1, total download = 211778
read: offset = 1005401600, size = 512, total download = 212290
None
322
So this downloads 212 kB for a 1GB tar file in order to get a list of 321 filenames in about 2 minutes on colab and 1.5 minutes on ec2 in the same region as the bucket.
In comparison, it takes 17 seconds to download the full file on colab and 1 second to list the files in it with tar -tf file.tar. So if I'm optimizing on execution time, I'd rather just download the full file and work on it locally. Otherwise, there might be some optimization that could be done in your original code? IDK.
OTOH, fetching a single file is more efficient than the above 2 minutes if it's at the beginning of the tar, but as slow as getting all file names if it's at the end. But I couldn't do that with the getmember() function because it seems that it internally calls getmembers() which has to go through the full file. Instead, I rolled out my own while loop to find the file and abort the search once found:
bucket_name, file_name = "bucket", "file.tar"
import boto3
s3client = boto3.client("s3")
s3file = S3File(bucket_name, file_name, s3client)
import tarfile
with tarfile.open(mode="r", fileobj=s3file) as tarf:
tarinfo = 1 # dummy
while tarinfo is not None:
tarinfo = tarf.next()
if tarinfo.name == name_search:
break
I think a future direction for this would be to have the tarinfo.open(...) cache the offsets of each file so that a subsequent tarinfo.open(...) doesn't go through the full file again. Once that's done, a first pass through the tar file will allow downloading individual files from the tar in s3 without going through the full file again and again for reach file.
Side note, couldn't you have just run gunzip on the tar.gz to get the tar to test on?

python 3.X concatenate zipped csv files to one non-zipped csv file

here is my python 3 code:
import zipfile
import os
import time
from timeit import default_timer as timer
import re
import glob
import pandas as pd
# local variabless
# pc version
# the_dir = r'c:\ImpExpData'
# linux version
the_dir = '/home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95'
def main():
"""
this is the function that controls the processing
"""
start_time = timer()
for root, dirs, files in os.walk(the_dir):
for file in files:
if file.endswith(".zip"):
print("working dir is ...", the_dir)
zipPath = os.path.join(root, file)
z = zipfile.ZipFile(zipPath, "r")
for filename in z.namelist():
if filename.endswith(".csv"):
# print filename
if re.match(r'^Trade-Geo.*\.csv$', filename):
pass # do somethin with geo file
# print " Geo data: " , filename
elif re.match(r'^Trade-Metadata.*\.csv$', filename):
pass # do something with metadata file
# print "Metadata: ", filename
else:
try:
with zipfile.ZipFile(zipPath) as z:
with z.open(filename) as f:
# print("send to test def...", filename)
# print(zipPath)
with zipfile.ZipFile(zipPath) as z:
with z.open(filename) as f:
frame = pd.DataFrame()
# EmptyDataError: No columns to parse from file -- how to deal with this error
train_df = read_csv(f, index_col=None, header=0, skiprows=1, encoding="cp1252")
# train_df = pd.read_csv(f, header=0, skiprows=1, delimiter=",", encoding="cp1252")
list_ = []
list_.append(train_df)
# print(list_)
frame = pd.concat(list_, ignore_index=True)
frame.to_csv('/home/ralph/Documents/lulumcusb/ImpExpData/Exports/concat_test.csv', encoding='cp1252') # works
except: # catches EmptyDataError: No columns to parse from file
print("EmptyDataError...." ,filename, "...", zipPath)
# GetSubDirList(the_dir)
end_time = timer()
print("Elapsed time was %g seconds" % (end_time - start_time))
if __name__ == '__main__':
main()
it mostly works -- only it does not concatenate all zipped csv files into one. there is one empty file and all csv files have the same field structure with all csv files varing in number of rows.
here is what spyder reports when i run it:
runfile('/home/ralph/Documents/lulumcusb/Sep15_cocncatCSV.py', wdir='/home/ralph/Documents/lulumcusb')
working dir is ... /home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95
EmptyDataError.... Trade-Exports-Chp-77.csv ... /home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95/Trade-Exports-Yr1992-1995.zip
/home/ralph/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py:688: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
execfile(filename, namespace)
Elapsed time was 104.857 seconds
the final csvfile is the last zipped csv file processed; the csv file changes in size as it processes the files
there are 99 csv files in the zipped file that i wish to concat into one non-zipped csv file
the field or column names are:
colmNames = ["hs_code", "uom", "country", "state", "prov", "value", "quatity", "year", "month"]
the csvfiles are labled: chp01.csv, cht02.csv, etc to chp99.csv with the "uom" (unit of measure) being either empty, or an integer or a string depending on the hs_code
Question: how do I get the zipped csv files to get concatenated into one large(estimated 100 mb uncompressed) csv file?
added details:
i am trying not to unzip the csv files, i would then have to go an delete them. I need to concat files because i have additional processing to do. The extracting of the zipped csv files is a viable option, i was hoping not having to do that
Is there any reason you don't want to do this with your shell?
Assuming the order in which you concatenate is irrelevant:
cd "/home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95"
unzip "Trade-Exports-Yr1992-1995.zip" -d unzipped && cd unzipped
for f in Trade-Exports-Chp*.csv; do tail --lines=+2 "$f" >> concat.csv; done
This removes the first line (column names) from each csv file before appending to concat.csv.
If you just did:
tail --lines=+2 "Trade-Exports-Chp*.csv" > concat.csv
You'd end up with:
==> Trade-Exports-Chp-1.csv <==
...
==> Trade-Exports-Chp-10.csv <==
...
==> Trade-Exports-Chp-2.csv <==
...
etc.
If you care about the order, change Trade-Exports-Chp-1.csv .. Trade-Exports-Chp-9.csv to Trade-Exports-Chp-01.csv .. Trade-Exports-Chp-09.csv.
Although it's doable in Python I don't think it's the right tool for the job in this case.
If you want to do the job in place without actually extracting the zip file:
for i in {1..99}; do
unzip -p "Trade-Exports-Yr1992-1995.zip" "Trade-Exports-Chp$i.csv" | tail --lines=+2 >> concat.csv
done

Resources