Python: Access a zipped XL file without extracting it - python-3.x

Is there a way I can process an open the excel file within a zip file without first extracting it. I am not interested in modifying it.
from zipfile import ZipFile
from openpyxl import load_workbook
procFile ="C:\\Temp2\\XLFile-Demo-PW123.zip"
xl_file = "XLFile-Demo.xlsx"
myzip = ZipFile(procFile)
myzip.setpassword(bytes('123', 'utf-8'))
# line below returns an error
with load_workbook(myzip.open(xl_file)) as wb_obj:
print(wb_obj.sheetnames)
Most of the examples that perform this only directly open text files.
I would like to simulate the behaviour of archiving programs such as WinRar and 7zip.
Thanks

Related

How to work with XlsxWriter without saving the file in the disk

I want to be able to create an excel file from some data I have. Once it is ready, I want to send it using a python telegram bot and get ride of the file.
Ideally, the file will be created from scratch and saved into a variable, and once it is done sent using the python telegram bot module to send the file and end it without saving the file to the disk.
import xlsxWriter as xs
workbook = xs.Workbook('demo.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write('A1', 'Hello')
workbook.close()
Ok so after the write command I don't see any file created in the folder, but I don't know if the file is there waiting to be closed instead of not existing.
Hoy can I, without saving it, do
bot.send_file(my_xlsx,chat_id=1111111)
One way to not save the file in disk is writing it in a BytesIO object. BytesIO is a buffer, so the excel information is stored in memory:
import xlsxwriter as xs
from io import BytesIO
excel_io = BytesIO()
workbook = xs.Workbook(excel_io)
worksheet = workbook.add_worksheet()
worksheet.write('A1', 'Hello')
workbook.close()
You can get the data stored using the method getvalues(). More info about BytesIO (and io library in general) here:
https://docs.python.org/3/library/io.html
https://www.journaldev.com/19178/python-io-bytesio-stringio

Download multiple Dropbox zip files from csv file

In have a .csv file that contains ~100 links to dropbox files. The current method I have downloads the files missing the ?dl=0 extension that seems to be critical
#import packages
import pandas as pd
import wget
#read the .csv file, iterate through each row and download it
data = pd.read_csv("BRAIN_IMAGING_SUMSTATS.csv")
for index, row in data.iterrows():
print(row['Links'])
filename = row['Links']
wget.download(filename)
Output:
https://www.dropbox.com/s/xjtu071g7o6gimg/metal_roi_volume_dec12_2018_pheno1.txt.zip?dl=0
https://www.dropbox.com/s/9oc9j8zhd4mn113/metal_roi_volume_dec12_2018_pheno2.txt.zip?dl=0
https://www.dropbox.com/s/0jkdrb76i7rixa5/metal_roi_volume_dec12_2018_pheno3.txt.zip?dl=0
https://www.dropbox.com/s/gu5p46bakgvozs5/metal_roi_volume_dec12_2018_pheno4.txt.zip?dl=0
https://www.dropbox.com/s/8zfpfscp8kdwu3h/metal_roi_volume_dec12_2018_pheno5.txt.zip?dl=0
These look like the correct links, but the download files are in the format
metal_roi_volume_dec12_2018_pheno1.txt.zip instead of metal_roi_volume_dec12_2018_pheno1.txt.zip?dl=0, so I cannot unzip them. Any ideas how to download the actual dropbox files?
By default (without extra URL parameters, or with dl=0 like in your example), Dropbox shared links point to an HTML preview page for the linked file, not the file data itself. Your code as-is will download the HTML, not the actual zip file data.
You can modify these links for direct file access though, as documented in this Dropbox help center article.
So, you should modify the link, e.g., to use raw=1 instead of dl=0, before calling wget.download on it.
Quick fix would be something like:
#import packages
import pandas as pd
import wget
import os
from urllib.parse import urlparse
#read the .csv file, iterate through each row and download it
data = pd.read_csv("BRAIN_IMAGING_SUMSTATS.csv")
for index, row in data.iterrows():
print(row['Links'])
filename = row['Links']
parsed = urlparse(filename)
fname = os.path.basename(parsed.path)
wget.download(filename, fname)
Basically, you extract filename from the URL and then use that filename as the output param in the wget.download fn.

read_csv one file from several files in a gzip?

I have several files in my tar.gz zip file. I want to read only one of them into a pandas data frame. Is there any way to do that?
Pandas can read a file inside a gz. But seems like there is no way to tell it specifically read one of them if there are several files inside the gz.
Would appreciate any thoughts.
Babak
To read a specific file in any compressed folder we just need to give its name or position for e.g to read a specific csv file in a zipped folder we can just open that file and read the content.
from zipfile import ZipFile
import pandas as pd
# opening the zip file in READ mode
with ZipFile("results.zip") as z:
read = pd.read_csv(z.open(z.infolist()[2].filename))
print(read)
Here the folder structure of results looks like and I want to read test.csv :
$ data_description.txt sample_submission.csv test.csv train.csv
If you use pardata, you can do this in one line:
import pardata
data = pardata.load_dataset_from_location('path-to-zip.zip')['table/csv']
The returned data variable should be a dictionary of all csv files in the zip archive.
Disclaimer: I'm one of the main co-authors of pardata.

Camelot-py does not work in loops but works for an individual file

I am currently working on an automation project for a company, and one of the tasks require that I loop through a directory and convert all the pdf files into a CSV file. I am using the camelot-py library (which has been better than the others I have tried). When I apply the code below to a single file, it works just fine; however, I wish to make it loop through all pdf files in the directory. I get the following error with the code below:
"OSError: [Errno 22] Invalid argument"
import camelot
import csv
import pandas as pd
import os
directoryPath = r'Z:\testDirectory'
os.chdir(directoryPath)
print(os.listdir())
folderList = os.listdir(directoryPath)
for folders, sub_folders, file in os.walk(directoryPath):
for name in file:
if name.endswith(".pdf"):
filename = os.path.join(folders,name)
print(filename)
print(name)
tables = camelot.read_pdf(filename, flavor = 'stream', columns= ['72,73,150,327,442,520,566,606,683'])
tables = tables[0].df
print(tables[0].parsing_report)
tables.to_csv('foo2.csv')
I expect all files to be converted to '.csv' files but I get the error 'OSError: [Errno 22] Invalid argument'. My error appears to be from line 16.
I don’t know if you have the same problem, but in my case I made a really stupid mistake of not putting the files in the correct directory. I was getting the same error but once I found out the problem, script works within a regular for loop.
Instead of the to methods, I am using the bulk export to export the results in sql, but that should not be a problem.

Watson Data Platform how to unzip the zip file in the data assets

How to unzip the zip file in the data assets of the Watson Data Platform?
from io import BytesIO
import zipfile
zip_ref = zipfile.ZipFile(BytesIO(streaming_body_1.read()), 'r')
zip_ref.extractall(WHICH DIRECTORY FOR THE DATA ASSETS)
zip_ref.close()
streaming_body_1 is the zip file streaming body object in the DATA ASSETS section. I uploaded the zip file to the DATA ASSETS.
How can I unzip the zip file in the Data Assets?
Since I don't know the exact Key Path of the DATA ASSETS section.
I am trying to do this in the jupyter notebook of the project.
Thank you!
When you upload a file to your project it is stored in the project's assigned cloud storage, which should now be Cloud Object Storage by default. (Check your project settings.) To work with uploaded files (which are just one type of data asset, there are others) in a notebook you'll have to first download it from the cloud storage to make it accessible in the kernel's file system and then perform the desired file operation (e.g. read, extract, ...)
Assuming you've uploaded your ZIP file you should be able to generate code that reads the ZIP file using the tooling:
click the 1010 (Data icon) on the upper right hand side
select "Insert to code" > "Insert StreamingBody object"
consume the StreamingBody as desired
I ran a quick test and it worked like a charm:
...
# "Insert StreamingBody object" generated code
...
from io import BytesIO
import zipfile
zip_ref = zipfile.ZipFile(BytesIO(streaming_body_1.read()), 'r')
print zip_ref.namelist()
zip_ref.close()
Edit 1: If your archive is a compressed tar file use the following code instead:
...
# "Insert StreamingBody object" generated code
...
import tarfile
from io import BytesIO
tf = tarfile.open(fileobj=BytesIO(streaming_body_1.read()), mode="r:gz")
tf.getnames()
Edit 2: To avoid the read timeout you'll have to change the generated code from
config=Config(signature_version='oauth'),
to
config=Config(signature_version='oauth',connect_timeout=50, read_timeout=70),
With those changes in place I was able to download and extract training_data.tar.gz from the repo you've mentioned.

Resources