Extracting a particular file from a zipfile using Python - python-3.x

I have a list of 3 million html files in a zipfile. I would like to extract ~4000 html files from the entire list of files. Is there a way to extract a specific file without unzipping the entire zipfile using Python?
Any leads would be appreciated! Thanks in advance.
Edit:My bad, I should have elaborated on the question. I have a list of all the html filenames that need to be extracted but they are spread out over 12 zipfiles. How do I iterate through each zipfile, extract the matched html file and get the final list of extracted html files?

Let's say you wish to extract all the html files, then you can this out. If you have the list of all the file names to be extracted, then this will require slight modification.
listOfZipFiles = ['sample1.zip', 'sample2.zip', 'sample1.zip',... , 'sample12.zip' ]
fileNamesToBeExtracted = ['file1.html', 'file2.html', ... 'filen.html']
# Create a ZipFile Object and load sample.zip in it
for zipFileName in listOfZipFiles:
with ZipFile(zipFileName, 'r') as zipObj:
# Get a list of all archived file names from the zip
listOfFileNames = zipObj.namelist()
# Iterate over the file names
for fileName in listOfFileNames:
# Check if file to be extracted is present in file names to be extracted
if fileName in fileNamesToBeExtracted:
# Extract a single file from zip
zipObj.extract(fileName)

Related

Read and Write xlsx file, from pandas dataframe to specific directory

I have a function in python, that basically merges three txt files into one file, in xlsx format.
For that I use pandas package.
So I am running the python function in a certain directory. This function has the input as a specific path. Then the function takes this path, list the files of the directory, and filter the files that are needed. Meaning that, since I only want to read the txt files, I then filter the txt files. However, when I try to convert this txt files into pandas dataframe, the dataframe is None.
Also, I want to write a final xlsx to the directory where the initial files are.
Here is my function:
def concat_files(path):
summary=''
files_separate=[]
arr2 = os.listdir(mypath)
for i, items_list in enumerate(arr2):
if len(items_list) > 50:
files_separate.append(items_list)
files_separate
chunks= [files_separate[x:x+3] for x in range(0,len(files_separate),3)]
while chunks:
focus=chunks.pop(0)
for items_1 in focus:
if items_1.endswith('.Cox1.fastq.fasta.usearch_cluster_fast.fasta.reps.fasta.blastn.report.txt.all_together.txt'):
pandas_dataframe=pd.Dataframe(example)
pandas_dataframe.to_excel('destiny_path/' + str(header_file)+'.final.xlsx')
you need to create the folders before exporting the xlsx files.
so assuming you already have the folders created.
change this line
pandas_dataframe.to_excel('destiny_path/' + str(header_file)+'.final.xlsx')
to
pandas_dataframe.to_excel(os.path.join('destiny_path' ,str(header_file),'.final.xlsx'))

Is there a way to give multiple file locations for renaming files?

I want a program to rename subtitles file as same as the movie file, which are located in different folders and sub-folders.
What I have done:
Imported os, re, and shutil modules.
Made a for loop to iterate through a directory and return files/folders inside of a parent folder.
for foldername, subfoldername, filename in os.walk('E:\Movies'):
This loop will iterate through the E:\Movies folder and assume a list of sub folders and files.
To check whether file is a subtitle file, inside of for loop,
if filename.endswith('(.srt|.idx|.sub)'):
How do i give multiple paths and new names in the single second argument?
os.rename(filename,'')
Why do you want to give multiple paths and new names in the second argument?
The Code you shared does the following :
Loop through the directory tree.
For each filename in directory :
If file is subtitle file :
Rename file to movie file present some directory
In the last step you are not renaming all the files at once. You are doing them one at a time.
os.rename(src,dest)
Accepts two arguments only, the src filename and dest filename.
So for your case you will have to loop again through all the files in the directory, match the name of subtitle file with movie file, and then rename the subtitle file.
Try Something Like :
for foldername,subfoldername,filename in os.walk('E:\Movies'):
if filename.endswith('(.srt|.idx|.sub)'):
for folder2,subfolder2,moviename in os.walk('E:\Movies'):
# We don't want to match the file with itself
if(moviename != filename):
# You would have to think of your matching logic here
# How would you know if that subtitle is of that particular movie
# eg. if subtitle is of form 'a_good_movie.srt' you can split on '_' and check if all words are present in movie name
Edit
After the clarifications in the comments, it seems you want to implement the following :
Loop through all Folders in Directory:
For each Folder in directory, rename all subtitle_files to the Folder name
You can do this in Python 3 like this :
for folder in next(os.walk(directory))[1]:
for filename in next(os.walk(directory+folder))[2]:
if(filename.endswith(('.srt','.idx','.sub'))):
os.rename(filename,directory);
os.walk() returns a generator function. You can access the value of os.walk() generator as such in python 3 :
next(os.walk('C:\startdir'))[0] # returns 'C:\startdir'
next(os.walk('C:\startdir'))[1] # returns list of directories in 'C:\startdir'
next(os.walk('C:\startdir'))[2] # returns list of files in 'C:\startdir'
For python 2 you can call os.walk().next()[] with same return values

read_csv one file from several files in a gzip?

I have several files in my tar.gz zip file. I want to read only one of them into a pandas data frame. Is there any way to do that?
Pandas can read a file inside a gz. But seems like there is no way to tell it specifically read one of them if there are several files inside the gz.
Would appreciate any thoughts.
Babak
To read a specific file in any compressed folder we just need to give its name or position for e.g to read a specific csv file in a zipped folder we can just open that file and read the content.
from zipfile import ZipFile
import pandas as pd
# opening the zip file in READ mode
with ZipFile("results.zip") as z:
read = pd.read_csv(z.open(z.infolist()[2].filename))
print(read)
Here the folder structure of results looks like and I want to read test.csv :
$ data_description.txt sample_submission.csv test.csv train.csv
If you use pardata, you can do this in one line:
import pardata
data = pardata.load_dataset_from_location('path-to-zip.zip')['table/csv']
The returned data variable should be a dictionary of all csv files in the zip archive.
Disclaimer: I'm one of the main co-authors of pardata.

Could not append multiple image files in a PDF

I know there are answers regarding this question, but hear me out.
I am currently trying to make PDF out of .jpg files using img2pdf in python, but instead of appending the files to PDF it overwrites the already existing pages from the PDF.
Here's the code
import os,img2pdf
os.chdir("/home/aditya/Desktop")#images are inside desktop
root, dir, files = list(os.walk(os.getcwd()))[0]#files contains the
list of all names of all .jpg file
which I want to convert into PDF
with open("pdf_file.pdf","ab") as f:#PDF file is set to append
for img_file in files:
with open(img_file,"rb") as im_file:#read bytes from the image files
f.write(img2pdf.convert(im_file))#this line overwrites the exisiting
pages in the pdf despite the fact that
I have set it to #append
Any reason for this? Is there special attribute I need to pass?
Any help is appreciated. Thanks
img2pdf.convert does not convert image by image. It either does a single image or all at once
img2pdf.convert(list of images )
with open("pdf_file.pdf","ab") as f: #Open a pdf
f.write(img2pdf.convert(files)) #convert all the images and write bytes

How to add the contents of a directory to a zip file?

How would I add the contents of an entire directory to an already existing zip file using python? The directory to be added to the zip file will also include other folders as well and there will be duplicates in the zip file that will need to be overwritten. Any help would be appreciated. Thanks in advance!
P.S. If it is possible to zip the directory then combine both files that would also work.
Python's zipfile module allows you to manipular ZIP compressed archives. The ZipFile.namelist() method returns a list of files in an archive, and the ZipFile.write() method lets you add files to the archive.
z = zipfile.ZipFile('myfile.zip')
The os.walk method allows you to iterate over all the files contained in a directory tree.
for root, dirs, files in os.walk('mydir'):
for filename in files:
z.write(os.path.join(root, filename))
Replacing a file in an archive appears to be tricky; you can removed items by creating a temporary archive and then replacing the original when you're done as described in this question.
It might be easier just to call the zip command instead, but put these together and you should be able to get to where you want.

Resources