Extract tar.gz{some integer} in python - python-3.x

I am trying to extract a file name with this format--> filename.tar.gz10
I have tried mutpile wayd but for all of them, I get the error that is unknow format. it works fine for files ends with tar.gz00. I tried to change the name but still does not work.
Here are what I have tried,
import tarfile
file = tarfile.open('filename.tar.gz10')
file.extractall('./extracted_path')
file.close()
Another way is,
shutil.unpack_archive('./filename.tar.gz10', './extracted_path', 'tar.gz17')
Thanks for your help in advance.

This coule be because the archive was split into smaller chunks, on linux you could do so using the split -b command so one big file is actually multiple smaller ones now, and they are named like
file.tar.gz01
file.tar.gz02
file.tar.gz03
file.tar.gz04
etc...
you wont be able to decompress these file individually, so you have to concatenate them first into one file then decompress.
To verify whther it was split or not, run file {filename} and if does not recognize it as a gzip compressed archive then it is propably split (this is why you get unknown format error)
You can try to do the following:
from glob import glob
import os
path = '/path/to/' # location of your files
list_of_files = glob(path + '*.tar.gz*') # list all gzip files
bash_command = 'gzip -dk filename.tar.gz' + ' '.join(list_of_files) # create bash command to concatenate the files
os.system(bash_command)

Related

How to create a file in python having name = ".gitignore"

I was trying to create a file without a name in python (only filetype)
I tried this -
open(".gitignore","w+").close()
But it does not work.
edit - it does work real issue is in getting file through glob.glob
classify_folder_name = #path of the folder which contain .gitignore file
rel_paths = glob.glob(classify_folder_name + '/**', recursive=True)
for local_file in rel_paths:
print(local_file)
it does not print .gitignore file.
Any help will be appreciated.
Note -: don't want to use os.listdir()
There are few things that you might check:
files with dot at the beginning are hidden so whatever OS you are using, make sure you have hidden files visibility enabled
It might be saved in different directory
open(".gitignore","w+").close()
It would be better if you do this:
To create a file:
with open('.gitignore', 'w') as fp:
pass

How do I write a python script to read through files in a Linux directory and perform certain actions?

I need to write a python script to read through files in a directory, retrieve the header record (which contains date)? I need to compare the date in the header record of each file with current date and if the difference is greater than 30 days. I need to delete such files.
I managed to come up with below code but not sure how to proceed since I am new to Python.
Example:
Sample file in the directory (/tmp/ah): abcdedfgh1234.123456
Header record : FILE-edidc40: 20200602-123539 46082 /tmp/ah/srcfile
I have below code for the list of files in the current directory. I need to pass the python equivalent of below actions on unix files
head -1 file|cut -c 15-22
Output: 20200206 (to compare with current date and if older than 30) then delete file(using rm command).
import os
def files in os.listdir(path):
for files in os.listdir(path):
if os.path.isfile(os.path.join(path,file)):
yield file
for file in files(".") : # prints the list of files

How to define the condition of a corrupted file for audio file in Python

I am using Python 3.6, Jupyter notebook by connecting to a remote machine. I have a large dataset of mp3 files. I use FFmpeg (version is 2.8.14-0ubuntu0.16.04.1.) to convert mp3 files to wav format.
My code below goes over the file path list and if the file is mp3 it converts it to wav format and deletes the mp3 file. The code works but for a few files it stops and gives error. I opened those files and saw that they have no duration and each of them has size 600 looking at the terminal folder size column but it might be a coincidence. The error is file not found for 'temp_name.wav'.
I can see that these corrupted files are not able to be converted to wav. When I delete them manually and run the code again it works. But I have large datasets and cannot know which files are corrupted beforehand. Is there a way to make the code (before converting the file to wav) if the file is corrupted it deletes it and continues to next file. I just don`t know how to define the condition of a corrupted file or if the file cannot be converted to wav.
# npaths is the list of full file paths
for fpath in npaths:
if (fpath.endswith(".mp3")):
cdir=os.path.dirname(fpath) # extract the directory of file
os.chdir(cdir) # change the directory to cdir
filename=os.path.basename(fpath) # extract the filename from the path
os.system("ffmpeg -i {0} temp_name.wav".format(filename))
ofnamepath=os.path.splitext(fpath)[0] # filename without extension
temp_name=os.path.join(cdir, "temp_name.wav")
new_name = os.path.join(ofnamepath+'.wav')
os.rename(temp_name,new_name) # use original filename with wav ext
old_file = os.path.join(ofnamepath+'.mp3') # find and delete the mp3
os.remove(old_file)

Python: "FileNotFoundError" Despite being able to print such files

I'm working on a Python3 script where the code walks through directories and sub-directories to pull out all gzipped warc files.
I'd like to also add that the files are not in my home directory
file_path = os.path.join('/nappa7/pip73/Service')
walk_file(parallel_bulk, file_path)
Perhaps python is not looking where i think it's looking, nevertheless, here is my walk_file functions:
def walk_file(bulk, file_path):
warc = warcat.model.WARC()
try:
for (file_path,dirs,files) in os.walk(file_path):
for filenames in files:
if filenames.endswith('.warc.gz'):
warc.load(filenames)
except ValueError:
pass
When I replace the warc.load(filenames) with a print statement like so:
if filenames.endswith('.warc.gz'):
print(filenames)
The filenames are printed out onto the console as expected. Therefore, It leads me to believe that python was able to succesfully locate all warc.gz files. However, when i try the warc.load(filenames), i get:
FileNotFoundError: [Errno 2] No such file or directory: 'Sample.warc.gz'
I can certainly use some guidance.
Thank you.
So for anyone else who has a similar issue:
changing the code to this worked:
warc.load(os.path.join(file_path, filenames))
You need to use os.path.join(file_path, filenames) instead of just filenames.
Otherwise the operating system will look for the file in the current directory instead of file_path.
(And why is filenames plural when it refers to a single filename?)

Is there a way to undo a batch-rename of file extensions?

Ok so I kinda dropped the ball. I was trying to understand how things work. I had a few html files on my computer that I was trying to rename as txt files. This was strictly a learning exercise. Following the instructions I found here using this code:
for file in *.html
do
mv "$file" "${file%.html}.txt"
done
produced this error:
mv: rename *.html to *.txt: No such file or directory
Long story short I ended up going rogue and renaming the html files, as well as a lot of other non html files as txt files. So now I have files labeled like
my_movie.mp4.txt
my_song.mp3.txt
my_file.txt.txt
This may be a really dumb question but.. Is there a way to check if a file has two extensions and if yes remove the last one? Or any other way to undo this mess?
EDIT
Doing this find . -name "*.*.txt" -exec echo {} \; | cat -b seems to tell me what was changed and where it is located. The cat -b part is not necessary but I like it. This still doesn't fix what I broke though.
I'm not sure if terminal can check for extensions "twice", but you can check for . in every name an if there's more than one occurence of ., then your file has more extensions. Then you can cut the extension off with finding first occurence of . in a string when going backwards... or last one if checking characters in string in a normal way.
I have a faster option for you if you can use python. You can strip the extension with:
for file in list_of_files:
os.rename(file,os.path.splitext(file)[0])
which can give you from your file.txt.txt your file.txt
Example:
You wrote that your command tells you what has changed, so just take those changed files and dump them into a file(path to file per line). Then you can easily run this:
with open('<path to list>') as f:
list_of_files = f.readlines()
for file in list_of_files:
os.rename(file.strip('\n'), os.path.splitext(file.strip('\n'))[0])
If not, then you'd need to get the list from python:
import os
results = []
for root, folder, filenames in os.walk(<your path to folder>):
for filename in filenames:
if filename.endswith('.txt.txt'):
results.append(os.path.join(root, filename))
With this you got a list of files ending with .txt.txt like this <your folder>\\<path_to_file>.
Get a path to your directory used in os.walk() without folder's name(it's already in list) so it'll be like this:
e.g. os.walk('/home/me/directory') -> path='/home/me/' and res is item already in a list, which looks like directory/...
for res in results:
path = '' # set the path here
file = os.path.join(path,r)
os.rename(file, os.path.splitext(file)[0])
Depending on what files you want to find change .txt.txt in if filename.endswith('...') to whatever you like and os.rename() will take file's name without extension which in your case means it strips the additional extension you don't want to have.

Resources