Problems reading .bz2 or .tar.bz2 files as hdf5 in R

Problems reading .bz2 or .tar.bz2 files as hdf5 in R - linux

I downloaded some files with extension .tar.bz2. I was able to untar these into folders containing .bz2 files. These should unzip as hdf5 files (Metadata said they were hdf5) , but they unzip into files with no extensions I have tried the following but didnt work:
untar("File.tar.bz2")
#Read lines of one of the files from the unzipped file
readLines(bzfile("File1.bz2"))
[1] "‰HDF" "\032"
library (rhdf5)
#Explore just as a bzip2 file
bzfile("File1.bz2")
description "File1.bz2"
class "bzfile"
mode "rb"
text "text"
opened "closed"
can read "yes"
can write "yes"
#Try to read as hdf5 using rhdf5 library
h5ls(bzfile("File1.bz2"))
Error in h5checktypeOrOpenLoc(). Argument neither of class H5IdComponent nor a character.
Is there some sort of encoding I need to do? What am I missing? What should I do?

Related

Extract tar.gz{some integer} in python

I am trying to extract a file name with this format--> filename.tar.gz10
I have tried mutpile wayd but for all of them, I get the error that is unknow format. it works fine for files ends with tar.gz00. I tried to change the name but still does not work.
Here are what I have tried,
import tarfile
file = tarfile.open('filename.tar.gz10')
file.extractall('./extracted_path')
file.close()
Another way is,
shutil.unpack_archive('./filename.tar.gz10', './extracted_path', 'tar.gz17')
Thanks for your help in advance.

This coule be because the archive was split into smaller chunks, on linux you could do so using the split -b command so one big file is actually multiple smaller ones now, and they are named like
file.tar.gz01
file.tar.gz02
file.tar.gz03
file.tar.gz04
etc...
you wont be able to decompress these file individually, so you have to concatenate them first into one file then decompress.
To verify whther it was split or not, run file {filename} and if does not recognize it as a gzip compressed archive then it is propably split (this is why you get unknown format error)
You can try to do the following:
from glob import glob
import os
path = '/path/to/' # location of your files
list_of_files = glob(path + '*.tar.gz*') # list all gzip files
bash_command = 'gzip -dk filename.tar.gz' + ' '.join(list_of_files) # create bash command to concatenate the files
os.system(bash_command)

How to define the condition of a corrupted file for audio file in Python

I am using Python 3.6, Jupyter notebook by connecting to a remote machine. I have a large dataset of mp3 files. I use FFmpeg (version is 2.8.14-0ubuntu0.16.04.1.) to convert mp3 files to wav format.
My code below goes over the file path list and if the file is mp3 it converts it to wav format and deletes the mp3 file. The code works but for a few files it stops and gives error. I opened those files and saw that they have no duration and each of them has size 600 looking at the terminal folder size column but it might be a coincidence. The error is file not found for 'temp_name.wav'.
I can see that these corrupted files are not able to be converted to wav. When I delete them manually and run the code again it works. But I have large datasets and cannot know which files are corrupted beforehand. Is there a way to make the code (before converting the file to wav) if the file is corrupted it deletes it and continues to next file. I just don`t know how to define the condition of a corrupted file or if the file cannot be converted to wav.
# npaths is the list of full file paths
for fpath in npaths:
if (fpath.endswith(".mp3")):
cdir=os.path.dirname(fpath) # extract the directory of file
os.chdir(cdir) # change the directory to cdir
filename=os.path.basename(fpath) # extract the filename from the path
os.system("ffmpeg -i {0} temp_name.wav".format(filename))
ofnamepath=os.path.splitext(fpath)[0] # filename without extension
temp_name=os.path.join(cdir, "temp_name.wav")
new_name = os.path.join(ofnamepath+'.wav')
os.rename(temp_name,new_name) # use original filename with wav ext
old_file = os.path.join(ofnamepath+'.mp3') # find and delete the mp3
os.remove(old_file)

Automatically renames files with correct file extention in bulk

I have a folder with multiple types of file ( mp4, mp4, jpg, wma .etc) and these files have either have no extension, or all messed up extensions extension such as mp3.mp3, mp3.jpg, or just file name. I was reading that exiftool or even python magic can be used to assign correct file extension on understanding filetype. I am looking for exiftool based solution where these file can be renamed with correct file extension.
eg
filename (this is mp3 file)
filename1.jpg ( this is again mp3 file, with jpg as file extension)
filename.mp3.mp3.mp3 (repetition of extension)

At the simplest, try this (change double quotes to single quotes if on Mac/Linux):
exiftool -ext "*" "-filename<$filename.$filetype" TargetDir
or
exiftool -ext "*" "-testname<%f.$filetype" TargetDir
That will simply add the extension all the files in TargetDir. To recurse, add -r. If there was already an extension, this will add the proper extension at the end of the false extension e.g. filename.mp3 would become filename.mp3.jpeg.
For a more complex version which strips away some of the previous, false extensions, you could try something like this:
exiftool -ext "*" "-filename<${filename;s/(\.(mp3|mp4|jpe?g|png|wma|mov))*($)//i}%-c.$filetype" TargetDir
which would strip away extensions that are in the center parens in the regex. The %-c will add a number if the resulting rename would be a duplicate e.g. filename.jpeg, filename-1.jpeg, … filename-n.jpeg.
Edit: added -ext option to deal with files without an extension.

ParaView get file path

I am opening some VTU files from Directory X and there are other output files in that directory (for example log.txt) that I want to open via a plugin. If I do a os.getcwd() I end up in ParaViews installation directory. What I want is the directory of the VTU files I loaded BEFORE applying the plugin... So basically the start Point of the Pipline.

You could do something like this to get the reader
myreader = FindSource('MyReader')
then get the file name via the FileName attribute
myreader.FileName

What is the file extension of a file called "foo.tar.bz2"?

Considering the file named:
foo.tar.bz2
What is the file extension? Is it .tar.bz2 or .bz2? Is it well defined?
Edit: The question here is one of the definition of a "file extension", or where the separation is between the file's name and its extension: is it "foo|.tar.bz2" or "foo.tar|.bz2"

The standard file extension would be .tar.bz2, but .tbz2 should suffice as a shortened extension.

tar - is archive file
bz2 - is compressed with bzip
to unarchive and get all files you should type in command line unix:
tar jxf foo.tar.bz2
after that you will have files unarchived and extracted
extension is the last one .bz2

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Problems reading .bz2 or .tar.bz2 files as hdf5 in R - linux

Related

Extract tar.gz{some integer} in python

How to define the condition of a corrupted file for audio file in Python

Automatically renames files with correct file extention in bulk

ParaView get file path

What is the file extension of a file called "foo.tar.bz2"?

Categories

Resources