Make and read a custom webdataset - pytorch

Using a Python package webdataset, how to make a tar archive from a dataset that consists of directories. Each directory is a class, and consists of files, say jpeg images? How to read such a webdataset later with an iterator over the data loader?

Related

Reading ".rec" file in python

I have timeseries sensor data (temperature, vibration) in .rec files. I am looking a way to import them using Python. What I read while finding a solution for this is, usually .rec file contains images (medical images). In my case it is time series data. There is tools such as MXnet and MNE. But they use .idx files along with .rec files to iterate. Also, examples are limited to images.
Currently, what I am doing is to change the extension to .csv ans importing using pandas (delimiter '\t'). It is working as intended but I am looking for a clean solution that directly allows me to read and in best case scenario to iterate on .rec files.

How can I combine many files into single file without compression, keeping the same behavior across platforms?

I have a folder which includes a lot of subfolders and files. I want to combine all those files into one single large file. That file should be able to get expanded rendering back the original folder and files.
Another requirement is that the method to do it should render exactly the same output (single large file) across different platforms (Node.js, Android, iOS). I've tried ZIP utility's store mode, it indeed renders one file combining all input files and it doesn't compress them, which is good. However, when I try it on Node.js and Windows 7Zip software (ZIP format Store mode), I find that the outputs are not exactly the same. The two large files' sizes are slightly different and of course with different md5. Though they can both be expanded and return back identical files, the single files doesn't meet my requirement.
Another option I tried is Tar file format. Node.js and 7Zip renders different output as well.
Do you know anything I miss using ZIP store mode and Tar file? e.g. using specific versions or customized ZIP util?
Or, could you provide another method to realize my tasks?
I need a method to combine files which shares exactly the same protocol across Node.js, android, and iOS platforms.
Thank you.
The problem is your requirement. You should only require that the files and directory structure be exactly reconstructed after extraction. Not that the archive itself be exactly the same. Instead of running your MD5 on the archive, instead run it on the reconstructed files.
There is no way to assure the same zip result using different compressors, or different versions of the same compressor, or the same version of the same code with different settings. If you do not have complete control of the code creating and compressing the data, e.g., by virtue of having written it yourself and assuring portability across platforms, then you cannot guarantee that the archive files will be the same.
More importantly, there is no need to have that guarantee. If you want to assure the integrity of the transfer, check the result of extraction, not the intermediate archive file. Then your check is even better than checking the archive, since you are then also verifying that there were no bugs in the construction and extraction processes.

How to load JPG ,PDF files to HBASE using SPARK?

I have image files in HDFS and I need them to load to HBase. Can I use SPARK to get this done instead of MapReduce? If so how, please suggest. Am new to hadoop eco system.
I have created a Hbase table with MOB type with a threshold of 10MB size.
Am stuck here on how to load the data using shell command line.
After some research there were couple of recommendations to use MapReduce but were not informative.
You can use Apache Tika... along with sc.binaryFiles(filesPath) formats supported by Tika are formats
out of which you need
Image formats The ImageParser class uses the standard javax.imageio
feature to extract simple metadata from image formats supported by the
Java platform. More complex image metadata is available through the
JpegParser and TiffParser classes that uses the metadata-extractor
library to supports Exif metadata extraction from Jpeg and Tiff
images.
and
Portable Document Format The PDFParser class parsers Portable Document
Format (PDF) documents using the Apache PDFBox library.
Example code with Spark see in my answer
another example code answer given here by me to load in to hbase

How to find the compression level of a zip file?

I would like to know how to find the compression level of a zip file. Zip files made by 7z and winzip have different ratings for levels, so i would like to map few of them to their corresponding level in the other tool.
Store level or level 0 for all should be the same, but how do we check?
or to be specific,
How can we find the compression level of a zip file from file data,
or
By comparing with other zip files, for which we know the level of.
Refereed compression algo - DEFALTE
The only way is to recompress the zip file with different levels until you find the one that matches the lengths. You could just recompress one of the entries to find the level, on the assumption that the entire zip file used the same level.
Even that only works if you know the tool, and the version of the tool that was used. E.g. 7z, WinZip, Info-ZIP.

automatically partition audio files into small parts

I am looking for a way to automatically extract parts from audio files. Something like Imagemagick for audio files.
I only need to extract random parts of a fixed length from a large set of complete ogg-vorbis files. I easily know how to automatically interpret the output from a programm, so I would be able to write a small script if I had programs to do the following:
Get the length of the file
Extract parts of the given an offset in seconds and a length
Is there any program, which allows me to do this under linux? The files I am using are ogg vorbis files.
If there is a python library, which is able to do this, it would work as well.
You can use SoX (Sound eXchange) to do both.

Resources