How to load JPG ,PDF files to HBASE using SPARK? - apache-spark

I have image files in HDFS and I need them to load to HBase. Can I use SPARK to get this done instead of MapReduce? If so how, please suggest. Am new to hadoop eco system.
I have created a Hbase table with MOB type with a threshold of 10MB size.
Am stuck here on how to load the data using shell command line.
After some research there were couple of recommendations to use MapReduce but were not informative.

You can use Apache Tika... along with sc.binaryFiles(filesPath) formats supported by Tika are formats
out of which you need
Image formats The ImageParser class uses the standard javax.imageio
feature to extract simple metadata from image formats supported by the
Java platform. More complex image metadata is available through the
JpegParser and TiffParser classes that uses the metadata-extractor
library to supports Exif metadata extraction from Jpeg and Tiff
images.
and
Portable Document Format The PDFParser class parsers Portable Document
Format (PDF) documents using the Apache PDFBox library.
Example code with Spark see in my answer
another example code answer given here by me to load in to hbase

Related

Reading ".rec" file in python

I have timeseries sensor data (temperature, vibration) in .rec files. I am looking a way to import them using Python. What I read while finding a solution for this is, usually .rec file contains images (medical images). In my case it is time series data. There is tools such as MXnet and MNE. But they use .idx files along with .rec files to iterate. Also, examples are limited to images.
Currently, what I am doing is to change the extension to .csv ans importing using pandas (delimiter '\t'). It is working as intended but I am looking for a clean solution that directly allows me to read and in best case scenario to iterate on .rec files.

If the feather file format still relevant or is the community leaning towards other file formats for large file storage?

I'm exploring file storage format options for Python and stumbled on feather. I noticed the last release was back in 2017 and was concerned about its long term existence.
Web searches are pulling back posts that all seem to stop around 2017.
The feather format is still relevant and support for more data types, especially on the R side has improved a lot recently. A remarkable change is that it is no longer released as a separate package but comes as part of arrow / https://arrow.apache.org/. There it is actively developed.
The other alternative format that the community is leading towards is Apache Parquet. There are some differences between feather and Parquet so that you may choose one over the other, e.g. Feather writes the data as-is and Parquet encodes and compresses it to achieve much smaller files. Additionally Parquet is also available in the Java world which might come in handy. Feather and Parquet are both available in R in the arrow library and in Python as part of pyarrow.

What is the most lightweight method to load different image file formats in nodejs and read pixels?

Is there a unified and lightweight method for loading multiple common image file formats in NodeJS which provides read access to individual pixels?
It should support gif, jpeg, and png.
Preferably it would either support other image formats too or provide a way to add more. (webp, etc.)
It does not need to be able to save the file again after modifying pixels, provide metadata access, or anything else.
It doesn't need to be able to load images from URLs.
So far the libraries that support multiple image formats are heavyweight, such as providing full canvas support or full image processing support.
Is there a lightweight way to do this that I'm not finding?
I don't know why I couldn't find this one before posting here:
get-pixels
Given a URL/path, grab all the pixels in an image and return the result as an ndarray. Written in 100% JavaScript, works both in browserify and in node.js and has no external native dependencies.
Currently the following file formats are supported:
PNG
JPEG
GIF
It hasn't had any updates for two years but seems the most lightweight. I'm guessing people might mostly use Jimp these days. It doesn't seem to have external dependencies and is actively developed, but includes a lot of image processing functionality I don't need.

How to convert an Enhanced Windows Metafile (emf) to a JPG or PNG without loosing quality?

I am using Tableau to do some data representations and the only good quality image export Tableau allows is *.emf
Unfortunately, the online tool I use to put the report together(Canva) does not support emf format.
When I convert the file to jpg or png, the quality is drastically reduced :(
How can I overcome this matter? I tried many things such as opening emf in Illustrator and saving back with CMYK colors and 300dpi and such. But nothing seems to keep the crisp quality of the original emf file.
User Friendly solution:
InkScape opens enhanced windows metafiles, and many other vector-graphical file formats.
It exports to png with choice for output's resolution
It is opensource and available for Linux, windows and Mac OS X.
It is a fact that Tableau's image export feature does not provide many options. In general when I need high quality images, I use one of the below methods depending on the quality I need and the tools available to me at that time:
Screenshot method: If you have a large screen, taking a screenshot directly from Tableau yields better images than the exported ones. If my viz is exported to web, I sometimes enlarge the graphic from my web browser and then take the screenshot.
Converting from PDF: Since PDF can contain vector objects, Tableau's PDF files are in high quality most of the times. If you cannot use these PDF files, you may try converting these files to PNG or JPG files using online or desktop tools. Here is an online tool you may use for this purpose, but be careful about your confidential files when using such online services :)
And there are more ways to convert from PDF but are usually more complicated since they contain some Photoshop steps. I am not sure whether these are easy to apply methods for a lot of files but still you may want to check one of them: https://community.tableau.com/thread/120134

efficient image compression for pdf embedding with linux

I would like to compress scanned text (monochrome or few colours) and store it in pdf (maybe djvu) files. I remember that I got very good results with Windows/Acrobat and "ZRLE" compressed monochrome tiff embedded into pdf. The algorithm was loossless as far as I remember. Now I search a way to obtain good results on linux. It should be storage saving and avoid loss (I do not mind loosing colours, but I do not want e.g. jpeg compression which would create noisy results for text scans). I need it for batch conversion, so I was thinking of the ImageMagick convert command. But which output format should I use so I get good results and to be able to embed it into pdf files (for example using pdflatex)? Or is it generally better to use djvu files?
jbig2enc encoder for images using jbig2 compression,
was originally written for GoogleBooks by Adam Langley
https://github.com/agl/jbig2enc
I forked to include latest improvements By Rubypdf and others
https://github.com/DingoDog/jbig2enc
I also built several binaries of jbig2enc for puppy linux (it can be working also on other distributions)
http://dokupuppylinux.info/programs:encoders
DJVU is not a bad choice, but if you want to stay in PDF for better compatibility you may want to look into lossless JBIG2 compression.
Quote from Wikipedia:
Overall, the algorithm used by JBIG2 to compress text is very similar
to the JB2 compression scheme used in the DjVu file format for coding
binary images.

Resources