Spark decompression technique using Scala

Spark decompression technique using Scala - apache-spark

I'm trying to use compression technique in Spark.
files1.saveAsSequenceFile("/sequ1", Some(classOf[org.apache.hadoop.io.compress.SnappyCodec]))
Now I want to read from compressed file. So I need to decompress it.
How can I do that? Can anyone help me on this?

Related

custom stream sink in pyspark

Im trying to follow the code in this link. However the code is in scala. I want to know if there is an equivalent of StreamSinkProvider in pyspark or if there is an other way to build a custom stream sink in pyspark, because I am not good in scala.
Thank you in advance

Can antlr4 be used to parse very large gzip compressed files?

I am trying to parse very large gzip compressed (10+GB) file in python3. Instead of creating the parse tree, instead I used embedded actions based on the suggestions in this answer.
However, looking at the FileStream code it wants to read the entire file and then parse it. This will not work for big files.
So, this is a two part question.
Can ANTLR4 use a file stream, probably custom, that allows it to read chunks of the file at a time? What should the class interface look like?
Predicated on the above having "yes", would that class need to handle seek operations, which would be a problem if the underlying file is gzip compressed?

Short anser: no, not possible.
Long(er) answer: ANTLR4 can potentially use unlimited lookahead, so it relies on the stream to seek to any position with no delay or parsing speed will drop to nearly a hold. For that reason all runtimes use a normal file stream that reads in the entire file at once.
There were discussions/attempts in the past to create a stream that buffers only part of the input, but I haven't heard of anything that actually works.

How to load JPG ,PDF files to HBASE using SPARK?

I have image files in HDFS and I need them to load to HBase. Can I use SPARK to get this done instead of MapReduce? If so how, please suggest. Am new to hadoop eco system.
I have created a Hbase table with MOB type with a threshold of 10MB size.
Am stuck here on how to load the data using shell command line.
After some research there were couple of recommendations to use MapReduce but were not informative.

You can use Apache Tika... along with sc.binaryFiles(filesPath) formats supported by Tika are formats
out of which you need
Image formats The ImageParser class uses the standard javax.imageio
feature to extract simple metadata from image formats supported by the
Java platform. More complex image metadata is available through the
JpegParser and TiffParser classes that uses the metadata-extractor
library to supports Exif metadata extraction from Jpeg and Tiff
images.
and
Portable Document Format The PDFParser class parsers Portable Document
Format (PDF) documents using the Apache PDFBox library.
Example code with Spark see in my answer
another example code answer given here by me to load in to hbase

Streaming images with Yesod and any image conversion library

I need to work with tiff images online. Tiff images are not supported by browsers. So i thought maybe i can convert them on the fly and stream them into the browser as pngs.
I found many image processing haskell libraries and JuicyPixels looks simple enough and supports reading from tiff and saving to many other formats including png.
The simplest case is to just save to png file and then serve it with sendFile
But i think involving hard drive in the process is going to add too much overhead and substantially slow down the response. SO my question is, how do i stream the image converted with JuicyPixels from tiff to png directly, without saving it into a file first.
Does JuicyPixels have any streaming interfaces? Or maybe there's a simple enough way to get to data representation in specific format and then pass it to any streaming libraries like conduit?
As i side question, anyone did streaming images from Yesod?

I don't have any experience with JuicyPixels, but it looks like it encodes to lazy ByteStrings. If that's the case, then you just need to return that lazy ByteString wrapped up in a DontFullyEvaluate.

Using SAXXMLReader with large zipped xml files

I'm really kind of surprised I couldn't find an answer to this on Google. Especially since xml files lend themselves to being zipped since they are so verbose.
I'm implementing the sax reader from the msxml library in my VB6 program to read large multi-gigabyte xml files from a zip file. Unzipping these files to the hard drive and then reading them is not the way to go since unzipping to disk is not necessary and so slow. This is where my problem comes in.
I can use zlib to read chunks of data from the zip file and process those chunks, but I don't see any way in the SAXXMLReader to process chunks. I've read that the parse method accepts IStream, but I haven't been able to find any method using Google to get an IStream from a zip file.
Can anyone here please provide me an answer to this problem or a pointer in the right direction? Thank you so much for your time.

The idea of getting a stream from the zip file is certainly how I'd deal with this in Java.
I'm not a .Net developer, so it's hard for me to certify this, but it sort of looks like SharpZipLib may have what you're looking for.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark decompression technique using Scala - apache-spark

I'm trying to use compression technique in Spark. files1.saveAsSequenceFile("/sequ1", Some(classOf[org.apache.hadoop.io.compress.SnappyCodec])) Now I want to read from compressed file. So I need to decompress it. How can I do that? Can anyone help me on this?

Related

custom stream sink in pyspark

Can antlr4 be used to parse very large gzip compressed files?

How to load JPG ,PDF files to HBASE using SPARK?

Streaming images with Yesod and any image conversion library

Using SAXXMLReader with large zipped xml files

Categories

Resources