custom stream sink in pyspark - apache-spark

Im trying to follow the code in this link. However the code is in scala. I want to know if there is an equivalent of StreamSinkProvider in pyspark or if there is an other way to build a custom stream sink in pyspark, because I am not good in scala.
Thank you in advance

Related

How to get offsets for ElectraTokenizer

I am trying to use ELECTRA model from HuggingFace library. However, I need to get the offsets for ElectraTokenizer, which can be done straightforward, according to docs. Does anyone know how can I get them? Any help is appreciated.

How to load JPG ,PDF files to HBASE using SPARK?

I have image files in HDFS and I need them to load to HBase. Can I use SPARK to get this done instead of MapReduce? If so how, please suggest. Am new to hadoop eco system.
I have created a Hbase table with MOB type with a threshold of 10MB size.
Am stuck here on how to load the data using shell command line.
After some research there were couple of recommendations to use MapReduce but were not informative.
You can use Apache Tika... along with sc.binaryFiles(filesPath) formats supported by Tika are formats
out of which you need
Image formats The ImageParser class uses the standard javax.imageio
feature to extract simple metadata from image formats supported by the
Java platform. More complex image metadata is available through the
JpegParser and TiffParser classes that uses the metadata-extractor
library to supports Exif metadata extraction from Jpeg and Tiff
images.
and
Portable Document Format The PDFParser class parsers Portable Document
Format (PDF) documents using the Apache PDFBox library.
Example code with Spark see in my answer
another example code answer given here by me to load in to hbase

Spark decompression technique using Scala

I'm trying to use compression technique in Spark.
files1.saveAsSequenceFile("/sequ1", Some(classOf[org.apache.hadoop.io.compress.SnappyCodec]))
Now I want to read from compressed file. So I need to decompress it.
How can I do that? Can anyone help me on this?

how to build test enviroment (Linux, Spark, jupyterhub

I am working on my thesis, and i have the opportunity to set up a working environment to test the functionality and how it works.
the following points should be covered:
jupyterhub (within a private cloud)
pandas, numpy, sql, nbconvert, nbviewer
get Data into DataFrame (csv), analyze Data, store the data (RDD?, HDF5?, HDFS?)
spark for future analysis
The test scenario will consist:
multiple user environment with notebooks for Users/Topics
analyze structured tables (RSEG, MSEG, EKPO) with several million lines in a 3-way-match with pandas, numpy and spark (spark-sql), matplotlib.... its about 3GB of Data in those 3 tables.
export notebooks with nbconvert, nbviewer to pdf, read-only notbook and/or reveal.js
Can you guys please give me some hints or experiences on how many notes i should use for testing, which Linux distribution is a good start?
i am sure there are many more questions, i have problems to find ways or info how to evaluate possible answers.
thanks in advance!

Heatmap with Spark Streaming

I have just started using Spark Streaming and done few POCs. It is fairly easy to implement. I was thinking of presenting the data using some smart graphing & dashboarding tools e.g. Graphite or Grafna, but they don't have heat-maps. I also looked at Zeppelin , but unable to found any heat-map functionality.
Could you please suggest any data visualization tools using Heat-map and Spark streaming.
In Stratio we work all the time with heatmaps that takes the data from spark. All you need is the combination of stratio-viewer http://www.stratio.com/datavis/kbase/article/map-widget/ and stratio-sparkta http://docs.stratio.com/modules/sparkta/development/about.html.
Disclaimer: I work for Stratio

Resources