Heatmap with Spark Streaming - apache-spark

I have just started using Spark Streaming and done few POCs. It is fairly easy to implement. I was thinking of presenting the data using some smart graphing & dashboarding tools e.g. Graphite or Grafna, but they don't have heat-maps. I also looked at Zeppelin , but unable to found any heat-map functionality.
Could you please suggest any data visualization tools using Heat-map and Spark streaming.

In Stratio we work all the time with heatmaps that takes the data from spark. All you need is the combination of stratio-viewer http://www.stratio.com/datavis/kbase/article/map-widget/ and stratio-sparkta http://docs.stratio.com/modules/sparkta/development/about.html.
Disclaimer: I work for Stratio

Related

custom stream sink in pyspark

Im trying to follow the code in this link. However the code is in scala. I want to know if there is an equivalent of StreamSinkProvider in pyspark or if there is an other way to build a custom stream sink in pyspark, because I am not good in scala.
Thank you in advance

How to load JPG ,PDF files to HBASE using SPARK?

I have image files in HDFS and I need them to load to HBase. Can I use SPARK to get this done instead of MapReduce? If so how, please suggest. Am new to hadoop eco system.
I have created a Hbase table with MOB type with a threshold of 10MB size.
Am stuck here on how to load the data using shell command line.
After some research there were couple of recommendations to use MapReduce but were not informative.
You can use Apache Tika... along with sc.binaryFiles(filesPath) formats supported by Tika are formats
out of which you need
Image formats The ImageParser class uses the standard javax.imageio
feature to extract simple metadata from image formats supported by the
Java platform. More complex image metadata is available through the
JpegParser and TiffParser classes that uses the metadata-extractor
library to supports Exif metadata extraction from Jpeg and Tiff
images.
and
Portable Document Format The PDFParser class parsers Portable Document
Format (PDF) documents using the Apache PDFBox library.
Example code with Spark see in my answer
another example code answer given here by me to load in to hbase

Spark decompression technique using Scala

I'm trying to use compression technique in Spark.
files1.saveAsSequenceFile("/sequ1", Some(classOf[org.apache.hadoop.io.compress.SnappyCodec]))
Now I want to read from compressed file. So I need to decompress it.
How can I do that? Can anyone help me on this?

What is 3G & 4G of Big Data mean and the different?

I've read a page about the comparison between Apache Spark and Apache Flink.
I don't know what the 3G & 4G of Big Data mean.
Please explain to me!
Means 3rd Generation, 4th Generation. There are many publications and websites that use these 3G or 4G terms to highlight or denigrate some technology by assigning a certain "generation". Each tool have things for and against according to the problem you are facing. From hadoop to Flink (there are many more Zamza, Spark, Storm ...) each has brought something new to the world of Big Data:
Calculation on huge volumes of data
Easy to use
Support for efficient iterative calculation
Unification of batch and streaming APIs
Support for CEP
Full streaming processing
Complete compatibility with the hadoop ecosystem
Exactly-once processing guarantees
...
What others have recommended is true. You should not be guided by these 3G or 4G criteria to select a technology. You must study your problem fully, know the technologies and tools available or at least have them classified according to their philosophy and use case Something old but illustrative is this book
You will form an idea and classify each one according to your own criteria :)
Something is true: each tool comes first or later and each stands out because it contains a different or more appropriate approach to certain problems

how to build test enviroment (Linux, Spark, jupyterhub

I am working on my thesis, and i have the opportunity to set up a working environment to test the functionality and how it works.
the following points should be covered:
jupyterhub (within a private cloud)
pandas, numpy, sql, nbconvert, nbviewer
get Data into DataFrame (csv), analyze Data, store the data (RDD?, HDF5?, HDFS?)
spark for future analysis
The test scenario will consist:
multiple user environment with notebooks for Users/Topics
analyze structured tables (RSEG, MSEG, EKPO) with several million lines in a 3-way-match with pandas, numpy and spark (spark-sql), matplotlib.... its about 3GB of Data in those 3 tables.
export notebooks with nbconvert, nbviewer to pdf, read-only notbook and/or reveal.js
Can you guys please give me some hints or experiences on how many notes i should use for testing, which Linux distribution is a good start?
i am sure there are many more questions, i have problems to find ways or info how to evaluate possible answers.
thanks in advance!

Resources