how to build test enviroment (Linux, Spark, jupyterhub

how to build test enviroment (Linux, Spark, jupyterhub - linux

I am working on my thesis, and i have the opportunity to set up a working environment to test the functionality and how it works.
the following points should be covered:
jupyterhub (within a private cloud)
pandas, numpy, sql, nbconvert, nbviewer
get Data into DataFrame (csv), analyze Data, store the data (RDD?, HDF5?, HDFS?)
spark for future analysis
The test scenario will consist:
multiple user environment with notebooks for Users/Topics
analyze structured tables (RSEG, MSEG, EKPO) with several million lines in a 3-way-match with pandas, numpy and spark (spark-sql), matplotlib.... its about 3GB of Data in those 3 tables.
export notebooks with nbconvert, nbviewer to pdf, read-only notbook and/or reveal.js
Can you guys please give me some hints or experiences on how many notes i should use for testing, which Linux distribution is a good start?
i am sure there are many more questions, i have problems to find ways or info how to evaluate possible answers.
thanks in advance!

Related

If the feather file format still relevant or is the community leaning towards other file formats for large file storage?

I'm exploring file storage format options for Python and stumbled on feather. I noticed the last release was back in 2017 and was concerned about its long term existence.
Web searches are pulling back posts that all seem to stop around 2017.

The feather format is still relevant and support for more data types, especially on the R side has improved a lot recently. A remarkable change is that it is no longer released as a separate package but comes as part of arrow / https://arrow.apache.org/. There it is actively developed.
The other alternative format that the community is leading towards is Apache Parquet. There are some differences between feather and Parquet so that you may choose one over the other, e.g. Feather writes the data as-is and Parquet encodes and compresses it to achieve much smaller files. Additionally Parquet is also available in the Java world which might come in handy. Feather and Parquet are both available in R in the arrow library and in Python as part of pyarrow.

How to load JPG ,PDF files to HBASE using SPARK?

I have image files in HDFS and I need them to load to HBase. Can I use SPARK to get this done instead of MapReduce? If so how, please suggest. Am new to hadoop eco system.
I have created a Hbase table with MOB type with a threshold of 10MB size.
Am stuck here on how to load the data using shell command line.
After some research there were couple of recommendations to use MapReduce but were not informative.

You can use Apache Tika... along with sc.binaryFiles(filesPath) formats supported by Tika are formats
out of which you need
Image formats The ImageParser class uses the standard javax.imageio
feature to extract simple metadata from image formats supported by the
Java platform. More complex image metadata is available through the
JpegParser and TiffParser classes that uses the metadata-extractor
library to supports Exif metadata extraction from Jpeg and Tiff
images.
and
Portable Document Format The PDFParser class parsers Portable Document
Format (PDF) documents using the Apache PDFBox library.
Example code with Spark see in my answer
another example code answer given here by me to load in to hbase

Heatmap with Spark Streaming

I have just started using Spark Streaming and done few POCs. It is fairly easy to implement. I was thinking of presenting the data using some smart graphing & dashboarding tools e.g. Graphite or Grafna, but they don't have heat-maps. I also looked at Zeppelin , but unable to found any heat-map functionality.
Could you please suggest any data visualization tools using Heat-map and Spark streaming.

In Stratio we work all the time with heatmaps that takes the data from spark. All you need is the combination of stratio-viewer http://www.stratio.com/datavis/kbase/article/map-widget/ and stratio-sparkta http://docs.stratio.com/modules/sparkta/development/about.html.
Disclaimer: I work for Stratio

Graph drawing from sqlite3 data

I'm working with an SQlite3 database over a Debian computer. The information stored inside the database comes from several sensors which are recording temperatura, humidity, etc., so it will be nice to be able to present that data in a graphic way (better curves than bars). Web based, of course, would be awsome.
I'm looking for the best way to do it. All the information related to graph drawing from SQLite referes to Android, but I'm looking for a linux solution.
Anybody?
Thanks in advance.

How do you visualize logfiles in realtime?

Sometimes it might be useful, but mostly just looking cool or impressive to visualize log files (anything from http requests and to bandwith usage to cups of coffee drunk per day).
I know about Visitorville which I think look a bit silly, and then there's gltail.
How do you "visualize" your log files in realtime?

There is also the logstalgia tool. Visualizes Apache logs. See http://code.google.com/p/logstalgia/ for more details and a youtube video.

You may take a look at Apache Chainsaw. This nifty tool allows Log incomes from nearly everyqhere and has live filtering and colering. If you have an already written Log, I'm not sure if it can read it, it's been a while since I used it last time (was very usefull for the prototyping phase of our JBoss server)

Google has released the Visualization API that is probably flexible enough to help you:
The Google Visualization API lets you access multiple sources of structured data that you can display, choosing from a large selection of visualizations. The Google Visualization API also provides a platform that can be used to create, share and reuse visualizations written by the developer community at large.
It requires some Javascript knowledge and includes Google Docs integration, Spreadsheet integration. Check out the Gallery for some examples.

You could take a look at this. http://www.intalisys.com. 3D realtime vis app

We use Awk and Perl scripts to parse the log files and create summary reports and "databases" (technically databases in that each row corresponds to a unique event with many columns of data about that event, but not stored in a traditional database format. We're moving in that direction). I like Awk because you can very quickly search for specific strings in the log files using regex, keep counters and gather data from the log file entries, and do all kinds of calculations with that data. Then use your favorite plotting software. We use Excel, mainly because that's what was here before I started this job. I prefer MATLAB and it's open-source cousin, Octave, which is built on gnuplot.

I prefer Sawmill for visualizing data. You can basically throw any log file against it, and it will not only autodetect its structure*, but will also decide on how to analyze it. Even if you have a custom log file, you can still define what and how shall be analyzed and visualized.

I mainly use R to visualize data, but I've heard of Orange, too.

Not sure if it fits the question, but I just released this:
numStepCsvLogVis - analyze logfile data in CSV format
It uses Python's matplotlib, is motivated by the need to visualize syslog data in context of debugging kernel circular buffer operation (and variables) in C; and it visualizes by using CSV file format as intermediary to the logfile data (I cannot explain it better in brief - take a look at the README for more detail).
It has a "step" player accessed in terminal, and can handle "live" stdin input, but unfortunately, I cannot get a better response that 1 FPS when plot renders, so I wouldn't really call it "realtime" per se - but you can use it to eventually generate sonified videos of plot animations.

A simple solution is to use Logstalgia alongside the lightweight local-web-server.
First install the above. Then, from the root folder of your site visualise your logs in realtime with:
$ ws --log-format default | logstalgia -

Using SciTe, Notepad++ or other powerful text editor which have file processing routines, so you can create a script that colorizes parts of the log or just delete some non-important lines from it

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to build test enviroment (Linux, Spark, jupyterhub - linux

Related

If the feather file format still relevant or is the community leaning towards other file formats for large file storage?

How to load JPG ,PDF files to HBASE using SPARK?

Heatmap with Spark Streaming

Graph drawing from sqlite3 data

How do you visualize logfiles in realtime?

Categories

Resources