I have a Spark Java application for log mining. Currently I am reading the output from spark output files and displaying it in Excel Sheet. But I want a better UI. Can somebody help me to code a better UI for easier and better way to analyze the results of spark output. It will be helpful if I add graphs and table views.
One option is exposing Spark data via JDBC/ODBC as described in:
https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
So you can write an Application for a platform of your choice.
Related
My Environment:
Databricks 10.4
Pyspark
I'm looking into Spark performance and looking specifically into memory/disk spills that are available in Spark UI - Stage section.
What I want to achieve is to get notified if my job had spills.
I have found something below but I'm not sure how it works:
https://spark.apache.org/docs/3.1.3/api/java/org/apache/spark/SpillListener.html
I want to find a smart way where major spills are rather than going though all the jobs/stages manually.
ideally, I want to find spills programmatically using pyspark.
You can use SpillListener class as shown below,
spillListener = spark._jvm.org.apache.spark.SpillListener()
print(spillListener.numSpilledStages())
If you need more details, you have to extend that class and override the methods.
But, I think we can't add custom listeners directly on PySpark we have to do it via Scala. Refer this.
You can refer this page to see how we can implement SpillListener in scala.
For spark structured streaming we have written a custom source reader to read data from our "custom source". To write it , I was following the example of "kafka custom source" code in spark code itself.
We also have a requirement of writing our output to a custom location. For that I saw that spark provides "foreach" and "foreachbatch"(in spark 2.3.0 onwards). I found that using either of these is a very simple implementation and also at first look , I feel most of my custom sink implementation requirement can be met.
But when I looked at kafka code in spark, I saw that kafka uses "StreamSinkProvider"(https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-StreamSinkProvider.html) instead of using "foreach" or "foreachbatch".
Now I wanted to find out what are various pros/cons of using wither of these options(in terms of performance/ flexibility etc.) ? Is one better than another ? Anyone have any experience working on either of these options how they go along in actual use case ?
I am trying to visualize spark structured streams in Zeppelin. I am able to achieve using memory sink(spark.apache). But it is not reliable solution for high data volumes. What will be the better solution?
Example implementation or demo would be helpful.
Thanks,
Rilwan
Thanks for asking the question!! Having 2+ years of experience for developing Spark Monitoring Tools, I think I will be able to resolve your doubt!!
There are two types of processing available when data is coming to spark as stream.
Discretized Stream or DStream: In this mode, spark provides you data
in RDD format and you have to write your own logic to handle the
RDD.
Pros:
1. If you want to do some processing before saving the streaming data, RDD is the best way to handle compared to DataFrame.
2. DStream provides you a nice Streaming UI where it graphically show how much data havebeen processed. Check this link - https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html#monitoring-applications
Cons:
1. Handling Raw RDD is not so convenient and easy.
Structured Stream: In this mode, spark provides you data in a
DataFrame format, you need to mention where to store/send the data.
Pros:
1. Spark Streaming comes with some predefined sources and sinks which are very common and 95% of real-life scenarios can be resolved by plugging in these. Check this link - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Cons:
1. There is no Streaming UI available with Structured Streaming :( .Although you can get the metrices and create your own UI. Check this link - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries
You can also put store the metrices in some plaintext file, read the file in Zeppelin through spark.read.json, and plot your own graph.
I have deployed a 9 node DataStax Cluster in Google Cloud. I am new to Cassandra and not sure how generally people push the data to Cassandra.
My requirement is read the data from flatfiles and RDBMs table and load into Cassandra which is deployed in Google Cloud.
These are the options I see.
1. Use Spark and Kafka
2. SStables
3. Copy Command
4. Java Batch
5. Data Flow ( Google product )
Is there any other options and which one is best.
Thanks,
For flat files you have 2 most effective options:
Use Spark - it will load data in parallel, but requires some coding.
Use DSBulk for batch loading of data from command line. It supports loading from CSV and JSON, and very effective. DataStax's Academy blog just started a series of the blog posts on DSBulk, and first post will provide you enough information to start with it. Also, if you have big files, consider to split them into smaller ones, as it will allow DSBulk to perform parallel load using all available threads.
For loading data from RDBMS, it depends on what you want to do - load data once, or need to update data as they change in the DB. For first option you can use Spark with JDBC source (but it has some limitations too), and then saving data into DSE. For 2nd, you may need to use something like Debezium, that supports streaming of change data from some databases into Kafka. And then from Kafka you can use DataStax Kafka Connector for submitting data into DSE.
CQLSH's COPY command isn't as effective/flexible as DSBulk, so I won't recommend to use it.
And never use CQL Batch for data loading, until you know how it works - it's very different from RDBMS world, and if it's used incorrectly it will really make loading less effective than executing separate statements asynchronously. (DSBulk uses batches under the hood, but it's different story).
I am relatively new to spark. However I needed to find out that is there are a way by which we can see which data frame is being accessed at what time. Can this be achieved by native spark logging?
If so, then how do I implement this??
The DAG Visualization and Event Timeline are two very important built-in spark tools available from Spark 1.4 that you can use to see which DF/RDD is used and in what steps. See more details here - Understanding your Spark application through visualization