How do I log data frames in spark? - apache-spark

I am relatively new to spark. However I needed to find out that is there are a way by which we can see which data frame is being accessed at what time. Can this be achieved by native spark logging?
If so, then how do I implement this??

The DAG Visualization and Event Timeline are two very important built-in spark tools available from Spark 1.4 that you can use to see which DF/RDD is used and in what steps. See more details here - Understanding your Spark application through visualization

Related

Zeppelin with Spark Structured Streaming Example

I am trying to visualize spark structured streams in Zeppelin. I am able to achieve using memory sink(spark.apache). But it is not reliable solution for high data volumes. What will be the better solution?
Example implementation or demo would be helpful.
Thanks,
Rilwan
Thanks for asking the question!! Having 2+ years of experience for developing Spark Monitoring Tools, I think I will be able to resolve your doubt!!
There are two types of processing available when data is coming to spark as stream.
Discretized Stream or DStream: In this mode, spark provides you data
in RDD format and you have to write your own logic to handle the
RDD.
Pros:
1. If you want to do some processing before saving the streaming data, RDD is the best way to handle compared to DataFrame.
2. DStream provides you a nice Streaming UI where it graphically show how much data havebeen processed. Check this link - https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html#monitoring-applications
Cons:
1. Handling Raw RDD is not so convenient and easy.
Structured Stream: In this mode, spark provides you data in a
DataFrame format, you need to mention where to store/send the data.
Pros:
1. Spark Streaming comes with some predefined sources and sinks which are very common and 95% of real-life scenarios can be resolved by plugging in these. Check this link - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Cons:
1. There is no Streaming UI available with Structured Streaming :( .Although you can get the metrices and create your own UI. Check this link - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries
You can also put store the metrices in some plaintext file, read the file in Zeppelin through spark.read.json, and plot your own graph.

how to use flink and spark together,and spark just for transformation?

Let`s say there is a collection "goods" in mongodb like this:
{name:"A",attr:["location":"us"],"eventTime":"2018-01-01"}
{name:"B",attr:["brand":"nike"],"eventTime":"2018-01-01"}
In the past,I use spark to flatten it and save to hive:
goodsDF.select($"name",explode($"attribute"))
But,now we need to handle incremental data,
for example,there are a new good in the third line in the next day
{name:"A",attr:["location":"us"],"eventTime":"2018-01-01"}
{name:"B",attr:["brand":"nike"],"eventTime":"2018-01-01"}
{name:"C",attr:["location":"uk"],"eventTime":"2018-02-01"}
some of our team think flink is better on streaming,because flink has event driver application,streaming pipeline and batch,but spark is just micro batch.
so we change to use flink,but there are a lot of code has been written by spark,for example,the "explode" above,so my question is:
Is it possible to use flink to fetch source and save to the sink,but in the middle,use spark to transform the dataset?
If it is not possible,how about save it to a temporary sink,let`s say,some json files,and then spark read the files and transform and save to hive.But I am afraid this makes no sense,because for spark,It is also incremental data.Use flink then use spark is the same as use spark Structured Streaming directly.
No. Apache Spark code can not be used in Flink without making changes in code. As these two are different processing frameworks and APIs provided by two and it's syntax are different from each other. Choice of framework should really be driven by the use case and not by generic statements like Flink is better than Spark. A framework may work great for your use case and it may perform poorly in other use case. By the way, Spark is not just micro batch. It has batch, streaming, graph, ML and other things. Since the complete use case is not mentioned in question, it would be hard to suggest which one is better for this scenario. But if your use case can afford sub-second latency then I would not waste my time in moving to another framework.
Also, if the things are dynamic and it is anticipated that processing framework may change in future it would be better to use something like apache beam which provides abstraction over most of the processing engines. Using apache beam processing APIs will give you flexibility to change underlying processing engine any time. Here is the link to read more about beam - https://beam.apache.org/.

kafka streaming or spark streaming

Am using now kafka in Python.
Was wondering if Spark Kafka is needed or can we use just use kafka
through pyKafka.
My concern was Spark creates overhead (pyspark) in the process,
and if we don't use any spark functions, just Kafka streaming is required.
What are the inconvenients of using Pyspark and kafka spark ?
It totally depends on the use case at hand, as all mentioned in the comments, however I passed with the same situation a couple of months ago, I will try to transfer my knowledge and how I decided to move to kafka-streams instead of spark-streaming.
In my use case, we only used spark to do a realtime streaming from kafka, and don't do any sort of map-reduce, windowing, filtering, aggregation.
Given the above case, I did the comparison based on 3 dimentions:
Technicality
DevOps
Cost
Below image show the table of comparison I did to convince my team to migrate to use kafka-streams and suppress spark, Cost is not added in the image as it totally depends on your cluster size (HeadNode-WorkerNodes).
V.I. NOTE:
Again, this is based on your case, I just tried to give you a pointer how to do the comparison, but spark itself has lots of benefits, which is irrelevant to describe it in this question.

how to display output of spark java application in UI

I have a Spark Java application for log mining. Currently I am reading the output from spark output files and displaying it in Excel Sheet. But I want a better UI. Can somebody help me to code a better UI for easier and better way to analyze the results of spark output. It will be helpful if I add graphs and table views.
One option is exposing Spark data via JDBC/ODBC as described in:
https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
So you can write an Application for a platform of your choice.

Storm and spark

I want to check and see if it is a good idea to invoke Spark code from a storm bolt. We have a stream based system in Storm. So per message we would like to do so ML and we are thinking of using Spark for that. So wanted to check if it is a good idea to do so. Any run time issues we might encounter ?
Thanks
ap
if you already have a system in place with Storm, then why do you want to use Spark?
IMHO both Spark and storm are different beast, you may want to run them in parallel for same or different use cases but do not tightly integrate each other.
What do you mean ML per message? ML on a single message doesn't make much sense. Do you mean a ML on a stream? Sure you can do it with Spark, but then you need to either use Spark Streaming (and you have two streaming architectures...) or save the data somewhere and do batch ML with Spark.
Why not use trident-ml instead?

Resources