from single machine to parallel processing? - apache-spark

i am new to Spark Streaming and Big Data in general. I am trying to understand the structure of a project in spark. I want to create a main class lets say "driver" with M machines, each machine keeps an array of counters and its values. In single machine and not in Spark, I would create a class for the machines and a class for the counters and i would do the computations that i want. But i am wondering if that is happening in Spark too. Would the same project but in Spark, have the structure I am quoting bellow?
class Driver {
var num : Int = 100
var machines: Array[Machine] = new Array[Machine](num)
//split incoming dstream and fill machines' queues
}
class Machine {
var counters = new Queue[(Int,Int)]() // counter with id 1 and value 25
def fillCounters: Unit = { ... } //function to fill the queue counters
}

In general, you could imagine Spark application as a driver part of application which runs all coordination tasks, constructs graph (you will find mentions of direct acyclic graph or DAG in theoretical parts of tutorials on Spark and distributed computations) of operations to take place over your data, and executor part which results in many copies of the code, sent to each node of the cluster to run over the data.
Main idea is that driver extracts part of your application's code that needs to be run locally with data on nodes, serializes it, sends over network to each executor, launches, manages and collects results.
Spark framework hides this details for simplicity of usage, so applications being developed and looks like single-threaded application.
Developer could separate contexts that run on driver and executors, but this is not very common for tutorials (again, for simplicity).
So, to the answer for the actual question above:
you do not need to design your application in a way you demonstrated above, unless your certainly want to.
Just follow official Spark tutorial to get viable solution and split it afterwards with contexts of execution.
There is good post, summarizing a lot of Spark turorials, videos and talks - you could find it here at SO.

Related

How to collect custom metrics from Spark based application

I have a Spark (java) based application which scans few tables in Hive and then based on some condition, it gives a list of table/partitions which satisfy the condition.
I want to know how I can collect information regarding this scan?
For eg: How much time it took the application to perform the scan of different tables? How much memory was being used while performing the scan? etc.
I know i can use simple stopwatch for time calculation and print them to the log. But i do not want them to print into logs. I want to push the to a custom Kafka producer that i have created.
I looked into using spark listener and extending them: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.SparkListener
But these are based on spark specific events(onJobStart, onJobEnd, etc.). I want to trigger my listener on business specific events. For eg: onTableRead, onSpecificMethodCall
2)https://github.com/groupon/spark-metrics
I had a look at this library and it is quite simple to use. But the problem here I am facing is that my Spark application is java based.
In the spark-metrics documentation, they have mentioned the metrics should be initialized as lazy vals to get correct data.
public List getFilteredPartList(){
List<String> fileteredPartitions = utils.getPartNames(getTable().getDbName(), getTable().getTableName(),getConfiguration().getEnvCode(), partNames,
getTableProperties(), getDataAccessManager(), true, true, false, true).getPartNames()};
I want to know the time taken for the execution of above method and push it to my custom Kafka producer.

Asynchronous object sharing in Spark

I have a very basic understanding about spark and I am trying to find something that can help me achieve the following :
Have a Pool of objects shared over all the nodes, asynchronously.
What I am thinking of currently, is, lets say there are ten nodes numbered from 1 to 10.
If I have a single object, I will have to make my object synchronous in order for it to be accessible by any node. I do not want that.
Second option is, I can have a pool of say 10 objects.
I want to write my code in such a way that the node number 1 always uses the object number 1, the node number 2 always uses the object number 2 and so on..
A sample approach would be, before performing a task, get the thread ID and use the object number (threadID % 10). This would result in a lot of collisions and would not work.
Is there a way that I can somehow get a nodeID or processID, and make my code fetch an object according to that ID ? Or some other way to have an asynchronous pool of objects on my cluster?
I apologize if it sounds trivial, I am just getting started and cannot find a lot of resources pertaining to my doubt online.
PS : I am using a SparkStreaming + Kafka + YARN setup if it matters.
Spark automatically partitions the data across all available cluster nodes; you don't need to control or keep track of where the partitions are actually stored. Some RDD operations also require shuffling which is fully managed by Spark, so you can't rely on the layout of the partitions.
Sharing an object only makes sense if it's immutable. Each worker node receives a copy of the original object, and any local changes to it will not be reflected on other nodes. If that's what you need, you can use sc.broadcast() to efficiently distribute an object across all workers prior to a parallel operation.

Storm like structure in Apache Spark

You know how in Apache Storm you can have a Spout streaming data to multiple Bolts. Is there a way to do something similar in Apache Spark?
I basically want that there be one program to read data from a Kafka Queue and output it to 2 different programs, which can then process it in their own, different ways.
Specifically, there would be a reader program that would read data from the Kafka queue and output it to 2 programs x and y. x would process the data to calculate metrics of one kind (in my case it would calculate the user activities) whereas y would calculate metrics of another kind (in my case this would be checking activities based on different devices).
Can someone help me understand how this is possible in Spark?
Why don't you simply create two topologies?
Both topologies have a spout reading from the kafka topic (yes, you can have multiple topologies reading from the same topic; I have this running on production systems). Make sure that you use different spout configs, otherwise kafka-zookeper will see both topologies as being the same. Have a look at the documentation here.
Spoutconfig is an extension of KafkaConfig that supports additional fields with ZooKeeper connection info and for controlling behavior specific to KafkaSpout. The Zkroot will be used as root to store your consumer's offset. The id should uniquely identify your spout.
public SpoutConfig(BrokerHosts hosts, String topic, String zkRoot, String id);
Implement program x in topology x and program y in topology y.
Another option would have two graphs of bolts subscribing from the same spout, but IMHO this is not optimal, because failed tuples (which are likely to fail only in one single graph) would be replayed to both graphs, event if they fail in only one of the graphs; and therefore some kafka messages will end being processed twice, using tow separated topologies you avoid this.

Spark: Importing Data

I currently have a spark app that reads a couple of files and forms a data frame out of them and implements some logic on the data frames.
I can see the number and size of these files growing by a lot in the future and wanted to understand what goes on behind the scenes to be able to keep up with this growth.
Firstly, I just wanted to double check that since all machines on the cluster can access the files (which is a requirement by spark), the task of reading in data from these files is distributed and no one machine is burdened by it?
I was looking at the Spark UI for this app but since it only shows what actions were performed by which machines and since "sc.textFile(filePath)" is not an action I couldn't be sure what machines are performing this read.
Secondly, what advantages/disadvantages would I face if I were to read this data from a database like Cassandra instead of just reading in files?
Thirdly, in my app I have some code where I perform a collect (val treeArr = treeDF.collect()) on the dataframe to get an array and then I have some logic implemented on those arrays. But since these are not RDDs, how does Spark distribute this work? Or does it distribute them at all?
In other words, should I be doing maximum amount of my work transforming and performing actions on RDDs than converting them into arrays or some other data structure and then implementing the logic like I would in any programming language?
I am only about two weeks into Spark so I apologize if these are stupid questions!
Yes, sc.textFile is distributed. It even has an optional minPartitions argument.
This question is too broad. But the short answer is that you should benchmark it for yourself.
collect fetches all the data to the master. After that it's just a plain array. Indeed the idea is that you should not use collect if you want to perform distributed computations.

Distributed Web crawling using Apache Spark - Is it Possible?

An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark?
I guessed that it was possible, because it supports distributed processing capacity of Spark. After the interview I searched for this, but couldn't find any interesting answer. Is that possible with Spark?
Spark adds essentially no value to this task.
Sure, you can do distributed crawling, but good crawling tools already support this out of the box. The datastructures provided by Spark such as RRDs are pretty much useless here, and just to launch crawl jobs, you could just use YARN, Mesos etc. directly at less overhead.
Sure, you could do this on Spark. Just like you could do a word processor on Spark, since it is turing complete... but it doesn't get any easier.
How about this way:
Your application would get a set of websites URLs as input for your crawler, if you are implementing just a normal app, you might do it as follows:
split all the web pages to be crawled into a list of separate site, each site is small enough to fit in a single thread well:
for example: you have to crawl www.example.com/news from 20150301 to 20150401, split results can be: [www.example.com/news/20150301, www.example.com/news/20150302, ..., www.example.com/news/20150401]
assign each base url(www.example.com/news/20150401) to a single thread, it is in the threads where the really data fetch happens
save the result of each thread into FileSystem.
When the application become a spark one, same procedure happens but encapsulate in Spark notion: we can customize a CrawlRDD do the same staff:
Split sites: def getPartitions: Array[Partition] is a good place to do the split task.
Threads to crawl each split: def compute(part: Partition, context: TaskContext): Iterator[X] will be spread to all the executors of your application, run in parallel.
save the rdd into HDFS.
The final program looks like:
class CrawlPartition(rddId: Int, idx: Int, val baseURL: String) extends Partition {}
class CrawlRDD(baseURL: String, sc: SparkContext) extends RDD[X](sc, Nil) {
override protected def getPartitions: Array[CrawlPartition] = {
val partitions = new ArrayBuffer[CrawlPartition]
//split baseURL to subsets and populate the partitions
partitions.toArray
}
override def compute(part: Partition, context: TaskContext): Iterator[X] = {
val p = part.asInstanceOf[CrawlPartition]
val baseUrl = p.baseURL
new Iterator[X] {
var nextURL = _
override def hasNext: Boolean = {
//logic to find next url if has one, fill in nextURL and return true
// else false
}
override def next(): X = {
//logic to crawl the web page nextURL and return the content in X
}
}
}
}
object Crawl {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Crawler")
val sc = new SparkContext(sparkConf)
val crdd = new CrawlRDD("baseURL", sc)
crdd.saveAsTextFile("hdfs://path_here")
sc.stop()
}
}
YES.
Check out the open source project: Sparkler (spark - crawler) https://github.com/USCDataScience/sparkler
Checkout Sparkler Internals for a flow/pipeline diagram. (Apologies, it is an SVG image I couldn't post it here)
This project wasn't available when the question was posted, however as of December 2016 it is one of the very active projects!.
Is it possible to crawl the Websites using Apache Spark?
The following pieces may help you understand why someone would ask such a question and also help you to answer it.
The creators of Spark framework wrote in the seminal paper [1] that RDDs would be less suitable for applications that make asynchronous finegrained updates to shared state, such as a storage system
for a web application or an incremental web crawler
RDDs are key components in Spark. However, you can create traditional map reduce applications (with little or no abuse of RDDs)
There is a widely popular distributed web crawler called Nutch [2]. Nutch is built with Hadoop Map-Reduce (in fact, Hadoop Map Reduce was extracted out from the Nutch codebase)
If you can do some task in Hadoop Map Reduce, you can also do it with Apache Spark.
[1] http://dl.acm.org/citation.cfm?id=2228301
[2] http://nutch.apache.org/
PS:
I am a co-creator of Sparkler and a Committer, PMC for Apache Nutch.
When I designed Sparkler, I created an RDD which is a proxy to Solr/Lucene based indexed storage. It enabled our crawler-databse RDD to make asynchronous finegrained updates to shared state, which otherwise is not possible natively.
There is a project, called SpookyStuff, which is an
Scalable query engine for web scraping/data mashup/acceptance QA, powered by Apache Spark
Hope it helps!
I think the accepted answer is incorrect in one fundamental way; real-life large-scale web extraction is a pull process.
This is because often times requesting HTTP content is far less laborious task than building the response. I have built a small program, which is able to crawl 16 million pages a day with four CPU cores and 3GB RAM and that was not even optimized very well. For similar server such load (~200 requests per second) is not trivial and usually requires many layers of optimization.
Real web-sites can for example break their cache system if you crawl them too fast (instead of having most popular pages in the cache, it can get flooded with the long-tail content of the crawl). So in that sense, a good web-scraper always respects robots.txt etc.
The real benefit of the distributed crawler doesn't come from splitting the workload of one domain, but from splitting the work load of many domains to a single distributed process so that the one process can confidently track how many requests the system puts through.
Of course in some cases you want to be the bad boy and screw the rules; however, in my experience, such products don't stay alive long, since the web-site owners like to protect their assets from things, which look like DoS attacks.
Golang is very good for building web scrapers, since it has channels as native data type and they support pull-queues very well. Because HTTP protocol and scraping in general is slow, you can include the extraction pipelines as part of the process, which will lower the amount of data to be stored in the data warehouse system. You can crawl one TB with spending less than $1 worth of resources and do it fast when using Golang and Google Cloud (probably able to do with AWS and Azure also).
Spark gives you no additional value. Using wget as a client is clever, since it automatically respects robots.txt properly: parallel domain specific pull queue to wget is the way to go if you are working professionally.

Resources