Distributed Web crawling using Apache Spark - Is it Possible?

Distributed Web crawling using Apache Spark - Is it Possible? - web

An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark?
I guessed that it was possible, because it supports distributed processing capacity of Spark. After the interview I searched for this, but couldn't find any interesting answer. Is that possible with Spark?

Spark adds essentially no value to this task.
Sure, you can do distributed crawling, but good crawling tools already support this out of the box. The datastructures provided by Spark such as RRDs are pretty much useless here, and just to launch crawl jobs, you could just use YARN, Mesos etc. directly at less overhead.
Sure, you could do this on Spark. Just like you could do a word processor on Spark, since it is turing complete... but it doesn't get any easier.

How about this way:
Your application would get a set of websites URLs as input for your crawler, if you are implementing just a normal app, you might do it as follows:
split all the web pages to be crawled into a list of separate site, each site is small enough to fit in a single thread well:
for example: you have to crawl www.example.com/news from 20150301 to 20150401, split results can be: [www.example.com/news/20150301, www.example.com/news/20150302, ..., www.example.com/news/20150401]
assign each base url(www.example.com/news/20150401) to a single thread, it is in the threads where the really data fetch happens
save the result of each thread into FileSystem.
When the application become a spark one, same procedure happens but encapsulate in Spark notion: we can customize a CrawlRDD do the same staff:
Split sites: def getPartitions: Array[Partition] is a good place to do the split task.
Threads to crawl each split: def compute(part: Partition, context: TaskContext): Iterator[X] will be spread to all the executors of your application, run in parallel.
save the rdd into HDFS.
The final program looks like:
class CrawlPartition(rddId: Int, idx: Int, val baseURL: String) extends Partition {}
class CrawlRDD(baseURL: String, sc: SparkContext) extends RDD[X](sc, Nil) {
override protected def getPartitions: Array[CrawlPartition] = {
val partitions = new ArrayBuffer[CrawlPartition]
//split baseURL to subsets and populate the partitions
partitions.toArray
}
override def compute(part: Partition, context: TaskContext): Iterator[X] = {
val p = part.asInstanceOf[CrawlPartition]
val baseUrl = p.baseURL
new Iterator[X] {
var nextURL = _
override def hasNext: Boolean = {
//logic to find next url if has one, fill in nextURL and return true
// else false
}
override def next(): X = {
//logic to crawl the web page nextURL and return the content in X
}
}
}
}
object Crawl {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Crawler")
val sc = new SparkContext(sparkConf)
val crdd = new CrawlRDD("baseURL", sc)
crdd.saveAsTextFile("hdfs://path_here")
sc.stop()
}
}

YES.
Check out the open source project: Sparkler (spark - crawler) https://github.com/USCDataScience/sparkler
Checkout Sparkler Internals for a flow/pipeline diagram. (Apologies, it is an SVG image I couldn't post it here)
This project wasn't available when the question was posted, however as of December 2016 it is one of the very active projects!.
Is it possible to crawl the Websites using Apache Spark?
The following pieces may help you understand why someone would ask such a question and also help you to answer it.
The creators of Spark framework wrote in the seminal paper [1] that RDDs would be less suitable for applications that make asynchronous finegrained updates to shared state, such as a storage system
for a web application or an incremental web crawler
RDDs are key components in Spark. However, you can create traditional map reduce applications (with little or no abuse of RDDs)
There is a widely popular distributed web crawler called Nutch [2]. Nutch is built with Hadoop Map-Reduce (in fact, Hadoop Map Reduce was extracted out from the Nutch codebase)
If you can do some task in Hadoop Map Reduce, you can also do it with Apache Spark.
[1] http://dl.acm.org/citation.cfm?id=2228301
[2] http://nutch.apache.org/
PS:
I am a co-creator of Sparkler and a Committer, PMC for Apache Nutch.
When I designed Sparkler, I created an RDD which is a proxy to Solr/Lucene based indexed storage. It enabled our crawler-databse RDD to make asynchronous finegrained updates to shared state, which otherwise is not possible natively.

There is a project, called SpookyStuff, which is an
Scalable query engine for web scraping/data mashup/acceptance QA, powered by Apache Spark
Hope it helps!

I think the accepted answer is incorrect in one fundamental way; real-life large-scale web extraction is a pull process.
This is because often times requesting HTTP content is far less laborious task than building the response. I have built a small program, which is able to crawl 16 million pages a day with four CPU cores and 3GB RAM and that was not even optimized very well. For similar server such load (~200 requests per second) is not trivial and usually requires many layers of optimization.
Real web-sites can for example break their cache system if you crawl them too fast (instead of having most popular pages in the cache, it can get flooded with the long-tail content of the crawl). So in that sense, a good web-scraper always respects robots.txt etc.
The real benefit of the distributed crawler doesn't come from splitting the workload of one domain, but from splitting the work load of many domains to a single distributed process so that the one process can confidently track how many requests the system puts through.
Of course in some cases you want to be the bad boy and screw the rules; however, in my experience, such products don't stay alive long, since the web-site owners like to protect their assets from things, which look like DoS attacks.
Golang is very good for building web scrapers, since it has channels as native data type and they support pull-queues very well. Because HTTP protocol and scraping in general is slow, you can include the extraction pipelines as part of the process, which will lower the amount of data to be stored in the data warehouse system. You can crawl one TB with spending less than $1 worth of resources and do it fast when using Golang and Google Cloud (probably able to do with AWS and Azure also).
Spark gives you no additional value. Using wget as a client is clever, since it automatically respects robots.txt properly: parallel domain specific pull queue to wget is the way to go if you are working professionally.

Related

Handling Out-Of-Order Event Windowing in Apache Beam from a Multitenant Kafka Topic

I’ve been mulling over how to solve a given problem in Beam and thought I’d reach out to a larger audience for some advice. At present things seem to be working sparsely and I was curious if someone could provide a sounding-board to see if this workflow makes sense.
The primary high-level goal is to read records from Kafka that may be out of order and need to be windowed in Event Time according to another property found on the records and eventually emitting the contents of those windows and writing them out to GCS.
The current pipeline looks roughly like the following:
val partitionedEvents = pipeline
.apply("Read Events from Kafka",
KafkaIO
.read<String, Log>()
.withBootstrapServers(options.brokerUrl)
.withTopic(options.incomingEventsTopic)
.withKeyDeserializer(StringDeserializer::class.java)
.withValueDeserializerAndCoder(
SpecificAvroDeserializer<Log>()::class.java,
AvroCoder.of(Log::class.java)
)
.withReadCommitted()
.commitOffsetsInFinalize()
// Set the watermark to use a specific field for event time
.withTimestampPolicyFactory { _, previousWatermark -> WatermarkPolicy(previousWatermark) }
.withConsumerConfigUpdates(
ImmutableMap.of<String, Any?>(
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest",
ConsumerConfig.GROUP_ID_CONFIG, "log-processor-pipeline",
"schema.registry.url", options.schemaRegistryUrl
)
).withoutMetadata()
)
.apply("Logging Incoming Logs", ParDo.of(Events.log()))
.apply("Rekey Logs by Tenant", ParDo.of(Events.key()))
.apply("Partition Logs by Source",
// This is a custom function that will partition incoming records by a specific
// datasource field
Partition.of(dataSources.size, Events.partition<KV<String, Log>>(dataSources))
)
dataSources.forEach { dataSource ->
// Store a reference to the data source name to avoid serialization issues
val sourceName = dataSource.name
val tempDirectory = Directories.resolveTemporaryDirectory(options.output)
// Grab all of the events for this specific partition and apply the source-specific windowing
// strategies
partitionedEvents[dataSource.partition]
.apply(
"Building Windows for $sourceName",
SourceSpecificWindow.of<KV<String, Log>>(dataSource)
)
.apply("Group Windowed Logs by Key for $sourceName", GroupByKey.create())
.apply("Log Events After Windowing for $sourceName", ParDo.of(Events.logAfterWindowing()))
.apply(
"Writing Windowed Logs to Files for $sourceName",
FileIO.writeDynamic<String, KV<String, MutableIterable<Log>>>()
.withNumShards(1)
.by { row -> "${row.key}/${sourceName}" }
.withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.fn(SerializableFunction { logs -> Files.stringify(logs.value) }), TextIO.sink())
.to(options.output)
.withNaming { partition -> Files.name(partition)}
.withTempDirectory(tempDirectory)
)
}
In a simpler, bulleted form, it might look like this:
Read records from single Kafka topic
Key all records by their tenant
Partition stream by another event properly
Iterate through known partitions in previous step
Apply custom windowing rules for each partition (related to datasource, custom window rules)
Group windowed items by key (tenant)
Write tenant-key pair groupings to GCP via FileIO
The problem is that the incoming Kafka topic contains out-of-order data across multiple tenants (e.g. events for tenant1 might be streaming in now, but then a few minutes later you’ll get them for tenant2 in the same partition, etc.). This would cause the watermark to bounce back and forth in time as each incoming record would not be guaranteed to continually increase, which sounds like it would be a problem, but I'm not certain. It certainly seems that while data is flowing through, some files are simply not being emitted at all.
The custom windowing function is extremely simple and was aimed to emit a single window once the allowed lateness and windowing duration has elapsed:
object SourceSpecificWindow {
fun <T> of(dataSource: DataSource): Window<T> {
return Window.into<T>(FixedWindows.of(dataSource.windowDuration()))
.triggering(Never.ever())
.withAllowedLateness(dataSource.allowedLateness(), Window.ClosingBehavior.FIRE_ALWAYS)
.discardingFiredPanes()
}
}
However, it seemed inconsistent since we'd see logging come out after the closing of the window, but not necessarily files being written out to GCS.
Does anything seem blatantly wrong or incorrect with this approach? Since the data can come in out of order within the source (i.e. right now, 2 hours ago, 5 minutes from now) and covers data across multiple tenants, but the aim is try and ensure that one tenant that keeps up to date won't drown out tenants that might come in the past.
Would we potentially need another Beam application or something to "split" this single stream of events into sub-streams that are each processed independently (so that each watermark processes on their own)? Is that where a SplittableDoFn would come in? Since I'm running on the SparkRunner, which doesn't appear to support that - but it seems as though it'd be a valid use case.
Any advice would be greatly appreciated or even just another set of eyes. I'd be happy to provide any additional details that I could.
Environment
Currently running against SparkRunner

While this may not be the most helpful response, I'll be transparent as far as the end result. Eventually the logic required for this specific use-case extended far beyond the built-in capabilities of those in Apache Beam, primarily in the area around windowing/governance of time.
The solution that was landed on was to switch the preferred streaming technology from Apache Beam to Apache Flink, which as you might imagine was quite a leap. The stateful-centric nature of Flink allowed us to more easily handle our use cases, define custom eviction criteria (and ordering) around windowing, while losing a layer of abstraction over it.

Apache Beam bundling issue

My problem is the following, I want to aggregate some data that is stored on S3. As initial input to my pipeline I use a text file that contains the path of all the S3 files that should be aggregated.
PCollection<String> readInputPipeline = p.apply("ReadLines", TextIO.read().from(options.getInputFile()));
readInputPipeline = readInputPipeline.apply(ParDo.of(new ReadFromS3Mapper()));
The input file has 346k lines. When I deploy this code to a Spark cluster reading from S3 looks like it happens only in 2 Spark Tasks even though many cores are available. Is there any way for me to increase the parallelism of this operation?
I am running this on EMR on Amazon with a master (m3.xlarge) and a core machine (R3.4xlarge) with the following options:
"spark-submit"
"--driver-java-options='-Dspark.yarn.app.container.log.dir=/mnt/var/log/hadoop'",
"--master", "yarn",
"--executor-cores","16",
"--executor-memory","6g"
PS: maybe the solution could be that I shouldn't do this kind of expensive IO operations in this context?

Spark decides how to split up an input, here it's decided to go through the entire file in one go, because it so small.
I've done something similar in a distcp application; this uses Spark's ParallelCollectionRDD class to explicitly tell spark to split the listing up one-by-one.
That class should be enough for you to do something similar -you may have to read the initial text file in locally to a list, then pass the list to the ParallelCollectionRDD constructor

A bit late reply, but I looked into what Beam does in the 2.16.0 release.
You're getting 2 tasks after the first TextIO.read() -- I suspect that your initial list of files of 346k lines is being split into two partitions. This behaviour is controlled by the desiredBundleSize inside TextIO, which is hard-coded to 64MB.
In Spark, your action ReadFromS3Mapper will be "fused" to the arriving records and you'll always stay at two partitions.
If you want to keep the same code, you can force a repartition between the two transformations:
PCollection<String> allContents = p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
.apply("Repartition", Reshuffle.viaRandomKey())
.apply(ParDo.of(new ReadFromS3Mapper()));
As an alternative, there's quite a few interesting patterns available in the TextIO and FileIO utilities. There's an example that matches yours almost exactly (implicitly including the reshuffle).

from single machine to parallel processing?

i am new to Spark Streaming and Big Data in general. I am trying to understand the structure of a project in spark. I want to create a main class lets say "driver" with M machines, each machine keeps an array of counters and its values. In single machine and not in Spark, I would create a class for the machines and a class for the counters and i would do the computations that i want. But i am wondering if that is happening in Spark too. Would the same project but in Spark, have the structure I am quoting bellow?
class Driver {
var num : Int = 100
var machines: Array[Machine] = new Array[Machine](num)
//split incoming dstream and fill machines' queues
}
class Machine {
var counters = new Queue[(Int,Int)]() // counter with id 1 and value 25
def fillCounters: Unit = { ... } //function to fill the queue counters
}

In general, you could imagine Spark application as a driver part of application which runs all coordination tasks, constructs graph (you will find mentions of direct acyclic graph or DAG in theoretical parts of tutorials on Spark and distributed computations) of operations to take place over your data, and executor part which results in many copies of the code, sent to each node of the cluster to run over the data.
Main idea is that driver extracts part of your application's code that needs to be run locally with data on nodes, serializes it, sends over network to each executor, launches, manages and collects results.
Spark framework hides this details for simplicity of usage, so applications being developed and looks like single-threaded application.
Developer could separate contexts that run on driver and executors, but this is not very common for tutorials (again, for simplicity).
So, to the answer for the actual question above:
you do not need to design your application in a way you demonstrated above, unless your certainly want to.
Just follow official Spark tutorial to get viable solution and split it afterwards with contexts of execution.
There is good post, summarizing a lot of Spark turorials, videos and talks - you could find it here at SO.

Spark Parquet Loader: Reduce number of jobs involved in listing a dataframe's files

I'm loading parquet data into a dataframe via
spark.read.parquet('hdfs:///path/goes/here/...')
There are around 50k files in that path due to parquet partitioning. When I run that command, spark spawns off dozens of small jobs that as a whole take several minutes to complete. Here's what the jobs look like in the spark UI:
As you can see, although each job has ~2100 tasks, they execute quickly, in about 2 seconds. Starting so many 'mini jobs' is inefficient and leads this file listing step to take about 10 minutes (where the clusters resources are mostly idle, and the cluster is mostly dealing with straggling tasks or the overhead of managing jobs/tasks).
How can I consolidate these tasks into fewer jobs, each with more tasks?
Bonus points for a solution that also works in pyspark.
I'm running spark 2.2.1 via pyspark on hadoop 2.8.3.

I believe you encountered a bug for which a former colleague of mine has filed a ticket and opened a pull request. You can check it out here. If it fits your issue, your best shot is probably voting the issue up and making some noise on the mailing list about it.
What you might want to do is tweaking the spark.sql.sources.parallelPartitionDiscovery.threshold and spark.sql.sources.parallelPartitionDiscovery.parallelism configuration parameters (with the former being cited in the linked ticket) in a way that suits your job.
You can have a look here and here to see how the configuration key is used. I'll share the related snippets here for completeness.
spark.sql.sources.parallelPartitionDiscovery.threshold
// Short-circuits parallel listing when serial listing is likely to be faster.
if (paths.size <= sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
return paths.map { path =>
(path, listLeafFiles(path, hadoopConf, filter, Some(sparkSession)))
}
}
spark.sql.sources.parallelPartitionDiscovery.parallelism
// Set the number of parallelism to prevent following file listing from generating many tasks
// in case of large #defaultParallelism.
val numParallelism = Math.min(paths.size, parallelPartitionDiscoveryParallelism)
The default value for this configuration are 32 for the threshold and 10000 for the parallelism (related code here).
In your case, I'd say that probably what you want to do is setting the threshold so that the process is run without spawning parallel jobs.
Note
The linked sources are from the latest available tagged release at the time of writing, 2.3.0.

Against an object store, even the listing and calls to getFileStatus are pretty expensive, and as this is done during partitioning, can extend the job a lot.
Play with mapreduce.input.fileinputformat.list-status.num-threads to see if adding more threads speeds things up, say a value of 20-30

Spark: Importing Data

I currently have a spark app that reads a couple of files and forms a data frame out of them and implements some logic on the data frames.
I can see the number and size of these files growing by a lot in the future and wanted to understand what goes on behind the scenes to be able to keep up with this growth.
Firstly, I just wanted to double check that since all machines on the cluster can access the files (which is a requirement by spark), the task of reading in data from these files is distributed and no one machine is burdened by it?
I was looking at the Spark UI for this app but since it only shows what actions were performed by which machines and since "sc.textFile(filePath)" is not an action I couldn't be sure what machines are performing this read.
Secondly, what advantages/disadvantages would I face if I were to read this data from a database like Cassandra instead of just reading in files?
Thirdly, in my app I have some code where I perform a collect (val treeArr = treeDF.collect()) on the dataframe to get an array and then I have some logic implemented on those arrays. But since these are not RDDs, how does Spark distribute this work? Or does it distribute them at all?
In other words, should I be doing maximum amount of my work transforming and performing actions on RDDs than converting them into arrays or some other data structure and then implementing the logic like I would in any programming language?
I am only about two weeks into Spark so I apologize if these are stupid questions!

Yes, sc.textFile is distributed. It even has an optional minPartitions argument.
This question is too broad. But the short answer is that you should benchmark it for yourself.
collect fetches all the data to the master. After that it's just a plain array. Indeed the idea is that you should not use collect if you want to perform distributed computations.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string