Why do changing className on body have same performance impact as changing className on body? - performance-testing

I recently learn about web optimization. There are some resources that mentioned,
changing the class of a root node is more costly than a leaf node.
So I tried to proof that by recording the performance of (Google) and change className of a body node and a leaf node (#hplogo).
But here what I found:
(Sorry I can't post any image, don't have enough reputation to do that here)
body:
body
#hplogo:
leaf-node
So is it true that changing the class of a root node is more costly than a leaf node?

Related

Cassandra gossipinfo severity explained

I was unable to find a good documentation/explanation as to what severity indicates in nodetool gossipinfo. was looking for a detailed explanation but could not find a suitable one.
The severity is a value added to the latency in the dynamic snitch to determine which replica a coordinator will send the read's DATA and DIGEST requests to.
Its value would depend on the IO used in compaction and also it would try to read /proc/stat (same as the iostat utility) to get actual disk statistics as its weight. In post 3.10 versions of cassandra this is removed in https://issues.apache.org/jira/browse/CASSANDRA-11738. In pervious versions you can disable it by setting -Dcassandra.ignore_dynamic_snitch_severity in jvm options. The issue is that it weighting the io use the same as the latency. So if a node is GC thrashing and not doing much IO because of it, it could end up being treated as the target of most reads even though its the worst possible node to send requests to.
Now you can still use JMX to set the value still (to 1) if you want to exclude it from being used for reads. A example use case is using nodetool disablebinary so application wont query it directly, then setting the severity to 1. That node would then only be queried by cluster if theres a CL.ALL request or a read repair. Its a way to take a node "offline" for maintenance from a read perspective but still allow it to get mutations so it doesn't fall behind.
Severity reports activity that happens on the particular node (compaction, etc.), and this information then is used to make a decision on what node could better handle the request. There is discussion in original JIRA about this functionality & how this information is used.
P.S. Please see Chris's answer about changes in post 3.10 versions - I wasn't aware about these changes...

C# Suggestion for Data Structure

Looking for Data Structure which has below charecteristics
Navigate from one node to other in both directions.
There can be multiple parent (source) node to any node (destination node)
Search should be possible and efficient
Memory usage and performance should be efficient
This is needed for parsing the language and providing the code completion.

Distributed Web crawling using Apache Spark - Is it Possible?

An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark?
I guessed that it was possible, because it supports distributed processing capacity of Spark. After the interview I searched for this, but couldn't find any interesting answer. Is that possible with Spark?
Spark adds essentially no value to this task.
Sure, you can do distributed crawling, but good crawling tools already support this out of the box. The datastructures provided by Spark such as RRDs are pretty much useless here, and just to launch crawl jobs, you could just use YARN, Mesos etc. directly at less overhead.
Sure, you could do this on Spark. Just like you could do a word processor on Spark, since it is turing complete... but it doesn't get any easier.
How about this way:
Your application would get a set of websites URLs as input for your crawler, if you are implementing just a normal app, you might do it as follows:
split all the web pages to be crawled into a list of separate site, each site is small enough to fit in a single thread well:
for example: you have to crawl www.example.com/news from 20150301 to 20150401, split results can be: [www.example.com/news/20150301, www.example.com/news/20150302, ..., www.example.com/news/20150401]
assign each base url(www.example.com/news/20150401) to a single thread, it is in the threads where the really data fetch happens
save the result of each thread into FileSystem.
When the application become a spark one, same procedure happens but encapsulate in Spark notion: we can customize a CrawlRDD do the same staff:
Split sites: def getPartitions: Array[Partition] is a good place to do the split task.
Threads to crawl each split: def compute(part: Partition, context: TaskContext): Iterator[X] will be spread to all the executors of your application, run in parallel.
save the rdd into HDFS.
The final program looks like:
class CrawlPartition(rddId: Int, idx: Int, val baseURL: String) extends Partition {}
class CrawlRDD(baseURL: String, sc: SparkContext) extends RDD[X](sc, Nil) {
override protected def getPartitions: Array[CrawlPartition] = {
val partitions = new ArrayBuffer[CrawlPartition]
//split baseURL to subsets and populate the partitions
partitions.toArray
}
override def compute(part: Partition, context: TaskContext): Iterator[X] = {
val p = part.asInstanceOf[CrawlPartition]
val baseUrl = p.baseURL
new Iterator[X] {
var nextURL = _
override def hasNext: Boolean = {
//logic to find next url if has one, fill in nextURL and return true
// else false
}
override def next(): X = {
//logic to crawl the web page nextURL and return the content in X
}
}
}
}
object Crawl {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Crawler")
val sc = new SparkContext(sparkConf)
val crdd = new CrawlRDD("baseURL", sc)
crdd.saveAsTextFile("hdfs://path_here")
sc.stop()
}
}
YES.
Check out the open source project: Sparkler (spark - crawler) https://github.com/USCDataScience/sparkler
Checkout Sparkler Internals for a flow/pipeline diagram. (Apologies, it is an SVG image I couldn't post it here)
This project wasn't available when the question was posted, however as of December 2016 it is one of the very active projects!.
Is it possible to crawl the Websites using Apache Spark?
The following pieces may help you understand why someone would ask such a question and also help you to answer it.
The creators of Spark framework wrote in the seminal paper [1] that RDDs would be less suitable for applications that make asynchronous finegrained updates to shared state, such as a storage system
for a web application or an incremental web crawler
RDDs are key components in Spark. However, you can create traditional map reduce applications (with little or no abuse of RDDs)
There is a widely popular distributed web crawler called Nutch [2]. Nutch is built with Hadoop Map-Reduce (in fact, Hadoop Map Reduce was extracted out from the Nutch codebase)
If you can do some task in Hadoop Map Reduce, you can also do it with Apache Spark.
[1] http://dl.acm.org/citation.cfm?id=2228301
[2] http://nutch.apache.org/
PS:
I am a co-creator of Sparkler and a Committer, PMC for Apache Nutch.
When I designed Sparkler, I created an RDD which is a proxy to Solr/Lucene based indexed storage. It enabled our crawler-databse RDD to make asynchronous finegrained updates to shared state, which otherwise is not possible natively.
There is a project, called SpookyStuff, which is an
Scalable query engine for web scraping/data mashup/acceptance QA, powered by Apache Spark
Hope it helps!
I think the accepted answer is incorrect in one fundamental way; real-life large-scale web extraction is a pull process.
This is because often times requesting HTTP content is far less laborious task than building the response. I have built a small program, which is able to crawl 16 million pages a day with four CPU cores and 3GB RAM and that was not even optimized very well. For similar server such load (~200 requests per second) is not trivial and usually requires many layers of optimization.
Real web-sites can for example break their cache system if you crawl them too fast (instead of having most popular pages in the cache, it can get flooded with the long-tail content of the crawl). So in that sense, a good web-scraper always respects robots.txt etc.
The real benefit of the distributed crawler doesn't come from splitting the workload of one domain, but from splitting the work load of many domains to a single distributed process so that the one process can confidently track how many requests the system puts through.
Of course in some cases you want to be the bad boy and screw the rules; however, in my experience, such products don't stay alive long, since the web-site owners like to protect their assets from things, which look like DoS attacks.
Golang is very good for building web scrapers, since it has channels as native data type and they support pull-queues very well. Because HTTP protocol and scraping in general is slow, you can include the extraction pipelines as part of the process, which will lower the amount of data to be stored in the data warehouse system. You can crawl one TB with spending less than $1 worth of resources and do it fast when using Golang and Google Cloud (probably able to do with AWS and Azure also).
Spark gives you no additional value. Using wget as a client is clever, since it automatically respects robots.txt properly: parallel domain specific pull queue to wget is the way to go if you are working professionally.

Can The same GG node serves both roles: IMDB and Compute?

I would like to do "some calculation" on each In Memory DataBase(IMDB) GridGain (GG) node which continue receiving new data.
While looking at GG examples it seems a node must be started either as data node or as compute node.
Alternative GG architectural ideas would be appreciated.
Thanks
The GridGain Data Grid edition (which I think you are referring to) includes Compute functionality. If you start GridGain node with any configuration, Compute functionality is included by default.
Alternatively if you, for example, would like data grid and streaming functionality together, you may download the platform edition which includes everything.

Routing table creation at a node in a Pastry P2P network

This question is about the routing table creation at a node in a p2p network based on Pastry.
I'm trying to simulate this scheme of routing table creation in a single JVM. I can't seem to understand how these routing tables are created from the point of joining of the first node.
I have N independent nodes each with a 160 bit nodeId generated as a SHA-1 hash and a function to determine the proximity between these nodes. Lets say the 1st node starts the ring and joins it. The protocol says that this node should have had its routing tables set up at this time. But I do not have any other nodes in the ring at this point, so how does it even begin to create its routing tables?
When the 2nd node wishes to join the ring, it sends a Join message(containing its nodeID) to the 1st node, which it passes around in hops to the closest available neighbor for this 2nd node, already existing in the ring. These hops contribute to the creation of routing table entries for this new 2nd node. Again, in the absence of sufficient number of nodes, how do all these entries get created?
I'm just beginning to take a look at the FreePastry implementation to get these answers, but it doesn't seem very apparent at the moment. If anyone could provide some pointers here, that'd be of great help too.
My understanding of Pastry is not complete, by any stretch of the imagination, but it was enough to build a more-or-less working version of the algorithm. Which is to say, as far as I can tell, my implementation functions properly.
To answer your first question:
The protocol says that this [first] node should have had its routing tables
set up at this time. But I do not have any other nodes in the ring at
this point, so how does it even begin to create its routing tables?
I solved this problem by first creating the Node and its state/routing tables. The routing tables, when you think about it, are just information about the other nodes in the network. Because this is the only node in the network, the routing tables are empty. I assume you have some way of creating empty routing tables?
To answer your second question:
When the 2nd node wishes to join the ring, it sends a Join
message(containing its nodeID) to the 1st node, which it passes around
in hops to the closest available neighbor for this 2nd node, already
existing in the ring. These hops contribute to the creation of routing
table entries for this new 2nd node. Again, in the absence of
sufficient number of nodes, how do all these entries get created?
You should take another look at the paper (PDF warning!) that describes Pastry; it does a rather good job of explain the process for nodes joining and exiting the cluster.
If memory serves, the second node sends a message that not only contains its node ID, but actually uses its node ID as the message's key. The message is routed like any other message in the network, which ensures that it quickly winds up at the node whose ID is closest to the ID of the newly joined node. Every node that the message passes through sends their state tables to the newly joined node, which it uses to populate its state tables. The paper explains some in-depth logic that takes the origin of the information into consideration when using it to populate the state tables in a way that, I believe, is intended to reduce the computational cost, but in my implementation, I ignored that, as it would have been more expensive to implement, not less.
To answer your question specifically, however: the second node will send a Join message to the first node. The first node will send its state tables (empty) to the second node. The second node will add the sender of the state tables (the first node) to its state tables, then add the appropriate nodes in the received state tables to its own state tables (no nodes, in this case). The first node would forward the message on to a node whose ID is closer to that of the second node's, but no such node exists, so the message is considered "delivered", and both nodes are considered to be participating in the network at this time.
Should a third node join and route a Join message to the second node, the second node would send the third node its state tables. Then, assuming the third node's ID is closer to the first node's, the second node would forward the message to the first node, who would send the third node its state tables. The third node would build its state tables out of these received state tables, and at that point it is considered to be participating in the network.
Hope that helps.

Resources