I have bipartite graph of urban traffic and I want to implement simrank for this graph. I have two RDDs each containing nodes and neighbors of a part of graph. To be more specific, one RDD contains cars as nodes and cameras that each car has passed and the other RDD contains cameras as nodes and cars that passed each camera. I am using random walk. So I start with a query (for example id of a camera) and I want to find similar cameras. So I lookup that camera in the RDD containing cameras as nodes and between its neighbors choose one randomly. Then look up randomly selected car id in RDD containing cars as nodes and this process keep going until I run to the same camera a few times so I conclude this is similar to the query. But the problem is that the code is very slow and it takes 2 hours.
Do you have any suggestions?
Related
I have a data that is grouped on three columns. Two of the three columns have very high cardinality (can go up to 500 unique values per column), but each group will have at most 400 rows.
I need to perform some computation on the grouped data. The computation takes a couple of seconds for each group. Will using spark be an overkill here? Will the process of parallelizing and distributing the operation add more time than doing it on one machine (and maybe using multiprocessing)?
Also, will adding more levels of parallelisation (on high cardinality columns) using spark increase the net time taken to process the data for the same cluster configuration?
Background / scenario:
I have two tables: One 1-2 million entry table with transactions of the form
TRX-ID , PROCESS-ID , ACTOR-ID
Additionally a participant-lookup (one of multiple categories of users of the system) table of the form
USER-ID , PARTICIPANT-ID
The transaction table is for historical reasons a bit messy. The PROCESS-ID can be a participant-id and the ACTOR-ID the user-id of a different kind of user. In some situations the PROCESS-ID is something else and the ACTOR-ID is the user-id of the participant.
I need to join the transaction and the participant-lookup table in order to get the participant-id for all transactions. I tried this in two ways.
(I left out some code steps in the snippets and focused on the join parts. Assume that df variables are data frames and the I select right columns to support e.g. unions.)
First approach:
transactions_df.join(pt_lookup_df, (transactions_df['actor-id'] == pt_lookup_df['user-id']) | (transactions_df['process-id'] == pt_lookup_df['participant-id']))
The code with this join is extrem slow. It ends up in my job running 45 minutes on a 10 instances AWS glue cluster with nearly 99% load on all executers.
Second approach:
I realised that some of the transactions already have the participant-id and I do not need to join for them. So I changed to:
transactions_df_1.join(pt_lookup_df, (transactions_df_1['actor-id'] == pt_lookup_df['user-id']))
transactions_df_2 = transactions_df_2.withColumnRenamed('process-id', 'participant-id')
transactions_df_1.union(transactions_df_2)
This finished in 5 minutes!
Both approaches give the correct results.
Question
I do not understand why the one is so slow and the other not. The amount of data excluded in the second approach is minimal. So transactions_df_2 has only a very small subset of the total data.
Looking at the plans, the affect is mainly around on Cartesian product that is done in approach 1 but not 2. So I assume, this is the performance bottleneck. I still do not understand how this can lead to 40 minutes differences in compute time.
Can someone give explanations?
Would a Cartesian product in the DAG be in general a warning sign in Spark?
Summary
It seems that a join with multiple columns in the condition triggers an extrem slow Cartesian product operation. Should I have done a broadcast operation on the smaller data set to avoid this?
DAG approach 1
DAG approach 2
This is because a Cartesian Product join and a regular join do not involve the same data shuffling process. Even thought the amount of data is similar the amount of shuffling is different.
This article explains where is the extra shuffling coming from.
I have a graph database on ArangoDB which have a node depth of around 100 levels and around 205k nodes for some users. For normal uses the AQL used to traverse the graph works well. But there are some scenarios on which the graph traversal is taking too much time, which are:
Calculating the maximum depth for some user
Calculating weight for a root node which have a node depth around 50+ and child node 200k+. Here weight is calculated by summing the weight of each child node.
Fetching all the nodes on a specific level for some node.
The solutions tried by me is as following:
For #1 and #2, I am calculating the maximum depth and weight in the background and keeping them to a cache to avoid real time processing. Also these are recalculated if there any changes in the graph.
For #3, I have tried placing the graph on shards which actually worsen the performance ( and it was expected because graph can not utilize sharding benefits)
I need suggestions for the following:
Is it a good idea to pre-calculate the user ids on each level and placing it to cache for each user?
Is there a way to split the graph on different shards ( some better way) so that my query can run in parallel to finish the task of fetching the node data early?
Can the tools like Elastic search or Spark, be helpful to improve the performance of a Graph query?
Thanks
I was reading about Narrow Vs Wide dependencies of an RDD partitioned across multiple partititon.
My Question: I do not understand that why RDDs built with Narrow Dependencies do not require a schuffle over the network? OR is it that shuffle DOES happens, but only a few number of times?
Please refer to the diagram below -
Let's say a child RDD is created with Narrow Dependency from a parent RDD, as marked in the red rectangle below. Now, parent RDD had 3 partitions, let's say (P1,P2,P3) and data in each respective partition got mapped got mapped into 3 other partitions, let's say (P1,P4,P5) respectively.
Since, the data in parent RDD partition P1 got mapped to itself, so there is no shuffle over the network. But since the data from parent RDD partition P2 & P3 got mapped to child RDD partitions P4 & P5, which are different partitions, so naturally the data has to pass through the network to have the corresponding values placed in P4 & P5. Thus, why do we say that there is no shuffle over the network?
See the box in green, this is even more complex case. Only case which I could visualize where there is no shuffle over the network should be when parent RDD partitions get mapped to itself.
I am sure my reasoning is incorrect. Could someone provide some explanation?
Thanks
Narrow dependency doesn't imply that there is no network traffic.
The distinction between narrow and wide is more subtle:
With wide dependency each child partition depends on each partition of its parents. It is many-to-many relationship.
With narrow dependency each child partition depends on at most one partition from each parent. It can be either one-to-one or many-to-one relationship.
If network traffic is required depends on other factors than transformation alone. For example co-partitioned RDDs can be joined without network traffic if shuffle happened during the same action (in this case there is both co-partitioning and co-location) or with network traffic otherwise.
Let me give an analogy to illustrate partitions.
If you had a set of documents and wanted to filter it to identify all the improperly filled ones, it is equivalent to doing a filter operation. In order to speed up the operation, you distribute the set of documents to three people so each one of them has a partition of the documents. Each person then sifts through the subset of documents given to them (in the input box) and puts the ones that are improperly filled into an output box.
The operation performed by each individual is such that the contents of the output box depend only upon the input box provided to them; the input box of other people has no bearing on the output. Hence requires no network transfer.
Hope this explains.
From the link you provided:
A typical execution sequence is as follows ... RDD is created originally from external data sources (e.g. HDFS, Local file ... etc) RDD undergoes a sequence of TRANSFORMATION (e.g. map, flatMap, filter, groupBy, join), each provide a different RDD that feed into the next transformation. Finally the last step is an ACTION (e.g. count, collect, save, take), which convert the last RDD into an output to external data sources The above sequence of processing is called a lineage (outcome of the topological sort of the DAG)
Now think about how the data is processed as it makes it's way through the pipeline.
If there is a narrow dependency, then the child partition is only dependent on 1 parent partition. The parent partition's data can be processed on 1 machine and the child partition can then exist on the same machine. No shuffling of data is necessary.
If there is a wide dependency, then 1 child partition is dependent on many parent partitions. The parent partitions may exist on many machines, so the data must be shuffled across the network in order to complete the data processing.
I'm very new to Spark and don't really know the basics, I just jumped into it to solve a problem. The solution for the problem involves making a graph (using GraphX) where edges have a string attribute. A user may wish to query this graph and I handle the queries by filtering out only those edges that have the string attribute which is equal to the user's query.
Now, my graph has more than 16 million edges; it takes more than 10 minutes to create the graph when I'm using all 8 cores of my computer. However, when I query this graph (like I mentioned above), I get the results instantaneously (to my pleasant surprise).
So, my question is, how exactly does the filter operation search for my queried edges? Does it look at them iteratively? Are the edges being searched for on multiple cores and it just seems very fast? Or is there some sort of hashing involved?
Here is an example of how I'm using filter: Mygraph.edges.filter(_.attr(0).equals("cat")) which means that I want to retrieve edges that have the attribute "cat" in them. How are the edges being searched?
How can the filter results be instantaneous?
Running your statement returns so fast because it doesn't actually perform the filtering. Spark uses lazy evaluation: it doesn't actually perform transformations until you perform an action which actually gathers the results. Calling a transformation method, like filter just creates a new RDD that represents this transformation and its result. You will have to perform an action like collect or count to actually have it executed:
def myGraph: Graph = ???
// No filtering actually happens yet here, the results aren't needed yet so Spark is lazy and doesn't do anything
val filteredEdges = myGraph.edges.filter()
// Counting how many edges are left requires the results to actually be instantiated, so this fires off the actual filtering
println(filteredEdges.count)
// Actually gathering all results also requires the filtering to be done
val collectedFilteredEdges = filteredEdges.collect
Note that in these examples the filter results are not stored in between: due to the laziness the filtering is repeated for both actions. To prevent that duplication, you should look into Spark's caching functionality, after reading up on the details on transformations and actions and what Spark actually does behind the scene: https://spark.apache.org/docs/latest/programming-guide.html#rdd-operations.
How exactly does the filter operation search for my queried edges (when I execute an action)?
in Spark GraphX the edges are stored in a an RDD of type EdgeRDD[ED] where ED is the type of your edge attribute, in your case String. This special RDD does some special optimizations in the background, but for your purposes it behaves like its superclass RDD[Edge[ED]] and filtering occurs like filtering any RDD: it will iterate through all items, applying the given predicate to each. An RDD however is split into a number of partitions and Spark will filter multiple partitions in parallel; in your case where you seem to run Spark locally it will do as many in parallel as the number of cores you have, or how much you have specified explicitly with --master local[4] for instance.
The RDD with edges is partitioned based on the PartitionStrategy that is set, for instance if you create your graph with Graph.fromEdgeTuples or by calling partitionBy on your graph. All strategies are based on the edge's vertices however, so don't have any knowledge about your attribute, and so don't affect your filtering operation, except maybe for some unbalanced network load if you'd run it on a cluster, all 'cat' edges end up in the same partition/executor and you do a collect or some shuffle operation. See the GraphX docs on Vertex and Edge RDDs for a bit more information on how graphs are represented and partitioned.