Iterating a GraphTraversal with GraphFrame causes UnsupportedOperationException Row to Vertex conversion - cassandra

The following
GraphTraversal<Row, Edge> traversal = gf().E().hasLabel("foo").limit(5);
while (traversal.hasNext()) {}
causes the following Exception:
java.lang.UnsupportedOperationException: Row to Vertex conversion is not supported: Use .df().collect() instead of the iterator
at com.datastax.bdp.graph.spark.graphframe.DseGraphTraversal.iterator$lzycompute(DseGraphTraversal.scala:92)
at com.datastax.bdp.graph.spark.graphframe.DseGraphTraversal.iterator(DseGraphTraversal.scala:78)
at com.datastax.bdp.graph.spark.graphframe.DseGraphTraversal.hasNext(DseGraphTraversal.scala:129)
Exception says to use .df().collect() but gf().E().hasLabel("foo") does not allow you to do .df() afterwards. In other words, method df() is not there for object returned by hasLabel()
I'm using the Java API via dse-graph-frames:5.1.4 along with dse-byos_2.11:5.1.4.

The short answer: You need to cast GraphTraversal to DseGraphTraversal that has df() method. Then use one of spark Dataset methods to collect Rows:
List<Row> rows =
((DseGraphTraversal)graph.E().hasLabel("foo"))
.df().limit(5).collectAsList();
DseGraphFrame does not yet support full TinkerPop specification. So you can not receive TinkerPop Vertex or Edge objects. ( limit() method is also not implemented in DSE 5.1.x). It is recommended to switch to spark dataset api with df() call, get Dataset<Row> and use Dataset base filtering and collecting
If you need only Edge/Vertex properties you still can use TinkerPop valueMap() or values()
GraphTraversal<Row, Map<String,Object>> traversal = graph.E().hasLabel("foo").valueMap();
while (traversal.hasNext()) {}

Related

Loading data into a Tuple using Spark

Can anyone help me understand the below error?
java.sql.SQLException: Processing attribute att_name failed, keylist
has different length with UDT fields at
com.tigergraph.jdbc.restpp.RestppPreparedStatement.executeBatch(RestppPreparedStatement.java:222)
Through Spark, I am trying to load values into a Tuple that I created in my graph(using JDBC Driver). The above-written tuple contains 5 attributes of various data types.
How do I need to keep the data for my tuple inside a DataFrame? I tried to keep them as an Array. But, the JDBC Driver didn't allow me to write an Array into the Graph DB. I then flattened the array into a string. But, I am getting an error:
KeyList has a different length with UDT fields
Target DB: TigerGraph
Graph Name: MyGraph
Edge Name:Ed_testedge
Edge Attributes: c1,c2,mytuple(c3,c4,c5) //totally three attributes are present out of which one of them is a Tuple.

Differences: Object instantiation within mapPartitions vs outside

I'm a beginner to Apache Spark. Spark's RDD API offers transformation functions like map, mapPartitions. I can understand that map works with each element from the RDD but mapPartitions works with each partition and many people have mentioned mapPartitions are ideally used where we want to do object creation/instantiation and provided examples like:
val rddData = sc.textFile("sample.txt")
val res = rddData.mapPartitions(iterator => {
//Do object instantiation here
//Use that instantiated object in applying the business logic
})
My question is can we not do that with map function itself by doing object instantiation outside the map function like:
val rddData = sc.textFile("sample.txt")
val obj = InstantiatingSomeObject
val res = rddData.map(element =>
//Use the instantiated object 'obj' and do something with data
)
I could be wrong in my fundamental understanding of map and mapPartitions and if the question is wrong, please correct me.
All objects that you create outside of your lambdas are created on the driver. For each execution of the lambda, they are sent over the network to the specific executor.
When calling map, the lambda is executed once per data element, causing to send your serialized object once per data element over the network. When using mapPartitions, this happens only once per partition. However, even when using mapPartions, it would usually be better to create the object inside of your lambda. In many cases, your object is not serializable (like a database connection for example). In this case you have to create the object inside your lambda.

How does the filter operation of Spark work on GraphX edges?

I'm very new to Spark and don't really know the basics, I just jumped into it to solve a problem. The solution for the problem involves making a graph (using GraphX) where edges have a string attribute. A user may wish to query this graph and I handle the queries by filtering out only those edges that have the string attribute which is equal to the user's query.
Now, my graph has more than 16 million edges; it takes more than 10 minutes to create the graph when I'm using all 8 cores of my computer. However, when I query this graph (like I mentioned above), I get the results instantaneously (to my pleasant surprise).
So, my question is, how exactly does the filter operation search for my queried edges? Does it look at them iteratively? Are the edges being searched for on multiple cores and it just seems very fast? Or is there some sort of hashing involved?
Here is an example of how I'm using filter: Mygraph.edges.filter(_.attr(0).equals("cat")) which means that I want to retrieve edges that have the attribute "cat" in them. How are the edges being searched?
How can the filter results be instantaneous?
Running your statement returns so fast because it doesn't actually perform the filtering. Spark uses lazy evaluation: it doesn't actually perform transformations until you perform an action which actually gathers the results. Calling a transformation method, like filter just creates a new RDD that represents this transformation and its result. You will have to perform an action like collect or count to actually have it executed:
def myGraph: Graph = ???
// No filtering actually happens yet here, the results aren't needed yet so Spark is lazy and doesn't do anything
val filteredEdges = myGraph.edges.filter()
// Counting how many edges are left requires the results to actually be instantiated, so this fires off the actual filtering
println(filteredEdges.count)
// Actually gathering all results also requires the filtering to be done
val collectedFilteredEdges = filteredEdges.collect
Note that in these examples the filter results are not stored in between: due to the laziness the filtering is repeated for both actions. To prevent that duplication, you should look into Spark's caching functionality, after reading up on the details on transformations and actions and what Spark actually does behind the scene: https://spark.apache.org/docs/latest/programming-guide.html#rdd-operations.
How exactly does the filter operation search for my queried edges (when I execute an action)?
in Spark GraphX the edges are stored in a an RDD of type EdgeRDD[ED] where ED is the type of your edge attribute, in your case String. This special RDD does some special optimizations in the background, but for your purposes it behaves like its superclass RDD[Edge[ED]] and filtering occurs like filtering any RDD: it will iterate through all items, applying the given predicate to each. An RDD however is split into a number of partitions and Spark will filter multiple partitions in parallel; in your case where you seem to run Spark locally it will do as many in parallel as the number of cores you have, or how much you have specified explicitly with --master local[4] for instance.
The RDD with edges is partitioned based on the PartitionStrategy that is set, for instance if you create your graph with Graph.fromEdgeTuples or by calling partitionBy on your graph. All strategies are based on the edge's vertices however, so don't have any knowledge about your attribute, and so don't affect your filtering operation, except maybe for some unbalanced network load if you'd run it on a cluster, all 'cat' edges end up in the same partition/executor and you do a collect or some shuffle operation. See the GraphX docs on Vertex and Edge RDDs for a bit more information on how graphs are represented and partitioned.

Accessing client side objects and code

A Spark applications needs to validate each element in an RDD.
Given a driver\client side Scala object called Validator, which of the following two solutions is better:
rdd.filter { x => if Validator.isValid(x.somefield) true else false }
or something like
// get list of the field to validate against
val list = rdd.map(x => x.somefield)
// Use the Validator to check which ones are invalid
var invalidElements = Validator.getValidElements().diff(list)
// remove invalid elements from the RDD
rdd.filter(x => !invalidElements.contains(x.somefield))
The second solution avoids referencing the driver side object from within the function passed to the RDD. The invalid elements are determined on the client, that list is then passed back to the RDD.
Or is neither recommended?
Thanks
If I understand you correctly (i.e. you have an object Validator), that's not driver code, because your job's Jar will also be distributed to the workers. So a Scala object you define will also be instantiated in the executor JVM. (That's also why you don't receive a serialization exception in contrast to using methods defined in the job, e.g. in Spark Streaming with checkpointing).
The first version should perform better because you filter first. Mapping over all of the data and then filtering it will be slower.
The second version is also problematic because if you are creating a list of valid elements on the driver, you now have to ship it back to the workers.

Caching in Spark

A function is defined to transform an RDD. Therefore, the function is called once for each element in the RDD.
The function needs to call an external web service to look up reference data, passing as a parameter data from the current element in the RDD.
Two questions:
Is there an issue with issuing a web service call within Spark?
The data from the web service needs to be cached. What is the best way to hold (and subsequently reference) the cached data? The simple way would be to hold the cache in a collection with the Scala class which contains the function being passed to the RDD. Would this be efficient, or is there a better approach for caching in Spark?
Thanks
There isn't really any mechanism for "caching" (in the sense that you mean). Seems like the best approach would be to split this task into two phases:
Get the distinct "keys" by which you must access the external lookup, and perform the lookup once for each key
Use this mapping to perform the lookup for each record in the RDD
I'm assuming there would potentially be many records accessing the same lookup key (otherwise "caching" won't be of any value anyway), so performing the external calls for the distinct keys is substantially faster.
How should you implement this?
If you know this set of distinct keys is small enough to fit into your driver machine's memory:
map your data into the distinct keys by which you'd want to cache these fetched values, and collect it, e.g. : val keys = inputRdd.map(/* get key */).distinct().collect()
perform the fetching on driver-side (not using Spark)
use the resulting Map[Key, FetchedValues] in any transformation on your original RDD - it will be serialized and sent to each worker where you can perform the lookup. For example, assuming the input has records for which the foreignId field is the lookup key:
val keys = inputRdd.map(record => record.foreignId).distinct().collect()
val lookupTable = keys.map(k => (k, fetchValue(k))).asMap
val withValues = inputRdd.map(record => (record, lookupTable(record.foreignId)))
Alternatively - if this map is large (but still can fit in driver memory), you can broadcast it before you use it in RDD transformation - see Broadcast Variables in Spark's Programming Guide
Otherwise (if this map might be too large) - you'll need to use join if you want keep data in the cluster, but to still refrain from fetching the same element twice:
val byKeyRdd = inputRdd.keyBy(record => record.foreignId)
val lookupTableRdd = byKeyRdd
.keys()
.distinct()
.map(k => (k, fetchValue(k))) // this time fetchValue is done in cluster - concurrently for different values
val withValues = byKeyRdd.join(lookupTableRdd)

Resources