Loading data into a Tuple using Spark

Loading data into a Tuple using Spark - apache-spark

Can anyone help me understand the below error?
java.sql.SQLException: Processing attribute att_name failed, keylist
has different length with UDT fields at
com.tigergraph.jdbc.restpp.RestppPreparedStatement.executeBatch(RestppPreparedStatement.java:222)
Through Spark, I am trying to load values into a Tuple that I created in my graph(using JDBC Driver). The above-written tuple contains 5 attributes of various data types.
How do I need to keep the data for my tuple inside a DataFrame? I tried to keep them as an Array. But, the JDBC Driver didn't allow me to write an Array into the Graph DB. I then flattened the array into a string. But, I am getting an error:
KeyList has a different length with UDT fields
Target DB: TigerGraph
Graph Name: MyGraph
Edge Name:Ed_testedge
Edge Attributes: c1,c2,mytuple(c3,c4,c5) //totally three attributes are present out of which one of them is a Tuple.

Related

Is the runtime of get_json_object constant regardless of data size?

I am trying to compare the speed hit for storing data in JSON and then just calling get_json_object to turn it into data that Spark can handle whenever the data is needed, as opposed to ingesting the data with the schema attached and removing the JSON element.
Has anyone analyzed the runtime of these two plans to deal with data?

How to check length of text field of Cassandra table

There is one field 'name' in our Cassandra database whose data type is type 'text'
How do I retrieve the data which has length of the 'name' field greater than some number using Cassandra query.

As was pointed in the comment, it's easy to add the user-defined function, and use it to retrieve the length of the text field, but the catch is that you can't use the user-defined function in the WHERE condition (see CASSANDRA-8488).
Even if it was possible, if you only have this as condition - that's a bad query for Cassandra, as it will need to go through all data in the database, and filter them out. For such tasks, usually things like, Spark are used - you can read data via Spark Cassandra Connector, and apply necessary filtering conditions. But this will involve reading all of the data from database, and then performing the filtering - this would be quite slower than normal CQL queries, but at least automatically parallelized.

Iterating a GraphTraversal with GraphFrame causes UnsupportedOperationException Row to Vertex conversion

The following
GraphTraversal<Row, Edge> traversal = gf().E().hasLabel("foo").limit(5);
while (traversal.hasNext()) {}
causes the following Exception:
java.lang.UnsupportedOperationException: Row to Vertex conversion is not supported: Use .df().collect() instead of the iterator
at com.datastax.bdp.graph.spark.graphframe.DseGraphTraversal.iterator$lzycompute(DseGraphTraversal.scala:92)
at com.datastax.bdp.graph.spark.graphframe.DseGraphTraversal.iterator(DseGraphTraversal.scala:78)
at com.datastax.bdp.graph.spark.graphframe.DseGraphTraversal.hasNext(DseGraphTraversal.scala:129)
Exception says to use .df().collect() but gf().E().hasLabel("foo") does not allow you to do .df() afterwards. In other words, method df() is not there for object returned by hasLabel()
I'm using the Java API via dse-graph-frames:5.1.4 along with dse-byos_2.11:5.1.4.

The short answer: You need to cast GraphTraversal to DseGraphTraversal that has df() method. Then use one of spark Dataset methods to collect Rows:
List<Row> rows =
((DseGraphTraversal)graph.E().hasLabel("foo"))
.df().limit(5).collectAsList();
DseGraphFrame does not yet support full TinkerPop specification. So you can not receive TinkerPop Vertex or Edge objects. ( limit() method is also not implemented in DSE 5.1.x). It is recommended to switch to spark dataset api with df() call, get Dataset<Row> and use Dataset base filtering and collecting
If you need only Edge/Vertex properties you still can use TinkerPop valueMap() or values()
GraphTraversal<Row, Map<String,Object>> traversal = graph.E().hasLabel("foo").valueMap();
while (traversal.hasNext()) {}

Accessing client side objects and code

A Spark applications needs to validate each element in an RDD.
Given a driver\client side Scala object called Validator, which of the following two solutions is better:
rdd.filter { x => if Validator.isValid(x.somefield) true else false }
or something like
// get list of the field to validate against
val list = rdd.map(x => x.somefield)
// Use the Validator to check which ones are invalid
var invalidElements = Validator.getValidElements().diff(list)
// remove invalid elements from the RDD
rdd.filter(x => !invalidElements.contains(x.somefield))
The second solution avoids referencing the driver side object from within the function passed to the RDD. The invalid elements are determined on the client, that list is then passed back to the RDD.
Or is neither recommended?
Thanks

If I understand you correctly (i.e. you have an object Validator), that's not driver code, because your job's Jar will also be distributed to the workers. So a Scala object you define will also be instantiated in the executor JVM. (That's also why you don't receive a serialization exception in contrast to using methods defined in the job, e.g. in Spark Streaming with checkpointing).
The first version should perform better because you filter first. Mapping over all of the data and then filtering it will be slower.
The second version is also problematic because if you are creating a list of valid elements on the driver, you now have to ship it back to the workers.

Caching in Spark

A function is defined to transform an RDD. Therefore, the function is called once for each element in the RDD.
The function needs to call an external web service to look up reference data, passing as a parameter data from the current element in the RDD.
Two questions:
Is there an issue with issuing a web service call within Spark?
The data from the web service needs to be cached. What is the best way to hold (and subsequently reference) the cached data? The simple way would be to hold the cache in a collection with the Scala class which contains the function being passed to the RDD. Would this be efficient, or is there a better approach for caching in Spark?
Thanks

There isn't really any mechanism for "caching" (in the sense that you mean). Seems like the best approach would be to split this task into two phases:
Get the distinct "keys" by which you must access the external lookup, and perform the lookup once for each key
Use this mapping to perform the lookup for each record in the RDD
I'm assuming there would potentially be many records accessing the same lookup key (otherwise "caching" won't be of any value anyway), so performing the external calls for the distinct keys is substantially faster.
How should you implement this?
If you know this set of distinct keys is small enough to fit into your driver machine's memory:
map your data into the distinct keys by which you'd want to cache these fetched values, and collect it, e.g. : val keys = inputRdd.map(/* get key */).distinct().collect()
perform the fetching on driver-side (not using Spark)
use the resulting Map[Key, FetchedValues] in any transformation on your original RDD - it will be serialized and sent to each worker where you can perform the lookup. For example, assuming the input has records for which the foreignId field is the lookup key:
val keys = inputRdd.map(record => record.foreignId).distinct().collect()
val lookupTable = keys.map(k => (k, fetchValue(k))).asMap
val withValues = inputRdd.map(record => (record, lookupTable(record.foreignId)))
Alternatively - if this map is large (but still can fit in driver memory), you can broadcast it before you use it in RDD transformation - see Broadcast Variables in Spark's Programming Guide
Otherwise (if this map might be too large) - you'll need to use join if you want keep data in the cluster, but to still refrain from fetching the same element twice:
val byKeyRdd = inputRdd.keyBy(record => record.foreignId)
val lookupTableRdd = byKeyRdd
.keys()
.distinct()
.map(k => (k, fetchValue(k))) // this time fetchValue is done in cluster - concurrently for different values
val withValues = byKeyRdd.join(lookupTableRdd)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Loading data into a Tuple using Spark - apache-spark

Related

Is the runtime of get_json_object constant regardless of data size?

How to check length of text field of Cassandra table

Iterating a GraphTraversal with GraphFrame causes UnsupportedOperationException Row to Vertex conversion

Accessing client side objects and code

Caching in Spark

Categories

Resources