RDD joinWithCassandraTable - apache-spark

Can anyone please help me on the below query.
I have an RDD with 5 columns. I want to join with a table in Cassandra.
I knew that there is a way to do that by using "joinWithCassandraTable"
I see somewhere a syntax to use it.
Syntax:
RDD.joinWithCassandraTable(KEYSPACE, tablename, SomeColumns("cola","colb"))
.on(SomeColumns("colc"))
Can anyone please send me the correct syntax??
I would like to actually know where to mention the column name of a table which is a key to join.

JoinWithCassandraTable works by pulling only the partition keys which match your RDD entries from C* so it only works on partition keys.
The documentation is here
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable
and API Doc is here
http://datastax.github.io/spark-cassandra-connector/ApiDocs/1.6.0-M2/spark-cassandra-connector/#com.datastax.spark.connector.RDDFunctions
The jWCT table method can be used without the fluent api by specifying all the arguments in the method
def joinWithCassandraTable[R](
keyspaceName: String,
tableName: String,
selectedColumns: ColumnSelector = AllColumns,
joinColumns: ColumnSelector = PartitionKeyColumns)
But the fluent api can also be used
joinWithCassandraTable[R](keyspace, tableName).select(AllColumns).on(PartitionKeyColumns)
These two calls are equivalent
Your example
RDD.joinWithCassandraTable(KEYSPACE, tablename, SomeColumns("cola","colb")) .on(SomeColumns("colc"))
Uses the Object from RDD to join against colc of tablename and only returns cola and colb as join results.

Use below syntax for join in cassandra
joinedData = rdd.joinWithCassandraTable(keyspace,table).on(partitionKeyName).select(Column Names)
It will look something like this,
joinedData = rdd.joinWithCassandraTable(keyspace,table).on('emp_id').select('emp_name', 'emp_city')

Related

Couchbase N1QL query with joined subquery with USE KEYS

I need to write a n1ql query which demands another sub-query in select clause. As it is mandatory to use 'USE KEYS' while writing subqueries in n1ql. How to write USE KEYS clause for an inner joined query, below is an example of same case:
select meta(m).id as _ID, meta(m).cas as _CAS,
(select c.description
from bucketName p join bucketName c on p.categoryId = c.categoryId and p.type='product' and
c.type='category' and p.masterId=m.masterId ) as description //--How to use USE KEYS here ?
from bucketName m where m.type='master' and m.caseId='12345'
My requirment is to fetch some value from another 2 joined tables. however, I simplified above query to make it more understandable.
Please suggest the correct way to implement.
Also, is writting
sub-queries in n1ql is better than fetching documents seperatly and
merging them in coding?
Non FROM CLAUSE, correlated sub queries requires USE KEYS due to global secondary indexes queries can take long time and resources. This is restriction at present in the N1QL. If you can derive p's document key from the m you can give that as USE KEYS in p.
Otherwise you have two options
Option 1: As your subquery is in the projection Use ANSI JOIN https://blog.couchbase.com/ansi-join-support-n1ql/
SELECT META(m).id AS _ID, META(m).cas AS _CAS, c.description
FROM bucketName AS m
LEFT JOIN bucketName AS p ON p.masterId=m.masterId AND p.type='product'
LEFT JOIN bucketName AS c ON c.type='category' AND p.categoryId = c.categoryId
WHERE m.type='master' AND m.caseId='12345';
CREATE INDEX ix1 ON (caseId) WHERE type='master';
CREATE INDEX ix2 ON (masterId, categoryId) WHERE type='product';
CREATE INDEX ix3 ON (categoryId, description) WHERE type='category';
NOTE: If there is no Unique relation m to p to c JOIN can produce more results.
If that is case, you can do GROUP BY META(m).id, META(m).cas and
ARRAY_AGG(c.description). All descriptions are given as ARRAY.
Option 2:
As described by you issue two separate quires and merge in the application.

How can you update values in a dataset?

So as far as I know Apache Spark doesn't has a functionality that imitates the update SQL command. Like, I can change a single value in a column given a certain condition. The only way around that is to use the following command I was instructed to use (here in Stackoverflow): withColumn(columnName, where('condition', value));
However, the condition should be of column type, meaning I have to use the built in column filtering functions apache has (equalTo, isin, lt, gt, etc). Is there a way I can instead use an SQL statement instead of those built in functions?
The problem is I'm given a text file with SQL statements, like WHERE ID > 5 or WHERE AGE != 50, etc. Then I have to label values based on those conditions, and I thought of following the withColumn() approach but I can't plug-in an SQL statement in that function. Any idea of how I can go around this?
I found a way to go around this:
You want to split your dataset into two sets: the values you want to update and the values you don't want to update
Dataset<Row> valuesToUpdate = dataset.filter('conditionToFilterValues');
Dataset<Row> valuesNotToUpdate = dataset.except(valuesToUpdate);
valueToUpdate = valueToUpdate.withColumn('updatedColumn', lit('updateValue'));
Dataset<Row> updatedDataset = valuesNotToUpdate.union(valueToUpdate);
This, however, doesn't keep the same order of records as the original dataset, so if order is of importance to you, this won't suffice your needs.
In PySpark you have to use .subtract instead of .except
If you are using DataFrame, you can register that dataframe as temp table,
using df.registerTempTable("events")
Then you can query like,
sqlContext.sql("SELECT * FROM events "+)
when clause translates into case clause which you can relate to SQL case clause.
Example
scala> val condition_1 = when(col("col_1").isNull,"NA").otherwise("AVAILABLE")
condition_1: org.apache.spark.sql.Column = CASE WHEN (col_1 IS NULL) THEN NA ELSE AVAILABLE END
or you can chain when clause as well
scala> val condition_2 = when(col("col_1") === col("col_2"),"EQUAL").when(col("col_1") > col("col_2"),"GREATER").
| otherwise("LESS")
condition_2: org.apache.spark.sql.Column = CASE WHEN (col_1 = col_2) THEN EQUAL WHEN (col_1 > col_2) THEN GREATER ELSE LESS END
scala> val new_df = df.withColumn("condition_1",condition_1).withColumn("condition_2",condition_2)
Still if you want to use table, then you can register your dataframe / dataset as temperory table and perform sql queries
df.createOrReplaceTempView("tempTable")//spark 2.1 +
df.registerTempTable("tempTable")//spark 1.6
Now, you can perform sql queries
spark.sql("your queries goes here with case clause and where condition!!!")//spark 2.1
sqlContest.sql("your queries goes here with case clause and where condition!!!")//spark 1.6
If you are using java dataset
you can update dataset by below.
here is the code
Dataset ratesFinal1 = ratesFinal.filter(" on_behalf_of_comp_id != 'COMM_DERIVS' ");
ratesFinal1 = ratesFinal1.filter(" status != 'Hit/Lift' ");
Dataset ratesFinalSwap = ratesFinal1.filter (" on_behalf_of_comp_id in ('SAPPHIRE','BOND') and cash_derivative != 'cash'");
ratesFinalSwap = ratesFinalSwap.withColumn("ins_type_str",functions.lit("SWAP"));
adding new column with value from existing column
ratesFinalSTW = ratesFinalSTW.withColumn("action", ratesFinalSTW.col("status"));

Spark Cassandra Connector keyBy and shuffling

I am trying to optimize my spark job by avoiding shuffling as much as possible.
I am using cassandraTable to create the RDD.
The column family's column names are dynamic, thus it is defined as follows:
CREATE TABLE "Profile" (
key text,
column1 text,
value blob,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='ALL' AND
...
This definition results in CassandraRow RDD elements in the following format:
CassandraRow <key, column1, value>
key - the RowKey
column1 - the value of column1 is the name of the dynamic column
value - the value of the dynamic column
So if I have RK='profile1', with columns name='George' and age='34', the resulting RDD will be:
CassandraRow<key=profile1, column1=name, value=George>
CassandraRow<key=profile1, column1=age, value=34>
Then I need to group elements that share the same key together to get a PairRdd:
PairRdd<String, Iterable<CassandraRow>>
Important to say, that all the elements I need to group are in the same Cassandra node (share the same row key), so I expect the connector to keep the locality of the data.
The problem is that using groupBy or groupByKey causes shuffling. I rather group them locally, because all the data is on the same node:
JavaPairRDD<String, Iterable<CassandraRow>> rdd = javaFunctions(context)
.cassandraTable(ks, "Profile")
.groupBy(new Function<ColumnFamilyModel, String>() {
#Override
public String call(ColumnFamilyModel arg0) throws Exception {
return arg0.getKey();
}
})
My questions are:
Does using keyBy on the RDD will cause shuffling, or will it keep the data locally?
Is there a way to group the elements by key without shuffling? I read about mapPartitions, but didn't quite understand the usage of it.
Thanks,
Shai
I think you are looking for spanByKey, a cassandra-connector specific operation that takes advantage of the ordering provided by cassandra to allow grouping of elements without incurring in a shuffle stage.
In your case, it should look like:
sc.cassandraTable("keyspace", "Profile")
.keyBy(row => (row.getString("key")))
.spanByKey
Read more in the docs:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md#grouping-rows-by-partition-key

non-ordinal access to rows returned by Spark SQL query

In the Spark documentation, it is stated that the result of a Spark SQL query is a SchemaRDD. Each row of this SchemaRDD can in turn be accessed by ordinal. I am wondering if there is any way to access the columns using the field names of the case class on top of which the SQL query was built. I appreciate the fact that the case class is not associated with the result, especially if I have selected individual columns and/or aliased them: however, some way to access fields by name rather than ordinal would be convenient.
A simple way is to use the "language-integrated" select method on the resulting SchemaRDD to select the column(s) you want -- this still gives you a SchemaRDD, and if you select more than one column then you will still need to use ordinals, but you can always select one column at a time. Example:
// setup and some data
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Score(name: String, value: Int)
val scores =
sc.textFile("data.txt").map(_.split(",")).map(s => Score(s(0),s(1).trim.toInt))
scores.registerAsTable("scores")
// initial query
val original =
sqlContext.sql("Select value AS myVal, name FROM scores WHERE name = 'foo'")
// now a simple "language-integrated" query -- no registration required
val secondary = original.select('myVal)
secondary.collect().foreach(println)
Now secondary is a SchemaRDD with just one column, and it works despite the alias in the original query.
Edit: but note that you can register the resulting SchemaRDD and query it with straight SQL syntax without needing another case class.
original.registerAsTable("original")
val secondary = sqlContext.sql("select myVal from original")
secondary.collect().foreach(println)
Second edit: When processing an RDD one row at a time, it's possible to access the columns by name by using the matching syntax:
val secondary = original.map {case Row(myVal: Int, _) => myVal}
although this could get cumbersome if the right hand side of the '=>' requires access to a lot of the columns, as they would each need to be matched on the left. (This from a very useful comment in the source code for the Row companion object)

JPQL / QueryDSL: join subquery and get aliased column

I'm trying to get an average for a count on a groupBy by joining with a subquery. Don't know if that the right way to go at all but I couldn't anything about subqueries other than the mysema doc.
Scenario:
How many orders per product did a customer do on average?
Meaning: A Customer orders products. So a customer ordered a specific product a number of times (count). What's the average number of orders that customer placed for any product?
Might sound a bit hypothetical, in fact it's just part of a prototype, but it made me wonder, how to get a reference to a custom column created within a subquery with the fancy QueryDSL from Mysema.
In SQL you just give the count column an alias and join using a second ID column. QueryDSL has the "as()" method as well but I have no Idea, how to retrieve that column plus I dont't see how it can join one query with anothers, since query.list() just gets a list but for some reason the join accepts it. Feels wrong...
Here's my code:
JPQLQuery query = createJPQLQuery();
QOrdering qOrdering = QOrdering.ordering;
QProduct qProduct = QProduct.product;
QCustomer qCustomer = QCustomer.customer;
// how many of each product did a customer order?
HibernateSubQuery subQuery = new HibernateSubQuery();
subQuery.from(qOrdering).innerJoin(qOrdering.product,qProduct).innerJoin(qOrdering.customer, qCustomer);
subQuery.groupBy(qCustomer,qProduct).list(qCustomer.id,qProduct.id,qProduct.count());
// get the average number of orders per product for each customer
query.from(qCustomer);
query.innerJoin(subQuery.list(qCustomer.id,qOrdering.count().as("count_orders")));
query.groupBy(qCustomer.id);
return (List<Object[]>) query.list(qCustomer.firstname,subQuery.count_orders.avg());
Again: How do I join with a subquery?
How do I get the aliased "count" column to do more aggregation like avg (is my group right btw?)
Might be that I have some other errors in this, so any help appreciated!
Thanks!
Edit:
That's kind of the native SQL I'd like to see QueryDSL produce:
Select avg(numOrders) as average, cust.lastname from
customer cust
inner join
(select count(o.product_id) as numOrders, c.id as cid, p.name
from ordering o
inner join product p on o.product_id=p.id
inner join customer c on o.customer_id=c.id
group by o.customer_id, o.product_id) as numprods
on cust.id = numprods.cid
group by numprods.cid
order by cust.lastname;
Using subqueries in the join clause is not allowed. in JPQL, subqueries are only allowed in the WHERE and HAVING part. The join method signatures in Querydsl JPA queries are too wide.
As this query needs two levels of grouping, maybe it can't be expressed with JPQL / Querydsl JPA.
I'd suggest to write this query using the Querydsl JPA Native query support.
As Querydsl JPA uses JPQL internally, it is restricted by the expressiveness of JPQL.
I know that this question is old and already has an accepted answer, but judging from this question, it seems to still be troubling guys. See my answer in the same question. The use of JoinFlag in the join() section and Expression.path() is able to achieve left-joining a subquery. Hope this helps someone.
QueryDsl does not support subQuery in join but you can achieve this via following way:
We wanted to achieve the following query:
select A.* from A join (select aid from B group by aid) b on b.aid=A.id;
Map a View or SQL query to JPA entity:
import lombok.Setter;
import org.hibernate.annotations.Subselect;
import org.hibernate.annotations.Synchronize;
import javax.persistence.Entity;
import javax.persistence.Id;
#Entity
#Getter
#Setter
#Subselect("select aid from B group by aid")
#Synchronize("B")
public class BGroupByAid {
#Id
private Integer aId;
}
then use the equivalent QueryDSl entity in the class just like the regular entity:
JPAQuery<QAsset> query = new JPAQuery<>(entityManager);
QBGroupByAid bGroupById = QBGroupByAid.bGroupByAid;
List<A> tupleOfAssets =
query.select(A)
.from(A).innerJoin(bGroupById).on(bGroupById.aId.eq(A.aId))
.fetchResults()
.getResults();
You can also use blazebit which supports also subquery in join. I have try it and it is working. You can create SubQueryExpression f.e like this
SubQueryExpression<Tuple> sp2 = getQueryFactory().select(entity.id,
JPQLNextExpressions.rowNumber().over().partitionBy(entity.folId).orderBy(entity.creationDate.desc()).as(rowNumber))
.from(entity)
.where(Expressions.path(Integer.class, rowNumber).eq(1));
and then just join it like this:
return getBlazeQueryFactory()
.select(entity1, entity)
.from(entity1)
.leftJoin(sp2, entity).on(entity.id.eq(entity1.id)).fetch();
I have put here just simple example. So maybe it doesn't make a perfect sense but maybe can be helpful.
Also don't be confused it will can produce union in the generated select. This is just for naming columns from subquery you can read better explanation about this union here: Blaze-Persistence GROUP BY in LEFT JOIN SUBQUERY with COALESCE in root query

Resources