Spark Cassandra Connector keyBy and shuffling - cassandra

I am trying to optimize my spark job by avoiding shuffling as much as possible.
I am using cassandraTable to create the RDD.
The column family's column names are dynamic, thus it is defined as follows:
CREATE TABLE "Profile" (
key text,
column1 text,
value blob,
PRIMARY KEY (key, column1)
bloom_filter_fp_chance=0.010000 AND
caching='ALL' AND
This definition results in CassandraRow RDD elements in the following format:
CassandraRow <key, column1, value>
key - the RowKey
column1 - the value of column1 is the name of the dynamic column
value - the value of the dynamic column
So if I have RK='profile1', with columns name='George' and age='34', the resulting RDD will be:
CassandraRow<key=profile1, column1=name, value=George>
CassandraRow<key=profile1, column1=age, value=34>
Then I need to group elements that share the same key together to get a PairRdd:
PairRdd<String, Iterable<CassandraRow>>
Important to say, that all the elements I need to group are in the same Cassandra node (share the same row key), so I expect the connector to keep the locality of the data.
The problem is that using groupBy or groupByKey causes shuffling. I rather group them locally, because all the data is on the same node:
JavaPairRDD<String, Iterable<CassandraRow>> rdd = javaFunctions(context)
.cassandraTable(ks, "Profile")
.groupBy(new Function<ColumnFamilyModel, String>() {
public String call(ColumnFamilyModel arg0) throws Exception {
return arg0.getKey();
My questions are:
Does using keyBy on the RDD will cause shuffling, or will it keep the data locally?
Is there a way to group the elements by key without shuffling? I read about mapPartitions, but didn't quite understand the usage of it.

I think you are looking for spanByKey, a cassandra-connector specific operation that takes advantage of the ordering provided by cassandra to allow grouping of elements without incurring in a shuffle stage.
In your case, it should look like:
sc.cassandraTable("keyspace", "Profile")
.keyBy(row => (row.getString("key")))
Read more in the docs:


Reading guarantees for full table scan while updating the table?

Given schema:
CREATE TABLE keyspace.table (
key text,
ckey text,
value text
PRIMARY KEY (key, ckey)
...and Spark pseudocode:
val sc: SparkContext = ...
val connector: CassandraConnector = ...
sc.cassandraTable("keyspace", "table")
.mapPartitions { partition =>
connector.withSessionDo { session =>
partition.foreach { row =>
val key = row.getString("key")
val ckey = Random.nextString(42)
val value = row.getString("value")
session.execute(s"INSERT INTO keyspace.table (key, ckey, value)" +
" VALUES ($key, $ckey, $value)")
Is it possible for a code like this to read an inserted value within a single application (Spark job) run? More generalized version of my question would be whether a token range scan CQL query can read newly inserted values while iterating over rows.
Yes, it is possible exactly as Alex wrote
but I don't think it's possible with above code
So per data model the table is ordered by ckey in ascending order
The funny part however is the page size and how many pages are prefetched and since this is by default 1000 (spark.cassandra.input.fetch.sizeInRows), then the only problem could occur, if you wouldn't use 42, but something bigger and/or the executor didn't page yet
Also I think you use unnecessary nesting, so the code to achieve what you want might be simplified (after all cassandraTable will give you a data frame).
(I hope I understand that you want to read per partition (note a partition in your case is all rows under one primary key - "key") and for every row (distinguished by ckey) in this partition generate new one (with new ckey that will just duplicate value with new ckey) - use case for such code is a mystery for me, but I hope it has some sense:-))

SPARK Combining Neighbouring Records in a text file

very new to SPARK.
I need to read a very large input dataset, but I fear the format of the input files would not be amenable to read on SPARK. Format is as follows:
Ideally what I would like to do is pull the lines of the file into a SPARK RDD, and then transform it into an RDD that only has one item per record (with the subrecords becoming part of their associated record item).
So if the example above was read in, I'd want to wind up with an RDD containing 3 objects: [record1,record2,record3]. Each object would contain the data from their RECORD and any associated SUBRECORD entries.
The unfortunate bit is that the only thing in this data that links subrecords to records is their position in the file, underneath their record. That means the problem is sequentially dependent and might not lend itself to SPARK.
Is there a sensible way to do this using SPARK (and if so, what could that be, what transform could be used to collapse the subrecords into their associated record)? Or is this the sort of problem one needs to do off spark?
There is a somewhat hackish way to identify the sequence of records and sub-records. This method assumes that each new "record" is identifiable in some way.
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.expressions.Window
val df = Seq(
).toDS() => (r._1._1, r._1._2, r._2)).toDF("record", "value", "id")
val win = Window.orderBy("id")
val recids = df.withColumn("newrec", ($"record" === "RECORD").cast(LongType))
.withColumn("recid", sum($"newrec").over(win))
.select($"recid", $"record", $"value")
val recs = recids.where($"record"==="RECORD").select($"recid", $"value".as("recname"))
val subrecs = recids.where($"record" =!= "RECORD").select($"recid", $"value".as("attr"))
recs.join(subrecs, Seq("recid"), "left").groupBy("recname").agg(collect_list("attr").as("attrs")).show()
This snippet will first zipWithIndex to identify each row, in order, then add a boolean column that is true every time a "record" is identified, and false otherwise. We then cast that boolean to a long, and then can do a running sum, which has the neat side-effect of essentially labeling every record and it's sub-records with a common identifier.
In this particular case, we then split to get the record identifiers, re-join only the sub-records, group by the record ids, and collect the sub-record values to a list.
The above snippet results in this:
| recname| attrs|
|record1identifier| [value1, value2]|
|record2identifier| []|
|record3identifier|[value3, value4, ...|

Spark write only to one hbase region server

import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.rdd.PairRDDFunctions
def bulkWriteToHBase(sparkSession: SparkSession, sparkContext: SparkContext, jobContext: Map[String, String], sinkTableName: String, outRDD: RDD[(ImmutableBytesWritable, Put)]): Unit = {
val hConf = HBaseConfiguration.create()
hConf.set("hbase.zookeeper.quorum", jobContext("hbase.zookeeper.quorum"))
hConf.set("zookeeper.znode.parent", jobContext("zookeeper.znode.parent"))
hConf.set(TableInputFormat.INPUT_TABLE, sinkTableName)
val hJob = Job.getInstance(hConf)
hJob.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, sinkTableName)
what I have found by using this hbase bulk insertion is that, every time spark will only write into one single region server from hbase, which becomes the bottleneck.
however when I use almost the same approach but reading from hbase, it is using multiple executors to do parallel reading .
def bulkReadFromHBase(sparkSession: SparkSession, sparkContext: SparkContext, jobContext: Map[String, String], sourceTableName: String) = {
val hConf = HBaseConfiguration.create()
hConf.set("hbase.zookeeper.quorum", jobContext("hbase.zookeeper.quorum"))
hConf.set("zookeeper.znode.parent", jobContext("zookeeper.znode.parent"))
hConf.set(TableInputFormat.INPUT_TABLE, sourceTableName)
val inputRDD = sparkContext.newAPIHadoopRDD(hConf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
can anyone please explain why this could happen? or maybe I have
used the wrong way for spark-hbase bulk I/O ?
Question : I have used the wrong way for spark-hbase bulk I/O ?
No your way is right, although, you need to pre-split regions before hand & create table with presplit regions.
for example create 'test_table', 'f1', SPLITS=> ['1', '2', '3', '4', '5', '6', '7', '8', '9']
Above table occupies 9 regions..
design good rowkey with will starts with 1-9
you can use guava murmur hash like below.
* getMurmurHash.
* #param content
* #return HashCode
public static HashCode getMurmurHash(String content) {
final HashFunction hf = Hashing.murmur3_128();
final HashCode hc = hf.newHasher().putString(content, Charsets.UTF_8).hash();
return hc;
final long hash = getMurmur128Hash(Bytes.toString(yourrowkey as string)).asLong();
final int prefix = Math.abs((int) hash % 9);
now append this prefix to your rowkey
For example
1rowkey1 // will go in to first region
2rowkey2 // will go in to
second region
3rowkey3 // will go in to third region
9rowkey9 //
will go in to ninth region
If you are doing pre-splitting, and want to manually manage region splits, you can also disable region splits, by setting hbase.hregion.max.filesize to a high number and setting the split policy to ConstantSizeRegionSplitPolicy. However, you should use a safeguard value of like 100GB, so that regions does not grow beyond a region server’s capabilities. You can consider disabling automated splitting and rely on the initial set of regions from pre-splitting for example, if you are using uniform hashes for your key prefixes, and you can ensure that the read/write load to each region as well as its size is uniform across the regions in the table
1) please ensure that you can presplit the table before loading data in to hbase table 2) Design good rowkey as Explained below using murmurhash or some other hashing technique. to ensure uniform distribution across the regions.
Also look at
Question : can anyone please explain why this could happen?
reason is quite obvious and simple HOT SPOTTING of data in to one specific reason becuase of poor rowkey for that table...
Consider a hashmap in java which has elements with hashcode 1234. then it will fill all the elements in one bucket isntit ? If hashmap elements are distributed across different good hashcode then it will put elements in different buckets. same is the case with hbase. here your hashcode is just like your rowkey...
Further more,
What happens if I already have a table and I want to split the regions
The RegionSplitter class provides several utilities to help in the administration lifecycle for developers who choose to manually split regions instead of having HBase handle that automatically.
The most useful utilities are:
Create a table with a specified number of pre-split regions
Execute a rolling split of all regions on an existing table
Example :
$ hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit -c 10 -f f1
where -c 10, specifies the requested number of regions as 10, and -f specifies the column families you want in the table, separated by “:”. The tool will create a table named “test_table” with 10 regions:
13/01/18 18:49:32 DEBUG hbase.HRegionInfo: Current INFO from scan results = {NAME => 'test_table,,1358563771069.acc1ad1b7962564fc3a43e5907e8db33.', STARTKEY => '', ENDKEY => '19999999', ENCODED => acc1ad1b7962564fc3a43e5907e8db33,}
13/01/18 18:49:32 DEBUG hbase.HRegionInfo: Current INFO from scan results = {NAME => 'test_table,19999999,1358563771096.37ec12df6bd0078f5573565af415c91b.', STARTKEY => '19999999', ENDKEY => '33333332', ENCODED => 37ec12df6bd0078f5573565af415c91b,}
as discussed in comment, you found that my final RDD right before writing into hbase only has 1 partition! which indicates that there
was only one executor holding the entire data... I am still trying to
find out why.
Also, Check
spark.default.parallelism defaults to the number of all cores on all
machines. The parallelize api has no parent RDD to determine the
number of partitions, so it uses the spark.default.parallelism.
So You can increase partitions by repartitioning.
NOTE : I observed that, In Mapreduce The number of partitions of the regions/input split = number of mappers launched.. Similarly in your case it may be the same situation where data loaded in to one particular region thats why one executor lauched. please verify that as well
Though you have not provided example data or enough explanation,this is mostly not due to your code or configuration.
It is happening so,due to non-optimal rowkey design.
The data you are writing is having keys(hbase rowkey) improperly structured(maybe monotonically increasing or something else).So, write to one of the regions is happening.You can prevent that thro' various ways(various recommended practices for rowkey design like salting,inverting,and other techniques).
For reference you can see
In case,if you are wondering whether the write is done in parallel for all regions or one by one(not clear from question) look at this :

RDD joinWithCassandraTable

Can anyone please help me on the below query.
I have an RDD with 5 columns. I want to join with a table in Cassandra.
I knew that there is a way to do that by using "joinWithCassandraTable"
I see somewhere a syntax to use it.
RDD.joinWithCassandraTable(KEYSPACE, tablename, SomeColumns("cola","colb"))
Can anyone please send me the correct syntax??
I would like to actually know where to mention the column name of a table which is a key to join.
JoinWithCassandraTable works by pulling only the partition keys which match your RDD entries from C* so it only works on partition keys.
The documentation is here
and API Doc is here
The jWCT table method can be used without the fluent api by specifying all the arguments in the method
def joinWithCassandraTable[R](
keyspaceName: String,
tableName: String,
selectedColumns: ColumnSelector = AllColumns,
joinColumns: ColumnSelector = PartitionKeyColumns)
But the fluent api can also be used
joinWithCassandraTable[R](keyspace, tableName).select(AllColumns).on(PartitionKeyColumns)
These two calls are equivalent
Your example
RDD.joinWithCassandraTable(KEYSPACE, tablename, SomeColumns("cola","colb")) .on(SomeColumns("colc"))
Uses the Object from RDD to join against colc of tablename and only returns cola and colb as join results.
Use below syntax for join in cassandra
joinedData = rdd.joinWithCassandraTable(keyspace,table).on(partitionKeyName).select(Column Names)
It will look something like this,
joinedData = rdd.joinWithCassandraTable(keyspace,table).on('emp_id').select('emp_name', 'emp_city')

Ordered union on spark RDDs

I am trying to do a sort on key of key-record pairs using apache spark. The key is 10 bytes long and the value is about 90 bytes long. In other words I am trying to replicate the sort benchmark Databricks used to break the sorting record. One of the things I noticed from the documentation is that they sorted on key-line-number pairs as opposed to key-record pairs to probably be cache/tlb friendly. I tried to replicate this approach but have not found a suitable solution. Here is what I have tried:
var keyValueRDD_1 = => (x.substring(0, 10), x.substring(12, 13)))
var keyValueRDD_2 = => (x.substring(0, 10), x.substring(14, 98))
var result = keyValueRDD_1.sortByKey(true, 1) // assume partitions = 1
var unionResult = result.union(keyValueRDD_2)
var finalResult = unionResult.foldByKey("")(_+_)
When I do a union on the result RDD and keyValueRDD_2 RDD and print the output of the unionResultRDD, the result and keyValueRDD_2 are not interleaved. In other words, it looks like the unionResult RDD has the keyValueRDD_2 contents followed by the result RDD contents. However, when I do a foldByKey operation which combines the values of same key into a single key-value pair, the sorted order is destroyed. I need to do a fold by key operation in order to save the result as the original key-record pair. Is there an alternate rdd function that could be used to achieve this?
Any tips or suggestions would be quite useful.
The union method just puts two RDDs one after the other, except if they have the same partitioner. Then it joins the partitions.
What you want to do is impossible.
When you have one RDD sorted (keyValueRDD_1) and another unsorted RDD with the same keys (keyValueRDD_2) then the only way to get the second RDD sorted is to sort it.
The existence of the sorted RDD does not help us sort the second RDD.
The Databricks article talks about an optimization that happens locally on the executors. After the shuffle step, the records are roughly sorted. Each partition now covers a range of keys, but the partitions are unsorted.
Now you have to sort each partition locally, and this is where the prefix optimization helps with cache locality.
