Keeping data together in spark based on cassandra table partition key - apache-spark

When loading data from Cassandra table, a spark partition represents all rows with same partition key. However, when I create data in spark with same partition key and re-partitioning the new RDD using .repartitionByCassandraReplica(..) method, it ends up in a different spark partition? How do I achieve consistent partitions in spark using the partition-scheme defined by the Spark-Cassandra connector?
Links to download CQL and Spark job code that I tested
.CQL with the keyspace and table schema.
Spark job and other classes.
Version and other information
Spark : 1.3
Cassandra : 2.1
Connector : 1.3.1
Spark nodes (5) and Cass* cluster nodes (4) runs in different data centers
Code extract. Download code using above links for more details
Step 1 : Loads data into 8 spark partitions
Map<String, String> map = new HashMap<String, String>();
CassandraTableScanJavaRDD<TestTable> tableRdd = javaFunctions(conf)
.cassandraTable("testkeyspace", "testtable", mapRowTo(TestTable.class, map));
Step 2 : Repartition data into 8 partitions
.repartitionByCassandraReplica(
"testkeyspace",
"testtable",
partitionNumPerHost,
someColumns("id"),
mapToRow(TestTable.class, map));
Step 3: Print partition id and values for both rdds
rdd.mapPartitionsWithIndex(...{
#Override
public Iterator<String> call(..) throws Exception {
List<String> list = new ArrayList<String>();
list.add("PartitionId-" + integer);
while (itr.hasNext()) {
TestTable value = itr.next();
list.add(Integer.toString(value.getId()));
}
return list.iterator();
}
}, true).collect();
Step 4 : Snapshot of results printed on Partition 1. Different for both Rdds but expect to be same
Load Rdd values
----------------------------
Table load - PartitionId -1
----------------------------
15
22
--------------------------------------
Repartitioned values - PartitionId -1
--------------------------------------
33
16

Repartition by Cassandra replica does not deterministically place keys. There is a ticket currently to change that.
https://datastax-oss.atlassian.net/projects/SPARKC/issues/SPARKC-278
A workaround now is to set the Partitionspernode parameter to 1.

Related

Spark-Cassandra: repartitionByCassandraReplica or converting dataset to JavaRDD and back do not maintain number of partitions?

So, I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96. I am using the Spark-Cassandra Connector 3.0.0 and I am trying to join a dataset with a cassandra table on the partition key, while also using .repartitionByCassandraReplica.
However repartitionByCassandraReplica is implemented only on RDDs so I am converting my dataset to JavaRDD, do the repartitionByCassandraReplica, then converting it back to dataset and do a Direct Join with the cassandra table. It seems though, that in the process of that the number of partitions is "changing" or is not as expected.
I am doing a PCA on 4 partition keys which have some thousands of rows and for which I know the nodes where they are stored according to nodetool getendpoints . It looks like not only the number of partitions is changing but also the nodes where data are pulled are not the ones that actually have the data. Below is the code.
//FYI experimentlist is a List<String> which is converted to Dataset,then JavaRDD, then partitioned
//according to repartitionByCassandraReplica and then back to Dataset. The table with which I want to
//join it, is called experiment.
List<ExperimentForm> tempexplist = experimentlist.stream()
.map(s -> { ExperimentForm p = new ExperimentForm(); p.setExperimentid(s); return p; })
.collect(Collectors.toList());
Encoder<ExperimentForm> ExpEncoder = Encoders.bean(ExperimentForm.class);
Dataset<ExperimentForm> dfexplistoriginal = sp.createDataset(tempexplist, Encoders.bean(ExperimentForm.class));
//Below prints DATASET: PartNum 4
System.out.println("DATASET: PartNum "+dfexplistoriginal.rdd().getNumPartitions());
JavaRDD<ExperimentForm> predf = CassandraJavaUtil.javaFunctions(dfexplistoriginal.javaRDD()).repartitionByCassandraReplica("mdb","experiment",experimentlist.size(),CassandraJavaUtil.someColumns("experimentid"),CassandraJavaUtil.mapToRow(ExperimentForm.class));
//Below prints RDD: PartNum 64
System.out.println("RDD: PartNum "+predf.getNumPartitions());
Dataset<ExperimentForm> newdfexplist = sp.createDataset(predf.rdd(), ExpEncoder);
Dataset<Row> readydfexplist = newdfexplist.as(Encoders.STRING()).toDF("experimentid");
//Below prints DATASET: PartNum 64
System.out.println("DATASET: PartNum "+readydfexplist.rdd().getNumPartitions());
//and finally the DirectJoin which for some reason is not mentioned as DirectJoin in DAGs like other times
Dataset<Row> metlistinitial = sp.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "mdb");
put("table", "experiment");
}
})
.load().select(col("experimentid"), col("description"), col("intensity")).join(readydfexplist,"experimentid");
Is the code wrong? Below are also some images from SparkUI the Stages Tab with DAGs. At first I have 4 tasks/partitions and after repartitionByCassandraReplica I get 64 or more. Why?
All the Stages:
Stage 0 DAG
Stage 0 Metrics
Stage 1 DAG
Stage 1 Some Metrics
Looks like the code I wrote above is not entirely correct! I managed to get repartitionByCassandraReplica working by just converting dataset to RDD, performing the repartitionByCassandraReplica doing the join with JoinWithCassandraTable and THEN converting back to dataset! Now it is indeed repartitioned on the nodes that actually have the data! Partitions are maintained between these conversions!

How to improve performance my spark job here to load data into cassandra table?

I am using spark-sql-2.4.1 ,spark-cassandra-connector_2.11-2.4.1 with java8 and apache cassandra 3.0 version.
I have my spark-submit or spark cluster enviroment as below to load 2 billion records.
--executor-cores 3
--executor-memory 9g
--num-executors 5
--driver-cores 2
--driver-memory 4g
I am using Cassandra 6 node cluster with below settings :
cassandra.output.consistency.level=ANY
cassandra.concurrent.writes=1500
cassandra.output.batch.size.bytes=2056
cassandra.output.batch.grouping.key=partition
cassandra.output.batch.grouping.buffer.size=3000
cassandra.output.throughput_mb_per_sec=128
cassandra.connection.keep_alive_ms=30000
cassandra.read.timeout_ms=600000
I am loading using spark dataframe into cassandra tables.
After reading into spark data set I am grouping by on certain columns as below.
Dataset<Row> dataDf = //read data from source i.e. hdfs file which are already partitioned based "load_date", "fiscal_year" , "fiscal_quarter" , "id", "type","type_code"
Dataset<Row> groupedDf = dataDf.groupBy("id","type","value" ,"load_date","fiscal_year","fiscal_quarter" , "create_user_txt", "create_date")
groupedDf.write().format("org.apache.spark.sql.cassandra")
.option("table","product")
.option("keyspace", "dataload")
.mode(SaveMode.Append)
.save();
Cassandra table(
PRIMARY KEY (( id, type, value, item_code ), load_date)
) WITH CLUSTERING ORDER BY ( load_date DESC )
Basically I am groupBy "id","type","value" ,"load_date" columns. As the other columns ( "fiscal_year","fiscal_quarter" , "create_user_txt", "create_date") should be available for storing into cassandra table I have to include them also in the groupBy clause.
1) Frankly speaking I dont know how to get those columns after groupBy
into resultant dataframe i.e groupedDf to store. Any advice here
to how to tackle this please ?
2) With above process/steps , my spark job for loading is pretty slow due to lot of shuffling i.e. read shuffle and write shuffle processes.
What should I do here to improve the speed ?
While reading from source (into dataDf) do I need to do anything here to improve performance? This is already partitioned.
Should I still need to do any partitioning ? If so , what is the best way/approach given the above cassandra table?
HDFS file columns
"id","type","value","type_code","load_date","item_code","fiscal_year","fiscal_quarter","create_date","last_update_date","create_user_txt","update_user_txt"
Pivoting
I am using groupBy due to pivoting as below
Dataset<Row> pivot_model_vals_unpersist_df = model_vals_df.groupBy("id","type","value","type_code","load_date","item_code","fiscal_year","fiscal_quarter","create_date")
.pivot("type_code" )
.agg( first(//business logic)
)
)
Please advice.
Your advice/feedback are highly thankful.
So, as I got from comments your task is next:
Take 2b rows from HDFS.
Save this rows into Cassandra with some conversion.
Schema of Cassandra table is not the same as schema of HDFS dataset.
At first, you definitely don't need group by. GROUP BY doesn't group columns, it group rows invoking some aggregate function like sum, avg, max, etc. Semantic is similar to SQL "group by", so it's no your case. What you really need - make your "to save" dataset fit into desired Cassandra schema.
In Java this is a little bit trickier than in Scala. At first I suggest to define a bean that would represent a Cassandra row.
public class MyClass {
// Remember to declare no-args constructor
public MyClass() { }
private Long id;
private String type;
// another fields, getters, setters, etc
}
Your dataset is Dataset, you need to convert it into JavaRDD. So, you need a convertor.
public class MyClassFabric {
public static MyClass fromRow(Row row) {
MyClass myClass = new MyClass();
myClass.setId(row.getInt("id"));
// ....
return myClass;
}
}
In result we would have something like this:
JavaRDD<MyClass> rdd = dataDf.toJavaRDD().map(MyClassFabric::fromRow);
javaFunctions(rdd).writerBuilder("keyspace", "table",
mapToRow(MyClass.class)).saveToCassandra();
For additional info you can take a look https://github.com/datastax/spark-cassandra-connector/blob/master/doc/7_java_api.md

Filter Partition Before Reading Hive table (Spark)

Currently I'm trying to filter a Hive table by the latest date_processed.
The table is partitioned by.
System
date_processed
Region
The only way I've managed to filter it, is by doing a join query:
query = "select * from contracts_table as a join (select (max(date_processed) as maximum from contract_table as b) on a.date_processed = b.maximum"
This way is really time consuming, as I have to do the same procedure for 25 tables.
Any one Knows a way to read directly the latest loaded partition of a table in Spark <1.6
This is the method I'm using to read.
public static DataFrame loadAndFilter (String query)
{
return SparkContextSingleton.getHiveContext().sql(+query);
}
Many thanks!
Dataframe with all table partitions can be received by:
val partitionsDF = hiveContext.sql("show partitions TABLE_NAME")
Values can be parsed, for get max value.

Cassandra Spark Connector

My cassandra CF has date and id as partition Key .
while querying I only know the date , so I loop over range of id's .
My question revolves around how the connector executes the following code.
SparkDriver code looks like -
SparkConf conf = new SparkConf().setAppName("DemoApp")
.conf.setMaster("local[*]")
.set("spark.cassandra.connection.host", "10.*.*.*")
.set("spark.cassandra.connection.port", "*");
JavaSparkContext sc = new JavaSparkContext(conf);
SparkContextJavaFunctions javaFunctions = CassandraJavaUtil.javaFunctions(sc);
String date = "23012017";
for(String id : idlist) {
JavaRDD<CassandraRow> cassandraRowsRDD =
javaFunctions.cassandraTable("datakeyspace", "sample2")
.where("date = ?",date)
.where("id = ? ", id)
.select("data");
cassandraRowsRDDList.add(cassandraRowsRDD);
}
List<CassandraRow> collectAllRows = new ArrayList<CassandraRow>();
for(JavaRDD<CassandraRow> rdd : cassandraRowsRDDList){
//do transformations
collectAllRows.addAll(rdd.collect());
}
1) First of all I wanted to ask if I loop over the idlist ,say idlist has 1000 elements which might be increasing ever , will this be efficient ? how each select query be distributed in the cluster ?Especially how Cassandra DB connections will be maintained ?
2) In my driver program After looping over I am putting All the rows in List , and then apply transformations to each row and filter out the duplicates . Will this also be distributed by spark on the cluster or will this take place at driver's side .
Kindly help .!
There is a better way of doing this provided by spark cassandra connector.
you can create a rdd of (date,id) and then call joinWithCassandraTable function on columns date and id. Connector do it smartly all the data will be fetched by the workers only and that too without shuffle that is each worker will fetch the data only for the date and id it is having.

PHOENIX SPARK - Load Table as DataFrame

I have created a DataFrame from a HBase Table (PHOENIX) which has 500 million rows. From the DataFrame I created an RDD of JavaBean and use it for joining with data from a file.
Map<String, String> phoenixInfoMap = new HashMap<String, String>();
phoenixInfoMap.put("table", tableName);
phoenixInfoMap.put("zkUrl", zkURL);
DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap).load();
JavaRDD<Row> tableRows = df.toJavaRDD();
JavaPairRDD<String, AccountModel> dbData = tableRows.mapToPair(
new PairFunction<Row, String, String>()
{
#Override
public Tuple2<String, String> call(Row row) throws Exception
{
return new Tuple2<String, String>(row.getAs("ID"), row.getAs("NAME"));
}
});
Now my question - Lets say the file has 2 unique million entries matching with the table. Is the entire table loaded into memory as RDD or only the matching 2 million records from the table will be loaded into memory as RDD ?
Your statement
DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap)
.load();
will load the entire table into memory. You have not provided any filter for phoenix to push down into hbase - and thus reduce the number of rows read.
If you do a join to a non-HBase datasource - e.g a flat file - then all of the records from the hbase table would first need to be read in. The records not matching the secondary data source would not be saved in the new DataFrame - but the initial reading would have still happened.
Update A potential approach would be to pre-process the file - i.e. extracting the id's you want. Store the results into a new HBase table. Then perform the join directly in HBase via Phoenix not Spark .
The rationale for that approach is to move the computation to the data. The bulk of the data resides in HBase - so then move the small data (id's in the files) to there.
I am not familiar directly with Phoenix except that it provides a sql layer on top of hbase. Presumably then it would be capable of doing such a join and storing the result in a separate HBase table ..? That separate table could then be loaded into Spark to be used in your subsequent computations.

Resources