Cassandra Spark Connector - apache-spark

My cassandra CF has date and id as partition Key .
while querying I only know the date , so I loop over range of id's .
My question revolves around how the connector executes the following code.
SparkDriver code looks like -
SparkConf conf = new SparkConf().setAppName("DemoApp")
.conf.setMaster("local[*]")
.set("spark.cassandra.connection.host", "10.*.*.*")
.set("spark.cassandra.connection.port", "*");
JavaSparkContext sc = new JavaSparkContext(conf);
SparkContextJavaFunctions javaFunctions = CassandraJavaUtil.javaFunctions(sc);
String date = "23012017";
for(String id : idlist) {
JavaRDD<CassandraRow> cassandraRowsRDD =
javaFunctions.cassandraTable("datakeyspace", "sample2")
.where("date = ?",date)
.where("id = ? ", id)
.select("data");
cassandraRowsRDDList.add(cassandraRowsRDD);
}
List<CassandraRow> collectAllRows = new ArrayList<CassandraRow>();
for(JavaRDD<CassandraRow> rdd : cassandraRowsRDDList){
//do transformations
collectAllRows.addAll(rdd.collect());
}
1) First of all I wanted to ask if I loop over the idlist ,say idlist has 1000 elements which might be increasing ever , will this be efficient ? how each select query be distributed in the cluster ?Especially how Cassandra DB connections will be maintained ?
2) In my driver program After looping over I am putting All the rows in List , and then apply transformations to each row and filter out the duplicates . Will this also be distributed by spark on the cluster or will this take place at driver's side .
Kindly help .!

There is a better way of doing this provided by spark cassandra connector.
you can create a rdd of (date,id) and then call joinWithCassandraTable function on columns date and id. Connector do it smartly all the data will be fetched by the workers only and that too without shuffle that is each worker will fetch the data only for the date and id it is having.

Related

Apache Spark + cassandra+Java +Spark session filter records based on datetime between given from and to values

I am working on a Spring Java Project and integrating Apache spark and cassandra using Datastax connector.
I have autowired sparkSession and the below lines of code seems to work.
Map<String, String> configMap = new HashMap<>();
configMap.put("keyspace", "key1");
configMap.put("table", tableName.toLowerCase());
Dataset<Row> ds = sparkSession.sqlContext().read().format("org.apache.spark.sql.cassandra").options(configMap)
.load();
ds.show();
In the above step I am loading Datasets and in below step I am doing filtration of datetime field .
String s1 = "2020-06-23 18:51:41";
String s2 = "2020-06-23 18:52:21";
Timestamp from = Timestamp.valueOf(s1);
Timestamp to = Timestamp.valueOf(s2);
ds = ds.filter(df.col("datetime").between(from, to));
Is it possible to apply this filter condition during load itself.If so can someone suggest me how to do this?
Thanks in advance.
You don't have to do anything explicitly here, spark-cassandra-connector has predicate pushdown, so your filtering condition would be applied during the data selection.
Source: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md
The connector will automatically pushdown all valid predicates to Cassandra. The Datasource will also automatically only select columns from Cassandra which are required to complete the query. This can be monitored with the explain command.
This filter will be effectively pushed down only if your column on which you're doing filtering is the first clustering column. As Rayan pointed, we can use the explain command on the dataset to check that predicates pushdown happened - the corresponding predicates should have the * characters near them, like this:
val dcf3 = dc.filter("event_time >= cast('2019-03-10T14:41:34.373+0000' as timestamp)
AND event_time <= cast('2019-03-10T19:01:56.316+0000' as timestamp)")
// dcf3.explain
// == Physical Plan ==
// *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [uuid#21,event_time#22,id#23L,value#24]
// PushedFilters: [ *GreaterThanOrEqual(event_time,2019-03-10 14:41:34.373), *LessThanOrE...,
// ReadSchema: struct<uuid:string,event_time:timestamp,id:bigint,value...
if predicate won't be pushed, we would see an additional step after scan when the filtering happens on the Spark level.

spark structured streaming dynamic string filter

We are trying to use dynamic filter for a structured streaming application.
Let's say we have following pseudo-implementation of a Spark structured streaming application:
spark.readStream()
.format("kafka")
.option(...)
...
.load()
.filter(getFilter()) <-- dynamic staff - def filter(conditionExpr: String):
.writeStream()
.format("kafka")
.option(.....)
.start();
and getFilter returns string
String getFilter() {
// dynamic staff to create expression
return expression; // eg. "column = true";
}
Is it possible in current version of Spark to have a dynamic filter condition? I mean the getFilter() method should dynamically return a filter condition (let's say it's refreshed each 10min). We tried to look into broadcast variable but not sure whether structured streaming supports such a thing.
It looks like it's not possible to update job's configuration once it's submitted. As a deploy we use yarn.
Every suggestion/option is highly appreciated.
EDIT:
assume getFilter() returns:
(columnA = 1 AND columnB = true) OR customHiveUDF(columnC, 'input') != 'required' OR columnD > 8
after 10 mins we can have small change (without first expression before first OR) and potentially we can have a new expression (columnA = 2) eg:
customHiveUDF(columnC, 'input') != 'required' OR columnD > 10 OR columnA = 2
The goal is to have multiple filters for one spark application and don't submit multiple jobs.
Broadcast variable should be ok here. You can write typed filter like:
query.filter(x => x > bv.value).writeStream(...)
where bv is a Broadcast variable. You can update it as described here: How can I update a broadcast variable in spark streaming?
Other solution is to provide i.e. RCP or RESTful endpoint and ask this endpoint every 10 minutes. For example (Java, because is simpler here):
class EndpointProxy {
Configuration lastValue;
long lastUpdated
public static Configuration getConfiguration (){
if (lastUpdated + refreshRate > System.currentTimeMillis()){
lastUpdated = System.currentTimeMillis();
lastValue = askMyAPI();
}
return lastValue;
}
}
query.filter (x => x > EndpointProxy.getConfiguration().getX()).writeStream()
Edit: hacky workaround for user's problem:
You can create 1-row view:
// confsDF should be in some driver-side singleton
var confsDF = Seq(some content).toDF("someColumn")
and then use:
query.crossJoin(confsDF.as("conf")) // cross join as we have only 1 value
.filter("hiveUDF(conf.someColumn)")
.writeStream()...
new Thread() {
confsDF = Seq(some new data).toDF("someColumn)
}.start();
This hack relies on Spark default execution model - microbatches. In each trigger the query is being rebuilt, so new data should be taken into consideration.
You can also in thread do:
Seq(some new data).toDF("someColumn).createOrReplaceTempView("conf")
and then in query:
.crossJoin(spark.table("conf"))
Both should work. Have in mind that it won't work with Continous Processing Mode
Here is the Simple example, In which i am dynamic filtering records which is coming form socket. Instead of Date you can use any rest API which can update your filter dynamically or light weight zookeeper instance.
Note: - If you planning to use any rest API or zookeeper or any other option, use mapPartition instead of filter because in that case you have call API/Connection one time for a partition.
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
// Split the lines into words
val words = lines.as[String].filter(_ == new java.util.Date().getMinutes.toString)
// Generate running word count
val wordCounts = words.groupBy("value").count()
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()

Dataproc spark job not able to scan records from bigtable

We are using newAPIHadoopRDD to scan a bigtable and add the records in Rdd. Rdd gets populated using newAPIHadoopRDD for a smaller (say less than 100K records) bigtable. However, it fails to load records into Rdd from larger(say 6M records) bigtable.
SparkConf sparkConf = new SparkConf().setAppName("mc-bigtable-sample-scan")
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
Configuration hbaseConf = HBaseConfiguration.create();
hbaseConf.set(TableInputFormat.INPUT_TABLE, "listings");
Scan scan = new Scan();
scan.addColumn(COLUMN_FAMILY_BASE, COLUMN_COL1);
hbaseConf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan));
JavaPairRDD<ImmutableBytesWritable, Result> source = jsc.newAPIHadoopRDD(hbaseConf, TableInputFormat.class,
ImmutableBytesWritable.class, Result.class);
System.out.println("source count " + source.count());
The count show properly for smaller table. But it shows zero for larger table.
Tried many different configuration options like increasing driver memory, number of executors, number of workers but nothing works.
Could someone help please?
My bad. Found the issue in my code. The column COLUMN_COL1 which I was trying to scan was not available in bigger bigtable and hence my count was appearing 0.

Spark RDD do not get processed in multiple nodes

I have a use case where in i create rdd from a hive table. I wrote a business logic that operates on every row in the hive table. My assumption was that when i create rdd and span a map process on it, it then utilises all my spark executors. But, what i see in my log is only one node process the rdd while rest of my 5 nodes sitting idle. Here is my code
val flow = hiveContext.sql("select * from humsdb.t_flow")
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Any clue where i go wrong?
As specify here by #jaceklaskowski
By default, a partition is created for each HDFS partition, which by
default is 64MB (from Spark’s Programming Guide).
If your input data is less than 64MB (and you are using HDFS) then by default only one partition will be created.
Spark will use all nodes when using big data
Could there be a possibility that your data is skewed?
To rule out this possibility, do the following and rerun the code.
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(200)
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Further if in your map logic you are dependent on a particular column you can do below
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(col("yourColumnName"))
var x = flow.rdd.map { row =>
< do some computation on each row>
}
A good partition column could be date column

How to stop load the whole table in spark?

The thing is, I have read right to one table,which is partition by year month and day.But I don't have right read the data from 2016/04/24.
when I execute in Hive command:
hive>select * from table where year="2016" and month="06" and day="01";
I CAN READ OTHER DAYS' DATA EXCEPT 2016/04/24
But,when I read in spark
sqlContext.sql.sql(select * from table where year="2016" and month="06" and day="01")
exceptition is throwable That I dont have the right to hdfs/.../2016/04/24
THIS SHOW SPARK SQL LOAD THE WHOLE TABLE ONCE AND THEN FILTER?
HOW CAN I AVOID LOAD THE WHOLE TABLE?
You can use JdbcRDDs directly. With it you can bypass spark sql engine therefore your queries will be directly sent to hive.
To use JdbcRDD you need to create hive driver and register it first (of course it is not registered already).
val driver = "org.apache.hive.jdbc.HiveDriver"
Class.forName(driver)
Then you can create a JdbcRDD;
val connUrl = "jdbc:hive2://..."
val query = """select * from table where year="2016" and month="06" and day="01" and ? = ?"""
val lowerBound = 0
val upperBound = 0
val numOfPartitions = 1
new JdbcRDD(
sc,
() => DriverManager.getConnection(connUrl),
query,
lowerBound,
upperBound,
numOfPartitions,
(r: ResultSet) => (r.getString(1) /** get data here or with a function**/)
)
JdbcRDD query must have two ? in order to create partition your data. So you should write a better query than me. This just creates one partition to demonstrate how it works.
However, before doing this I recommend you to check HiveContext. This supports HiveQL as well. Check this.

Resources