How to convert a Cassandra ResultSet to a Spark DataFrame? - apache-spark

I would normally load data from Cassandra into Apache Spark this way using Java:
SparkContext sparkContext = StorakleSparkConfig.getSparkContext();
CassandraSQLContext sqlContext = new CassandraSQLContext(sparkContext);
sqlContext.setKeyspace("midatabase");
DataFrame customers = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " +
"WHERE CAST(store_id as string) = '" + storeId + "'");
But imagine I have a sharder and I need to load several partion keys into this DataFrame. I could use WHERE IN (...) in my query and again use the cassandraSql method. But I am a bit reluctant to use WHERE IN due to the infamous problem with having a one-point-of-failure in terms of the coordinator node. This is explained here:
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Is there a way to use several queries but load them into a single DataFrame?

One way to do this would be run individual queries and unionAll/union multiple DataFrames/RDDs.
SparkContext sparkContext = StorakleSparkConfig.getSparkContext();
CassandraSQLContext sqlContext = new CassandraSQLContext(sparkContext);
sqlContext.setKeyspace("midatabase");
DataFrame customersOne = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId1 + "'");
DataFrame customersTwo = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId2 + "'");
DataFrame allCustomers = customersOne.unionAll(CustomersTwo)

Related

How does Spark work with a JDBC connection?

I am new to Spark and I am trying to work on a spark-jdbc program to count the number of rows in a database.
I have come up with this code:
object PartitionRetrieval {
var conf = new SparkConf().setAppName("Spark-JDBC")
val log = LogManager.getLogger("Spark-JDBC Program")
Logger.getLogger("org").setLevel(Level.ERROR)
val conFile = "/home/hmusr/ReconTest/inputdir/testconnection.properties"
val properties = new Properties()
properties.load(new FileInputStream(conFile))
val connectionUrl = properties.getProperty("gpDevUrl")
val devUserName = properties.getProperty("devUserName")
val devPassword = properties.getProperty("devPassword")
val driverClass = properties.getProperty("gpDriverClass")
val tableName = "source.bank_accounts"
try {
Class.forName(driverClass).newInstance()
} catch {
case cnf: ClassNotFoundException =>
log.error("Driver class: " + driverClass + " not found")
System.exit(1)
case e: Exception =>
log.error("Exception: " + e.printStackTrace())
System.exit(1)
}
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().config(conf).master("yarn").enableHiveSupport().getOrCreate()
val gpTable = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable",tableName)
.option("user",devUserName)
.option("password",devPassword).load()
val rc = gpTable.filter(gpTable("source_system_name")==="ORACLE").count()
println("gpTable Count: " + rc)
}
}
So far, this code is working. But I have 2 conceptual doubts about this.
In Java, we create a connection class and use that connection to query multiple tables and close it once our requirement is met. But it appears to work in a different way.
If I have to query 10 tables in a database, should I use this line 10 times with different tables names in it:
In Java, we create a connection class and use that connection to query multiple tables and close it once our requirement is met. But it appears to work in a different way.
If I have to query 10 tables in a database, should I use this line 10 times with different tables names in it:
val gpTable = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable",tableName)
.option("user",devUserName)
.option("password",devPassword).load()
The current table used here has total rows of 2000. I can use the filter/select/aggregate functions accordingly.
But in our production there are tables with millions of rows and if I put one of the huge table in the above statement, even though our requirement has filtering it later, wouldn't is create a huge dataframe first?
Could anyone care to give me some insight regarding the doubts I mentioned above?
Pass an SQL query to it first known as pushdown to database.
E.g.
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample where k = 1) e", connectionProperties)
You can substitute with s""" the k = 1 for hostvars, or, build your own SQL string and reuse as you suggest, but if you don't the world will still exist.

Connection to remote hbase through scala spark

I am trying to connect to remote Hbase through scala and spark. Unable to succeeded.
Can anyone suggest any methods related to this.
Thanks in advance.
we have two methods to connect to HBASE from spark/scala
HBASE Rest Api
phoenix.apache -- https://phoenix.apache.org/
Hbase Rest API code
val hbaseCluster = new org.apache.hadoop.hbase.rest.client.Cluster()
hbaseCluster.add("localhost or UP", <port>)
val restClient = new Client(hbaseCluster)
val table = new RemoteHTable(restClient, "STUDENT")
println("connected...")
var p = new Put(Bytes.toBytes("row1"))
p.add(Bytes.toBytes("0"), Bytes.toBytes("NAME"),Bytes.toBytes("raju"))
p.add(Bytes.toBytes("0"), Bytes.toBytes("COURSE"),Bytes.toBytes("SCALA"))
p.add(Bytes.toBytes("0"), Bytes.toBytes("YEAR"),Bytes.toBytes("2017"))
table.put(p)
val scan = new Scan()
val scanner : ResultScanner = table.getScanner(scan)
println("got scanner...")
val g = new Get(Bytes.toBytes("row1"))
val result = table.get(g)
val name = Bytes.toString(result.getValue(Bytes.toBytes("0"),Bytes.toBytes("NAME")))
val course = Bytes.toString(result.getValue(Bytes.toBytes("0"),Bytes.toBytes("COURSE")))
val year = Bytes.toString(result.getValue(Bytes.toBytes("0"),Bytes.toBytes("YEAR")))
println("row1 " + "name: " + name + " course: " + course + " year:" + year);
for (result <- scanner) {
var userId = Bytes.toString(result.getValue("NAME".getBytes(), "ID".getBytes()))
println("userId " + userId)
}
}
}
Apache Phoenix
Phoenix provides spark plugin and JDBC connection as well.
spark plugin - https://phoenix.apache.org/phoenix_spark.html
JDBC Connection (query server)- https://phoenix.apache.org/server.html
I came across similar problem last week. Eventually I made it using the HBase Spark connector. It is quite a bit setup/configuration. I've documented my steps in the link below
Setup Apache Zeppelin with Spark and HBase

Spark not saving all data to redshift

The following code loads data from S3, cleans and removes duplicates using SparkSQL, and then saves data using JDBC to Redshift. I have also tried using spark-redshift maven dependency and getting same result. I am using Spark 2.0.
What I cannot understand is how when showing the result loaded in memory, the sum is the expected number, however when Spark saves to Redshift, it's always less. Somehow not all records are saved, and I do not see any errors in STL_LOAD_ERRORS either. Anybody encountered this or have any ideas to why this happens?
// Load files that were loaded into firehose on this day
var s3Files = spark.sqlContext.read.schema(schema).json("s3://" + job.getAWSAccessKey + ":" + job.getAWSSecretKey + "#" + job.getBucketName + "/"+ job.getAWSS3RawFileExpression + "/" + year+ "/" + monthCheck+ "/" + dayCheck + "/*/" ).rdd
// Apply the schema to the RDD, here we will have duplicates
val usersDataFrame = spark.createDataFrame(s3Files, schema)
usersDataFrame.createOrReplaceTempView("results")
// Clean and use partition by the keys to eliminate duplicates and get latest record
var results = spark.sql(buildCleaningQuery(job,"results"))
results.createOrReplaceTempView("filteredResults")
// This returns the correct result!
var check = spark.sql("select sum(Reward) from filteredResults where period=1706")
check.show()
var path = UUID.randomUUID().toString()
println("s3://" + job.getAWSAccessKey + ":" + job.getAWSSecretKey + "#" + job.getAWSS3TemporaryDirectory + "/" + path)
val prop = new Properties()
results.write.jdbc(job.getRedshiftJDBC,"work.\"" + path + "\"",prop)
Using jdbc means that Spark will be trying to do repeated INSERT INTO statements - which is massively slow in Redshift. That's why you're not seeing entries in stl_load_errors.
I'd suggest that you use the spark-redshift library instead. It's well tested and will perform better. https://github.com/databricks/spark-redshift
Example (showing many options):
my_dataframe.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://my_cluster.qwertyuiop.eu-west-1.redshift.amazonaws.com:5439/my_database?user=my_user&password=my_password")
.option("dbtable", "my_table")
.option("tempdir", "s3://my-bucket")
.option("diststyle", "KEY")
.option("distkey", "dist_key")
.option("sortkeyspec", "COMPOUND SORTKEY(key_1, key_2)")
.option("extracopyoptions", "TRUNCATECOLUMNS COMPUPDATE OFF STATUPDATE OFF")
.mode("overwrite") // "append" / "error"
.save()

Spark query appends keyspace to the spark temp table

I have a cassandraSQLContext where I do this:
cassandraSqlContext.setKeyspace("test");
Because if I don't, it complains about me setting up the default keyspace.
Now when I run this piece of code:
def insertIntoCassandra(siteMetaData: MetaData, dataFrame: DataFrame): Unit ={
System.out.println(dataFrame.show())
val tableName = siteMetaData.getTableName.toLowerCase()
dataFrame.registerTempTable("spark_"+ tableName)
System.out.println("Registered the spark table to spark_" + tableName)
val columns = columnMap.get(siteMetaData.getTableName)
val query = cassandraQueryBuilder.buildInsertQuery("test", tableName, columns)
System.out.println("Query: " + query);
cassandraSqlContext.sql(query)
System.out.println("Query executed")
}
It gives me this error log:
Registered the spark table to spark_test
Query: INSERT INTO TABLE test.tablename SELECT **the columns here** FROM spark_tablename
17/02/28 04:15:53 ERROR JobScheduler: Error running job streaming job 1488255351000 ms.0
java.util.concurrent.ExecutionException: java.io.IOException: Couldn't find test.tablename or any similarly named keyspace and table pairs
What I don't understand is why isnt cassandraSQLContext executing the printed out query, why does it append the keyspace to the spark temptable.
public String buildInsertQuery(String activeReplicaKeySpace, String tableName, String columns){
String sql = "INSERT INTO TABLE " + activeReplicaKeySpace + "." + tableName +
" SELECT " + columns + " FROM spark_" + tableName;
return sql;
}
So the problem was I was using two different instances of cassandraSQLContext.In one of the methods, I was instantiating a new cassandraSQLContext which was conflicting with the cassandraSQLContext being passed to insertIntoCassandra method.

Issue while storing data from Spark-Streaming to Cassandra

SparkStreaming context reading a stream from RabbitMQ with an interval of 30 seconds. I want to modify the values of few columns of corresponding rows existing in cassandra and then want to store data back to Cassandra. For that i need to check whether the row for the particular primary key exist in Cassandra or not if, yes, fetch it and do the necessary operation. But the problem is, i create the StreamingContext on the driver and actions get performed on Worker. So, they are not able to get the StreamingContext object reason being it wasn't serialized and sent to workers and i get this error :
java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext. I also know that we cannot access the StreamingContext inside foreachRDD. But, How do i achieve the same functionality here without getting serialization error?
I have looked at fews examples here but it didn't help.
Here is the snippet of the code :
val ssc = new StreamingContext(sparkConf,30)
val receiverStream = RabbitMQUtils.createStream(ssc, rabbitParams)
receiverStream.start()
val lines = receiverStream.map(EventData.fromString(_))
lines.foreachRDD{ x => if (x.toLocalIterator.nonEmpty) {
x.foreachPartition { it => for (tuple <- it) {
val cookieid = tuple.cookieid
val sessionid = tuple.sessionid
val logdate = tuple.logdate
val EventRows = ssc.cassandraTable("SparkTest", CassandraTable).select("*")
.where("cookieid = '" + cookieid + "' and logdate = '" + logdate+ "' and sessionid = '" + sessionid + "')
Somelogic Whether row exist or not for Cookieid
} } }
The SparkContext cannot be serialized and passed across multiple workers in possibly different nodes. If you need to do something like this you could use forEachPartiion, mapPartitons.
Else do this withing your function that gets passed around
CassandraConnector(SparkWriter.conf).withSessionDo { session =>
....
session.executeAsync(<CQL Statement>)
and in the SparkConf you need to give the Cassandra details
val conf = new SparkConf()
.setAppName("test")
.set("spark.ui.enabled", "true")
.set("spark.executor.memory", "8g")
// .set("spark.executor.core", "4")
.set("spark.eventLog.enabled", "true")
.set("spark.eventLog.dir", "/ephemeral/spark-events")
//to avoid disk space issues - default is /tmp
.set("spark.local.dir", "/ephemeral/spark-scratch")
.set("spark.cleaner.ttl", "10000")
.set("spark.cassandra.connection.host", cassandraip)
.setMaster("spark://10.255.49.238:7077")
The Java CSCParser is a library that is not serializable. So Spark cannot send it possibly different nodes if you call map or forEach on the RDD. One workaround is using mapPartion, in which case one full Parition will be executed in one SparkNode. Hence it need not serialize for each call.Example
val rdd_inital_parse = rdd.mapPartitions(pLines).
def pLines(lines: Iterator[String]) = {
val parser = new CSVParser() ---> Cannot be serialized, will fail if using rdd.map(pLines)
lines.map(x => parseCSVLine(x, parser.parseLine))
}
Try with x.sparkContext.cassandraTable() instead of ssc.cassandraTable() and see if it helps

Resources