I am trying to connect to remote Hbase through scala and spark. Unable to succeeded.
Can anyone suggest any methods related to this.
Thanks in advance.
we have two methods to connect to HBASE from spark/scala
HBASE Rest Api
phoenix.apache -- https://phoenix.apache.org/
Hbase Rest API code
val hbaseCluster = new org.apache.hadoop.hbase.rest.client.Cluster()
hbaseCluster.add("localhost or UP", <port>)
val restClient = new Client(hbaseCluster)
val table = new RemoteHTable(restClient, "STUDENT")
println("connected...")
var p = new Put(Bytes.toBytes("row1"))
p.add(Bytes.toBytes("0"), Bytes.toBytes("NAME"),Bytes.toBytes("raju"))
p.add(Bytes.toBytes("0"), Bytes.toBytes("COURSE"),Bytes.toBytes("SCALA"))
p.add(Bytes.toBytes("0"), Bytes.toBytes("YEAR"),Bytes.toBytes("2017"))
table.put(p)
val scan = new Scan()
val scanner : ResultScanner = table.getScanner(scan)
println("got scanner...")
val g = new Get(Bytes.toBytes("row1"))
val result = table.get(g)
val name = Bytes.toString(result.getValue(Bytes.toBytes("0"),Bytes.toBytes("NAME")))
val course = Bytes.toString(result.getValue(Bytes.toBytes("0"),Bytes.toBytes("COURSE")))
val year = Bytes.toString(result.getValue(Bytes.toBytes("0"),Bytes.toBytes("YEAR")))
println("row1 " + "name: " + name + " course: " + course + " year:" + year);
for (result <- scanner) {
var userId = Bytes.toString(result.getValue("NAME".getBytes(), "ID".getBytes()))
println("userId " + userId)
}
}
}
Apache Phoenix
Phoenix provides spark plugin and JDBC connection as well.
spark plugin - https://phoenix.apache.org/phoenix_spark.html
JDBC Connection (query server)- https://phoenix.apache.org/server.html
I came across similar problem last week. Eventually I made it using the HBase Spark connector. It is quite a bit setup/configuration. I've documented my steps in the link below
Setup Apache Zeppelin with Spark and HBase
Related
I have a spark job which makes a right join based on two tables, the reading and joining is pretty fast but when try to insert join resuls to cassandra db it is so slow. It takes more than 30 minutes to insert 1000 rows, takes 3 minutes to insert 9 rows record. Please see my configuration below. we have 3 cassandra and spark node and spark is installed for all nodes. I'm pretty new with Spark and cant understand what is wrong. ı can insert same sized data with dse driver less than 1 second (more than 2000 rows). I appreciate your time and help!!
Spark submit :
"dse -u " + username + " -p " + password + " spark-submit --class com.SparkJoin --executor-memory=20G " +
"SparkJoinJob-1.0-SNAPSHOT.jar " + filterMap.toString() + "
Spark core version : 2.7.2
spark-cassandra-connector_2.11 : 2.3.1
spark-sql_2.11 : 2.3.1
Spark Conf
SparkConf conf = new SparkConf(true).setAppName("Appname");
conf.set("spark.cassandra.connection.host", host);
conf.set("spark.cassandra.auth.username", username);
conf.set("spark.cassandra.auth.password", password);
conf.set("spark.network.timeout", "600s");
conf.set("spark.cassandra.connection.keep_alive_ms", "25000");
conf.set("spark.cassandra.connection.timeout_ms", "5000000");
conf.set("spark.sql.broadcastTimeout", "5000000");
SparkContext sc = new SparkContext(conf);
SparkSession sparkSession = SparkSession.builder().sparkContext(sc).getOrCreate();
SQLContext sqlContext = sparkSession.sqlContext();
sqlContext.setConf("spark.cassandra.connection.host", host);
sqlContext.setConf("spark.cassandra.auth.username", username);
sqlContext.setConf("spark.cassandra.auth.password", password);
sqlContext.setConf("spark.network.timeout", "600s");
sqlContext.setConf("spark.cassandra.connection.keep_alive_ms", "2500000");
sqlContext.setConf("spark.cassandra.connection.timeout_ms", "5000000");
sqlContext.setConf("spark.sql.broadcastTimeout", "5000000");
sqlContext.setConf("spark.executor.heartbeatInterval", "5000000");
sqlContext.setConf("spark.sql.crossJoin.enabled", "true");
Left and Right Table fetch;
Dataset<Row> resultsFrame = sqlContext.sql("select * from table where conditions");
return resultsFrame.map((MapFunction<Row, JavaObject>) row -> {
// some operations here
return obj;
}, Encoders.bean(JavaObject.class)
);
Join
Dataset<Row> result = RigtTableJavaRDD.join(LeftTableJavaRDD,
(LeftTableJavaRDD.col("col1").minus(RigtTableJavaRDD.col("col2"))).
between(new BigDecimal("0").subtract(twoHundredMilliseconds), new BigDecimal("0").add(twoHundredMilliseconds))
.and(LeftTableJavaRDD.col("col5").equalTo(RigtTableJavaRDD.col("col6")))
, "right");
Insert Result
CassandraJavaUtil.javaFunctions(resultRDD.javaRDD()).
writerBuilder("keyspace", "table", CassandraJavaUtil.mapToRow(JavaObject.class)).
saveToCassandra();
I am new to Spark and I am trying to work on a spark-jdbc program to count the number of rows in a database.
I have come up with this code:
object PartitionRetrieval {
var conf = new SparkConf().setAppName("Spark-JDBC")
val log = LogManager.getLogger("Spark-JDBC Program")
Logger.getLogger("org").setLevel(Level.ERROR)
val conFile = "/home/hmusr/ReconTest/inputdir/testconnection.properties"
val properties = new Properties()
properties.load(new FileInputStream(conFile))
val connectionUrl = properties.getProperty("gpDevUrl")
val devUserName = properties.getProperty("devUserName")
val devPassword = properties.getProperty("devPassword")
val driverClass = properties.getProperty("gpDriverClass")
val tableName = "source.bank_accounts"
try {
Class.forName(driverClass).newInstance()
} catch {
case cnf: ClassNotFoundException =>
log.error("Driver class: " + driverClass + " not found")
System.exit(1)
case e: Exception =>
log.error("Exception: " + e.printStackTrace())
System.exit(1)
}
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().config(conf).master("yarn").enableHiveSupport().getOrCreate()
val gpTable = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable",tableName)
.option("user",devUserName)
.option("password",devPassword).load()
val rc = gpTable.filter(gpTable("source_system_name")==="ORACLE").count()
println("gpTable Count: " + rc)
}
}
So far, this code is working. But I have 2 conceptual doubts about this.
In Java, we create a connection class and use that connection to query multiple tables and close it once our requirement is met. But it appears to work in a different way.
If I have to query 10 tables in a database, should I use this line 10 times with different tables names in it:
In Java, we create a connection class and use that connection to query multiple tables and close it once our requirement is met. But it appears to work in a different way.
If I have to query 10 tables in a database, should I use this line 10 times with different tables names in it:
val gpTable = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable",tableName)
.option("user",devUserName)
.option("password",devPassword).load()
The current table used here has total rows of 2000. I can use the filter/select/aggregate functions accordingly.
But in our production there are tables with millions of rows and if I put one of the huge table in the above statement, even though our requirement has filtering it later, wouldn't is create a huge dataframe first?
Could anyone care to give me some insight regarding the doubts I mentioned above?
Pass an SQL query to it first known as pushdown to database.
E.g.
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample where k = 1) e", connectionProperties)
You can substitute with s""" the k = 1 for hostvars, or, build your own SQL string and reuse as you suggest, but if you don't the world will still exist.
I have a custom foreach writer for Spark streaming. For each row I write to JDBC source. I also want to do somekind of fast lookup before I perform a JDBC operation and update the value after I perform JDBC operations, like "Step-1" and "Step-3" in below sample code ...
I don't want to use external databases like REDIS, MongoDB. I want something with low foot print like RocksDB, Derby, etc ...
I'm okay with storing one-file per application, just like checkpointing , I'll create a internal-db folder ...
I could not see any in-memory DB for Spark ..
def main(args: Array[String]): Unit = {
val brokers = "quickstart:9092"
val topic = "safe_message_landing_app_4"
val sparkSession = SparkSession.builder().master("local[*]").appName("Ganesh-Kafka-JDBC-Streaming").getOrCreate();
val sparkContext = sparkSession.sparkContext;
sparkContext.setLogLevel("ERROR")
val sqlContext = sparkSession.sqlContext;
val kafkaDataframe = sparkSession.readStream.format("kafka")
.options(Map("kafka.bootstrap.servers" -> brokers, "subscribe" -> topic,
"startingOffsets" -> "latest", "group.id" -> " Jai Ganesh", "checkpoint" -> "cp/kafka_reader"))
.load()
kafkaDataframe.printSchema()
kafkaDataframe.createOrReplaceTempView("kafka_view")
val sqlDataframe = sqlContext.sql("select concat ( topic, '-' , partition, '-' , offset) as KEY, string(value) as VALUE from kafka_view")
val customForEachWriter = new ForeachWriter[Row] {
override def open(partitionId: Long, version: Long) = {
println("Open Started ==> partitionId ==> " + partitionId + " ==> version ==> " + version)
true
}
override def process(value: Row) = {
// Step 1 ==> Lookup a key in persistent KEY-VALUE store
// JDBC operations
// Step 3 ==> Update the value in persistent KEY-VALUE store
}
override def close(errorOrNull: Throwable) = {
println(" ************** Closed ****************** ")
}
}
val yy = sqlDataframe
.writeStream
.queryName("foreachquery")
.foreach(customForEachWriter)
.start()
yy.awaitTermination()
sparkSession.close();
}
Manjesh,
What you are looking for, "Spark and your in-memory DB as one seamless cluster, sharing a single process space", with support for MVCC is exactly what SnappyData provides. With SnappyData, the tables that you want to do a fast lookup on are in the same process that is running your Spark streaming job. Check it out here
SnappyData has a Apache V2 license for the core product and the specific use that you are referring to is available in the OSS download.
(Disclosure: I am a SnappyData employee and it makes sense to provide a product specific answer to this question because the product is the answer to the question)
SparkStreaming context reading a stream from RabbitMQ with an interval of 30 seconds. I want to modify the values of few columns of corresponding rows existing in cassandra and then want to store data back to Cassandra. For that i need to check whether the row for the particular primary key exist in Cassandra or not if, yes, fetch it and do the necessary operation. But the problem is, i create the StreamingContext on the driver and actions get performed on Worker. So, they are not able to get the StreamingContext object reason being it wasn't serialized and sent to workers and i get this error :
java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext. I also know that we cannot access the StreamingContext inside foreachRDD. But, How do i achieve the same functionality here without getting serialization error?
I have looked at fews examples here but it didn't help.
Here is the snippet of the code :
val ssc = new StreamingContext(sparkConf,30)
val receiverStream = RabbitMQUtils.createStream(ssc, rabbitParams)
receiverStream.start()
val lines = receiverStream.map(EventData.fromString(_))
lines.foreachRDD{ x => if (x.toLocalIterator.nonEmpty) {
x.foreachPartition { it => for (tuple <- it) {
val cookieid = tuple.cookieid
val sessionid = tuple.sessionid
val logdate = tuple.logdate
val EventRows = ssc.cassandraTable("SparkTest", CassandraTable).select("*")
.where("cookieid = '" + cookieid + "' and logdate = '" + logdate+ "' and sessionid = '" + sessionid + "')
Somelogic Whether row exist or not for Cookieid
} } }
The SparkContext cannot be serialized and passed across multiple workers in possibly different nodes. If you need to do something like this you could use forEachPartiion, mapPartitons.
Else do this withing your function that gets passed around
CassandraConnector(SparkWriter.conf).withSessionDo { session =>
....
session.executeAsync(<CQL Statement>)
and in the SparkConf you need to give the Cassandra details
val conf = new SparkConf()
.setAppName("test")
.set("spark.ui.enabled", "true")
.set("spark.executor.memory", "8g")
// .set("spark.executor.core", "4")
.set("spark.eventLog.enabled", "true")
.set("spark.eventLog.dir", "/ephemeral/spark-events")
//to avoid disk space issues - default is /tmp
.set("spark.local.dir", "/ephemeral/spark-scratch")
.set("spark.cleaner.ttl", "10000")
.set("spark.cassandra.connection.host", cassandraip)
.setMaster("spark://10.255.49.238:7077")
The Java CSCParser is a library that is not serializable. So Spark cannot send it possibly different nodes if you call map or forEach on the RDD. One workaround is using mapPartion, in which case one full Parition will be executed in one SparkNode. Hence it need not serialize for each call.Example
val rdd_inital_parse = rdd.mapPartitions(pLines).
def pLines(lines: Iterator[String]) = {
val parser = new CSVParser() ---> Cannot be serialized, will fail if using rdd.map(pLines)
lines.map(x => parseCSVLine(x, parser.parseLine))
}
Try with x.sparkContext.cassandraTable() instead of ssc.cassandraTable() and see if it helps
Referring to my other question "Writing to HBase from Spark Streaming", I was advised to follow https://www.mapr.com/blog/spark-streaming-hbase in order to write to HBase from Spark Streaming and that's what I did (with modification according to my need), when running Spark-submit, there is no error but also that data is not written into HBase, I'll show you the code if you could please figure out if I did something wrong and how to correct it :
val conf = HBaseConfiguration.create()
val jobConfig: JobConf = new JobConf(conf)
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE, "tabName")
Dstream.foreachRDD(rdd =>
rdd.map(Convert.toPut).saveAsHadoopDataset(jobConfig))
with :
object Convert{
def toPut (parametre: (String,String)): (ImmutableBytesWritable, Put) = {
val put = new Put(Bytes.toBytes(1))
put.add(Bytes.toBytes("colfamily"), Bytes.toBytes(parametre._1), Bytes.toBytes(parametre._2))
return (new ImmutableBytesWritable(Bytes.toBytes(1)), put)
}
Could you please help me find out what I'm doing wrong here ?
Thank you in advance