Spark not deleting old data in MemSql when Overwrite mode is used - apache-spark

I am running a spark job using overwrite mode. I was expecting that it will delete the data in the table and will insert new data. However it is just appending the data to it .
I was expecting same behavior as when save moce override is used in fileSystem,
object HiveToMemSQL {
def main(args: Array[String]) {
val log = Logger.getLogger(HiveToMemSQL.getClass)
//var options = getOptions()
//val cmdLineArgs = new CommandLineOptions().validateArguments(args, options)
//if (cmdLineArgs != null) {
// Get command line options values
var query = "select * from default.students"
// Get destination DB details from command line
val destHostName ="localhost"
//val destUserName = cmdLineArgs.getOptionValue("destUserName")
//val destPassword = cmdLineArgs.getOptionValue("destPassword")
val destDBName ="tsg"
val destTable = "ORC_POS_TEST"
val destPort = 3308
val destConnInfo = MemSQLConnectionInfo(destHostName, destPort, "root", "", destDBName)
val spark = SparkSession.builder().appName("Hive To MemSQL")
.config("maxRecordsPerBatch" ,"100")
.config("spark.memsql.host", destConnInfo.dbHost)
.config("spark.memsql.port", destConnInfo.dbPort.toString)
.config("spark.memsql.user", destConnInfo.user)
.config("spark.memsql.password", destConnInfo.password)
.config("spark.memsql.defaultDatabase", destConnInfo.dbName)
// .config("org.apache.spark.sql.SaveMode" , SaveMode.Overwrite.toString())
.config("spark.memsql.defaultSaveMode" , "Overwrite")
.config("maxRecordsPerBatch" ,"100").master("local[*]").enableHiveSupport().getOrCreate()
import spark.implicits._
import spark.sql
// Queries are expressed in HiveQL
val sqlDF = spark.sql("select* from tsg.v_pos_krogus_wk_test")
log.info("Successfully read data from source")
sqlDF.printSchema()
sqlDF.printSchema()
// MemSQL destination DB Master Aggregator, Port, Username and Password
import spark.implicits._
// Disabling writing to leaf nodes directly
var saveConf = SaveToMemSQLConf(spark.memSQLConf,
params = Map("useKeylessShardingOptimization" -> "false",
"writeToMaster" -> "false" ,
"saveMode" -> SaveMode.Overwrite.toString()))
log.info("Save mode before :" + saveConf.saveMode )
saveConf= saveConf.copy(saveMode=SaveMode.Overwrite)
log.info("Save mode after :" + saveConf.saveMode )
val tableIdent = TableIdentifier(destDBName, destTable)
sqlDF.saveToMemSQL(tableIdent, saveConf)
log.info("Successfully completed writing to MemSQL DB")
}}

The MemSQL Spark Connector setting will write a REPLACE statement. REPLACE works exactly like INSERT, except that if an old row in the table has the same value as a new row for a PRIMARY KEY, the old row is deleted before the new row is inserted. See https://docs.memsql.com/sql-reference/v6.0/replace/

Related

Neo4j thinks that password is database

I am trying to integrate Spark and Neo4j. I am new to Neo4j. I have the following short Spark app
import com.typesafe.config._
import org.apache.spark.sql.SparkSession
import org.neo4j.spark._
object Neo4jStorer {
var conf :Config = null
def main(args: Array[String]): Unit = {
val spark = getSparkSession()
val sc = spark.sparkContext
val g = Neo4jGraph.loadGraph(sc, label1="a", relTypes=Seq("rel"), label2 = "b")
val vCount = g.toString
println("Count= " + vCount)
}
def getSparkSession(): SparkSession = {
SparkSession
.builder
.appName("SparkNeo4j")
.config("spark.neo4j.bolt.url", "neo4j://127.0.0.1:7687")
.config("spark.neo4j.bolt.user", "neo4j")
.config("spark.neo4j.bolt.password", "FakePassword")
.getOrCreate()
}
}
I used https://neo4j.com/blog/neo4j-3-0-apache-spark-connector/ as an example for this code as I am using Spark 3.0. When I run this I get the following
20/10/17 14:36:36 ERROR LoadBalancer: Failed to update routing table for database 'FakePassword'. Current routing table: Ttl 1602963396190, currentTime 1602963396527, routers AddressSet=[], writers AddressSet=[], readers AddressSet=[], database 'FakePassword'.
org.neo4j.driver.exceptions.FatalDiscoveryException: Unable to get a routing table for database 'FakePassword' because this database does not exist
If I change the password I get an authentication error and I see that again the incorrect password is shown as being a database. I created a database with the name FakePassword and I still got the same error. Why is this happening and how can I fix it?
Also when I tried to get g.vertices.count as is shown in the example I am following I get a compilation error.
With this code I am able to get data from a DataFrame into Neo4j, which is what I really wanted to do. This does not seem to be the ideal solution as it uses foreach. I am open to improvements.
import com.typesafe.config._
import org.apache.spark.sql.SparkSession
import org.neo4j.driver.{AuthTokens, GraphDatabase, Session}
import org.neo4j.spark._
object StackoverflowAnswer {
def main(args: Array[String]): Unit = {
val spark = getSparkSession()
val sc = spark.sparkContext
import spark.implicits._
val df = sc.parallelize(List(1, 2, 3)).toDF
df.foreach(
row => {
val query = "CREATE (n:NumLable {num: " + row.get(0).toString +"})"
Neo4jSess.session.run(query)
()
}
)
}
def getSparkSession(): SparkSession = {
SparkSession
.builder
.appName("SparkNeo4j")
.getOrCreate()
}
}
object Neo4jSess {
/**
* Store a Neo4j session in a object so that it can be used by Spark
*/
var conf :Config = null
this.conf = ConfigFactory.load().getConfig("DeltaStorer")
val neo4jUrl: String = "bolt://127.0.0.1:7687"
val neo4jUser: String = "neo4j"
val neo4jPassword: String = "FakePassword"
val driver = GraphDatabase.driver(neo4jUrl, AuthTokens.basic(neo4jUser, neo4jPassword))
val session: Session = driver.session()
}
Please try to update spark-defaults.conf:
spark.jars.packages neo4j-contrib:neo4j-spark-connector:2.4.5-M2
spark.neo4j.url bolt://XX.XXX.X.XXX:7687
spark.neo4j.user neo4j
spark.neo4j.password test

How to interact with different cassandra cluster from the same spark context

I want to migrate my old cassandra cluster data to a new cluster and thinking to write some spark jobs to do that. Is there any way to interact with multiple cassandra cluster from the same SparkContext. So that i can read the data from one cluster and write to another cluster using saveToCassandra function inside the same sparkJob.
val products = sc.cassandraTable("first_cluster","products").cache()
products.saveToCassandra("diff_cluster","products2")
Can we save the data into a different cluster ?
Example from spark-cassandra-connector
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql._
import org.apache.spark.SparkContext
def twoClusterExample ( sc: SparkContext) = {
val connectorToClusterOne = CassandraConnector(sc.getConf.set("spark.cassandra.connection.host", "127.0.0.1"))
val connectorToClusterTwo = CassandraConnector(sc.getConf.set("spark.cassandra.connection.host", "127.0.0.2"))
val rddFromClusterOne = {
// Sets connectorToClusterOne as default connection for everything in this code block
implicit val c = connectorToClusterOne
sc.cassandraTable("ks","tab")
}
{
//Sets connectorToClusterTwo as the default connection for everything in this code block
implicit val c = connectorToClusterTwo
rddFromClusterOne.saveToCassandra("ks","tab")
}
}

How to convert java Resultset into Spark dataframe

I am trying to use preparestatement with JDBC. It results ResultSet object. I want to convert it into spark dataframe.
object JDBCRead {
val tableName:String = "TABLENAME"
val url :String = "jdbc:teradata://TERADATA_URL/user=USERNAME,password=PWD,charset=UTF8,TYPE=FASTEXPORT,SESSIONS=10"
val selectTable:String = "SELECT * FROM " + tableName +" sample 10";
val con : Connection = DriverManager.getConnection(url);
val pstmt2: PreparedStatement = con.prepareStatement(selectTable)
import java.sql.ResultSet
val rs: ResultSet = pstmt2.executeQuery
val rsmd: ResultSetMetaData = rs.getMetaData
while(rs.next()!=null)
{
val k: Boolean = rs.next()
for(i<-1 to rsmd.getColumnCount) {
print(" " + rs.getObject(i))
}
println()
}
}
I want to call above code from Spark Dataframe so that I can load the data into dataframe and get the results faster distributedly.
I must use PreparedStatement. I can not use spark.jdbc.load since FASTEXPORT of Teradata does not work with jdbc load. It has to be used with PreparedStatement
How to achieve this? How can I user preparestatement along with SELECT statement to load into Spark Dataframe.
-
AFAIK there are 2 options available for this kind of requirements
1. DataFrame 2. JdbcRDD
I'd offer JdbcRDD (since you are so specific to preparedstatement)
Which used prepareStatement internally in compute method. Therefore, you don't need to create connection and maintain it explicitly(error prone).
Later you can convert result in to dataframe
For speed you can configure other parameters.
Example code usage of JdbcRDD is below..
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext.__
import org.apache.spark.SparkConf
import org.apache.spark.rdd.JdbcRDD
import java.sql.{connection, DriverManager,ResultSet}
object jdbcRddExample {
def main(args: Array[String]) {
// Connection String
VAL URL = "jdbc:teradata://SERVER/demo"
val username = "demo"
val password = "Spark"
Class.forName("com.teradata.jdbc.Driver").newInstance
// Creating & Configuring Spark Context
val conf = new SparkConf().setAppName("App1").setMaster("local[2]").set("spark.executor.memory",1)
val sc = new SparkContext(conf)
println("Start...")
// Fetching data from Database
val myRDD = new JdbcRDD(sc,() => DriverManager.getConnection(url,username,password),
"select first_name, last_name, gender from person limit ?,?",
3,5,1,r => r.getString("last_name") + "," +r.getString("first_name"))
// Displaying the content
myRDD.foreach(println)
// Saving the content inside Text File
myRDD.saveAsTextFile("c://jdbcrdd")
println("End...")
}
}

Serialization of transform function in checkpointing

I'm trying to understand Spark Streaming's RDD transformations and checkpointing in the context of serialization. Consider the following example Spark Streaming app:
private val helperObject = HelperObject()
private def createStreamingContext(): StreamingContext = {
val conf = new SparkConf()
.setAppName(Constants.SparkAppName)
.setIfMissing("spark.master", Constants.SparkMasterDefault)
implicit val streamingContext = new StreamingContext(
new SparkContext(conf),
Seconds(Constants.SparkStreamingBatchSizeDefault))
val myStream = StreamUtils.createStream()
myStream.transform(transformTest(_)).print()
streamingContext
}
def transformTest(rdd: RDD[String]): RDD[String] = {
rdd.map(str => helperObject.doSomething(str))
}
val ssc = StreamingContext.getOrCreate(Settings.progressDir,
createStreamingContext)
ssc.start()
while (true) {
helperObject.setData(...)
}
From what I've read in other SO posts, transformTest will be invoked on the driver program once for every batch after streaming starts. Assuming createStreamingContext is invoked (no checkpoint is available), I would expect that the instance of helperObject defined up top would be serialized out to workers once per batch, hence picking up the changes applied to it via helperObject.setData(...). Is this the case?
Now, if createStreamingContext is not invoked (a checkpoint is available), then I would expect that the instance of helperObject cannot possibly be picked up for each batch, since it can't have been captured if createStreamingContext is not executed. Spark Streaming must have serialized helperObject as part of the checkpoint, correct?
Is it possible to update helperObject throughout execution from the driver program when using checkpointing? If so, what's the best approach?
If helperObject is going to be serialized to each executors?
Ans: Yes.
val helperObject = Instantiate_SomeHow()
rdd.map{_.SomeFunctionUsing(helperObject)}
Spark Streaming must have serialized helperObject as part of the checkpoint, correct?
Ans Yes.
If you wish to refresh your helperObject behaviour for each RDD operation you can still do that by making your helperObject more intelligent and not sending the helperObject directly but via a function which has the following signature () => helperObject_Class.
Since it is a function it is serializable. It is a very common design pattern used for sending objects that are not serializable e.g. database connection object or for your fun use case.
An example is given from Kafka Exactly once semantics using database
package example
import kafka.serializer.StringDecoder
import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import scalikejdbc._
import com.typesafe.config.ConfigFactory
import org.apache.spark.{SparkContext, SparkConf, TaskContext}
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka.{KafkaUtils, HasOffsetRanges, OffsetRange}
/** exactly-once semantics from kafka, by storing offsets in the same transaction as the results
Offsets and results will be stored per-batch, on the driver
*/
object TransactionalPerBatch {
def main(args: Array[String]): Unit = {
val conf = ConfigFactory.load
val kafkaParams = Map(
"metadata.broker.list" -> conf.getString("kafka.brokers")
)
val jdbcDriver = conf.getString("jdbc.driver")
val jdbcUrl = conf.getString("jdbc.url")
val jdbcUser = conf.getString("jdbc.user")
val jdbcPassword = conf.getString("jdbc.password")
val ssc = setupSsc(kafkaParams, jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)()
ssc.start()
ssc.awaitTermination()
}
def setupSsc(
kafkaParams: Map[String, String],
jdbcDriver: String,
jdbcUrl: String,
jdbcUser: String,
jdbcPassword: String
)(): StreamingContext = {
val ssc = new StreamingContext(new SparkConf, Seconds(60))
SetupJdbc(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
// begin from the the offsets committed to the database
val fromOffsets = DB.readOnly { implicit session =>
sql"select topic, part, off from txn_offsets".
map { resultSet =>
TopicAndPartition(resultSet.string(1), resultSet.int(2)) -> resultSet.long(3)
}.list.apply().toMap
}
val stream: InputDStream[(String,Long)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, Long)](
ssc, kafkaParams, fromOffsets,
// we're just going to count messages per topic, don't care about the contents, so convert each message to (topic, 1)
(mmd: MessageAndMetadata[String, String]) => (mmd.topic, 1L))
stream.foreachRDD { rdd =>
// Note this block is running on the driver
// Cast the rdd to an interface that lets us get an array of OffsetRange
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// simplest possible "metric", namely a count of messages per topic
// Notice the aggregation is done using spark methods, and results collected back to driver
val results = rdd.reduceByKey {
// This is the only block of code running on the executors.
// reduceByKey did a shuffle, but that's fine, we're not relying on anything special about partitioning here
_+_
}.collect
// Back to running on the driver
// localTx is transactional, if metric update or offset update fails, neither will be committed
DB.localTx { implicit session =>
// store metric results
results.foreach { pair =>
val (topic, metric) = pair
val metricRows = sql"""
update txn_data set metric = metric + ${metric}
where topic = ${topic}
""".update.apply()
if (metricRows != 1) {
throw new Exception(s"""
Got $metricRows rows affected instead of 1 when attempting to update metrics for $topic
""")
}
}
// store offsets
offsetRanges.foreach { osr =>
val offsetRows = sql"""
update txn_offsets set off = ${osr.untilOffset}
where topic = ${osr.topic} and part = ${osr.partition} and off = ${osr.fromOffset}
""".update.apply()
if (offsetRows != 1) {
throw new Exception(s"""
Got $offsetRows rows affected instead of 1 when attempting to update offsets for
${osr.topic} ${osr.partition} ${osr.fromOffset} -> ${osr.untilOffset}
Was a partition repeated after a worker failure?
""")
}
}
}
}
ssc
}
}

Spark: Read HBase in secured cluster

I have an easy task: I want to read HBase data in a Kerberos secured cluster.
So far I tried 2 approaches:
sc.newAPIHadoopRDD(): here I don't know how to handle the kerberos authentication
create a HBase connection from the HBase API: Here I don't really know how to convert the result into RDDs
Furthermore there seem to be some HBase-Spark connectors. But somehow I didn't really manage to find them as Maven artifact and/or they require a fixed structure of the result (but I just need to have the HBase Result object since the columns in my data are not fixed).
Do you have any example or tutorials or ....?
I appreciate any help and hints.
Thanks in advance!
I assume that you are using spark + scala +Hbase
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.HTable;
object SparkWithMyTable {
def main(args: Array[String]) {
//Initiate spark context with spark master URL. You can modify the URL per your environment.
val sc = new SparkContext("spark://ip:port", "MyTableTest")
val tableName = "myTable"
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "list of cluster ip's")
conf.set("hbase.zookeeper"+ ".property.clientPort","2181");
conf.set("hbase.master", "masterIP:60000");
conf.set("hadoop.security.authentication", "kerberos");
conf.set("hbase.security.authentication", "kerberos");
UserGroupInformation.setConfiguration(conf);
UserGroupInformation.loginUserFromKeytab("user#---", keyTabPath);
// Add local HBase conf
// conf.addResource(new Path("file://hbase/hbase-0.94.17/conf/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, tableName)
// create my table with column family
val admin = new HBaseAdmin(conf)
if(!admin.isTableAvailable(tableName)) {
print("Creating MyTable")
val tableDesc = new HTableDescriptor(tableName)
tableDesc.addFamily(new HColumnDescriptor("cf1".getBytes()));
admin.createTable(tableDesc)
}else{
print("Table already exists!!")
val columnDesc = new HColumnDescriptor("cf1");
admin.disableTable(Bytes.toBytes(tableName));
admin.addColumn(tableName, columnDesc);
admin.enableTable(Bytes.toBytes(tableName));
}
//first put data into table
val myTable = new HTable(conf, tableName);
for (i <- 0 to 5) {
var p = new Put();
p = new Put(new String("row" + i).getBytes());
p.add("cf1".getBytes(), "column-1".getBytes(), new String(
"value " + i).getBytes());
myTable.put(p);
}
myTable.flushCommits();
//how to create rdd
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
//get the row count
val count = hBaseRDD.count()
print("HBase RDD count:"+count)
System.exit(0)
}
}
Maven Artifact
<dependency>
<groupId>it.nerdammer.bigdata</groupId>
<artifactId>spark-hbase-connector_2.10</artifactId>
<version>1.0.3</version> // Version can be changed as per your Spark version, I am using Spark 1.6.x
</dependency>
Can also have a look at
Spark play with HBase's Result object: handling HBase KeyValue and ByteArray in Scala with Spark -- Real World Examples
scan-that-works-on-kerberos
HBaseScanRDDExample.scala

Resources