SPARK JDBC Connection Reuse for many Queries Executed - apache-spark

Borrowing from SO 26634853, then the following questions :
Using an IMPALA connection like this is a one-shot set up :
val JDBCDriver = "com.cloudera.impala.jdbc41.Driver"
val ConnectionURL = "jdbc:impala://url.server.net:21050/default;auth=noSasl"
Class.forName(JDBCDriver).newInstance
val con = DriverManager.getConnection(ConnectionURL)
val stmt = con.createStatement()
val rs = stmt.executeQuery(query)
val resultSetList = Iterator.continually((rs.next(), rs)).takeWhile(_._1).map(r => {
getRowFromResultSet(r._2) // (ResultSet) => (spark.sql.Row)
}).toList
sc.parallelize(resultSetList)
What if I need to put a loop around the con.createStatement() and associated code thereunder with some logic and execute it say, some, 5000 times?
Referring to the db connection overhead discussions with map vs. mapPartitions, would I, in this case, incur 5000 x the cost of the connection, or is it re-usable the way it is done here? From documentation on SCALA JDBC it looks like it can be reused.
My thinking is that as it is not a high-level SPARK API like df_mysql = sqlContext.read.format("jdbc").options ..., then I think it should remain open, but would like to check. May be the SPARK env closes it automatically, but I think not. At the end of processing a close could be issued?
Using the HIVE Context means we do not need to open the connection every time - or is that not so? Then using a parquet or ORC table, then I presume would allow such an approach as performance is quite fast.

I tried this simulation and connection remains open, so provided not in foreach, it is not an issue performance-wise.
var counter = 0
do
{
counter = counter + 1
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select author from family) f ", connectionProperties)
dataframe_mysql.show
} while (counter < 3)

Related

How to use spark to write to HBase using multi-thread

I'm using spark to write data to HBase, but at the writing stage, only one executor and one core are executing.
I wonder why my code is not writing properly or what should I do to make it write faster?
Here is my code:
val df = ss.sql("SQL")
HBaseTableWriterUtil.hbaseWrite(ss, tableList, df)
def hbaseWrite(ss:SparkSession,tableList: List[String], df:DataFrame): Unit ={
val tableName = tableList(0)
val rowKeyName = tableList(4)
val rowKeyType = tableList(5)
hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, s"${tableName}")
//写入到HBase
val sc = ss.sparkContext
sc.hadoopConfiguration.addResource(hbaseConf)
val columns = df.columns
val result = df.rdd.mapPartitions(par=>{
par.map(row=>{
var rowkey:String =""
if("String".equals(rowKeyType)){
rowkey = row.getAs[String](rowKeyName)
}else if("Long".equals(rowKeyType)){
rowkey = row.getAs[Long](rowKeyName).toString
}
val put = new Put(Bytes.toBytes(rowkey))
for(name<-columns){
var value = row.get(row.fieldIndex(name))
if(value!=null){
put.addColumn(Bytes.toBytes("cf"),Bytes.toBytes(name),Bytes.toBytes(value.toString))
}
}
(new ImmutableBytesWritable,put)
})
})
val job = Job.getInstance(sc.hadoopConfiguration)
job.setOutputKeyClass(classOf[ImmutableBytesWritable])
job.setOutputValueClass(classOf[Result])
job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
result.saveAsNewAPIHadoopDataset(job.getConfiguration)
}
You may not control how many parallel execute may write to HBase.
Though you can start multiple Spark jobs in multiThreaded client program.
e.g. You can have a shell script which triggers multiple spark-submit command to induce parallelism. Each spark job can work on one set of data independent to each other and push into HBase.
This can also be done using Spark Java/Scala SparkLauncher API using it with Java concurrent API (e.g. Executor framework).
val sparkLauncher = new SparkLauncher
//Set Spark properties.only Basic ones are shown here.It will be overridden if properties are set in Main class.
sparkLauncher.setSparkHome("/path/to/SPARK_HOME")
.setAppResource("/path/to/jar/to/be/executed")
.setMainClass("MainClassName")
.setMaster("MasterType like yarn or local[*]")
.setDeployMode("set deploy mode like cluster")
.setConf("spark.executor.cores","2")
// Lauch spark application
val sparkLauncher1 = sparkLauncher.startApplication()
//get jobId
val jobAppId = sparkLauncher1.getAppId
//Get status of job launched.THis loop will continuely show statuses like RUNNING,SUBMITED etc.
while (true) {
println(sparkLauncher1.getState().toString)
}
However, the challenge is to track each of them for failure and automatic recovery. It may be tricky specially when partial data is already written into HBase. i.e. A job fails to process the complete set of data assigned to it. You may have to automatically clean the data from HBase before automatically retrigger.

Filtering and selecting data from a DataFrame in Spark

I am working on a Spark-JDBC program
I came up with the following code so far:
object PartitionRetrieval {
var conf = new SparkConf().setAppName("Spark-JDBC")
val log = LogManager.getLogger("Spark-JDBC Program")
Logger.getLogger("org").setLevel(Level.ERROR)
val conFile = "/home/hmusr/ReconTest/inputdir/testconnection.properties"
val properties = new Properties()
properties.load(new FileInputStream(conFile))
val connectionUrl = properties.getProperty("gpDevUrl")
val devUserName = properties.getProperty("devUserName")
val devPassword = properties.getProperty("devPassword")
val driverClass = properties.getProperty("gpDriverClass")
val tableName = "source.bank_accounts"
try {
Class.forName(driverClass).newInstance()
} catch {
case cnf: ClassNotFoundException =>
log.error("Driver class: " + driverClass + " not found")
System.exit(1)
case e: Exception =>
log.error("Exception: " + e.printStackTrace())
System.exit(1)
}
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().config(conf).master("yarn").enableHiveSupport().getOrCreate()
val gpTable = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable",tableName)
.option("user",devUserName)
.option("password",devPassword).load()
val rc = gpTable.filter(gpTable("source_system_name")==="ORACLE").count()
println("gpTable Count: " + rc)
}
}
In the above code, will the statement:val gpTable = spark.read.format("jdbc").option("url", connectionUrl) dump the whole data of the table: bank_accounts into the DataFrame: gpTable and then DataFrame: rc gets the filtered data. I have this doubt as the table: bank_accounts is a very small table and it doesn't have an effect if it is loaded into memory as a dataframe as a whole. But in our production, there are tables with billions of records. In that case what is the recommended way to load data into a DataFrame using a JDBC connection ?
Could anyone let me know the concept of Spark-Jdbc's entry point here ?
will the statement ... dump the whole data of the table: bank_accounts into the DataFrame: gpTable and then DataFrame: rc gets the filtered data.
No. DataFrameReader is not eager. It only defines data bindings.
Additionally, simple predicates, like trivial equality, checks are pushed to the source and only required columns should loaded when plan is executed.
In the database log you should see a query similar to
SELECT 1 FROM table WHERE source_system_name = 'ORACLE'
if it is loaded into memory as a dataframe as a whole.
No. Spark doesn't load data in memory unless it instructed to (primarily cache) and even then it limits itself to the blocks that fit into available storage memory.
During standard process it keep only the data that is required to compute the plan. For global plan memory footprint shouldn't depend on the amount of data.
In that case what is the recommended way to load data into a DataFrame using a JDBC connection ?
Please check Partitioning in spark while reading from RDBMS via JDBC, Whats meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?, https://stackoverflow.com/a/45028675/8371915 for questions related to scalability.
Additionally you can read Does spark predicate pushdown work with JDBC?

What is the fastest way to read database using PySpark?

I am trying to read a table of a database using PySpark and SQLAlchamy as follows:
SUBMIT_ARGS = "--jars mysql-connector-java-5.1.45-bin.jar pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
sc = SparkContext('local[*]', 'testSparkContext')
sqlContext = SQLContext(sc)
t0 = time.time()
database_uri = 'jdbc:mysql://{}:3306/{}'.format("127.0.0.1",<db_name>)
dataframe_mysql = sqlContext.read.format("jdbc").options(url=database_uri, driver = "com.mysql.jdbc.Driver", dbtable = <tablename>, user= <user>, password=<password>).load()
print(dataframe_mysql.rdd.map(lambda row :list(row)).collect())
t1 = time.time()
database_uri2 = 'mysql://{}:{}#{}/{}'.format(<user>,<password>,"127.0.0.1",<db_name>)
engine = create_engine(database_uri2)
connection = engine.connect()
s = text("select * from {}.{}".format(<db_name>,<table_name>))
result = connection.execute(s)
for each in result:
print(each)
t2= time.time()
print("Time taken by PySpark:", (t1-t0))
print("Time taken by SQLAlchamy", (t2-t1))
This is the time taken to fetch some 3100 rows:
Time taken by PySpark: 12.326745986938477
Time taken by SQLAlchamy: 0.21664714813232422
Why SQLAlchamy is outperforming PySpark? Is there any way to make this faster? Is there any error in my approach?
Why SQLAlchamy is outperforming PySpark? Is there any way to make this faster? Is there any error in my approach?
More than one. Ultimately you try use Spark in a way it is not intended to be used, measure incorrect thing and introduce incredible amount of indirection. Overall:
JDBC DataSource is inefficient, and as you use it is completely sequential. Check parallellizing reads in Spark Gotchas.
Collecting data is not intended for production use in practice.
You introduce a lot of indirection, by converting data to RDD and serializing, fetching to driver and deserializing.
Your code measures not only data processing time, but also cluster / contexts initialization time.
local mode (designed for prototyping and unit testing) is just a cherry on the top.
And so on...
So at the end of the day your code is slow but it is not something you'd use in production application. SQLAlchemy and Spark are designed for complete different purposes - if you're looking for low latency database access layer Spark is not the right choice.

How to effectively read millions of rows from Cassandra?

I have a hard task to read from a Cassandra table millions of rows. Actually this table contains like 40~50 millions of rows.
The data is actually internal URLs for our system and we need to fire all of them. To fire it, we are using Akka Streams and it have been working pretty good, doing some back pressure as needed. But we still have not found a way to read everything effectively.
What we have tried so far:
Reading the data as Stream using Akka Stream. We are using phantom-dsl that provides a publisher for a specific table. But it does not read everything, only a small portion. Actually it stops to read after the first 1 million.
Reading using Spark by a specific date. Our table is modeled like a time series table, with year, month, day, minutes... columns. Right now we are selecting by day, so Spark will not fetch a lot of things to be processed, but this is a pain to select all those days.
The code is the following:
val cassandraRdd =
sc
.cassandraTable("keyspace", "my_table")
.select("id", "url")
.where("year = ? and month = ? and day = ?", date.getYear, date.getMonthOfYear, date.getDayOfMonth)
Unfortunately I can't iterate over the partitions to get less data, I have to use a collect because it complains the actor is not serializable.
val httpPool: Flow[(HttpRequest, String), (Try[HttpResponse], String), HostConnectionPool] = Http().cachedHostConnectionPool[String](host, port).async
val source =
Source
.actorRef[CassandraRow](10000000, OverflowStrategy.fail)
.map(row => makeUrl(row.getString("id"), row.getString("url")))
.map(url => HttpRequest(uri = url) -> url)
val ref = Flow[(HttpRequest, String)]
.via(httpPool.withAttributes(ActorAttributes.supervisionStrategy(decider)))
.to(Sink.actorRef(httpHandlerActor, IsDone))
.runWith(source)
cassandraRdd.collect().foreach { row =>
ref ! row
}
I would like to know if any of you have such experience on reading millions of rows for doing anything different from aggregation and so on.
Also I have thought to read everything and send to a Kafka topic, where I would be receiving using Streaming(spark or Akka), but the problem would be the same, how to load all those data effectively ?
EDIT
For now, I'm running on a cluster with a reasonable amount of memory 100GB and doing a collect and iterating over it.
Also, this is far different from getting bigdata with spark and analyze it using things like reduceByKey, aggregateByKey, etc, etc.
I need to fetch and send everything over HTTP =/
So far it is working the way I did, but I'm afraid this data get bigger and bigger to a point where fetching everything into memory makes no sense.
Streaming this data would be the best solution, fetching in chunks, but I haven't found a good approach yet for this.
At the end, I'm thinking of to use Spark to get all those data, generate a CSV file and use Akka Stream IO to process, this way I would evict to keep a lot of things in memory since it takes hours to process every million.
Well, after spending sometime reading, talking with other guys and doing tests the result could be achieve by the following code sample:
val sc = new SparkContext(sparkConf)
val cassandraRdd = sc.cassandraTable(config.getString("myKeyspace"), "myTable")
.select("key", "value")
.as((key: String, value: String) => (key, value))
.partitionBy(new HashPartitioner(2 * sc.defaultParallelism))
.cache()
cassandraRdd
.groupByKey()
.foreachPartition { partition =>
partition.foreach { row =>
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
val myActor = system.actorOf(Props(new MyActor(system)), name = "my-actor")
val source = Source.fromIterator { () => row._2.toIterator }
source
.map { str =>
myActor ! Count
str
}
.to(Sink.actorRef(myActor, Finish))
.run()
}
}
sc.stop()
class MyActor(system: ActorSystem) extends Actor {
var count = 0
def receive = {
case Count =>
count = count + 1
case Finish =>
println(s"total: $count")
system.shutdown()
}
}
case object Count
case object Finish
What I'm doing is the following:
Try to achieve a good number of Partitions and a Partitioner using the partitionBy and groupBy methods
Use Cache to prevent Data Shuffle, making your Spark move large data across nodes, using high IO etc.
Create the whole actor system with it's dependencies as well as the Stream inside the foreachPartition method. Here is a trade off, you can have only one ActorSystem but you will have to make a bad use of .collect as I wrote in the question. However creating everything inside, you still have the ability to run things inside spark distributed across your cluster.
Finish each actor system at the end of the iterator using the Sink.actorRef with a message to kill(Finish)
Perhaps this code could be even more improved, but so far I'm happy to do not make the use of .collect anymore and working only inside Spark.

Handle database connection inside spark streaming

I am not sure if I understand correctly how spark handle database connection and how to reliable using large number of database update operation insides spark without potential screw up the spark job. This is a code snippet I have been using (for easy illustration):
val driver = new MongoDriver
val hostList: List[String] = conf.getString("mongo.hosts").split(",").toList
val connection = driver.connection(hostList)
val mongodb = connection(conf.getString("mongo.db"))
val dailyInventoryCol = mongodb[BSONCollection](conf.getString("mongo.collections.dailyInventory"))
val stream: InputDStream[(String,String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](
ssc, kafkaParams, fromOffsets,
(mmd: MessageAndMetadata[String, String]) => (mmd.topic, mmd.message()));
def processRDD(rddElem: RDD[(String, String)]): Unit = {
val df = rdd.map(line => {
...
}).flatMap(x => x).toDF()
if (!isEmptyDF(df)) {
var mongoF: Seq[Future[dailyInventoryCol.BatchCommands.FindAndModifyCommand.FindAndModifyResult]] = Seq();
val dfF2 = df.groupBy($"CountryCode", $"Width", $"Height", $"RequestType", $"Timestamp").agg(sum($"Frequency")).collect().map(row => {
val countryCode = row.getString(0); val width = row.getInt(1); val height = row.getInt(2);
val requestType = row.getInt(3); val timestamp = row.getLong(4); val frequency = row.getLong(5);
val endTimestamp = timestamp + 24*60*60; //next day
val updateOp = dailyInventoryCol.updateModifier(BSONDocument("$inc" -> BSONDocument("totalFrequency" -> frequency)), false, true)
val f: Future[dailyInventoryCol.BatchCommands.FindAndModifyCommand.FindAndModifyResult] =
dailyInventoryCol.findAndModify(BSONDocument("width" -> width, "height" -> height, "country_code" -> countryCode, "request_type" -> requestType,
"startTs" -> timestamp, "endTs" -> endTimestamp), updateOp)
f
})
mongoF = mongoF ++ dfF2
//split into small chunk to avoid drying out the mongodb connection
val futureList: List[Seq[Future[dailyInventoryCol.BatchCommands.FindAndModifyCommand.FindAndModifyResult]]] = mongoF.grouped(200).toList
//future list
futureList.foreach(seqF => {
Await.result(Future.sequence(seqF), 40.seconds)
});
}
stream.foreachRDD(processRDD(_))
Basically, I am using Reactive Mongo (Scala) and for each RDD, I convert it into dataframe, group/extract the necessary data and then fire a large number of database update query against mongo. I want to ask:
I am using mesos to deploy spark on 3 servers and have one more server for mongo database. Is this the correct way to handle database connection. My concern is if database connection / polling is opened at the beginning of spark job and maintained properly (despite timeout/network error failover) during the whole duration of spark(weeks, months....) and if it will be closed when each batch finished? Given the fact that job might be scheduled on different servers? Does it means that each batch, it will open different set of DB connections?
What happen if exception occurs when executing queries. The spark job for that batch will failed? But the next batch will keep continue?
If there is too many queries (2000->+) to run update on mongo-database, and the executing time is exceeding configured spark batch duration (2 minutes), will it cause the problem? I was noticed that with my current setup, after abt 2-3 days, all of the batch is queued up as "Process" on Spark WebUI (if i disable the mongo update part, then i can run one week without prob), none is able to exit properly. Which basically hang up all batch job until i restart/resubmit the job.
Thanks a lot. I appreciate if you can help me address the issue.
Please read "Design Patterns for using foreachRDD" section in http://spark.apache.org/docs/latest/streaming-programming-guide.html. This will clear your doubts about how connections should be used/ created.
Secondly i would suggest to keep the direct update operations separate from your Spark Job. Better way would be that your spark job, process the data and then post it into a Kafka Queue and then have another dedicated process/ job/ code which reads the data from Kafka Queue and perform insert/ update operation on Mongo DB.

Resources