I use spark v1.6. I have the below dataframe.
Primary_key | Dim_id
PK1 | 1
PK2 | 2
PK3 | 3
I would like to create a new dataframe with a new sequence #s whenever a new record comes in. Lets say, I get 2 new records from the source with values PK4 & PK5, I would like to create new dim_ids with the values 4 and 5. So, my new dataframe should look like below.
Primary_key | Dim_id
PK1 | 1
PK2 | 2
PK3 | 3
PK4 | 4
PK5 | 5
How to generate a running sequence number in spark dataframe v1.6 for the new records?
If you have a database somewhere, you can create a sequence in it, and use it with a user defined function (as you, I stumbled upon this problem...).
Reserve a bucket of sequence numbers, and use it (the incrementby parameter must be the same as the one used to create the sequence). As it's an object, SequenceID will be a singleton on each working node, and you can iterate over the bucket of sequences using the atomiclong.
It's far from being perfect (possible connection leaks, relies on a DB, locks on static class, does), comments welcome.
import java.sql.Connection
import java.sql.DriverManager
import java.util.concurrent.locks.ReentrantLock
import java.util.concurrent.atomic.AtomicLong
import org.apache.spark.sql.functions.udf
object SequenceID {
var current: AtomicLong = new AtomicLong
var max: Long = 0
var connection: Connection = null
var connectionLock = new ReentrantLock
var seqLock = new ReentrantLock
def getConnection(): Connection = {
if (connection != null) {
return connection
}
connectionLock.lock()
if (connection == null) {
// create your jdbc connection here
}
connectionLock.unlock()
connection
}
def next(sequence: String, incrementBy: Long): Long = {
if (current.get == max) {
// sequence bucket exhausted, get a new one
seqLock.lock()
if (current.get == max) {
val rs = getConnection().createStatement().executeQuery(s"SELECT NEXT VALUE FOR ${sequence} FROM sysibm.sysdummy1")
rs.next()
current.set(rs.getLong(1))
max = current.get + incrementBy
}
seqLock.unlock()
}
return current.getAndIncrement
}
}
class SequenceID() extends Serializable {
def next(sequence: String, incrementBy: Long): Long = {
return SequenceID.next(sequence, incrementBy)
}
}
val sequenceGenerator = new SequenceID(properties)
def sequenceUDF(seq: SequenceID) = udf[Long](() => {
seq.next("PK_SEQUENCE", 500L)
})
val seq = sequenceUDF(sequenceGenerator)
myDataframe.select(myDataframe("foo"), seq())
Related
In my Scala application, I make several threads. In each thread, I write different data from the array to the same PostgreSQL table. I noticed that some threads did not write data to the PostgreSQL table. However, there are no errors in the application logs. Is it possible for the database to block parallel table accesses? What can be the cause of this behavior?
MainApp.scala:
val postgreSQL = new PostgreSQL(configurations)
val semaphore = new Semaphore(5)
for (item <- array) {
semaphore.acquire()
val thread = new Thread(new CustomThread(postgreSQL, semaphore, item))
thread.start()
}
CustomThread.scala:
import java.util.concurrent.Semaphore
import java.util.UUID.randomUUID
import utils.PostgreSQL
class CustomThread(postgreSQL: PostgreSQL, semaphore: Semaphore, item: Item) extends Runnable {
override def run(): Unit = {
try {
// Create the unique filename.
val filename: String = randomUUID().toString
// Write to the database the filename of the item.
postgreSQL.changeItemFilename(filename, item.id)
// Change the status type of the item.
postgreSQL.changeItemStatusType(3, request.id)
} catch {
case e: Throwable =>
e.printStackTrace()
} finally {
semaphore.release()
}
}
}
PostgreSQL.scala:
package utils
import java.sql.{Connection, DriverManager, PreparedStatement, ResultSet}
import java.util.Properties
class PostgreSQL(configurations: Map[String, String]) {
val host: String = postgreSQLConfigurations("postgresql.host")
val port: String = postgreSQLConfigurations("postgresql.port")
val user: String = postgreSQLConfigurations("postgresql.user")
val password: String = postgreSQLConfigurations("postgresql.password")
val db: String = postgreSQLConfigurations("postgresql.db")
val url: String = "jdbc:postgresql://" + host + ":" + port + "/" + db
val driver: String = "org.postgresql.Driver"
val properties = new Properties()
val connection: Connection = getConnection
var statement: PreparedStatement = _
def getConnection: Connection = {
properties.setProperty("user", user)
properties.setProperty("password", password)
var connection: Connection = null
try {
Class.forName(driver)
connection = DriverManager.getConnection(url, properties)
} catch {
case e:Exception =>
e.printStackTrace()
}
connection
}
def changeItemFilename(filename: String, id: Int): Unit = {
try {
statement = connection.prepareStatement("UPDATE REPORTS SET FILE_NAME = ? WHERE ID = ?;", ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_READ_ONLY)
statement.setString(1, filename)
statement.setInt(2, id)
statement.execute()
} catch {
case e: Exception =>
e.printStackTrace()
}
}
}
Just for your interest, by default, JDBC is synchronous. It means that it blocks your thread until the operation is done on a specific connection. This means if you try to do multiple things on a single connection at the same time, actions will be done sequentially instead.
More on that:
https://dzone.com/articles/myth-asynchronous-jdbc
That's the first and the most probable reason. The second possible reason, database blocks modifying actions on table cells which are being updated by another transaction, how exactly - depends on isolation level.
https://www.sqlservercentral.com/articles/isolation-levels-in-sql-server
That's the second probable reason.
The last, but not least, it is not necessary to use bare threads in Scala. For concurrent/asynchronous programming many of libraries like cats-effects, monix, zio, was developed, and there are special libraries for database access using these libraries like slick or doobie.
It's better to use them than bare threads due to numerous reasons.
I am trying to work with Flink and Cassandra. Both are massively parallel environments, but I have difficulties to make them working together.
Right now I need to make an operation for parallel read from Cassandra by different token ranges with the possibility to terminate query after N objects read.
The batch mode suites me more, but DataStreams are also possible.
I tried LongCounter (see below), but it would not work as I expected. I failed to get the global sum with them. Only local values.
Async mode is not nessesary since this operation CassandraRequester is performed in a parallel context with parallelization of about 64 or 128.
This is my attempt
class CassandraRequester<T> (val klass: Class<T>, private val context: FlinkCassandraContext):
RichFlatMapFunction<CassandraTokenRange, T>() {
companion object {
private val session = ApplicationContext.session!!
private var preparedStatement: PreparedStatement? = null
private val manager = MappingManager(session)
private var mapper: Mapper<*>? = null
private val log = LoggerFactory.getLogger(CassandraRequesterStateless::class.java)
public const val COUNTER_ROWS_NUMBER = "flink-cassandra-select-count"
}
private lateinit var counter: LongCounter
override fun open(parameters: Configuration?) {
super.open(parameters)
if(preparedStatement == null)
preparedStatement = session.prepare(context.prepareQuery()).setConsistencyLevel(ConsistencyLevel.LOCAL_ONE)
if(mapper == null) {
mapper = manager.mapper<T>(klass)
}
counter = runtimeContext.getLongCounter(COUNTER_ROWS_NUMBER)
}
override fun flatMap(tokenRange: CassandraTokenRange, collector: Collector<T>) {
val bs = preparedStatement!!.bind(tokenRange.start, tokenRange.end)
val rs = session.execute(bs)
val resultSelect = mapper!!.map(rs)
val iter = resultSelect.iterator()
while (iter.hasNext()) when {
this.context.maxRowsExtracted == 0L || counter.localValue < context.maxRowsExtracted -> {
counter.add(1)
collector.collect(iter.next() as T)
}
else -> {
collector.close()
return
}
}
}
}
Is it possible to terminate query in such a case?
I need to read data from two different tables (both with above 100k rows) in the same database. So I tried to create two Futures and connection pool size is 50, but the performance doesn't seem to improve (total time around 5 seconds). Then I found this article
So if you want to run multiple queries in parallel: no problem, just start them in separate Futures. However you won't have performance benefits, JDBC simply blocks a different Thread, not your main Thread of execution.
This means all the threads will be stuck at JDBC and processed sequentially. Is this true even if my connection pool size is 50? If yes, could you suggest an efficient way when dealing with data tables with large rows (such as load the data in less than 2 seconds)?
Here is my piece of code:
case class User(name: String, age: Int)
class User(tag: Tag) extends Table[User](tag, "User"){
def user_id = column[Long]("user_id")
def name = column[String]("user_name")
def age = column[Int]("user_age")
def * = (user_id, name, age) <> (
{ (row:(Long, String, Int)) => User(row._1, row._2, row._3)}
{ (p: User) => Some(p.name, p.age) }
)
}
val users = TableQuery[User]
case class Patron(name: String, type: Int)
class Patron(tag: Tag) extends Table[Patron](tag, "Patron"){
def patron_id = column[Long]("patron_id")
def name = column[String]("patron_name")
def type = column[Int]("patron_type")
def * = (patron_id, name, type) <> (
{ (row:(Long, String, Int)) => User(row._1, row._2, row._3)}
{ (p: Patron) => Some(p.name, p.type) }
)
}
val patrons = TableQuery[Patron]
def getUsers(implicit session:Session): Future[Map[String, Int]] = Future {
val allUserQuery = for (
user <- users
) yield(user.name, user.age)
allUserQuery.run.toMap
}
def getPatrons(implicit session:Session): Future[Map[String, Int]] = Future {
val allPatronQuery = for (
patron <- patrons
) yield(patron.name, patron.type)
allPatronQuery.run.toMap
}
val (users: List[Map[String, Int]], patrons: Map[Long, String]) = Await.result(
for {
userData <- getUsers
patronData <- getPatrons
} yield (userData, patronData),
10.seconds)
This might be a very trivial question, but I just don't know what is wrong.
I have a JavaPairDStream,and for each batch interval I want to get the number of keys in my RDD within the stream, so that I can use this number later in the application.
The problem is, I can get the number of keys by doing:
streamGiveKey.foreachRDD(new Function<JavaPairRDD<String, String>, Void>() {
#Override
public Void call(JavaPairRDD<String, String> stringStringJavaPairRDD) throws Exception {
int a= stringStringJavaPairRDD.countByKey().size();
countPartitions=a;
System.out.print(a + "\r\n");
return null;
}
});
JavaPairDStream<String,Iterable<String>>groupingEachBoilerValues= streamGiveKey.groupByKey(countPartitions);
In which countPartitions is a global variable that stores the number of keys for one batch interval.
The problem is, the application never reaches groupingEachBoilerValues, it just keeps printing inside the forEachRDD in an endless cycle.
Is there another way for me to do this?
Thank you so much.
You can keep a global count in the driver. Here
long globalCount = 0L;
.. foreachRDD(
.. globalCount += rdd.count();
This globalCount variable will reside in the driver and will keep being updated after every batch.
Update Skeptics ahoy! The above code is specific to streaming. I am well aware it would not work in standard non-streaming RDD code.
I have created test code encompassing the above approach and the counter works fine. Will post this code in a few minutes.
import org.apache.spark._
import org.apache.spark.streaming._
var globalCount = 0L
val ssc = new StreamingContext(sc, Seconds(4))
val lines = ssc.socketTextStream("localhost", 19999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
lines.count().foreachRDD(rdd => { globalCount += rdd.count; println(globalCount) } )
ssc.start
ssc.awaitTermination
Here it is running
scala> ssc.awaitTermination
-------------------------------------------
Time: 1466366660000 ms
-------------------------------------------
1
-------------------------------------------
Time: 1466366664000 ms
-------------------------------------------
2
-------------------------------------------
Time: 1466366668000 ms
-------------------------------------------
3
Here is a tiny data generator program to test with:
import java.net._
import java.io._
case class ClientThread(sock: Socket) {
new Thread() {
override def run() {
val bos = new BufferedOutputStream(sock.getOutputStream)
while (true) {
bos.write(s"Hello there it is ${new java.util.Date().toString}\n".getBytes)
Thread.sleep(3000)
}
}
}.start
}
val ssock = new ServerSocket(19999)
while (true) {
val sock = ssock.accept()
ClientThread(sock)
}
First of I would like to state that I am new to slick and am using version 3.1.1 . I been reading the manual but i am having trouble getting my query to work. Either something is wrong with my connection string or something is wrong with my Slick code. I got my config from here http://slick.typesafe.com/doc/3.1.1/database.html and my update example from here bottom of page http://slick.typesafe.com/doc/3.1.1/queries.html . Ok so here is my code
Application Config.
mydb= {
dataSourceClass = org.postgresql.ds.PGSimpleDataSource
properties = {
databaseName = "Jsmith"
user = "postgres"
password = "unique"
}
numThreads = 10
}
My Controller -- Database table is called - relations
package controllers
import play.api.mvc._
import slick.driver.PostgresDriver.api._
class Application extends Controller {
class relations(tag: Tag) extends Table[(Int,Int,Int)](tag, "relations") {
def id = column[Int]("id", O.PrimaryKey)
def me = column[Int]("me")
def following = column[Int]("following")
def * = (id,me,following)
}
val profiles = TableQuery[relations]
val db = Database.forConfig("mydb")
try {
// ...
} finally db.close()
def index = Action {
val q = for { p <- profiles if p.id === 2 } yield p.following
val updateAction = q.update(322)
val invoker = q.updateStatement
Ok()
}
}
What could be wrong with my code above ? I have a separate project that uses plain JDBC and this configuration works perfectly for it
db.default.driver=org.postgresql.Driver
db.default.url="jdbc:postgresql://localhost:5432/Jsmith"
db.default.user="postgres"
db.default.password="unique"
You did not run your action yet. db.run(updateAction) executes your query respectively your action on a database (untested):
def index = Action.async {
val q = for { p <- profiles if p.id === 2 } yield p.following
val updateAction = q.update(322)
val db = Database.forConfig("mydb")
db.run(updateAction).map(_ => Ok())
}
db.run() returns a Future which will be eventually completed. It is then simply mapped to a Result in play.
q.updateStatement on the other hand just generates a sql statement. This can be useful while debugging.
Look the code from my project:
def updateStatus(username: String, password: String, status: Boolean): Future[Boolean] = {
db.run(
(for {
user <- Users if user.username === username
} yield {
user
}).map(_.online).update(status)
}