Spark worker shutdown - how to free shared resources - apache-spark

In Spark manual there is recommended to use shared static resource (e. g. connection pool) inside the worker code.
Example from the manual:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
What to do when the static resource needs to be freed/closed before the executor is shut down? There is no place where to call the close() function. Tried a shutdown hook, but it doesn't seem to help.
Actually currently my worker process becomes zombie, because I am using shared resource which creates a pool of non-deamon threads (HBase async client) that means the JVM hangs until forever.
I am using the Spark Streaming graceful shutdown called on the driver:
streamingContext.stop(true, true);
EDIT:
It seems there already exists an issue in the Spark JIRA that's dealing with the same problem
https://issues.apache.org/jira/browse/SPARK-10911

Related

EF6 Garbage collecton increases unmanaged memory instead of releasing it

I have a .net webapi executing tasks in a background queue and it's deployed on k8s cluster. The pods keep getting evicted because of the memory threshold.
private async Task BackgroundProcessing(CancellationToken stoppingToken)
{
while (!stoppingToken.IsCancellationRequested)
{
var task = await _taskQueue.DequeueAsync(stoppingToken);
var scope = _services.CreateScope();
await task(scope.ServiceProvider, stoppingToken);
scope.Dispose();
}
}
My application seems to run "normaly", but when the garbage collection is is done, it seems to increase the unmanaged memory instead of releasing it.

How to create connection(s) to a Datasource in Spark Streaming for Lookups

I have a use case where we are streaming events and for each event I have to do some lookups. The Lookups are in Redis and I am wondering what is the best way to create the connections. The spark streaming would run 40 executors and I have 5 such Streaming jobs all connecting to same Redis Cluster. So I am confused what approach should I be taking to create the Redis connection
Create a connection object on the driver and broadcast it to the executors ( Not sure if it really works as I have to make this object Serializable). Can I do this with broadcast variables?
Create a Redis connection for each partition, however I have the code written this way
val update = xyz.transform(rdd => {
// on driver
if (xyz.isNewDay) {
.....
}
rdd
})
update.foreachRDD(rdd => {
rdd.foreachPartition(partition => {
partition.foreach(Key_trans => {
// perform some lookups logic here
}
}
})
So now if i create a connection inside each partition it would mean that for every RDD and for each partition in that RDD I would be creating a new connection.
Is there a way i can maintain one connection for each partition and cache that object so that I would not have to create connections again and again?
I can add more context/info if required.
1. Create a connection object on the driver and broadcast it to the executors ( Not sure if it really works as I have to make this object Serializable). Can I do this with broadcast variables?
Answer - No. Most of the connection objects are not serializable due to machine dependent data associated with connection.
2. Is there a way i can maintain one connection for each partition and cache that object so that I would not have to create connections again and again?
Ans- Yes, create a connection pool and use it in partition. here is the style. You can create a connection pool like this https://github.com/RedisLabs/spark-redis/blob/master/src/main/scala/com/redislabs/provider/redis/ConnectionPool.scala
and then use it
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
Please check this:
design pattern for using foreachRDD

scala: apache httpclient in multi-threaded environment

I am writing a singleton class (Object in scala) which uses apache httpclient(4.5.2) to post some file content and return status to caller.
object HttpUtils{
protected val retryHandler = new HttpRequestRetryHandler() {
def retryRequest(exception: IOException, executionCount: Int, context: HttpContext): Boolean = {
//retry logic
true
}
}
private val connectionManager = new PoolingHttpClientConnectionManager()
// Reusing same client for each request that might be coming from different threads .
// Is it correct ????
val httpClient = HttpClients.custom()
.setConnectionManager(connectionManager)
.setRetryHandler(retryHandler)
.build()
def restApiCall (url : String, rDD: RDD[SomeMessage]) : Boolean = {
// Creating new context for each request
val httpContext: HttpClientContext = HttpClientContext.create
val post = new HttpPost(url)
// convert RDD to text file using rDD.collect
// add this file as MultipartEntity to post
var response = None: Option[CloseableHttpResponse] // Is it correct way of using it ?
try {
response = Some(httpClient.execute(post, httpContext))
val responseCode = response.get.getStatusLine.getStatusCode
EntityUtils.consume(response.get.getEntity) // Is it require ???
if (responseCode == 200) true
else false
}
finally {
if (response.isDefined) response.get.close
post.releaseConnection() // Is it require ???
}
}
def onShutDown = {
connectionManager.close()
httpClient.close()
}
}
Multiple threads (More specifically from spark streaming context) are calling restApiCall method. I am relatively new to scala and apache httpClient. I have to make frequent connections to only few fixed server (i.e. 5-6 fixed URL's with different request parameters).
I went through multiple online resource but still not confident about it.
Is it the best way to use http client in multi-threaded environment?
Is it possible to keep live connections and use it for various requests ? Will it be beneficial in this case ?
Am i using/releasing all resources efficiently ? If not please suggest.
Is it good to use it in Scala or there exist some better library ?
Thanks in advance.
It seems the official docs have answers to all your questions:
2.3.3. Pooling connection manager
PoolingHttpClientConnectionManager is a more complex implementation
that manages a pool of client connections and is able to service
connection requests from multiple execution threads. Connections are
pooled on a per route basis. A request for a route for which the
manager already has a persistent connection available in the pool will
be serviced by leasing a connection from the pool rather than creating
a brand new connection.
PoolingHttpClientConnectionManager maintains a maximum limit of
connections on a per route basis and in total. Per default this
implementation will create no more than 2 concurrent connections per
given route and no more 20 connections in total. For many real-world
applications these limits may prove too constraining, especially if
they use HTTP as a transport protocol for their services.
2.4. Multithreaded request execution
When equipped with a pooling connection manager such as
PoolingClientConnectionManager, HttpClient can be used to execute
multiple requests simultaneously using multiple threads of execution.
The PoolingClientConnectionManager will allocate connections based on
its configuration. If all connections for a given route have already
been leased, a request for a connection will block until a connection
is released back to the pool. One can ensure the connection manager
does not block indefinitely in the connection request operation by
setting 'http.conn-manager.timeout' to a positive value. If the
connection request cannot be serviced within the given time period
ConnectionPoolTimeoutException will be thrown.

Transaction management and Multithreading in Hibernate 4

I have a requirement of executing parent task which may or maynot have child task. Each parent and child task should be run in thread. If something goes wrong in parent or child execution the transaction of both parent and child task must be rollback. I am using hibernate4.
If I got it, the parent and the child task will run in differents threads.
According to me it's a very bad idea that does not worth considering.
While it may be possible using jta transaction, it's clearly not the case using hibernate transaction management delegation to underlying jdbc connection (you have one connection per session and MUST NOT share an hibernate session between threads).
Using jta you will have to handle connection retrieval and transactions yourself and can't so take advantages of connection pooling and container managed transaction (spring or java ee ones). It may be overcomplicated for about no performance improvments as sharing the database connection between two threads will just probably move the bottleneck one level below.
See how to share one transaction between multi threads
According to OP expectation here is a pseudo code for Hibernate 4 standalone session management with jdbc transaction (I personnaly advise to go with a container (Java ee or spring) and JTA container managed transaction)
In hibernate.cfg.xml
<property name="hibernate.current_session_context_class">thread</property>
SessionFactory :
Configuration configuration = new Configuration();
configuration.configure("hibernate.cfg.xml");
StandardServiceRegistryBuilder builder = new StandardServiceRegistryBuilder().applySettings(configuration.getProperties());
SessionFactory sessionFactory = configuration.buildSessionFactory(builder.build());
The session factory should be exposed using a singleton (any way you choose you must have only one instance for the whole app)
public void executeParentTask() {
try {
sessionFactory.getCurrentSession().beginTransaction();
sessionFactory.getCurrentSession().persist(someEntity);
myChildTask.execute();
sessionFactory.getCurrentSession().getTransaction().commit();
}
catch (RuntimeException e) {
sessionFactory .getCurrentSession().getTransaction().rollback();
throw e; // or display error message
}
}
getCurrentSession() will return the session bound to the current thread. If you manage the thread execution yourself you should create the session at the beginning of the thread execution and close it at the end.
the child task will retrieve the same session than the parent one using sessionFactory.getCurrentSession()
See https://docs.jboss.org/hibernate/orm/4.3/manual/en-US/html/ch03.html#configuration-sessionfactory
http://docs.jboss.org/hibernate/orm/4.3/manual/en-US/html_single/#transactions-demarcation-nonmanaged
You may find this interesting too : How to configure and get session in Hibernate 4.3.4.Final?

Closing Netty server cleanly

Hello currently I am developing an Arquillian extension for Moco framework (https://github.com/dreamhead/moco). Moco is used for testing RESTful services and relies on Netty for dealing with communication. Currently Moco is using Netty 4.0.18.Final.
But I have found a problem when running Moco (and Netty server) inside a container (Arquillian runs tests within the container) and is that it starts correctly but when the application is undeployed and server is shutdown next log error messages are printed:
SEVERE: The web application [/ba32e781-3a18-44b3-9547-7c26787f3fe7] appears to have started a thread named [pool-2-thread-1] but has failed to stop it. This is very likely to create a memory leak.
abr 08, 2014 10:29:06 AM org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks
SEVERE: The web application [/ba32e781-3a18-44b3-9547-7c26787f3fe7] created a ThreadLocal with key of type [io.netty.util.internal.ThreadLocalRandom$2] (value [io.netty.util.internal.ThreadLocalRandom$2#77468cae]) and a value of type [io.netty.util.internal.ThreadLocalRandom] (value [io.netty.util.internal.ThreadLocalRandom#6cd3851]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.
Basically it seems that there are some threads that are not closed yet when the server tries to shutdown.
From the point of view of Arquillian extension when the application is deployed into the server the start method of Moco is called and before undeploying the application the stop method from Moco is called.
But let me show you the code of Moco:
public int start(final int port, ChannelHandler pipelineFactory) {
ServerBootstrap bootstrap = new ServerBootstrap();
bootstrap.group(bossGroup, workerGroup)
.channel(NioServerSocketChannel.class)
.childHandler(pipelineFactory);
try {
future = bootstrap.bind(port).sync();
SocketAddress socketAddress = future.channel().localAddress();
address = (InetSocketAddress) socketAddress;
return address.getPort();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
and the stop method looks like:
private void doStop() {
if (future != null) {
future.channel().close().syncUninterruptibly();
future = null;
}
So it seems that the close method returns before killing all the threads and for this reason containers warns you about possible memory leaks.
Because I have never used Netty I was wondering if there is a way to ensure that the whole Netty runtime is closed.
Thank you so much for your help.
I am new to Netty as well (and unfamiliar with Arquillian), but based on the Netty Docs examples I believe you might not be shutting down the EventLoopGroups you created (bossGroup, workerGroup). From the Netty 4.0 User Guide:
Shutting down a Netty application is usually as simple as shutting down all EventLoopGroups you created via shutdownGracefully(). It returns a Future that notifies you when the EventLoopGroup has been terminated completely and all Channels that belong to the group have been closed.
So your doStop() method might look like:
private void doStop() {
workerGroup.shutdownGracefully();
bossGroup.shutdownGracefully();
}
An example in the Netty docs: Http Static File Server Example

Resources