Scala Future concurrency Issue - multithreading

Following is my class where I run tasks concurrently. My problem is , my application never ends even after getting result for all the features. I suspect Thread pool is not shutting down which leads my application alive even after my tasks.Believe me I googled alot to figure it out but no luck. What I'm missing here?
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.collection.mutable.ListBuffer
import scala.util.Failure
import scala.util.Success
object AppLauncher{
def launchAll(): ListBuffer[Future[String]] = {
// My code logic where I launch all my threads say 50
null
}
def main(args:Array[String]):Unit= {
register(launchAll())
}
def register(futureList: ListBuffer[Future[String]]): Unit =
{
futureList.foreach { future =>
{
future.onComplete {
case Success(successResult) => {
println(successResult)
}
case Failure(failureResult) => { println(failureResult) }
}
}
}
}
}

Usually, when you operate on an iterable of Futures you should use Future.sequence which changes say, a Seq[Future[T]] to a Future[Seq[T]].
So, use something like:
def register(futureList: Seq[Future[String]]) = Future.sequence(futureList) foreach { results =>
println("received result")
}
if you'd like to map each future and print outputs as it completes, you can also do something on the lines of;
def register(futureList: Seq[Future[String]]) = Future.sequence (
futureList.map(f => f.map { v =>
println(s"$v is complete")
v
}) ) map { vs =>
println("all values done $vs")
vs
}

Finally I was able to figure out the issue.The issue is certianly because of Thread Pool was not terminated even after my futures were completed successfully . I tried to isolate the issue by changing my implementation slightly as below.
//import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.collection.mutable.ListBuffer
import scala.util.Failure
import scala.util.Success
//Added ExecutionContex explicitly
import java.util.concurrent.Executors
import concurrent.ExecutionContext
object AppLauncher {
//Implemented EC explicitly
private val pool = Executors.newFixedThreadPool(1000)
private implicit val executionContext = ExecutionContext.fromExecutorService(pool)
def launchAll(): ListBuffer[Future[String]] = {
// My code logic where I launch all my threads say 50
null
}
def main(args: Array[String]): Unit = {
register(launchAll())
}
def register(futureList: ListBuffer[Future[String]]): Unit =
{
futureList.foreach { future =>
{
println("Waiting...")
val result = Await.result(future, scala.concurrent.duration.Duration.Inf)
println(result)
}
}
pool.shutdownNow()
executionContext.shutdownNow()
println(pool.isTerminated() + " Pool terminated")
println(pool.isShutdown() + " Pool shutdown")
println(executionContext.isTerminated() + " executionContext terminated")
println(executionContext.isShutdown() + " executionContext shutdown")
}
}
Result before adding highlighted code to shutdown pools
false Pool terminated
true Pool shutdown
false executionContext terminated
true executionContext shutdown
After adding highlighted code solved my issue. I ensured no resource leak in my code.My scenario permits me to kill the pool when all the futures are done. I'm aware of the fact that I changed elegant callback implementation to blocking implementation but still it solved my problem.

Related

Flowfiles are stuck in queue while calling store procedure using Execute Script

enter image description here
I am trying to call store procedure using groovy script and the processor I am using is Execute Script (using groovy because i want to capture the response of store procedure).
But the flow files are getting stuck and when I am restarting the processor it's getting passed
The same code I am using on other environment it's working fine without an issue.
Below is code I am using to call the store procedure:
import org.apache.commons.io.IOUtils
import org.apache.nifi.controller.ControllerService
import org.apache.nifi.processor.io.StreamCallback
import java.nio.charset.*
import groovy.sql.OutParameter
import groovy.sql.Sql
import java.sql.ResultSet
import java.sql.Clob
try{
def lookup = context.controllerServiceLookup
def dbServiceName = ConncationPool.value
def dbcpServiceId = lookup.getControllerServiceIdentifiers(ControllerService).find {
cs -> lookup.getControllerServiceName(cs) == dbServiceName
}
def conn = lookup.getControllerService(dbcpServiceId).getConnection();
sql = Sql.newInstance(conn);
def flowFile = session.get()
if(!flowFile) return
attr1= flowFile.getAttribute('attr1')
attr2= flowFile.getAttribute('attr2')
attr3= flowFile.getAttribute('attr3')
def data = []
String sqlString ="""{call procedure_name(?,?,?,?)}""";
def OUT_JSON
def parametersList = [attr1,attr2,attr3,Sql.VARCHAR];
sql.call(sqlString, parametersList) {out_json_response ->
OUT_JSON = out_json_response
};
def attrMap = ['out_json_response':String.valueOf(OUT_JSON),'Conn':String.valueOf(conn)]
flowFile = session.putAllAttributes(flowFile, attrMap)
conn.close()
sql.close();
session.transfer(flowFile, REL_SUCCESS)
}
catch (e){
if (conn != null) conn.close();
if (sql != null) sql.close();
log.error('Scripting error', e)
flowFile = session.putAttribute(flowFile, "error", e.getMessage())
session.transfer(flowFile, REL_FAILURE)
} finally {
if (conn != null) conn.close();
if (sql != null) sql.close();
}
Can you please help me to solve the issue. Is anyone face the same issue?
I cannot see the run schedule in your screenshot.
So first go to the configuration by right clicking on the processor.
There you'll find a tab named Scheduling. Click on it.
Now, check if the Scheduling Strategy is marked as CRON driven or Timer driven.
There you can also check the Run Schedule.
When you mentioned that your script runs after restarting or every 15 minutes, I thought that your Run Schedule is set to run every ~15 minutes.
Just verify that and if that is the case, stop the processor and change the below configuration to:
Scheduling Strategy: Timer Driven
Run Schedule: 0 sec
NOTE: Before making any changes, if the initial flow was not developed by you, check if changing the Scheduling of the processor will not have any undesired effect. Or just make sure why it was set to run every 15 minutes.

How to get Scala future output into a separate variables

I am having 4 data bricks notebooks running concurrently using scala futures. Using below code.
case class NotebookData(path: String, timeout: Int, parameters: Map[String, String] = Map.empty[String, String])
def parallelNotebooks(notebooks: Seq[NotebookData]): Future[Seq[String]] = {
import scala.concurrent.{Future, blocking, Await}
import java.util.concurrent.Executors
import scala.concurrent.ExecutionContext
import com.databricks.WorkflowException
val numNotebooksInParallel = 4
// If you create too many notebooks in parallel the driver may crash when you submit all of the jobs at once.
// This code limits the number of parallel notebooks.
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(numNotebooksInParallel))
val ctx = dbutils.notebook.getContext()
Future.sequence(
notebooks.map { notebook =>
Future {
dbutils.notebook.setContext(ctx)
if (notebook.parameters.nonEmpty)
dbutils.notebook.run(notebook.path, notebook.timeout, notebook.parameters)
else
dbutils.notebook.run(notebook.path, notebook.timeout)
}
.recover {
case NonFatal(e) => s"ERROR: ${e.getMessage}"
}
}
)
}
def parallelNotebook(notebook: NotebookData): Future[String] = {
import scala.concurrent.{Future, blocking, Await}
import java.util.concurrent.Executors
import scala.concurrent.ExecutionContext.Implicits.global
import com.databricks.WorkflowException
val ctx = dbutils.notebook.getContext()
// The simplest interface we can have but doesn't
// have protection for submitting to many notebooks in parallel at once
Future {
dbutils.notebook.setContext(ctx)
if (notebook.parameters.nonEmpty)
dbutils.notebook.run(notebook.path, notebook.timeout, notebook.parameters)
else
dbutils.notebook.run(notebook.path, notebook.timeout)
}
.recover {
case NonFatal(e) => s"ERROR: ${e.getMessage}"
}
}
val notebooks = Seq(
NotebookData("/notebook1",0,Map("Env" -> test.toString())),
NotebookData("/notebook2",0,Map("Env" -> test.toString())),
NotebookData("/notebook3",0,Map("Env" -> test.toString())),
NotebookData("/notebook4",0,Map("Env" -> test.toString()))
)
val res = parallelNotebooks(notebooks)
Await.result(res, 3600000 seconds) // this is a blocking call.
res.value
Here each notebook is returning some value with dbutils.notebook.exit(). I am getting return values like this.
parallelNotebooks: (notebooks: Seq[NotebookData])scala.concurrent.Future[Seq[String]]
parallelNotebook: (notebook: NotebookData)scala.concurrent.Future[String]
notebooks: Seq[NotebookData] = List(NotebookData(/notebook1,0,Map(Env -> true)), NotebookData(/notebook2,0,Map(Env -> true)),NotebookData(/notebook3,0,Map(Env -> true)),NotebookData(/notebook4,0,Map(Env -> true)))
res: scala.concurrent.Future[Seq[String]] = Future(Success(List(0, 0,0,0)))
res2: Option[scala.util.Try[Seq[String]]] = Some(Success(List(0, 0,0,0)))
Future(Success(List(0, 0,0,0)) and Some(Success(List(0, 0,0,0)) here 0 are the return values from my notebook.
How can I get this values into separate variables. So that I can use this values in later on my need.
val value: Seq[String] = Await.result(res, 3600000 seconds)
value will be a List of the String results from the dbutils.notebook.run calls.

Is there a way to use Spark to load a file in FTP using TLS

I am in the process of moving a python process to Spark. In python we are using ftplib to connect and download a file to a EC2 instance. Once file is downloaded, we are uploading to S3. We are transitioning to severless infrastructure and would like to load file in spark via AWS Glue and then use mulit-part upload to move it to S3. I have tried to just run the current code in a in a larger glue instance type but the machine still runs out of memory (20gb file).
old python code
"""
This script will get the backup file
"""
import sys
from datetime import datetime
import re
import ftplib
from retry import retry
import shutil
from tools.python.s3_functions import s3_upload
from python_scripts.get import *
def get_ftp_connector(path, user, password):
ftp = ftplib.FTP_TLS(path)
ftp.login(user, password)
ftp.prot_p()
return ftp
def get_ftp_files_list(ftp, dir):
ftp.cwd(dir)
files = ftp.nlst()
print(str("-".join(files)))
if "filecompleted.txt" not in files:
print("Failed to find filescompleted.txt file in ftp server.")
raise Exception("Failed to find filescompleted.txt file in ftp server.")
regex_str = 'Backup_File_Mask_Goes_here([\d]{8}).bak'
find_date_regex = re.compile(regex_str)
searched = [(f, find_date_regex.match(f)) for f in files if find_date_regex.match(f)]
searched = \
[(file_name, datetime.strptime(regex_result.groups()[0], '%Y%m%d')) for file_name, regex_result in searched]
searched = sorted(searched, key=lambda elem: elem[1], reverse=True)
if not searched:
print("Failed to find appropriate file in ftp server.")
raise Exception("Failed to find appropriate file in ftp server.")
return searched[0]
class FtpUploadTracker:
size_written = 0
total_size = 0
last_shown_percent = "X"
def __init__(self, total_size, bk_file):
self.total_size = total_size
self.bk_file = bk_file
self.output_file = open(self.bk_file, 'wb')
self.start_time = datetime.now()
def handle(self, block):
self.size_written += len(block)
percent_complete = str(round((self.size_written / self.total_size) * 100, 1))
self.output_file.write(block)
time_elapsed = (datetime.now() - self.start_time).total_seconds()
speed = round(self.size_written / (1000 * 1000 * time_elapsed), 2)
msg = "{percent}% complete # average speed of {speed}MB/s : total run time {minutes}m".\
format(percent=percent_complete, speed=speed, minutes=round(time_elapsed/60))
if time_elapsed > 600 and speed < 1:
print("Zombie connection, failing dl.")
raise Exception("Zombie connection, failing dl.")
if self.last_shown_percent != percent_complete:
self.last_shown_percent = percent_complete
print(msg)
def close(self):
self.output_file.close()
#retry(tries=4, delay=300)
def retrieve_db():
"""
This function will retrieve via FTP the backup
:return: None
"""
ftp = get_ftp_connector(FTP_PATH, FTP_USER, FTP_PASSWORD)
# return back the most recent entry
file_name, file_date = get_ftp_files_list(ftp, 'database')
file_epoch = (file_date - datetime(1970, 1, 1)).total_seconds()
new_file_name = "backup_{epoch}.bak".format(epoch=str(int(file_epoch)))
if os.path.exists(DATAFILEPATH):
shutil.rmtree(DATAFILEPATH)
if not os.path.exists(DATAFILEPATH):
os.makedirs(DATAFILEPATH)
temp_backup_file_location = os.path.join(DATAFILEPATH + new_file_name)
print("Found file {file_name}, and downloading it to {loc}".
format(file_name=file_name, loc=temp_backup_file_location))
ftp_handler = FtpUploadTracker(ftp.size(file_name), temp_backup_file_location)
ftp.retrbinary("RETR " + file_name, ftp_handler.handle)
ftp.quit()
ftp_handler.close()
print("Finished download. Uploading to S3.")
s3_upload(DATAFILEPATH, new_file_name, bucket, "db_backup")
os.remove(temp_backup_file_location)
def main():
try:
retrieve_db()
except Exception as e:
print("Failed to download backup after 4 tries with error {e}.".format(e=e))
return 1
return 0
if __name__ == "__main__":
rtn = main()
sys.exit(rtn)
New Spark Code (in progress): The username has a | character that made me encode the uri. When I run the code, I get a connection refused. I am able to use same connection info for python.
from pyspark import SparkContext
from pyspark import SparkFiles
import urllib
sc = SparkContext()
ftp_path = "ftp://Username:password#ftplocation.com/path_to_file"
file_path_clean = urllib.parse.urlencode(ftp_path, safe='|')
print(f"file_path_clean: {file_path_clean}")
sc.addFile(ftp_path)
filename = SparkFiles.get(file_path.split('/')[-1])
print(f"filename: {filename}")
rdd = sc.textFile("file://" + filename)
print("We got past rdd = sc.textFile(file:// + filename)")
rdd.take(10)
rdd.collect()
print(rdd)
There are three ways to approach the problem:
Use a mounted file system backed by FTP and write to it from Spark.
Use a Spark to SFTP connector such as spark-sftp.
Write the files with Spark somewhere else and copy to SFTP as a separate step. Due to the various reliability issues with SFTP and the fact that Spark leaves partial output during failed write operations, which is the path that we've taken. We write terabytes to SFTP endpoints using code that looks like the following in Scala. I hope it can be helpful for you Python work.
/** Defines some high-level operations for interacting with remote file protocols like FTP, SFTP, etc.
*/
trait RemoteFileOperations extends Closeable {
var backoff: BlockingRetry.Backoff = Backoff.linear(3000)
var retry: BlockingRetry.Retry = Retry.maxRetries(3)
var recover: Recovery = recoverable(this)
var ignore: Ignored = nonRecoverable
def listFiles(path: String = ""): Seq[FInfo]
def uploadFile(localPath: String, remoteDirectory: String): Unit
def downloadFile(localPath: String, remotePath: String): Unit
def deleteAll(path: String): Unit
def connect(): Unit = {}
def disconnect(): Unit = {}
def reconnect(): Unit = {
disconnect()
connect()
}
override def close(): Unit = disconnect()
/** Wraps a block of code and allows it to be retried when [[recoverable()]] conditions
* are met. [[BlockingRetry.retry()]] is called with the var fields
* [[backoff]], [[retry]], [[recover]], and [[ignore]], which can all be reconfigured.
*/
def retryable[A](f: => A): A = {
BlockingRetry.retry(retry, backoff, recover, ignore) {
f
}
}
def recoverable(fileOp: RemoteFileOperations): Recovery = {
case (_: SocketTimeoutException, _: Int) =>
fileOp.reconnect()
None
}
def nonRecoverable: Ignored = {
case _: UnknownHostException |
_: SSLException |
_: SocketException |
_: IllegalStateException =>
}
}
class SSHJClient(host: String, username: String, password: String) extends RemoteFileOperations {
import net.schmizz.keepalive.KeepAliveProvider
import net.schmizz.sshj.connection.ConnectionException
import net.schmizz.sshj.sftp.SFTPClient
import net.schmizz.sshj.transport.verification.PromiscuousVerifier
import net.schmizz.sshj.xfer.FileSystemFile
import net.schmizz.sshj.{DefaultConfig, SSHClient}
override def listFiles(path: String): Seq[FInfo] = {
import collection.JavaConverters._
retryable {
sftpSession(sftp => {
sftp.ls(path).asScala
.filter(f => f.getName != "." && f.getName != "..")
.map(f => FInfo(f.getPath, f.getParent, f.isDirectory, f.getAttributes.getSize, f.getAttributes.getMtime))
})
}
}
override def uploadFile(localPath: String, remoteDirectory: String): Unit = {
retryable {
sftpSession(sftp => {
sftp.getFileTransfer.setPreserveAttributes(false)
sftp.put(new FileSystemFile(localPath), remoteDirectory)
})
}
}
override def downloadFile(localPath: String, remotePath: String): Unit = {
retryable {
sftpSession(sftp => {
sftp.getFileTransfer.setPreserveAttributes(false)
sftp.get(remotePath, new FileSystemFile(localPath))
})
}
}
override def deleteAll(path: String): Unit =
throw new UnsupportedOperationException("#deleteAll is unsupported for SSHJClient")
private def sftpSession[A](f: SFTPClient => A): A = {
val defaultConfig = new DefaultConfig()
defaultConfig.setKeepAliveProvider(KeepAliveProvider.KEEP_ALIVE)
val ssh = new SSHClient(defaultConfig)
try {
// This is equivalent to StrictHostKeyChecking=no which is disabled since we don't usually know
// the SSH remote host key ahead of time.
ssh.addHostKeyVerifier(new PromiscuousVerifier())
ssh.connect(host)
ssh.authPassword(username, password)
val sftp = ssh.newSFTPClient()
try {
f(sftp)
} finally {
sftp.close()
}
} finally {
ssh.disconnect()
}
}
override def recoverable(fileOp: RemoteFileOperations): Recovery = {
super.recoverable(fileOp).orElse {
case (e: ConnectionException, _: Int) =>
println(s"Recovering session from exception: $e")
None
}
}
}

Gpars, name the threads

List<byte[]> outputBytes = []
GParsPool.withPool( ) {
outputBytes = events.collectParallel{
depositNoticeService.getDepositNotice( it.id )
}
}
I want to name each thread, so I can monitor it in logging, but I do not see any documentation on how to get there.
Try to get current thread, and than you can change or get it options:
Thread.currentThread().getName()
Thread.currentThread().setName("NEW NAME")
I'm using Executors and here example that works for me:
import java.util.concurrent.*
import javax.annotation.*
def threadExecute = Executors.newSingleThreadExecutor()
threadExecute.execute {
log.info(Thread.currentThread().getName())
log.info(Thread.currentThread().setName("NEW NAME"))
log.info(Thread.currentThread().getName())
}
threadExecute.shutdown()

How to run a Data Import Script in Seprate Thread Grails/Grrovy?

I use to import data from excel ,but i use the bootstrap.groovy to write the code and my import script method is called when the application starts.
Here the scenarios is i m having 8000 related data once to import if they are not on my database.And,also when i deploy it to tomcat6 it is blocking other apps from deployment ,until it finish the import.So,i want to use separate thread for to run the script in anyway without affecting performance AND BLOCKING OTHER FROM DEPLOYMENT.
code excerpt ...
class BootStrap {
def grailsApplication
def sessionFactory
def excelService
def importStateLgaArea(){
String fileName = grailsApplication.mainContext.servletContext.getRealPath('filename.xlsx')
ExcelImporter importer = new ExcelImporter(fileName)
def listState = importer.getStateLgaTerritoryList() //get the map,form excel
log.info "List form excel:${listState}"
def checkPreviousImport = Area.findByName('Osusu')
if(!checkPreviousImport) {
int i = 0
int j = 0 // up
date cases
def beforeTime = System.currentTimeMillis()
for(row in listState){
def state = State.findByName(row['state'])
if(!state) {
// log.info "Saving State:${row['state']}"
row['state'] = row['state'].toString().toLowerCase().capitalize()
// log.info "after capitalized" + row['state']
state = new State(name:row['state'])
if(!state.save(flash:true)){
log.info "${state.errors}"
break;
}
}
}
}
For import of large data I suggest to take in consideration the use of Spring Batch. Is easy to integrate it in grails. You can try with this plugin or integrate it manually.

Resources