Abort the driver immediately when any of the executor fails - apache-spark

while loading data into database if particular column is a bad record
If one executor fails then driver needs to be updated with the message and job has to be terminated
I thought to do using accumulator.Please give me suggestion of how to do this.....
Attached my code below...
public static void main(String[] args) {
SparkSession spark= SparkSession.builder().appName("loadSqlData").master("local[*]").getOrCreate();
Properties connectionProperties= new Properties();
connectionProperties.put("user","postgres");
connectionProperties.put("password","root");
Dataset<Row> personcsvdata = spark.read().option("header","true").csv("C:\\Users\\Manasa\\Documents\\nulldata.csv");
personcsvdata.show();
LongAccumulator countErrors = spark.sparkContext().longAccumulator();
try {
personcsvdata.write().mode(SaveMode.Append).jdbc("jdbc:postgresql://localhost:5432/postgres", "public.employee", connectionProperties);
countErrors.add(1);
}
catch (Exception e) {
}
}

Related

How can I parallel consumption kafka with spark streaming? I set concurrentJobs but something error [duplicate]

The doc of kafka give an approach about with following describes:
One Consumer Per Thread:A simple option is to give each thread its own consumer > instance.
My code:
public class KafkaConsumerRunner implements Runnable {
private final AtomicBoolean closed = new AtomicBoolean(false);
private final CloudKafkaConsumer consumer;
private final String topicName;
public KafkaConsumerRunner(CloudKafkaConsumer consumer, String topicName) {
this.consumer = consumer;
this.topicName = topicName;
}
#Override
public void run() {
try {
this.consumer.subscribe(topicName);
ConsumerRecords<String, String> records;
while (!closed.get()) {
synchronized (consumer) {
records = consumer.poll(100);
}
for (ConsumerRecord<String, String> tmp : records) {
System.out.println(tmp.value());
}
}
} catch (WakeupException e) {
// Ignore exception if closing
System.out.println(e);
//if (!closed.get()) throw e;
}
}
// Shutdown hook which can be called from a separate thread
public void shutdown() {
closed.set(true);
consumer.wakeup();
}
public static void main(String[] args) {
CloudKafkaConsumer kafkaConsumer = KafkaConsumerBuilder.builder()
.withBootstrapServers("172.31.1.159:9092")
.withGroupId("test")
.build();
ExecutorService executorService = Executors.newFixedThreadPool(5);
executorService.execute(new KafkaConsumerRunner(kafkaConsumer, "log"));
executorService.execute(new KafkaConsumerRunner(kafkaConsumer, "log.info"));
executorService.shutdown();
}
}
but it doesn't work and throws an exception:
java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access
Furthermore, I read the source of Flink (an open source platform for distributed stream and batch data processing). Flink using multi-thread consumer is similar to mine.
long pollTimeout = Long.parseLong(flinkKafkaConsumer.properties.getProperty(KEY_POLL_TIMEOUT, Long.toString(DEFAULT_POLL_TIMEOUT)));
pollLoop: while (running) {
ConsumerRecords<byte[], byte[]> records;
//noinspection SynchronizeOnNonFinalField
synchronized (flinkKafkaConsumer.consumer) {
try {
records = flinkKafkaConsumer.consumer.poll(pollTimeout);
} catch (WakeupException we) {
if (running) {
throw we;
}
// leave loop
continue;
}
}
flink code of mutli-thread
What's wrong?
Kafka consumer is not thread safe. As you pointed out in your question, the document stated that
A simple option is to give each thread its own consumer instance
But in your code, you have the same consumer instance wrapped by different KafkaConsumerRunner instances. Thus multiple threads are accessing the same consumer instance. The kafka documentation clearly stated
The Kafka consumer is NOT thread-safe. All network I/O happens in the
thread of the application making the call. It is the responsibility of
the user to ensure that multi-threaded access is properly
synchronized. Un-synchronized access will result in
ConcurrentModificationException.
That's exactly the exception you received.
It is throwing the exception on your call to subscribe. this.consumer.subscribe(topicName);
Move that block into a synchronized block like this:
#Override
public void run() {
try {
synchronized (consumer) {
this.consumer.subscribe(topicName);
}
ConsumerRecords<String, String> records;
while (!closed.get()) {
synchronized (consumer) {
records = consumer.poll(100);
}
for (ConsumerRecord<String, String> tmp : records) {
System.out.println(tmp.value());
}
}
} catch (WakeupException e) {
// Ignore exception if closing
System.out.println(e);
//if (!closed.get()) throw e;
}
}
Maybe is not your case, but if you are mergin processing of data of serveral topics, then you can read data from multiple topics with the same consumer. If not, then is preferable to create separate jobs consuming each topic.

Apache Spark -- Data Grouping and Execution in worker nodes

We are getting live machine data as json and we get this data from RabbitMQ. below is a sample of the json,
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:35","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:36","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:37","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:38","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
The data is windowed for duration of 'X' minutes and then below is what we want to achieve
Group the data by deviceId, this is done but not sure if we can get a DataSet
We want to loop through the above grouped data and execute for aggregation logic for each device using the foreachPartition so that the code is executed within worker nodes.
Please correct me if my thought process is wrong here.
Our earlier code was collecting the data,looping through the RDD's,convert them to DataSet and applying aggregation logic on the DataSet using Spark SqlContext api's.
When doing load testing we saw 90% of the processing was happening in Master node and after a while the cpu usage spiked to 100% and the process bombed out.
So we are now trying to re-engineer the whole process to execute maximum of logic in worker nodes.
Below is the code so far that actually works in worker node but we are yet to get a DataSet for aggregating Logic
public static void main(String[] args) {
try {
mconf = new SparkConf();
mconf.setAppName("OnPrem");
mconf.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(mconf);
jssc = new JavaStreamingContext(sc, Durations.seconds(60));
SparkSession spksess = SparkSession.builder().appName("Onprem").getOrCreate();
//spksess.sparkContext().setLogLevel("ERROR");
Map<String, String> rabbitMqConParams = new HashMap<String, String>();
rabbitMqConParams.put("hosts", "localhost");
rabbitMqConParams.put("userName", "guest");
rabbitMqConParams.put("password", "guest");
rabbitMqConParams.put("vHost", "/");
rabbitMqConParams.put("durable", "true");
List<JavaRabbitMQDistributedKey> distributedKeys = new LinkedList<JavaRabbitMQDistributedKey>();
distributedKeys.add(new JavaRabbitMQDistributedKey(QUEUE_NAME, new ExchangeAndRouting(EXCHANGE_NAME, "fanout", ""), rabbitMqConParams));
Function<Delivery, String> messageHandler = new Function<Delivery, String>() {
public String call(Delivery message) {
return new String(message.getBody());
}
};
JavaInputDStream<String> messages = RabbitMQUtils.createJavaDistributedStream(jssc, String.class, distributedKeys, rabbitMqConParams, messageHandler);
JavaDStream<String> machineDataRDD = messages.window(Durations.minutes(2),Durations.seconds(60)); //every 60 seconds one RDD is Created
machineDataRDD.print();
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
groupedData.foreachRDD(new VoidFunction<JavaPairRDD<String,Iterable<String>>>(){
#Override
public void call(JavaPairRDD<String, Iterable<String>> data) throws Exception {
data.foreachPartition(new VoidFunction<Iterator<Tuple2<String,Iterable<String>>>>(){
#Override
public void call(Iterator<Tuple2<String, Iterable<String>>> data) throws Exception {
while(data.hasNext()){
LOGGER.error("Machine Data == >>"+data.next());
}
}
});
}
});
jssc.start();
jssc.awaitTermination();
}
catch (Exception e)
{
e.printStackTrace();
}
The below grouping code gives us a Iterable of string for a Device , ideally we would like to get a DataSet
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
Important thing for me is the looping using foreachPartition so that code executing gets pushed to Worker Nodes.
After looking through more code samples and guidelines sqlcontext , sparksession are not serialized and available on the worker nodes , so we will be changing the strategy of not trying to build a dataset withing foreachpartition loop.

How to retry asynchronous writes on failure in datastax-jdbc?

I am using datastax driver to do cassandra writes asynchronously, would like to see if there is a way to retry writes on failure. It does not seem to contain the request in Throwable throwable during failure.
public void onQueryComplete(final ResultSetFuture rsf)
{
Futures.addCallback(rsf, new FutureCallback<ResultSet>()
{
#Override
public void onSuccess(ResultSet resultSet)
{
totalRecordsWritten.incrementAndGet();
jobContext.putLong("MDPREC_WRITE_CNT", totalRecordsWritten.get());
System.out.println("Ingestion succesful " + totalRecordsWritten.get());
Logging.log(Logging.INFO, "CassandraPersistence.java ingest() Ingestion succesful");
}
#Override public void onFailure(Throwable throwable)
{
jobContext.putInt("MDP_WRITE_FAILED", 1);
Logging.log(Logging.INFO,"CassandraPersistence.java ingest() Ingestion failed"); throw
new UnexpectedJobExecutionException("Exception while inserting data Job terminated"+throwable.getMessage());
}
});
}
You can implement your own retry strategy in case you don't want to save each query along with the result handler:
http://christopher-batey.blogspot.de/2013/10/cassandra-datastax-java-driver-retry.html

NoHostAvailable Exception on Async operations with Datastax Cassandra driver

I'm using nested Asynchronous query execution with Cassandra. Data is continuously streamed in and for each incoming data, the below block of cassandra operations are executed. It works fine for a while but then starts throwing a lot of NoHostAvailableException.
Please me help me out here.
Cassandra Session Connection code :
I use separate sessions for read and write. Each of these sessions connect to a different seed as I was told this would improve performance.
final com.datastax.driver.core.Session readSession = CassandraManager.connect("10.22.1.144", "fr_repo",
"READ");
final com.datastax.driver.core.Session writeSession = CassandraManager.connect("10.1.12.236", "fr_repo",
"WRITE");
The CassandraManager.connect method is below :
public static Session connect(String ip, String keySpace,String type) {
PoolingOptions poolingOpts = new PoolingOptions();
poolingOpts.setCoreConnectionsPerHost(HostDistance.REMOTE, 2);
poolingOpts.setMaxConnectionsPerHost(HostDistance.REMOTE, 400);
poolingOpts.setMaxSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 128);
poolingOpts.setMinSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 2);
cluster = Cluster
.builder()
.withPoolingOptions( poolingOpts )
.addContactPoint(ip)
.withRetryPolicy( DowngradingConsistencyRetryPolicy.INSTANCE )
.withReconnectionPolicy( new ConstantReconnectionPolicy( 100L ) ).build();
Session s = cluster.connect(keySpace);
return s;
}
Database operation code :
ResultSetFuture resultSetFuture = readSession.executeAsync(selectBound.bind(fr.getHashcode()));
Futures.addCallback(resultSetFuture, new FutureCallback<ResultSet>() {
public void onSuccess(com.datastax.driver.core.ResultSet resultSet) {
try {
Iterator<Row> rows = resultSet.iterator();
if (!rows.hasNext()) {
ResultSetFuture resultSetFuture = readSession.executeAsync(selectPrimaryBound
.bind(fr.getPrimaryKeyHashcode()));
Futures.addCallback(resultSetFuture, new FutureCallback<ResultSet>() {
public void onFailure(Throwable arg0) {
}
public void onSuccess(ResultSet arg0) {
Iterator<Row> rows = arg0.iterator();
if (!rows.hasNext()) {
writeSession.executeAsync(insertBound.bind(fr.getHashcode(), fr,
System.currentTimeMillis()));
writeSession.executeAsync(insertPrimaryBound.bind(
fr.getHashcode(),
fr.getCombinedPrimaryKeys(), System.currentTimeMillis()));
produceintoQueue(new Gson().toJson(frCompleteMap));
} else {
writeSession.executeAsync(updateBound.bind(fr,
System.currentTimeMillis(), fr.getHashcode()));
produceintoQueue(new Gson().toJson(frCompleteMap));
}
}
});
} else {
writeSession.executeAsync(updateLastSeenBound.bind(System.currentTimeMillis(),
fr.getHashcode()));
}
} catch (Exception e) {
e.printStackTrace();
}
}
It sounds like you're sending more requests than your pool/cluster can handle. This is pretty easy to do when you're never actually waiting for a result, as is the case in your code. You're essentially just throwing as many requests as you can into the pipeline with no blocking, and there's no natural back pressure to slow down your app if the pool or cluster get backed up. So if your request volume is too high, eventually all the hosts will be busy with the backed up work queue. You can use nodetool tpstats to see what your request queues look like on each node.

Java: Running transaction in multithreaded environment

We are launching a website that will have a very heavy volume for a short period of time. It is basically giving tickets. The code is written in Java, Spring & Hibernate. I want to mimic the high volume by spawning multiple threads and trying to get the ticket using JUnit test case. The problem is that in my DAO class the code just simply dies after I begin transaction. I mean there is no error trace in the log file or anything like that. Let me give some idea about the way my code is.
DAO code:
#Repository("customerTicketDAO")
public class CustomerTicketDAO extends BaseDAOImpl {// BaseDAOImpl extends HibernateDaoSupport
public void saveCustomerTicketUsingJDBC(String customerId) {
try{
getSession().getTransaction().begin(); //NOTHING HAPPENS AFTER THIS LINE OF CODE
// A select query
Query query1 = getSession().createSQLQuery("my query omitted on purpose");
.
.
// An update query
Query query2 = getSession().createSQLQuery("my query omitted on purpose");
getSession().getTransaction().commite();
} catch (Exception e) {
}
}
Runnable code:
public class InsertCustomerTicketRunnable implements Runnable {
#Autowired
private CustomerTicketDAO customerTicketDAO;
public InsertCustomerTicketRunnable(String customerId) {
this.customerId = customerId;
}
#Override
public void run() {
if (customerTicketDAO != null) {
customerTicketDAO.saveCustomerTicketUsingJDBC(customerId);
}
}
}
JUnit method:
#RunWith(SpringJUnit4ClassRunner.class)
#ContextConfiguration(locations={"file:src/test/resources/applicationContext-test.xml"})
public class DatabaseTest {
#Before
public void init() {
sessionFactory = (SessionFactory)applicationContext.getBean("sessionFactory");
Session session = SessionFactoryUtils.getSession(sessionFactory, true);
TransactionSynchronizationManager.bindResource(sessionFactory, new SessionHolder(session));
customerTicketDAO = (CustomerTicketDAO)applicationContext.getBean("customerTicketDAO");
}
#After
public void end() throws Exception {
SessionHolder sessionHolder = (SessionHolder) TransactionSynchronizationManager.unbindResource(sessionFactory);
SessionFactoryUtils.closeSession(session);
}
#Test
public void saveCustomerTicketInMultipleThreads () throws Exception {
ExecutorService executor = Executors.newFixedThreadPool(NTHREDS);
for (int i=0; i<1000; i++) {
executor.submit(new InsertCustomerTicketRunnable(i));
}
// This will make the executor accept no new threads
// and finish all existing threads in the queue
executor.shutdown();
// Wait until all threads are finish
executor.awaitTermination(1, TimeUnit.SECONDS);
}
I see no data being inserted into the database. Can someone please point me as to where I am going wrong?
Thanks
Raj
SessionFactory is thread safe but Session is not. So my guess is that you need to call SessionFactoryUtils.getSession() from within each thread, so that each thread gets its own instance. You are currently calling it from the main thread, so all children threads try to share the same instance.
Naughty, naughty!
public void saveCustomerTicketUsingJDBC(String customerId) {
try {
getSession().getTransaction().begin(); //NOTHING HAPPENS AFTER THIS LINE OF CODE
.
.
} catch (Exception e) {
}
}
You should never (well, hardly ever) have an empty catch block, if there is a problem you will find that your code 'just simply dies' with no log messages. Oh look, that's what's happening ;)
At the very minimum you should log the exception, that will go a long way towards you helping you find what the problem is (and from there, the solution).

Resources