In apache spark how we can remove data from file once spark reads it - apache-spark

I am using custom spark receiver for spark streaming. In the receiver, I am reading text from the file and make its RDD.
The problem is, that data remains in file and spark custom receiver reads it again. So I want to remove data from the file to avoid duplication once spark reads it.
The receiver function looks like this
private void receive() {
try {
List<String> blocks = new ArrayList<>();
while (!isStopped()){
JavaSparkContext sc = spark.getSparkContext();
JavaRDD<String> lines = sc.textFile("src/dummy2.csv");
blocks = lines.collect();
store(blocks.iterator());
blocks.clear();
}
// Restart in an attempt to connect again when server is active again
restart("Trying to connect again");
} catch(Throwable t) {
// restart if there is any other error
restart("Error receiving data", t);
}
}

Related

Register Java Class in Flink Cluster

I am running my Fat Jar in Flink Cluster which reads Kafka and saves in Cassandra, the code is,
final Properties prop = getProperties();
final FlinkKafkaConsumer<String> flinkConsumer = new FlinkKafkaConsumer<>
(kafkaTopicName, new SimpleStringSchema(), prop);
flinkConsumer.setStartFromEarliest();
final DataStream<String> stream = env.addSource(flinkConsumer);
DataStream<Person> sensorStreaming = stream.flatMap(new FlatMapFunction<String, Person>() {
#Override
public void flatMap(String value, Collector<Person> out) throws Exception {
try {
out.collect(objectMapper.readValue(value, Person.class));
} catch (JsonProcessingException e) {
logger.error("Json Processing Exception", e);
}
}
});
savePersonDetails(sensorStreaming);
env.execute();
and The Person POJO contains,
#Column(name = "event_time")
private Instant eventTime;
There is codec required to store Instant as below for Cassandra side,
final Cluster cluster = ClusterManager.getCluster(cassandraIpAddress);
cluster.getConfiguration().getCodecRegistry().register(InstantCodec.instance);
When i run standalone works fine, but when i run local cluster throws me an error as below,
Caused by: com.datastax.driver.core.exceptions.CodecNotFoundException: Codec not found for requested operation: [timestamp <-> java.time.Instant]
at com.datastax.driver.core.CodecRegistry.notFound(CodecRegistry.java:679)
at com.datastax.driver.core.CodecRegistry.createCodec(CodecRegistry.java:526)
at com.datastax.driver.core.CodecRegistry.findCodec(CodecRegistry.java:506)
at com.datastax.driver.core.CodecRegistry.access$200(CodecRegistry.java:140)
at com.datastax.driver.core.CodecRegistry$TypeCodecCacheLoader.load(CodecRegistry.java:211)
at com.datastax.driver.core.CodecRegistry$TypeCodecCacheLoader.load(CodecRegistry.java:208)
I read the below document for registering,
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/custom_serializers.html
but InstantCodec is 3rd party one. How can i register it?
I solved the problem, there was LocalDateTime which was emitting from and when i was converting with same type, there was above error. I changed the type into java.util Date type then it worked.

How to read messages in MQs using spark streaming,i.e ZeroMQ,RabbitMQ?

As the spark docs says,it support kafka as data streaming source.but I use ZeroMQ,And there is not a ZeroMQUtils.so how can I use it? and generally,how about other MQs. I am totally new to spark and spark streaming, so I am sorry if the question is stupid.Could anyone give me a solution.Thanks
BTW,I use python.
Update, I finally did it in java with a Custom Receiver. Below is my solution
public class ZeroMQReceiver extends Receiver<T> {
private static final ObjectMapper mapper = new ObjectMapper();
public ZeroMQReceiver() {
super(StorageLevel.MEMORY_AND_DISK_2());
}
#Override
public void onStart() {
// Start the thread that receives data over a connection
new Thread(this::receive).start();
}
#Override
public void onStop() {
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself if isStopped() returns false
}
/** Create a socket connection and receive data until receiver is stopped */
private void receive() {
String message = null;
try {
ZMQ.Context context = ZMQ.context(1);
ZMQ.Socket subscriber = context.socket(ZMQ.SUB);
subscriber.connect("tcp://ip:port");
subscriber.subscribe("".getBytes());
// Until stopped or connection broken continue reading
while (!isStopped() && (message = subscriber.recvStr()) != null) {
List<T> results = mapper.readValue(message,
new TypeReference<List<T>>(){} );
for (T item : results) {
store(item);
}
}
// Restart in an attempt to connect again when server is active again
restart("Trying to connect again");
} catch(Throwable t) {
// restart if there is any other error
restart("Error receiving data", t);
}
}
}
I assume you are talking about Structured Streaming.
I am not familiar with ZeroMQ, but an important point in Spark Structured Streaming sources is replayability (in order to ensure fault tolerance), which, if I understand correctly, ZeroMQ doesn't deliver out-of-the-box.
A practical approach would be buffering the data either in Kafka and using the KafkaSource or as files in a (local FS/NFS, HDFS, S3) directory and using the FileSource for reading. Cf. Spark Docs. If you use the FileSource, make sure not to append anything to an existing file in the FileSource's input directory, but move them into the directory atomically.

Apache Spark -- Data Grouping and Execution in worker nodes

We are getting live machine data as json and we get this data from RabbitMQ. below is a sample of the json,
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:35","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:36","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:37","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:38","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
The data is windowed for duration of 'X' minutes and then below is what we want to achieve
Group the data by deviceId, this is done but not sure if we can get a DataSet
We want to loop through the above grouped data and execute for aggregation logic for each device using the foreachPartition so that the code is executed within worker nodes.
Please correct me if my thought process is wrong here.
Our earlier code was collecting the data,looping through the RDD's,convert them to DataSet and applying aggregation logic on the DataSet using Spark SqlContext api's.
When doing load testing we saw 90% of the processing was happening in Master node and after a while the cpu usage spiked to 100% and the process bombed out.
So we are now trying to re-engineer the whole process to execute maximum of logic in worker nodes.
Below is the code so far that actually works in worker node but we are yet to get a DataSet for aggregating Logic
public static void main(String[] args) {
try {
mconf = new SparkConf();
mconf.setAppName("OnPrem");
mconf.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(mconf);
jssc = new JavaStreamingContext(sc, Durations.seconds(60));
SparkSession spksess = SparkSession.builder().appName("Onprem").getOrCreate();
//spksess.sparkContext().setLogLevel("ERROR");
Map<String, String> rabbitMqConParams = new HashMap<String, String>();
rabbitMqConParams.put("hosts", "localhost");
rabbitMqConParams.put("userName", "guest");
rabbitMqConParams.put("password", "guest");
rabbitMqConParams.put("vHost", "/");
rabbitMqConParams.put("durable", "true");
List<JavaRabbitMQDistributedKey> distributedKeys = new LinkedList<JavaRabbitMQDistributedKey>();
distributedKeys.add(new JavaRabbitMQDistributedKey(QUEUE_NAME, new ExchangeAndRouting(EXCHANGE_NAME, "fanout", ""), rabbitMqConParams));
Function<Delivery, String> messageHandler = new Function<Delivery, String>() {
public String call(Delivery message) {
return new String(message.getBody());
}
};
JavaInputDStream<String> messages = RabbitMQUtils.createJavaDistributedStream(jssc, String.class, distributedKeys, rabbitMqConParams, messageHandler);
JavaDStream<String> machineDataRDD = messages.window(Durations.minutes(2),Durations.seconds(60)); //every 60 seconds one RDD is Created
machineDataRDD.print();
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
groupedData.foreachRDD(new VoidFunction<JavaPairRDD<String,Iterable<String>>>(){
#Override
public void call(JavaPairRDD<String, Iterable<String>> data) throws Exception {
data.foreachPartition(new VoidFunction<Iterator<Tuple2<String,Iterable<String>>>>(){
#Override
public void call(Iterator<Tuple2<String, Iterable<String>>> data) throws Exception {
while(data.hasNext()){
LOGGER.error("Machine Data == >>"+data.next());
}
}
});
}
});
jssc.start();
jssc.awaitTermination();
}
catch (Exception e)
{
e.printStackTrace();
}
The below grouping code gives us a Iterable of string for a Device , ideally we would like to get a DataSet
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
Important thing for me is the looping using foreachPartition so that code executing gets pushed to Worker Nodes.
After looking through more code samples and guidelines sqlcontext , sparksession are not serialized and available on the worker nodes , so we will be changing the strategy of not trying to build a dataset withing foreachpartition loop.

How to write JavaRDD to marklogic database

I am evaluating spark with marklogic database. I have read a csv file, now i have a JavaRDD object which i have to dump into marklogic database.
SparkConf conf = new SparkConf().setAppName("org.sparkexample.Dataload").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> data = sc.textFile("/root/ml/workArea/data.csv");
SQLContext sqlContext = new SQLContext(sc);
JavaRDD<Record> rdd_records = data.map(
new Function<String, Record>() {
public Record call(String line) throws Exception {
String[] fields = line.split(",");
Record sd = new Record(fields[0], fields[1], fields[2], fields[3],fields[4]);
return sd;
}
});
This JavaRDD object i want to write to marklogic database.
Is there any spark api available for faster writing to the marklogic database ?
Lets say, If we could not write JavaRDD directly to marklogic then what is the currect approach to achieve this ?
Here is the code which i am using to write the JavaRDD data to marklogic database, let me know if it is wrong way to do that.
final DatabaseClient client = DatabaseClientFactory.newClient("localhost",8070, "MLTest");
final XMLDocumentManager docMgr = client.newXMLDocumentManager();
rdd_records.foreachPartition(new VoidFunction<Iterator<Record>>() {
public void call(Iterator<Record> partitionOfRecords) {
while (partitionOfRecords.hasNext()) {
Record record = partitionOfRecords.next();
System.out.println("partitionOfRecords - "+record.toString());
String docId = "/example/"+record.getID()+".xml";
JAXBContext context = JAXBContext.newInstance(Record.class);
JAXBHandle<Record> handle = new JAXBHandle<Record>(context);
handle.set(record);
docMgr.writeAs(docId, handle);
}
}
});
client.release();
I have used java client api to write the data, but i am getting below exception even though POJO class Record is implementing Serializable interface. Please let me know what could be the reason & how to solve that.
org.apache.spark.sparkexception task not Serializable .
The easiest way to get data into MarkLogic is via HTTP and the client REST API - specifically the /v1/documents endpoints - http://docs.marklogic.com/REST/client/management .
There are a variety of ways to optimize this, such as via a write set, but based on your question, I think the first thing to decide is - what kind of document do you want to write for each Record? Your example shows 5 columns in the CSV - typically, you'll write either a JSON or XML document with 5 fields/elements, each named based on the column index. So you'd need to write a little code to generate that JSON/XML, and then use whatever HTTP client you prefer (and one option is the MarkLogic Java Client API) to write that document to MarkLogic.
That addresses your question of how to write a JavaRDD to MarkLogic - but if your goal is to get data from a CSV into MarkLogic as fast as possible, then skip Spark and use mlcp - https://docs.marklogic.com/guide/mlcp/import#id_70366 - which involves zero coding.
Modified example from spark streaming guide, Here you will have to implement connection and writing logic specific to database.
public void send(JavaRDD<String> rdd) {
rdd.foreachPartition(new VoidFunction<Iterator<String>>() {
#Override
public void call(Iterator<String> partitionOfRecords) {
// ConnectionPool is a static, lazily initialized pool of
Connection connection = ConnectionPool.getConnection();
while (partitionOfRecords.hasNext()) {
connection.send(partitionOfRecords.next());
}
ConnectionPool.returnConnection(connection); // return to the pool
// for future reuse
}
});
}
I'm wondering if you just need to make sure everything you access inside your VoidFunction that was instantiated outside it is serializable (see this page). DatabaseClient and XMLDocumentManager are of course not serializable, as they're connected resources. You're right, however, to not instantiate DatabaseClient inside your VoidFunction as that would be less efficient (though it would work). I don't know if the following idea would work with spark. But I'm guessing you could create a class that keeps hold of a singleton DatabaseClient instance:
public static class MLClient {
private static DatabaseClient singleton;
private MLClient() {}
public static DatabaseClient get(DatabaseClientFactory.Bean connectionInfo) {
if ( connectionInfo == null ) {
throw new IllegalArgumentException("connectionInfo cannot be null");
}
if ( singleton == null ) {
singleton = connectionInfo.newClient();
}
return singleton;
}
}
then you just create a serializable DatabaseClientFactory.Bean outside your VoidFunction so your auth info is still centralized
DatabaseClientFactory.Bean connectionInfo =
new DatabaseClientFactory.Bean();
connectionInfo.setHost("localhost");
connectionInfo.setPort(8000);
connectionInfo.setUser("admin");
connectionInfo.setPassword("admin");
connectionInfo.setAuthenticationValue("digest");
Then inside your VoidFunction you could get that singleton DatabaseClient and new XMLDocumentManager like so:
DatabaseClient client = MLClient.get(connectionInfo);
XMLDocumentManager docMgr = client.newXMLDocumentManager();

Spark Streaming: how not to restart receiver after receiver's failure

We are using a custom spark receiver that reads streamed data from a provided http link. If the provided http link is incorrect, the receiver fails. The problem is that spark will continuously restart the receiver, and the application will never terminate. The question is how to tell Spark to terminate the application if the receiver fails.
This is an extract of our custom receiver:
def onStart() {
// Start the thread that receives data over a connection
new Thread("Receiver") {
override def run() { receive() }
}.start()
}
private def receive(): Unit = {
....
val response: CloseableHttpResponse = httpclient.execute(req)
try {
val sl = response.getStatusLine()
if (sl.getStatusCode != 200){
val errorMsg = "Error: " + sl.getStatusCode
val thrw = new RuntimeException(errorMsg)
stop(errorMsg, thrw)
} else {
...
store(doc)
}
We have a spark streaming application that uses this receiver:
val ssc = new StreamingContext(sparkConf, duration)
val changes = ssc.receiverStream(new CustomReceiver(...
...
ssc.start()
ssc.awaitTermination()
Everything works as expected if the receiver doesn't have errors. If the receiver fails (e.g. with a wrong http link), spark will continuously restart it and the application will never terminate.
16/05/31 17:03:38 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
16/05/31 17:03:38 ERROR ReceiverTracker: Receiver has been stopped. Try to restart it.
We just want to terminate the whole application if a receiver fails.
There is a way to control the life cycle of Custom receiver based spark-streaming applications. Define job progress listener for your application and keep track of what is happening.
class CustomReceiverListener extends StreamingJobProgressListener {
private boolean receiverStopped = false;
public CustomReceiverListener(StreamingContext ssc) { super(ssc);}
public boolean isReceiverStopped() {
return receiverStopped;
}
#Override
public void onReceiverStopped(StreamingListenerReceiverStopped receiverStopped) {
LOG.info("Update the flag field");
this.receiverStopped = true;
}
}
And in your driver, initialize a thread to monitor the state of receiverStopped flag. Driver will stop the stream app when this thread is finished. (Better approach is to define a callback method defined by the driver, that will stop the streaming application).
CustomReceiverListener listener = new CustomReceiverListener(ssc);
ssc.addStreamingListener(listener);
ssc.start();
Thread thread = new Thread(() -> {
while (!listener.isReceiverStopped()) {
LOG.info("Sleepy head...");
try {
Thread.sleep(2 * 1000); /*check after 2 seconds*/
} catch (InterruptedException e) {
e.printStackTrace();
}
}
});
thread.start();
thread.join();
LOG.info("Listener asked to die! Going to commit suicide :(");
ssc.stop(true, false);
Note: In case of multiple instances of your receivers, change the implementation of CustomReceiverListener to make sure all the receiver instances are stopped.
It seems that the scheduling in Spark Streaming works in such a way that ReceiverTracker will keep restarting a failed receiver until ReceiverTracker is not stopped itself.
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L618
To stop ReceiverTracker, we need to stop the whole application. Thus, it seems there is no a way to control this process from a receiver itself.

Resources