I have two separate jet pipeline. and I have to execute one by one.
public static void main(String[] args) {
JetInstance instance = Jet.newJetInstance();
// this job is for stream processing
JobConfig cfg1 = new JobConfig().setName("CDC-monitor-stream");
instance.newJob(pipeLine1(), cfg1);
// this job is for batch processing
JobConfig cfg2 = new JobConfig().setName("Batch-Processing-monitor");
instance.newJobIfAbsent(pipeLine2(), cfg2).join();
}
public static Pipeline pipeLine1() {
StreamSource<ChangeRecord> source = PostgresCdcSources.postgres("source")
.setCustomProperty("plugin.name", "pgoutput").setDatabaseAddress("127.0.0.1").setDatabasePort(5432)
.setDatabaseUser("postgres").setDatabasePassword("root").setDatabaseName("postgres")
.setTableWhitelist("public.customers").build();
Pipeline pipeline = Pipeline.create();
pipeline.readFrom(source).withoutTimestamps().writeTo(CdcSinks.map("customers", r -> r.key().toMap().get("id"),
r -> r.value().toObject(Customer.class).toString()));
return pipeline;
}
public static Pipeline pipeLine2() {
Pipeline pipeline = Pipeline.create();
pipeline.readFrom(Sources.map("customers")).filter(e -> e.getKey().equals("0001")).writeTo(Sinks.logger());
return pipeline;
}
First job is used to have a data sink between db and cache..
second job is used perform the batch processing and using Sources.map("customers") as source.(here I'm using batch processing the data extraction is depends on the user input
like String id="0001", String userName="harry" Or Is this possible using stream?)..
Or what is other best approach?
Related
I need to load around 10 millions records from flat file in hazelcast map.Also the ttl needs to set based on each map entry.
What is most efficient way to do the same?
Currently i am using Imap.putall().Is there a way to set ttl based on map entry using putall?
There isn't an API that allows you to do bulk put with individual expiry.
The code below would be a way to do it with Hazelcast Jet writing into Hazelcast's IMap.
The client submits this job and the grid servers process, reading a single file of input server side. The line .groupingKey partitions the input stream by
the entry key, so each server does a map.put where the key will be local, but enriched with a different TTL for each entry.
This is an alternative to iterating across your input file and insert each key individually. Whether it is faster will depend on factors such as the network speed, number of servers, and so on. It is certainly more complicated than simple iteration, so the speed gain would need to justify the complexity.
public class MyClient implements EntryExpiredListener<Long, Long> {
private static final String INPUT_DIRECTORY = System.getProperty("user.home") + "/input_data";
private static final String MAP_NAME = "test";
public static void main(String[] args) {
new MyClient().go();
}
public void go() {
JetInstance jetInstance = Jet.newJetClient();
jetInstance.getMap(MAP_NAME).addEntryListener(this, false);
Pipeline pipeline = MyClient.buildPipeline();
JobConfig jobConfig = new JobConfig();
jobConfig.addClass(MyClient.class);
try {
jetInstance.newJob(pipeline, jobConfig).join();
} catch (Exception e) {
e.printStackTrace();
}
}
/**
* Process a file that looks like <pre>
* % cat test/input
* 1
* 2
* 3
* 4
* 5
* </pre>
* #return
*/
private static Pipeline buildPipeline() {
ComparatorEx<Tuple3<Long, Long, Long>> comparatorEx = ComparatorEx.comparingLong(Tuple3::f0);
Pipeline pipeline = Pipeline.create();
BatchStage<String> input = pipeline.readFrom(MyClient.mySource(INPUT_DIRECTORY));
// Convert to trios of key, value, expiry
BatchStage<Tuple3<Long, Long, Long>> tuples
= input
.map(line -> {
long l = Long.parseLong(line);
return Tuple3.<Long, Long, Long>tuple3(100 * l, 200 * l, 300 * l);
});
// Route per JVM based on entry key
BatchStage<Entry<Long, Tuple3<Long, Long, Long>>> routedEntries
= tuples
.groupingKey(Tuple3::f0)
.rollingAggregate(AggregateOperations.maxBy(comparatorEx));
// Custom map save using expiry
routedEntries.writeTo(MyClient.mySink(MAP_NAME));
// [Optional] all log entries to systout
routedEntries.writeTo(Sinks.logger());
return pipeline;
}
private static BatchSource<String> mySource(String directory) {
return Sources.filesBuilder(directory)
.sharedFileSystem(true)
.build();
}
private static Sink<? super Entry<Long, Tuple3<Long, Long, Long>>> mySink(String mapName) {
return SinkBuilder.sinkBuilder("mySink",
processorContext -> processorContext.jetInstance().<Long, Long>getMap(mapName))
.receiveFn((IMap<Long, Long> map, Entry<Long, Tuple3<Long, Long, Long>> entry) -> {
map.put(entry.getKey(), entry.getValue().f1(), entry.getValue().f2(), TimeUnit.SECONDS);
})
.build();
}
#Override
public void entryExpired(EntryEvent<Long, Long> entryEvent) {
System.out.println(entryEvent.getEventType() + " for " + entryEvent.getKey());
}
}
We are getting live machine data as json and we get this data from RabbitMQ. below is a sample of the json,
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:35","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:36","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:37","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:38","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
The data is windowed for duration of 'X' minutes and then below is what we want to achieve
Group the data by deviceId, this is done but not sure if we can get a DataSet
We want to loop through the above grouped data and execute for aggregation logic for each device using the foreachPartition so that the code is executed within worker nodes.
Please correct me if my thought process is wrong here.
Our earlier code was collecting the data,looping through the RDD's,convert them to DataSet and applying aggregation logic on the DataSet using Spark SqlContext api's.
When doing load testing we saw 90% of the processing was happening in Master node and after a while the cpu usage spiked to 100% and the process bombed out.
So we are now trying to re-engineer the whole process to execute maximum of logic in worker nodes.
Below is the code so far that actually works in worker node but we are yet to get a DataSet for aggregating Logic
public static void main(String[] args) {
try {
mconf = new SparkConf();
mconf.setAppName("OnPrem");
mconf.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(mconf);
jssc = new JavaStreamingContext(sc, Durations.seconds(60));
SparkSession spksess = SparkSession.builder().appName("Onprem").getOrCreate();
//spksess.sparkContext().setLogLevel("ERROR");
Map<String, String> rabbitMqConParams = new HashMap<String, String>();
rabbitMqConParams.put("hosts", "localhost");
rabbitMqConParams.put("userName", "guest");
rabbitMqConParams.put("password", "guest");
rabbitMqConParams.put("vHost", "/");
rabbitMqConParams.put("durable", "true");
List<JavaRabbitMQDistributedKey> distributedKeys = new LinkedList<JavaRabbitMQDistributedKey>();
distributedKeys.add(new JavaRabbitMQDistributedKey(QUEUE_NAME, new ExchangeAndRouting(EXCHANGE_NAME, "fanout", ""), rabbitMqConParams));
Function<Delivery, String> messageHandler = new Function<Delivery, String>() {
public String call(Delivery message) {
return new String(message.getBody());
}
};
JavaInputDStream<String> messages = RabbitMQUtils.createJavaDistributedStream(jssc, String.class, distributedKeys, rabbitMqConParams, messageHandler);
JavaDStream<String> machineDataRDD = messages.window(Durations.minutes(2),Durations.seconds(60)); //every 60 seconds one RDD is Created
machineDataRDD.print();
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
groupedData.foreachRDD(new VoidFunction<JavaPairRDD<String,Iterable<String>>>(){
#Override
public void call(JavaPairRDD<String, Iterable<String>> data) throws Exception {
data.foreachPartition(new VoidFunction<Iterator<Tuple2<String,Iterable<String>>>>(){
#Override
public void call(Iterator<Tuple2<String, Iterable<String>>> data) throws Exception {
while(data.hasNext()){
LOGGER.error("Machine Data == >>"+data.next());
}
}
});
}
});
jssc.start();
jssc.awaitTermination();
}
catch (Exception e)
{
e.printStackTrace();
}
The below grouping code gives us a Iterable of string for a Device , ideally we would like to get a DataSet
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
Important thing for me is the looping using foreachPartition so that code executing gets pushed to Worker Nodes.
After looking through more code samples and guidelines sqlcontext , sparksession are not serialized and available on the worker nodes , so we will be changing the strategy of not trying to build a dataset withing foreachpartition loop.
I have a ASP.NET Web API application with supporting Azure Web Job with functions that are triggered by messages added to a storage queue by the API's controllers. Testing the Web API is simple enough using OWIN but how do I test the web jobs?
Do I run a console app in memory in the test runner? Execute the function directly (that wouldn't be a proper integration test though)? It is a continious job so the app doesn't exit. To make matters worse Azure Web Job-functions are void so there's no output to assert.
There is no need to run console app in memory. You can run JobHost in the memory of your integration test.
var host = new JobHost();
You could use host.Call() or host.RunAndBlock(). You would need to point to Azure storage account as webjobs are not supported in localhost.
It depends on what your function is doing, but you could manually add a message to a queue, add a blob or whatever. You could assert by querying the storage where your webjob executed result, etc.
While #boris-lipschitz is correct, when your job is continious (as op says it is), you can't do anything after calling host.RunAndBlock().
However, if you run the host in a separate thread, you can continue with the test as desired. Although, you have to do some kind of polling in the end of the test to know when the job has run.
Example
Function to be tested (A simple copy from one blob to another, triggered by created blob):
public void CopyBlob(
[BlobTrigger("input/{name}")] TextReader input,
[Blob("output/{name}")] out string output)
{
output = input.ReadToEnd();
}
Test function:
[Test]
public void CopyBlobTest()
{
var blobClient = GetBlobClient("UseDevelopmentStorage=true;");
//Start host in separate thread
var thread = new Thread(() =>
{
Thread.CurrentThread.IsBackground = true;
var host = new JobHost();
host.RunAndBlock();
});
thread.Start();
//Trigger job by writing some content to a blob
using (var stream = new MemoryStream())
using (var stringWriter = new StreamWriter(stream))
{
stringWriter.Write("TestContent");
stringWriter.Flush();
stream.Seek(0, SeekOrigin.Begin);
blobClient.UploadStream("input", "blobName", stream);
}
//Check every second for up to 20 seconds, to see if blob have been created in output and assert content if it has
var maxTries = 20;
while (maxTries-- > 0)
{
if (!blobClient.Exists("output", "blobName"))
{
Thread.Sleep(1000);
continue;
}
using (var stream = blobClient.OpenRead("output", "blobName"))
using (var streamReader = new StreamReader(stream))
{
Assert.AreEqual("TestContent", streamReader.ReadToEnd());
}
break;
}
}
I've been able to simulate this really easily by simply doing the following, and it seems to work fine for me:
private JobHost _webJob;
[OneTimeSetUp]
public void StartupFixture()
{
_webJob = Program.GetHost();
_webJob.Start();
}
[OneTimeTearDown]
public void TearDownFixture()
{
_webJob?.Stop();
}
Where the WebJob Code looks like:
public class Program
{
public static void Main()
{
var host = GetHost();
host.RunAndBlock();
}
public static JobHost GetHost()
{
...
}
}
I would like to make a running task (ListenResponse()), which continuously listens and sends data to serial port. At the same time, I would like to add another task (WriteParameter) through which I would set some parameters, which are then sent to serial port. The problem is, that now I'm using two async tasks, but the data in thread is not getting updated fast enough. It updates after few iterations of ListenResponse() task.
Pseudocode is below:
string data;
void ListenResponseSerial() {
byte b;
while (true) {
b = getSerialData(&b);
parse(b);
if (parsed) {
Respond(data);
}
}
}
public async Task<object> ListenResponse(dynamic input) {
(new Task(ListenResponseSerial)).Start();
}
public async Task<object> Write(dynamic input) {
IDictionary<string, object> payload = (IDictionary<string, object>)input;
data = (string) payload["data"];
return null;
}
Thank you very much for your help.
I want to measure the time that the execution of combineByKey function needs. I always get a result of 20-22 ms (HashPartitioner) and ~350ms (without pratitioning) with the code below, independent of the file size I use (file0: ~300 kB, file1: ~3GB, file2: ~8GB)! Can this be true? Or am I doing something wrong???
JavaPairRDD<Integer, String> pairRDD = null;
JavaPairRDD<Integer, String> partitionedRDD = null;
JavaPairRDD<Integer, Float> consumptionRDD = null;
boolean partitioning = true; //or false
int partitionCount = 100; // between 1 and 200 I cant see any difference in the duration!
SparkConf conf = new SparkConf();
JavaSparkContext sc = new JavaSparkContext(conf);
input = sc.textFile(path);
pairRDD = mapToPair(input);
partitionedRDD = partition(pairRDD, partitioning, partitionsCount);
long duration = System.currentTimeMillis();
consumptionRDD = partitionedRDD.combineByKey(createCombiner, mergeValue, mergeCombiners);
duration = System.currentTimeMillis() - duration; // Measured time always the same, independent of file size (~20ms with / ~350ms without partitioning)
// Do an action
Tuple2<Integer, Float> test = consumptionRDD.takeSample(true, 1).get(0);
sc.stop();
Some helper methods (shouldn't matter):
// merging function for a new dataset
private static Function2<Float, String, Float> mergeValue = new Function2<Float, String, Float>() {
public Float call(Float sumYet, String dataSet) throws Exception {
String[] data = dataSet.split(",");
float value = Float.valueOf(data[2]);
sumYet += value;
return sumYet;
}
};
// function to sum the consumption
private static Function2<Float, Float, Float> mergeCombiners = new Function2<Float, Float, Float>() {
public Float call(Float a, Float b) throws Exception {
a += b;
return a;
}
};
private static JavaPairRDD<Integer, String> partition(JavaPairRDD<Integer, String> pairRDD, boolean partitioning, int partitionsCount) {
if (partitioning) {
return pairRDD.partitionBy(new HashPartitioner(partitionsCount));
} else {
return pairRDD;
}
}
private static JavaPairRDD<Integer, String> mapToPair(JavaRDD<String> input) {
return input.mapToPair(new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(String debsDataSet) throws Exception {
String[] data = debsDataSet.split(",");
int houseId = Integer.valueOf(data[6]);
return new Tuple2<Integer, String>(houseId, debsDataSet);
}
});
}
The web ui provides you with details on jobs/stage that your application has run. It details the time for each of them, and you can now filter various details such as Scheduler Delay, Task Deserialization Time, and Result Serialization Time.
The default port for the webui is 8080. Completed application are listed there, and you can then click on the name, or craft the url like this: x.x.x.x:8080/history/app-[APPID] to access those details.
I don't believe any other "built-in" methods exist to monitor the running time of a task/stage. Otherwise, you may want to go deeper and use a JVM debugging framework.
EDIT: combineByKey is a transformation, which means that it is not applied on your RDD, as opposed to actions (read more the lazy behaviour of RDDs here, chapter 3.1). I believe the time difference you're observing comes from the time SPARK takes to create the actual data structure when partitioning or not.
If a difference there is, you'll see it at action's time (takeSample here)