Hazlecast Jet Cluster Processes duplicates - hazelcast-jet

I have deployed 3 spring boot apps with Hazelcast Jet embedded. The nodes recognize each other and run as a cluster. I have the following code: A simple reading from CSV and write to a file. But Jet writes duplicates to the file sink. To be precise, Jet processes total entries in the CSV multiplied by the number of nodes. So, if I have 10 entries in the source and 3 nodes, I see 3 files in the sink each having all 10 entries. I just want to process one record once and only once. Following is my code:
Pipeline p = Pipeline.create();
BatchSource<List<String>> source = Sources.filesBuilder("files")
.glob("*.csv")
.build(path -> Files.lines(path).skip(1).map(line -> split(line)));
p.readFrom(source)
.writeTo(Sinks.filesBuilder("out").build());
instance.newJob(p).join();

If it's a shared file system, then sharedFileSystem attribute in FilesourceBuilder must be set to true.

Related

Aggregate continuous stream of number from a file using hazelcast jet

I am trying to sum continuous stream of numbers from a file using hazelcast jet
pipe
.drawFrom(Sources.fileWatcher)<dir>))
.map(s->Integer.parseInt(s))
.addTimestamps()
.window(WindowDefinition.sliding(10000,1000))
.aggregate(AggregateOperations.summingDouble(x->x))
.drainTo(Sinks.logger());
Few questions
It doesn't give the expected output, my expectation is as soon as new number appears in the file, it should just add it to the existing sum
To do this why i need to give window and addTimestamp method, i just need to do sum of infinite stream
How can we achieve fault tolerance, i. e. if server restarts will it save the aggregated result and when it comes up it will aggregate from the last computed sum?
if the server is down and few numbers come in file now when the server comes up, will it read from last point from when the server went down or will it miss the numbers when it was down and will only read the number it got after the server was up.
Answer to Q1 & Q2:
You're looking for rollingAggregate, you don't need timestamps or windows.
pipe
.drawFrom(Sources.fileWatcher(<dir>))
.rollingAggregate(AggregateOperations.summingDouble(Double::parseDouble))
.drainTo(Sinks.logger());
Answer to Q3 & Q4: the fileWatcher source isn't fault tolerant. The reason is that it reads local files and when a member dies, the local files won't be available anyway. When the job restarts, it will start reading from current position and will miss numbers added while the job was down.
Also, since you use global aggregation, data from all files will be routed to single cluster member and other members will be idle.

View number of nodes used in Hive queries

I need to view the number of nodes used in my HDinsights cluster while running hive queries. How can i view this while running my queries. I know Ambari view provides this, but where can i get the exact number of nodes and storage used. Thanks
After you run the job, review the current Jobtracker log and you may see entries like this -
2014-01-23 20:14:59,136 INFO org.apache.hadoop.mapred.JobInProgress: Input size for job job_201401221948_0006 = 1395667. Number of splits = 7
2014-01-23 20:14:59,137 INFO org.apache.hadoop.mapred.JobInProgress: tip:task_201401221948_0006_m_000000 has split on node:/fd0/ud0/localhost
2014-01-23 20:14:59,137 INFO org.apache.hadoop.mapred.JobInProgress: tip:task_201401221948_0006_m_000001 has split on node:/fd0/ud0/localhost
......
If you see Number of splits=1, there will be one map task and you know that only node will be utilized.
and when Number of splits > 1, for each split you will see a map task created with Tasktracker node info like this -
2014-01-23 20:14:59,153 INFO org.apache.hadoop.mapred.JobTracker: Adding task (JOB_SETUP) 'attempt_201401221948_0006_m_000008_0' to tip task_201401221948_0006_m_000008, for tracker 'tracker_workernode7:127.0.0.1/127.0.0.1:49200'

Where are my extra Spark tasks coming from

I have a Spark program that is training several ML algorithms. The code that generates the final stage of my job looks like this (in Kotlin):
val runConfigs = buildOptionsCrossProduct(opts)
log.info("Will run {} different configurations.", runConfigs.size)
val runConfigsRdd: JavaRDD<RunConfiguration> = sc.parallelize(runConfigs)
// Create an RDD mapping window size to the score for that window size.
val accuracyRdd = runConfigsRdd.mapToPair { runConfig: RunConfiguration ->
runSingleOptionSet(runConfig, opts, trainingBroadcast, validBroadcast) }
accuracyRdd.saveAsTextFile(opts.output)
runConfigs is a list of 18 items. The log line right after the configs are generated shows:
17/02/06 19:23:20 INFO SparkJob: Will run 18 different configurations.
So I'd expect at most 18 tasks as there should be at most one task per stage per partition (at least that's my understanding). However, the History server reports 80 tasks most of which finish very quickly and, not surprisingly, produce no output:
There are in fact 80 output files generated with all but 18 of them being empty. My question is, what are the other 80 - 18 = 62 tasks in this stage doing and why did they get generated?
You use SparkContext.parallelize without providing numSlices argument so Spark is using defaultParallelism which is probably 80. In general parallelize tries to spread data uniformly between partitions but it doesn't remove empty ones so if you want to avoid executing empty task you should set numSlices to a number smaller or equal to runConfigs.size.

How to access this kind of data in Spark

The data is stored in the following forms:
data/file1_features.mat
data/file1_labels.txt
data/file2_features.mat
data/file2_labels.txt
...
data/file100_features.mat
data/file100_labels.txt
Each data/file*_features.mat stores the features of some samples and each row is a sample. Each data/file*_labels.txt stores the labels of those samples and each row is a number (e.g., 1,2,3,...). In the whole 100 files, there are total about 80 million samples.
In Spark, how to access this data set?
I have checked the spark-2.0.0-preview/examples/src/main/python/mllib/random_forest_classification_example.py. It has the following lines:
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
(trainingData, testData) = data.randomSplit([0.7, 0.3])
I run this example in ./bin/pyspark, it shows the data object is a PythonRDD.
PythonRDD[32] at RDD at PythonRDD.scala:48
The data/mllib/sample_libsvm_data.txt is just one file. In my case, there are many files. Is there any RDD in Spark to handle this case conveniently? Does it need to merge all 100 files to one big file and process it as the example? I want to use the Spark engine to scale the data set (mean-std normalization or min-max normalization).
Simply point
dir = "<path_to_data>/data"
sc.textFile(dir)
Spark automatically picks up all of the files inside that directory
If you want load specific file type for processing then you can use regular expression for loading files into RDD.
dir = "data/*.txt"
sc.textFile(dir)
Spark will all files ending with txt extension.

How does the Apache Spark scheduler split files into tasks?

In spark-summit 2014, Aaron gives the speak A Deeper Understanding of Spark Internals , in his slide, page 17 show a stage has been splited into 4 tasks as bellow:
Here I wanna know three things about how does a stage be splited into tasks?
in this example above, it seems that tasks' number are created based on the file number, am I right?
if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?
If I'm right in point 2, what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?
thanks a lot, I feel confused in how does the stage been splited into tasks.
You can configure the # of partitions (splits) for the entire process as the second parameter to a job, e.g. for parallelize if we want 3 partitions:
a = sc.parallelize(myCollection, 3)
Spark will divide the work into relatively even sizes (*) . Large files will be broken down accordingly - you can see the actual size by:
rdd.partitions.size
So no you will not end up with single Worker chugging away for a long time on a single file.
(*) If you have very small files then that may change this processing. But in any case large files will follow this pattern.
The split occurs in two stages:
Firstly HDSF splits the logical file into 64MB or 128MB physical files when the file is loaded.
Secondly SPARK will schedule a MAP task to process each physical file.
There is a fairly complex internal scheduling process as there are three copies of each physical file stored on three different servers, and, for large logical files it may not be possible to run all the tasks at once. The way this is handled is one of the major differences between hadoop distributions.
When all the MAP tasks have run the collectors, shuffle and reduce tasks can then be run.
Stage: New stage will get created when a wide transformation occurs
Task: Will get created based on partitions in a worker
Attaching the link for more explanation: How DAG works under the covers in RDD?
Question 1: in this example above, it seems that tasks' number are created based on the file number, am I right?
Answer : its not based on the filenumber, its based on your hadoop block(0.gz,1.gz is a block of data saved or stored in hdfs. )
Question 2:
if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?
Answer : By default block size in hadoop is of 64MB and that block of data will be treated as partition in spark.
Note : no of partitions = no of task, because of these it has created 3tasks.
Question 3 :
what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?
Answer : No, the very large file will be partitioned and as i answered for ur question 2 based on the no of partitions , no of task will be created

Resources