With Prometheus if I have several metrics that need to be collected at once, I would create a collector like so:
public List<MetricFamilySamples> collect() {
List<NetworkInterface> networkInterfaces = getNetworkInterfaces();
if(networkInterfaces.isEmpty()){
return new ArrayList<>();
}
NetworkMetricFamilies networkMetrics = new NetworkMetricFamilies();
collect(networkMetrics, networkInterfaces);
return networkMetrics.asList();
}
I would implement a collect() method that would return a list of the needed metrics.
What is the equivalent with MicroMeter?
All metrics are collected on each scrape. There is no efficiency to be gained by packing distinct metric names or combinations of tags into a single set of measurements.
Rather, iterate over the network interfaces and register any number of metrics for each.
Related
Using map/reduce functions only (not Mango),and the following example from the documentation, using the map and reduce functions below One may obtain the number of unique labels:
Documents return by the view
{"total_rows":9,"offset":0,"rows":[
{"id":"3525ab874bc4965fa3cda7c549e92d30","key":"bike","value":null},
{"id":"3525ab874bc4965fa3cda7c549e92d30","key":"couchdb","value":null},
{"id":"53f82b1f0ff49a08ac79a9dff41d7860","key":"couchdb","value":null},
{"id":"da5ea89448a4506925823f4d985aabbd","key":"couchdb","value":null},
{"id":"3525ab874bc4965fa3cda7c549e92d30","key":"drums","value":null},
{"id":"53f82b1f0ff49a08ac79a9dff41d7860","key":"hypertext","value":null},
{"id":"da5ea89448a4506925823f4d985aabbd","key":"music","value":null},
{"id":"da5ea89448a4506925823f4d985aabbd","key":"mustache","value":null},
{"id":"53f82b1f0ff49a08ac79a9dff41d7860","key":"philosophy","value":null}
]}
Map function
function(doc) {
if(doc.name && doc.tags) {
doc.tags.forEach(function(tag) {
emit(tag, 1);
});
}
}
Reduce function
function(keys, values) {
return sum(values);
}
Response with grouping
{"rows":[
{"key":"bike","value":1},
{"key":"couchdb","value":3},
{"key":"drums","value":1},
{"key":"hypertext","value":1},
{"key":"music","value":1},
{"key":"mustache","value":1},
{"key":"philosophy","value":1}
]}
Now my question is, using map/reduce views only (not Mango) how can I query the view to only select rows having a specific value following reduce (for example "3"). It looks like all view parameters focus on filtering based on the key, but I need to filter based on value. Ideally, being able to also use greater than, lesser than for reduce value filtering would also be great.
The ability to filter based on the value is essential for scenarios like the one above, but also for more advanced scenarios involving linked documents. Of course, I am not interested in filtering in memory in the application layer since in real world scenarios, the result set would be much larger than a dozen lines.
I have a list of accounts and perform a hashjoin on ticks and return the accounts with ticks data. But after hashjoin I have drainTo lListJet and then read it with DistributedStream and return it.
public List<Account> populateTicksInAccounts(List<Account> accounts) {
...
...
Pipeline p = Pipeline.create();
BatchSource<Tick> ticksSource = Sources.list(TICKS_LIST_NAME);
BatchSource<Account> accountSource = Sources.fromProcessor(AccountProcessor.of(accounts));
p.drawFrom(ticksSource)
.hashJoin(p.drawFrom(accountSource), JoinClause.joinMapEntries(Tick::getTicker), accountMapper())
.drainTo(Sinks.list(TEMP_LIST));
jet.newJob(p).join();
IListJet<Account> list = jet.getList(TEMP_LIST);
return DistributedStream.fromList(list).collect(DistributedCollectors.toIList());
}
Is it possible to drainTo to java List instead of lListJet after performing a hashjoin?
Something like below is possible?
IListJet<Account> accountWithTicks = new ArrayList<>();
p.drawFrom(ticksSource)
.hashJoin(p.drawFrom(accountSource), JoinClause.joinMapEntries(Tick::getTicker), accountMapper())
.drainTo(<CustomSinkProcessor(accountWithTicks)>);
return accountWithTicks;
where in CustomSinkProcessor will take empty java list and return with the accounts?
Keep in mind that the code you submit to Jet for execution runs outside the process where you submit it from. While it would be theoretically possible to provide the API you're asking for, under the hood it would just have to perform some tricks to run the code on each member of the cluster, let all members send their results to one place, and fill up a list to return to you. It would go against the nature of distributed computing.
If you think it will help the readability of your code, you can write a helper method such as this:
public <T, R> List<R> drainToList(GeneralStage<T> stage) {
String tmpListName = randomListName();
SinkStage sinkStage = stage.drainTo(Sinks.list(tmpListName));
IListJet<R> tmpList = jet.getList(tmpListName);
try {
jet.newJob(sinkStage.getPipeline()).join();
return new ArrayList<>(tmpList);
} finally {
tmpList.destroy();
}
}
Especially note the line
return new ArrayList<>(tmpList);
as opposed to your
IListJet<Account> list = jet.getList(TEMP_LIST);
return DistributedStream.fromList(list).collect(DistributedCollectors.toIList());
This just copies one Hazelcast list to another one and returns a handle to it. Now you have leaked two lists in the Jet cluster. They don't automatically disappear when you stop using them.
Even the code I provided can still be leaky. The JVM process that runs it can die during Job.join() without reaching finally. Then the temporary list lingers on.
No, it's not, due to the distributed nature of Jet. The sink will execute in multiple parallel processors (workers). It can't add to plain Collection. The sink has to be able to insert items on multiple cluster members.
In general I know what the problem is, but i have no idea how to solve it.
I have a simple map-function:
function(doc) {
if(doc.Type === 'Mission'){
for(var i in doc.Sections){
emit(doc._id, {_id:doc.Sections[i].id});
}
}
}
Based on the result of the map-function, I use a list-function to do some formatting:
function(head,req){
var result=[];
var row;
topo = require('lib/topojson');
while(row=getRow()){
if (row !== null) {
if(row.value._id){
row.doc.Geometry.properties.IDs.Section_ID = row.value._id;
}else{
row.doc.Geometry.properties.IDs.Section_ID = row.value;
}
geojson = {
type: "Feature",
geometry: row.doc.Geometry.geometry,
properties: row.doc.Geometry.properties
};
result.push(geojson);
}else{
send(JSON.stringify({
status_code: 404
}));
}
}
send(JSON.stringify(result));
}
The more documents are matching the map-function the longer does it take to do the processing with the list-function. The limiting factor is the couchjs view server. First the result from the map-function has to be serialized, after that the list-function can do the work.
As I wrote, for a small amount of documents the processing time isn't dramatical but as the amount of documents increase the time to do the processing by the list function increase as well.
Has someone an idea to improve my way to format the result?
Is it better to let the client do the work?
There exist several tricks to speed up _list functions.
Make your list and map functions live in two different design docs, to ensure they run in different SpiderMonkey instances.
Send response in large chunks, tens or even hundreds of kilobytes. Find out optimal chunk size: too large chunks are bad in terms of TTFB and memory consumption, small chunks produce IO overhead between SM and Erlang.
Minimize overhead of Storage->Erlang->JS serialize/deserialize. Make your map function emit strings that are serialized JSON and parse each row‘s JSON inside your list fn from a plain string. The more simple structure you pass to Erlang, the less time is spent at Erlang side to process it and pass to SM.
You can also use cache approach, but you must clearly understand what you‘re doing. Read more details here.
A list function is executed at runtime, that means the processing time is proportional to the number of documents the view returns. You can use a list function to display the last 20 posts of a blog, but you can't use it to elaborate 100.000 documents. That's something must be done inside a map function. In your place I would modify the map function to perform the operations you are doing inside the list function, or, even better, I would perform them before to save the document.
I want to instantiate a large number of StringProperty fields to put text values
(>100000). All in all my code performs well so far. I'm still trying to optimize my code as well as possible to harness the full power capabilities of my weak CPU (Intel Atom N2600, 1,6GHz, 2GByte Ram).
I'm calling the following method 100000 times and it takes some seconds
until all values are stored in my array of StringProperty.
public setData(int row, int numberOfCols, String data [][]) {
this.dataValue = new StringProperty[numberOfCols];
for(int i=0;i<numberOfCols;i++) dataValue[i] = new SimpleStringProperty(data[row][i]);
}
Is the method above good enough for intantiating fields and putting values?
Any alternative ideas of how to tweak the method above?
so the problem i'm trying to tackle is the following:
I need a data source that emits messages at a certain frequency
There are N neural nets that need to process each message individually
The outputs from all neural nets are aggregated and only when all N outputs for each message are collected, should a message be declared fully processed
At the end i should measure the time it took for a message to be fully processed (time between when it was emitted and when all N neural net outputs from that message have been collected)
I'm curious as to how one would approach such a task using spark streaming.
My current implementation uses 3 types of components: a custom receiver and two classes that implement Function, one for the neural nets, one for the end aggregator.
In broad strokes, my application is built as follows:
JavaReceiverInputDStream<...> rndLists = jssc.receiverStream(new JavaRandomReceiver(...));
Function<JavaRDD<...>, Void> aggregator = new JavaSyncBarrier(numberOfNets);
for(int i = 0; i < numberOfNets; i++){
rndLists.map(new NeuralNetMapper(neuralNetConfig)).foreachRDD(aggregator);
}
The main problem i'm having with this, though, is that it runs faster in local mode than when submitted to a 4-node cluster.
Is my implementation wrong to begin with or is something else happening here ?
There's also a full post here http://apache-spark-user-list.1001560.n3.nabble.com/Developing-a-spark-streaming-application-td12893.html with more details regarding the implementation of each of the three components mentioned previously.
It seems there might be a lot of repetitive instantiation and serialization of objects. The later might be hitting your performance in a cluster.
You should try instantiating your neural networks only once. You will have to ensure that they are serializable. You should use flatMap instead of multiple maps + union. Something along these lines:
// Initialize neural net first
List<NeuralNetMapper> neuralNetMappers = new ArrayList<>(numberOfNets);
for(int i = 0; i < numberOfNets; i++){
neuralNetMappers.add(new NeuralNetMapper(neuralNetConfig));
}
// Then create a DStream applying all of them
JavaDStream<Result> neuralNetResults = rndLists.flatMap(new FlatMapFunction<Item, Result>() {
#Override
public Iterable<Result> call(Item item) {
List<Result> results = new ArrayList<>(numberOfNets);
for (int i = 0; i < numberOfNets; i++) {
results.add(neuralNetMappers.get(i).doYourNeuralNetStuff(item));
}
return results;
}
});
// The aggregation stuff
neuralNetResults.foreachRDD(aggregator);
If you can afford to initialize the networks this way, you can save quite a lot of time. Also, the union stuff you included in your linked posts seems unnecessary and is penalizing your performance: a flatMap will do.
Finally, in order to further tune your performance in the cluster, you can use the Kryo serializer.