Spark: count two fields together - apache-spark

I am trying to count some parameters with Spark. I used the word count example.
In this example, we can count a word but I wonder how I can count two fields at the same time.
Here is what I want to do:
Input files
{
"redundancy":1,
"deviceID":"dv1"
}
{
"redundancy":1,
"deviceID":"dv2"
}
{
"redundancy":2,
"deviceID":"dv1"
}
{
"redundancy":1,
"deviceID":"dv1"
}
{
"redundancy":2,
"deviceID":"dv5"
}
Output files
{
"redundancy":1,
"count":3,
"nbDevice":2
}
{
"redundancy":2,
"count":2,
"nbDevice":2
}
I wonder if there is already an example of this use case or if you have any documentation or links, i would be very thankful.

You can use pairs as keys.
The solution can look like:
rdd.map(record => (record.firstField, record.secondField) -> 1)
.reduceByKey(_ + _)

Related

How to check for an index with a date in logstash output

I have an output like so:
output {
if [target_index] == "mystream-%{+YYYY.MM.dd}"{
kinesis {
stream_name => "mystream"
region => "us-east-1"
}
}
I'd like to filter by the index pattern mystream-{date}. For some reason this conditional is not working. I'm not sure what the problem is here. Any help would be greatly appreciated.
Values compared in conditionals are not sprintf'd. You would have to use mutate to add a value that gets sprintf'd, and then compare to that. For example
filter { mutate { add_field => { "[#metadata][streamName]" => "mystream-%{+YYYY.MM.dd}" } } }
output {
if [target_index] == [#metadata][streamName] { ...

Filter JavaRDD into multiple JavaRDD based on Condtion

I have one JavaRdd records
I would like to create 3 JavaRdd from records depending on condition:
JavaRdd<MyClass> records1 =records1.filter(record -> “A”.equals(record.getName()));
JavaRdd<MyClass> records2 =records1.filter(record -> “B”.equals(record.getName()));
JavaRdd<MyClass> records13=records1.filter(record -> “C”.equals(record.getName()));
The problem is, that I can do like I show above, but my records may have millions record and I don’t want to scan all records 3 times.
So I want to do it in one iteration over the records.
I need something like this:
records
.forEach(record -> {
if (“A”.equals(records.getName()))
{
records1(record);
}
else if (“B”.equals(records.getName()))
{
records2(record);
}
else if (“C”.equals(records.getName()))
{
records3(record);
}
});
How can I achieve this in Spark usin JavaRDD?
In my idea you can use "MapToPair" and new a Tuple2 object in each of your if condition block. Then your key in the Tuple2 will help you to find each rdd objects type. In other words, Tuple2s key shows the type of the objects you wanted to store in one rdd and it's value is your main data.
your code would be something like below:
JavaPairRdd<String,MyClass> records1 =records.forEach(record -> {
String key = "";
if (“A”.equals(record.getName()))
{
key="A";
}
else if ("B".equals(record.getName()))
{
key="B";
}
else if ("C".equals(record.getName()))
{
key="C";
}
return new Tuple2<>(key, record);
});
the resulting pairrdd objects can be divided by different keys you have used at foreach method.

Indexing the logs into different types(schema) in elastic search based on matching patterns

For example here is my log file
[2016-10-18 12:05:53.228] log example
[2016-10-18 11:55:53.228] 19249060-91df-11e6-be68-753fa0e2c729 logg example
[2016-10-18 11:35:53.228] 19249060-91ff-11e6-be68-753fa0e2c729 loggg example /api/userbasic/userinfo?requestedUserId=19249060-91df-11e6-be68-753fa0e2c729
grok filter for my log.here i have used multiple patterns
filter {
grok {
match => [
"message","\[%{TIMESTAMP_ISO8601:timestamp1}\] %{WORDS_EX:msg}",
"message","\[%{TIMESTAMP_ISO8601:timestamp2}\] %{UUID:user_id1} %{WORDS_EX:msg2} %{URIPATHPARAM:path}",
"message","\[%{TIMESTAMP_ISO8601:timestamp3}\] %{UUID:user_id2} %{WORDS_EX:msg3}"
]
}
}
now i want index the logs into elasticsearch with different types(schema) like
logstash/type1,
logstash/type2,
logstash/type3,
Any help appreciated!
First, there is a problem with your filters: the grok pattern are evaluated one by one and when one pattern match, the others will not be evaluated, so the pattern needs to be sorted from the most specific (the one with %{URIPATHPARAM:path}) to the most general (the one with %{WORDS_EX:msg}) like so:
"message","\[%{TIMESTAMP_ISO8601:timestamp2}\] %{UUID:user_id1} %{WORDS_EX:msg2} %{URIPATHPARAM:path}",
"message","\[%{TIMESTAMP_ISO8601:timestamp3}\] %{UUID:user_id2} %{WORDS_EX:msg3}",
"message","\[%{TIMESTAMP_ISO8601:timestamp1}\] %{WORDS_EX:msg}"
Then you can use the presence/absence of various fields with conditionnals like so:
if [path] {
elasticsearch {
...
}
} else if [user_id2] {
elasticsearch {
...
}
} else {
elasticsearch {
...
}
}

Compare two maps and find differences using Groovy or Java

I would like to find difference in two maps and create a new csv file with the difference (and put the difference between **) like below:
Map 1
[
[cuInfo:"T12",service:"3",startDate:"14-01-16 13:22",appId:"G12355"],
[cuInfo:"T13",service:"3",startDate:"12-02-16 13:00",appId:"G12356"],
[cuInfo:"T14",service:"9",startDate:"10-01-16 11:20",appId:"G12300"],
[cuInfo:"T15",service:"10",startDate:"26-02-16 10:20",appId:"G12999"]
]
Map 2
[
[name:"Apple", cuInfo:"T12",service:"3",startDate:"14-02-16 10:00",appId:"G12351"],
[name:"Apple",cuInfo:"T13",service:"3",startDate:"14-01-16 13:00",appId:"G12352"],
[name:"Apple",cuInfo:"T16",service:"3",startDate:"14-01-16 13:00",appId:"G12353"],
[name:"Google",cuInfo:"T14",service:"9",startDate:"10-01-16 11:20",appId:"G12301"],
[name:"Microsoft",cuInfo:"T15",service:"10",startDate:"26-02-16 10:20",appId:"G12999"],
[name:"Microsoft",cuInfo:"T18",service:"10",startDate:"26-02-16 10:20",appId:"G12999"]
]
How can I get the output csv like below
Map 1 data | Map 2 data
service 3;name Apple;
cuInfo;startDate;appId | cuInfo;startDate;appId
T12;*14-02-16 10:00*;*G12351* | T12;*14-01-16 13:22*;*G12355*
T13;*14-01-16 13:00*;*G12352* | T13;*12-02-16 13:00*;*G12356*
service 9;name Google;
T14;*10-01-16 11:20*;*G12301* | T12;*10-01-16 11:20*;*G12300*
Thanks
In the following I'm assuming that the list of maps is sorted appropriately so that the comparison is fair, and that both lists are of the same length:
First, create an Iterator to traverse both lists simultaneously:
#groovy.transform.TupleConstructor
class DualIterator implements Iterator<List> {
Iterator iter1
Iterator iter2
boolean hasNext() {
iter1.hasNext() && iter2.hasNext()
}
List next() {
[iter1.next(), iter2.next()]
}
void remove() {
throw new UnsupportedOperationException()
}
}
Next, process the lists to get rows for the CSV file:
def rows = new DualIterator(list1.iterator(), list2.iterator())
.findAll { it[0] != it[1] } // Grab the non-matching lines.
.collect { // Mark the non-matching values.
(m1, m2) = it
m1.keySet().each { key ->
if(m1[key] != m2[key]) {
m1[key] = "*${m1[key]}*"
m2[key] = "*${m2[key]}*"
}
}
[m1, m2]
}.collect { // Merge the map values into a List of String arrays
[it[0].values(), it[1].values()].flatten() as String[]
}
Finally, write the header and rows out in CSV format. NOTE: I'm using a proper CSV; your example is actually invalid because the number of columns are inconsistent:
def writer = new CSVWriter(new FileWriter('blah.csv'))
writer.writeNext(['name1', 'cuInfo1', 'service1', 'startDate1', 'appId1', 'name2', 'cuInfo2', 'service2', 'startDate2', 'appId2'] as String[])
writer.writeAll(rows)
writer.close()
The output looks like this:
"name1","cuInfo1","service1","startDate1","appId1","name2","cuInfo2","service2","startDate2","appId2"
"Apple","T12","3","*14-02-16 10:00*","*G12351*","Apple","T12","3","*14-01-16 13:22*","*G12355*"
"Apple","T13","3","*14-01-16 13:00*","*G12352*","Apple","T13","3","*12-02-16 13:00*","*G12356*"
"Google","T14","9","10-01-16 11:20","*G12301*","Google","T14","9","10-01-16 11:20","*G12300*"

Map/Reduce differences between Couchbase & CloudAnt

I've been playing around with Couchbase Server and now just tried replicating my local db to Cloudant, but am getting conflicting results for my map/reduce function pair to build a set of unique tags with their associated projects...
// map.js
function(doc) {
if (doc.tags) {
for(var t in doc.tags) {
emit(doc.tags[t], doc._id);
}
}
}
// reduce.js
function(key,values,rereduce) {
if (!rereduce) {
var res=[];
for(var v in values) {
res.push(values[v]);
}
return res;
} else {
return values.length;
}
}
In Cloudbase server this returns JSON like:
{"rows":[
{"key":"3d","value":["project1","project3","project8","project10"]},
{"key":"agents","value":["project2"]},
{"key":"fabrication","value":["project3","project5"]}
]}
That's exactly what I wanted & expected. However, the same query on the Cloudant replica, returns this:
{"rows":[
{"key":"3d","value":4},
{"key":"agents","value":1},
{"key":"fabrication","value":2}
]}
So it somehow only returns the length of the value array... Highly confusing & am grateful for any insights by some M&R ninjas... ;)
It looks like this is exactly the behavior you would expect given your reduce function. The key part is this:
else {
return values.length;
}
In Cloudant, rereduce is always called (since the reduce needs to span over multiple shards.) In this case, rereduce calls values.length, which will only return the length of the array.
I prefer to reduce/re-reduce implicitly rather than depending on the rereduce parameter.
function(doc) { // map
if (doc.tags) {
for(var t in doc.tags) {
emit(doc.tags[t], {id:doc._id, tag:doc.tags[t]});
}
}
}
Then reduce checks whether it is accumulating document ids from the identical tag, or whether it is just counting different tags.
function(keys, vals, rereduce) {
var initial_tag = vals[0].tag;
return vals.reduce(function(state, val) {
if(initial_tag && val.tag === initial_tag) {
// Accumulate ids which produced this tag.
var ids = state.ids;
if(!ids)
ids = [ state.id ]; // Build initial list from the state's id.
return { tag: val.tag,
, ids: ids.concat([val.id])
};
} else {
var state_count = state.ids ? state.ids.length : state;
var val_count = val.ids ? val.ids.length : val;
return state_count + val_count;
}
})
}
(I didn't test this code, but you get the idea. As long as the tag value is the same, it doesn't matter whether it's a reduce or rereduce. Once different tags start reducing together, it detects that because the tag value will change. So at that point just start accumulating.
I have used this trick before, although IMO it's rarely worth it.
Also in your specific case, this is a dangerous reduce function. You are building a wide list to see all the docs that have a tag. CouchDB likes tall lists, not fat lists. If you want to see all the docs that have a tag, you could map them.
for(var a = 0; a < doc.tags.length; a++) {
emit(doc.tags[a], doc._id);
}
Now you can query /db/_design/app/_view/docs_by_tag?key="3d" and you should get
{"total_rows":287,"offset":30,"rows":[
{"id":"project1","key":"3d","value":"project1"}
{"id":"project3","key":"3d","value":"project3"}
{"id":"project8","key":"3d","value":"project8"}
{"id":"project10","key":"3d","value":"project10"}
]}

Resources