Hazelcast Jet - drain the list to a stream - hazelcast-jet

My Jet job is transforming the Redis stream data - the tranformation is - I lookup a map for every item in the stream - if found it contains one or more items (list). I would like to write the items in draining stream as separate items and not as one element of list.
My code works, however it writes the list as a single item to another Redis stream - what I need is write each element of the list separately to the stream (so that the other job can work on items independently).
Code
pipeline.drawFrom(RedisSources.stream("source", uri, "payloads", "$"))
.withIngestionTimestamps()
.groupingKey(k -> k.get("eventType"))
.mapUsingContext(lookupService(), (svc, event, item) -> svc.findHooks(event) /*returns list*/)
.drainTo(RedisSinks.stream("drain", uri, "hooks"));
So, the returned list from the service should be written as separate elements in the output stream.
What api can I use to emit each item? I couldn't find much in docs.

To map one item into multiple items, you need to use the the flat map transform instead of the simple map transform.
Example below:
pipeline.drawFrom(RedisSources.stream("source", uri, "payloads", "$"))
.withIngestionTimestamps()
.groupingKey(k -> k.get("eventType"))
.flatMapUsingContext(lookupService(), (svc, event, item) -> Traversers.traverseIterable(svc.findHooks(event)) /*returns list*/)
.drainTo(RedisSinks.stream("drain", uri, "hooks"));

Related

Transforming large array of objects to csv using json2csv

I need to transform a large array of JSON (that can have over 100k positions) into a CSV.
This array is created directly in the application, it's not the result of an uploaded file.
Looking at the documentation, I've thought on using parser but it says that:
For that reason is rarely a good reason to use it until your data is very small or your application doesn't do anything else.
Because the data is not small and my app will do other things than creating the csv, I don't think it'll be the best approach but I may be misunderstanding the documentation.
Is it possible to use the others options (async parser or transform) with an already created data (and not a stream of data)?
FYI: It's a nest application but I'm using this node.js lib.
Update: I've tryied to insert with an array with over 300k positions, and it went smoothly.
Why do you need any external modules?
Converting JSON into a javascript array of javascript objects is a piece of cake with the native JSON.parse() function.
let jsontxt=await fs.readFile('mythings.json','uft8');
let mythings = JSON.parse(jsontxt);
if (!Array.isArray(mythings)) throw "Oooops, stranger things happen!"
And, then, converting a javascript array into a CSV is very straightforward.
The most obvious and absurd case is just mapping every element of the array into a string that is the JSON representation of the object element. You end up with a useless CSV with a single column containing every element of your original array. And then joining the resulting strings array into a single string, separated by newlines \n. It's good for nothing but, heck, it's a CSV!
let csvtxt = mythings.map(JSON.stringify).join("\n");
await fs.writeFile("mythings.csv",csvtxt,"utf8");
Now, you can feel that you are almost there. Replace the useless mapping function into your own
let csvtxt = mythings.map(mapElementToColumns).join("\n");
and choose a good mapping between the fields of the objects of your array, and the columns of your csv.
function mapElementToColumns(element) {
return `${JSON.stringify(element.id)},${JSON.stringify(element.name)},${JSON.stringify(element.value)}`;
}
or, in a more thorough way
function mapElementToColumns(fieldNames) {
return function (element) {
let fields = fieldnames.map(n => element[n] ? JSON.stringify(element[n]) : '""');
return fields.join(',');
}
}
that you may invoke in your map
mythings.map(mapElementToColumns(["id","name","element"])).join("\n");
Finally, you might decide to use an automated for "all fields in all objects" approach; which requires that all the objects in the original array maintain a similar fields schema.
You extract all the fields of the first object of the array, and use them as the header row of the csv and as the template for extracting the rest of the elements.
let fieldnames = Object.keys(mythings[0]);
and then use this field names array as parameter of your map function
let csvtxt= mythings.map(mapElementToColumns(fieldnames)).join("\n");
and, also, prepending them as the CSV header
csvtxt.unshift(fieldnames.join(','))
Putting all the pieces together...
function mapElementToColumns(fieldNames) {
return function (element) {
let fields = fieldnames.map(n => element[n] ? JSON.stringify(element[n]) : '""');
return fields.join(',');
}
}
let jsontxt=await fs.readFile('mythings.json','uft8');
let mythings = JSON.parse(jsontxt);
if (!Array.isArray(mythings)) throw "Oooops, stranger things happen!";
let fieldnames = Object.keys(mythings[0]);
let csvtxt= mythings.map(mapElementToColumns(fieldnames)).join("\n");
csvtxt.unshift(fieldnames.join(','));
await fs.writeFile("mythings.csv",csvtxt,"utf8");
And that's it. Pretty neat, uh?

Elasticsearch dsl - large unique list of single column in python

I have a large Windows event log set that I am attempting to find unique listing of a users from a single column in a single event ID. This runs, but takes an extremely long time. How would you use python Elasticsearch_dsl and Elasticsearch-py to accomplish this?
es = Elasticsearch([localhostmines], timeout=30)
s = Search(using=es, index="logindex-*").filter('term', EventID="4624")
users = set([])
for hit in s.scan():
users.add(hit.TargetUserName)
print(users)
TargetUserName column contains stringed names, EventID column contains strings of event ids for windows.
You need to use a terms aggregations which will do exactly what you expect.
s = Search(using=es, index="logindex-*").filter('term', EventID="4624")
s.aggs.bucket('per_user', 'terms', field='TargetUserName')
response = s.execute()
for user in response.aggregations.per_user.buckets:
print(user.key, user.doc_count)

How to render JSON using Stream Analytics Query

I have Inputs in the form of JSON stored in Blob Storage
I have Output in the form of SQL Azure table.
My wrote query and successfully moving value of specific property in JSON to corresponding Column of SQL Azure table.
Now for one column I want to copy entire JSON payload as Serialized string in one sql column , I am not getting proper library function to do that.
SELECT
CASE
WHEN GetArrayLength(E.event) > 0
THEN GetRecordPropertyValue(GetArrayElement(E.event, 0), 'name')
ELSE ''
END AS EventName
,E.internal.data.id as DataId
,E.internal.data.documentVersion as DocVersion
,E.context.custom As CustomDimensionsPayload
Into OutputTblEvents
FROM InputBlobEvents E
This CustomDimensionsPayload should be a JSON actually
I made a user defined function which did the job for me:
function main(InputJSON) {
var InputJSONString = JSON.stringify(InputJSON);
return InputJSONString;
}
Then, inside the Query, I used the function like this:
SELECT udf.ConvertToJSONString(COLLECT()) AS InputJSON
INTO outputX
FROM inputY
You need to just reference the input object itself instead of COLLECT() if you want the entire payload to be converted. I was trying to do this also so figured I'd add what i did.
I used the same function suggested by PerSchjetne, query then becomes
SELECT udf.JSONToString(IoTInputStream)
INTO [SQLTelemetry]
FROM [IoTInputStream]
Your output will now be the full JSON string, including all the metadata extras that IOT hub adds on.

Map three different functions to Observable in Node.js

I am new to Rxjs. I want to follow best practices if possible.
I am trying to perform three distinct functions on the same data that is returned in an observable. Following the 'streams of data' concept, I keep on thinking I need to split this Observable into three streams and carry on.
Here is my code, so I can stop talking abstractly:
// NotEmptyResponse splits the stream in 2 to account based on whether I get an empty observable back.
let base_subscription = RxNode.fromStream(siteStream).partition(NotEmptyResponse);
// Success Stream to perform further actions upon.
let successStream = base_subscription[0];
// The Empty stream for error reporting
let failureStream = base_subscription[1];
//Code works up until this point. I don't know how to split to 3 different streams.
successStream.filter(isSite)
.map(grabData)// Async action that returns data
/*** Perform 3 separate actions upon data that .map(grabData) returned **/
.subscribe();
How can I split this data stream into three, and map each instance of the data to a different function?
In fact partition() operator internally just calls filter() operator twice. First to create an Observable from values matching the predicate and then for values not matching the predicate.
So you can do the exact same thing with filter() operator:
let obs1 = base_subscription.filter(val => predicate1);
let obs2 = base_subscription.filter(val => predicate2);
let obs3 = base_subscription.filter(val => predicate3);
Now you have three Observables, each of them emitting only some specific values. Then you can carry on with your existing code:
obs2.filter(isSite)
.map(grabData)
.subscribe();
Just be aware that calling subscribe() triggers the generating values from the source Observable. This doesn't have to be always like this depending on what Observable you use. See “Hot” and “Cold” Observables in the documentation. Operator connect() might be useful for you depending on your usecase.

Couchdb: filter and group in a single view

I have a Couchdb database with documents of the form: { Name, Timestamp, Value }
I have a view that shows a summary grouped by name with the sum of the values. This is straight forward reduce function.
Now I want to filter the view to only take into account documents where the timestamp occured in a given range.
AFAIK this means I have to include the timestamp in the emitted key of the map function, eg. emit([doc.Timestamp, doc.Name], doc)
But as soon as I do that the reduce function no longer sees the rows grouped together to calculate the sum. If I put the name first I can group at level 1 only, but how to I filter at level 2?
Is there a way to do this?
I don't think this is possible with only one HTTP fetch and/or without additional logic in your own code.
If you emit([time, name]) you would be able to query startkey=[timeA]&endkey=[timeB]&group_level=2 to get items between timeA and timeB grouped where their timestamp and name were identical. You could then post-process this to add up whenever the names matched, but the initial result set might be larger than you want to handle.
An alternative would be to emit([name,time]). Then you could first query with group_level=1 to get a list of names [if your application doesn't already know what they'll be]. Then for each one of those you would query startkey=[nameN]&endkey=[nameN,{}]&group_level=2 to get the summary for each name.
(Note that in my query examples I've left the JSON start/end keys unencoded, so as to make them more human readable, but you'll need to apply your language's equivalent of JavaScript's encodeURIComponent on them in actual use.)
You can not make a view onto a view. You need to write another map-reduce view that has the filtering and makes the grouping in the end. Something like:
map:
function(doc) {
if (doc.timestamp > start and doc.timestamp < end ) {
emit(doc.name, doc.value);
}
}
reduce:
function(key, values, rereduce) {
return sum(values);
}
I suppose you can not store this view, and have to put it as an ad-hoc query in your application.

Resources