Adding forks after stream has been started? - node.js

I have a node.js application using Highland stream. It consumes a data source and it needs to send data into multiple destinations. The trouble is, the destinations are discovered dynamically by reading the data and creating destination streams based on the data.
This means I cannot .fork the stream to add destinations (because a stream that has started consuming cannot be forked, right?)
Is there anything else I can do?
My current approach is to have a .consume where I create destination streams and write data into the destinations. The result is ugly and messy.

Related

How to read a CSV stream from S3, but starting from somewhere in the middle of the file?

As the title states, my question pertains mostly to reading CSV data from AWS S3. I will provide details about the other technologies I am using, but they are not important to the core problem.
Context (not the core issue, just some extra detail)
I have a use case where I need to process some very large CSVs using a Node.js API on AWS Lambda and store some data from each CSV row to DynamoDB.
My implementation works well for small-to-medium-sized CSV files. However, for large CSV files (think 100k - 1m rows), the process takes way more than 15 minutes (the maximum execution time for an AWS Lambda function).
I really need this implementation to be serverless (because the rest of the project is serverless, because of a lack of predictable usage patterns, etc...).
So I decided to try and process the beginning of the file for 14.5 minutes or so, and then queue a new Lambda function to pick up where the last one left off.
I can easily pass the row number from the last function to the new function, so the new Lambda function knows where to start from.
So if the 1st function processed lines 1 - 15,000, then the 2nd function would pick up the processing job at row 15,001 and continue from there. That part is easy.
But I can't figure out how to start a read stream from S3 beginning in the middle. No matter how I set up my read stream, it always starts data flow from the beginning of the file.
It is impossible to break the processing task into smaller pieces (like queueing new Lambdas for each row) as I have already done this and optimized the process to be as minimal as possible.
Even if the 2nd job starts reading at the beginning of the file and I set it up to skip the already-processed rows, it will still take too long to get to the end of the file.
And even if I do some other implementation (like using EC2 instead of Lambda), I still run into the same problem. What if the EC2 process fails at row 203,001? I would need to queue up a new job to pick up from the next row. No matter what technology I use or what container/environment, I still need to be able to read from the middle of a file.
Core Problem
So... let's say I have a CSV file saved to S3. And I know that I want to start reading from row 15,001. Or alternatively, I want to start reading from the 689,475th byte. Or whatever.
Is there a way to do that? Using the AWS SDK for Node.js or any other type of request?
I know how to set up a read stream from S3 in Node.js, but I don't know how it works under the hood as far as how the requests are made. Maybe that knowledge would be helpful.
Ah it was so much easier than I was making it... Here is the answer in Node.js:
new aws.S3()
.getObject({
Key: 'bigA$$File.csv',
Bucket: 'bucket-o-mine',
Range: 'bytes=65000-100000',
})
.createReadStream()
Here is the doc: https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html
You can do this in any of the AWS SDKs or via HTTP header.
Here's what AWS has to say about the range header:
Downloads the specified range bytes of an object. For more information about the HTTP Range header, see https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.

How is the data chunked when using UploadFromStreamAsync and DownloadToStreamAsync when uploading to block blob

I just started learning about Azure blob storage. I have come across various ways to upload and download the data. One thing that puzzles me to when to use what.
I am mainly interested in PutBlockAsync in conjunction with PutBlockListAsync and UploadFromStreamAsync.
As far as I understand when using PutBlockAsync it is up to the user to break the data into chunks and making sure each chunk is within the Azure block blob size limits. There is an id associated with each chunk that is uploaded. At the end, all the ids are committed.
When using UploadFromStreamAsync, how does this work? Who handles chunking the data and uploading it.
Why not convert the data into Stream and use UploadFromStreamAsync all the time and avoid two commits?
You can use fiddler, and observe what happens when use UploadFromStreamAsync.
If the file is larger(more than 256MB), such as 500MB, the Put Block and Put Block List api are called in the background(they are also called when use PutBlockAsync and PutBlockListAsync method)
If the file is small than 256MB, then it(UploadFromStreamAsync) will call the Put Blob api in the background.
I use UploadFromStreamAsync and uploading a file whose size is 600MB, then open the fidder.
Here are some findings from fidder:
1.The large file is broken into small size(4MB) one by one, and calls Put Block api in the background:
2.At the end, the Put Block List api will be called:

Spring batch remote chunking - returning data from slave node

I am using spring batch remote chunking for distributed processing.
When a slave node is done with processing a chunk I would like to return some additional data along with ChunkResponse.
For example if a chunk consist of 10 user Ids I would like to return in response how many user ids were processed successfully.
The response could include some other data as well. I have spent considerable time trying to figure out ways to achieve this
but without any success.
For example I have tried to extend ChunkResponse class and add some additional fields to it. And then extend ChunkProcessorChunkHandler
and return customized ChunkResponse from it. But I am not sure if this is proper approach.
I also need a way on master node to read the ChunkResponse in some callback. I guess I can use afterChunk(ChunkContext) method of ChunkListener
but I couldn't find a way to get ChunkResponse from ChunkContext in the method.
So to sump it up I would like to know how can I pass data from slave to master per chunk and on master node how can I read this data.
Thanks a lot.
EDIT
In my case master node reads user records and slave nodes process these records. At the end of the job
master needs to take conditional action based on whether processing of a particular user failed or succeeded. The fail/success on
slave node is not based on any exception thrown there but based on some business rules. And there is other data that master needs to know about, for example
how many emails were sent for each user. Now if I was using remote partitioning I could use jobContext to put and get this data but in remote chunking
jobContext is not available. So I was wondering if along with ChunkResponse I could send back some additional data from slave to master.

Azure Stream Analytics Get Previous Output Row for Join to Input

I have the following scenario:
Mobile app produces events that are sent to Event Hub which is input stream source to a Stream Analytics query. From there they are passed through a sequential flow of queries that splits the stream into 2 streams based on criteria, evaluates other conditions and decides whether or not to let the event keep flowing through the pipeline (if it doesn't it is simply discarded). You could classify what we are doing is noise reduction/event filtering. Basically if A just happened don't let A happen again unless B & C happened or X time passes. At the end of the query gauntlet the streams are merged again and the "selected" events are propagated as "chosen" outputs.
My problem is that I need the ability to compare the current event to the previous "chosen" event (not just the previous input event) so in essence I need to join my input stream to my output stream. I have tried various ways to do this and so far none have worked, I know that other CEP engines support this concept. My queries are mostly all defined as temporary results sets inside of a WITH statement (that's where my initial input stream is pulled into the first query and each following query depends on the one above it) but I see no way to either join my input to my output or to join my input to another temporary result set that is further down in the chain. It appears that join only supports inputs?
For the moment I am attempting to work around this limitation with something I really don't want to do in production, but I actually have an output defined going to an Azure Queue then an Azure Function triggered by events on that queue that wakes up and posts it to a different Event hub that is mapped as a recirc feed input back into my queries which I can join to. Still wiring all of that up so not 100% sure it will work but thinking there has to be a better option for this relatively common pattern?
The WITH statement is indeed the right way to get a previous input joined with some other data.
You may need to combine it with the LAG operator, that gets the previous event in a data stream.
Let us know if it works for you.
Thanks,
JS - Azure Stream Analytics
AFAIK, the stream analytics job supports two distinct data input types: data stream inputs and reference data inputs. Per my understanding, you could leverage Reference data to perform a lookup or to correlate with your data stream. For more details, you could refer to the following tutorials:
Data input types: Data stream and reference data
Configuring reference data
Tips on refreshing your reference data
Reference Data JOIN (Azure Stream Analytics)

What is the most simple way to write to kafka from spark stream

I would like to write to kafka from spark stream data.
I know that I can use KafkaUtils to read from kafka.
But, KafkaUtils doesn't provide API to write to kafka.
I checked past question and sample code.
Is Above sample code the most simple way to write to kafka?
If I adopt way like above sample, I must create many classes...
Do you know more simple way or library to help to write to kafka?
Have a look here:
Basically this blog post summarise your possibilities which are written in different variations in the link you provided.
If we will look at your task straight forward, we can make several assumptions:
Your output data is divided to several partitions, which may (and quite often will) reside on different machines
You want to send the messages to Kafka using standard Kafka Producer API
You don't want to pass data between machines before the actual sending to Kafka
Given those assumptions your set of solution is pretty limited: You whether have to create a new Kafka producer for each partition and use it to send all the records of that partition, or you can wrap this logic in some sort of Factory / Sink but the essential operation will remain the same : You'll still request a producer object for each partition and use it to send the partition records.
I'll suggest you continue with one of the examples in the provided link, the code is pretty short, and any library you'll find would most probably do the exact same thing behind the scenes.

Resources