Aggregator that releases partial group based on correlation but holds on to rest of the messages

Aggregator that releases partial group based on correlation but holds on to rest of the messages - spring-integration

I want to set the correlation strategy on an aggregator so that it uses a date out of the incoming file (as message) name to correlate files so all files with todays date belong to the same group. Now since I might have multiple days worth of data its possible that I have aggregated 2 days of files. I want to base the release strategy on a done file (message) that includes the date in the filename as well so essentially each day will have a bunch of files and a done for file. Ingesting done file should release files for that day from the aggregator but still keep the other day files until the done file for that day is ingested.
so in this scenario, correlation is obviously simple - but what I am not sure about is how to release not all but only some specific messages from the group based on the correlation key. Documentation talks about messagereaper but that goes into messagestore stuff and I want to do all this in memory.
let me elaborate with an example
i have these files on a directory which im polling by a file inbound channel adapter
file-1-2014.04.27.dat
file-2-2014.04.27.dat
file-3-2014.04.27.dat
done-2014.04.27.dat
file-1-2014.04.28.dat
file-2-2014.04.28.dat
done-2014.04.28.dat
as these files are being polled in i have an aggregator in the flow where all incoming files are being aggregated. To correlate I was thinking I can extract the date and put that in correlation_id header so that first 3 files are being considered to belong to one group and then second 2 files belong to the second group .. now once I consume the done-2014.04.27.dat file at that time I want to release the first 3 files to be further processed in the flow but hold on to
file-1-2014.04.28.dat
file-2-2014.04.28.dat
until I receive the
done-2014.04.28.dat
and then release these 2 files.
Any help would be appreciated.
Thanks

I am not sure what you mean when you say "correlation is simple" but then go on to say you only want to release part of the group. If they have different dates then they will be in different groups, so there's no need to release part of a group, just release the whole group by running the reaper just after midnight (or any time the next day). It's not at all clear why you need a "done" message.
By default, the aggregator uses an in-memory message store (SimpleMessageStore).
EDIT:
Just put the done file in the same group and have your release strategy detect the presence of the done file. You could use an expression, but if the group can be large, it would be more efficient to implement ReleaseStrategy and iterate over MessageGroup.getMessages() looking for the done file.
The next step depends on what's downstream of the aggregator. If you use a splitter to split them back to separate files, you can simply add a filter to drop the done file. If you deal with the collection of files directly, either ignore the done file, or add a transformer to remove it from the collection.
With respect to the reaper; assuming files arrive in real time, I was simply suggesting that if you, say, run the reaper once a day (say at 01:00) with a group timeout of, say 30 minutes, then the reaper will release yesterday's files (without the need for a done file).
EDIT:
See my comment on your "answer" below - you have 2 subscribers on filesLogger.

Related

Why is Spark much faster at reading a directory compared to a list of filepaths?

I have a directory in S3 containing millions of small files. They are small (<10MB) and GZ, and I know it's inefficient for Spark. I am running a simple batch job to convert these files to parquet format. I've tried two different ways:
spark.read.csv("s3://input_bucket_name/data/")
as well as
spark.read.csv("file1", "file2"..."file8million")
where each file given in the list is located in the same bucket and subfolder.
I notice that when I feed in a whole directory, there isn't as much delay at the beginning for the driver indexing files (looks like around 20 minutes before the batch starts). In the UI for 1 directory, there is 1 task after this 20 minutes which looks like the conversion itself.
However, with individual filenames, this time for indexing increases to 2+ hours, and my job to do the conversion in the UI doesn't show up until this time. For the list of files, there are 2 tasks: (1) First one is listing leafs for 8mil files, and then (2) job that looks like the conversion itself.
I'm trying to understand why this is the case. Is there anything different about the underlying read API that would lead to this behaviour?

spark assumes every path passed in is a directory
so when given a list of paths, it has to do a list call on each
which for s3 means: 8M LIST calls against the s3 servers
which is rate limited to about 3k/second, ignoring details like thread count on client, http connectons etc
and with LIST build at $0.005 per 1000 calls, so 8M requests comes to $50
oh, and as the LIST returns nothing, the client falls back to a HEAD which adds another S3 API call, doubling execution time and adding another $32 to the query cost
in contrast,
listing a dir with 8M entries kicks off a single LIST request for the first 1K entries
and 7999 followups
s3a releases do async prefetch of the next page of results (faster, esp if the incremental list iterators are used). one thread to fetch, one to process and will cost you 4c
The big directory listing is more efficient and cost effective strategy, even ignoring EC2 server costs

Aggregate continuous stream of number from a file using hazelcast jet

I am trying to sum continuous stream of numbers from a file using hazelcast jet
pipe
.drawFrom(Sources.fileWatcher)<dir>))
.map(s->Integer.parseInt(s))
.addTimestamps()
.window(WindowDefinition.sliding(10000,1000))
.aggregate(AggregateOperations.summingDouble(x->x))
.drainTo(Sinks.logger());
Few questions
It doesn't give the expected output, my expectation is as soon as new number appears in the file, it should just add it to the existing sum
To do this why i need to give window and addTimestamp method, i just need to do sum of infinite stream
How can we achieve fault tolerance, i. e. if server restarts will it save the aggregated result and when it comes up it will aggregate from the last computed sum?
if the server is down and few numbers come in file now when the server comes up, will it read from last point from when the server went down or will it miss the numbers when it was down and will only read the number it got after the server was up.

Answer to Q1 & Q2:
You're looking for rollingAggregate, you don't need timestamps or windows.
pipe
.drawFrom(Sources.fileWatcher(<dir>))
.rollingAggregate(AggregateOperations.summingDouble(Double::parseDouble))
.drainTo(Sinks.logger());
Answer to Q3 & Q4: the fileWatcher source isn't fault tolerant. The reason is that it reads local files and when a member dies, the local files won't be available anyway. When the job restarts, it will start reading from current position and will miss numbers added while the job was down.
Also, since you use global aggregation, data from all files will be routed to single cluster member and other members will be idle.

Bull queue: Ensure unique job within time period by using partial timestamp in jobId

I need to ensure the same job added to queue isn't duplicated within a certain period of time.
Is it worth including partial timestamps (i.e. D/M/Y-HH:M) in my unique jobId strings, so it processes only if not in the same Minute?
It would still duplicate if one job was added at 12:01 and the other at 12:09 – or does Bull have a much better way of doing this?

Bull is designed to support idempotence by ignoring jobs that were added with existing job ids. Be careful to not enable options such as removeOnCompleted, since the job will be removed after completion and not being considered the next time you add a job.
In your case, where you want to make sure that no new jobs are added during a given timespan, just make sure that all the job ids during that timestamp are the same, for example as you wrote in your comment removing the 4 last digits of your UNIX timestamp.

I feel you should use Bull's API to check that the job is running or not, then you decide if you add the job to the queue if not (patch on the producer).
You can also decide to check if a similar job is already running when your are running the job (inside the process function) and do an early return instead of executing the job (patch on the consumer).
You can use the Queue getJobs function to do so:
getJobs(types: string[], start?: number, end?: number, asc?: boolean):Promise<Job[]>
"Returns a promise that will return an array of job instances of the given types. Optional parameters for range and ordering are provided."
From documentation:
https://github.com/OptimalBits/bull/blob/develop/REFERENCE.md#queuegetjobs
The Job item should provide enough data so you can find the one you are looking for.

Getting Multiple Last Price Quotes from Interactive Brokers's API

I have a question regarding the Python API of Interactive Brokers.
Can multiple asset and stock contracts be passed into reqMktData() function and obtain the last prices? (I can set the snapshots = TRUE in reqMktData to get the last price. You can assume that I have subscribed to the appropriate data services.)
To put things in perspective, this is what I am trying to do:
1) Call reqMktData, get last prices for multiple assets.
2) Feed the data into my prediction engine, and do something
3) Go to step 1.
When I contacted Interactive Brokers, they said:
"Only one contract can be passed to reqMktData() at one time, so there is no bulk request feature in requesting real time data."
Obviously one way to get around this is to do a loop but this is too slow. Another way to do this is through multithreading but this is a lot of work plus I can't afford the extra expense of a new computer. I am not interested in either one.
Any suggestions?

You can only specify 1 contract in each reqMktData call. There is no choice but to use a loop of some type. The speed shouldn't be an issue as you can make up to 50 requests per second, maybe even more for snapshots.
The speed issue could be that you want too much data (> 50/s) or you're using an old version of the IB python api, check in connection.py for lock.acquire, I've deleted all of them. Also, if there has been no trade for >10 seconds, IB will wait for a trade before sending a snapshot. Test with active symbols.
However, what you should do is request live streaming data by setting snapshot to false and just keep track of the last price in the stream. You can stream up to 100 tickers with the default minimums. You keep them separate by using unique ticker ids.

Any intelligence to run the Azure Data Factory other than Schedule Basis

I have a Client Request for my Data Factory Solution
They want to run my Data-Factory when ever the i/p file is available in the Blob Storage/any location.To be very clear they doesn't want to run the solution in an schedule basis,because some day the file won't shows up.So i want an intelligence to search whether the file is available to be process in the location or not.If yes then i have to run my Data factory Solution to process that file,else no need to run the Data factor
Thanks in Advance
Jay

I think you've currently got 3 options to dealing with this. None of which are exactly what you want...
Option 1 - use C# to create a custom activity that does some sort of checking on the directory before proceeding with other downstream pipelines.
Option 2 - Add a long delay to the activity so the processing retires for the next X days. Sadly only a maximum of 10 long retires is allowed currently.
Option 3 - Wait for a newer version of Azure Data Factory that might allow the possibility of more event driven activities, rather than using a scheduled time slice approach.
Apologies this isn't exactly the answer you want. But this gives you current options.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string