We're using the S3InboundFileSynchronizingMessageSource feature of Spring Integration to locally sync and then send messages for any files retrieved from an S3 bucket.
Before syncing, we apply a couple of S3PersistentAcceptOnceFileListFilter filters (to check the file's TimeModified and Hash/ETag) to make sure we only sync "new" files.
Note: We use the JdbcMetadataStore table to persist the record of the files that have previously made it through the filters (using a different REGION for each filter).
Finally, for the S3InboundFileSynchronizingMessageSource local filter, we have a S3PersistentAcceptOnceFileListFilter FileSystemPersistentAcceptOnceFileListFilter -- again on TimeModified and again persisted but in a different region.
The issue is: if the service is restarted after the file has made it through the 1st filter but before the message source successfully sent the message along, we essentially drop the file and never actually process it.
What are we doing wrong? How can we avoid this "dropped file" issue?
I assume you use a FileSystemPersistentAcceptOnceFileListFilter for the localFilter since S3PersistentAcceptOnceFileListFilter is not going to work there.
Let see how you use those filters in the configuration! I wonder if switching to the ChainFileListFilter for your remote files helps you somehow.
See docs: https://docs.spring.io/spring-integration/docs/current/reference/html/file.html#file-reading
EDIT
if the service is restarted after the file has made it through the 1st filter but before the message source successfully sent the message along
I think Gary is right: you need a transaction around that polling operation which includes filter logic as well.
See docs: https://docs.spring.io/spring-integration/docs/current/reference/html/jdbc.html#jdbc-metadata-store
This way the TX is not going to be committed until the message for a file leaves the polling channel adapter. Therefore after restart you simply will be able to synchronize the rolled back files again. Just because they are not present in the store for filtering.
Related
I'm struggle with cashed sftp session factory. Namely, I suffered from session unavailable because I used to many in my application. Currently I have one default non cashed session. Which writes file to sftp server but set up locks on them. Thus it can't be read by any other user. I'd like to avoid it. Perfectly, turn off lock after single file is uploaded. Is it possible ?
Test structure
Start polling adapter
Upload file to remote
Check whether files are uploaded
Stop polling adapter
Clean up remote
When you deal with data transferring over the network, you need to be sure that you release resources you use do to that. For example be sure to close InputStream after sending data to the SFTP. This is really not a framework responsibility to close it automatically. More over you may give us already not an InputStream, but just plain byte[] from it. That's only a reason I can think about locking-like behavior.
I have been working with FTP connector in my AzureLogicApp for unzipping files in FTP server from Source to Destination folder.
I have configured FTP connector to Trigger whenever the file is added in Source folder.
The Problem I face is the delay to Trigger the connector here.
Once I add the zipfile in source folder, It would take around 1 minute for the Azure FTP connector to identify and Pick the file over FTP.
To identify if the issue is with Azure FTP connector or FTP server, I tried using BLOB storage instead of FTP server and The connector was triggered in a second.!!!
What I understand by this is, The delay happens from FTP side, or the Way FTP connector communicates with FTP server.
Can Anyone tell the areas of optimization here? What possible changes I can make to minimize this delay.?
I also noticed this behaviour of the FTP trigger and found the reason for the delay in the FTP Trigger doco here:
https://learn.microsoft.com/en-us/azure/connectors/connectors-create-api-ftp#how-ftp-triggers-work
...when a trigger finds a new file, the trigger checks that the new file is complete, and not partially written. For example, a file might have changes in progress when the trigger checks the file server. To avoid returning a partially written file, the trigger notes the timestamp for the file that has recent changes, but doesn't immediately return that file. The trigger returns the file only when polling the server again. Sometimes, this behavior might cause a delay that is up to twice the trigger's polling interval.
Firstly you need to know, the logic app file trigger has some difference from the Function, mostly it won't trigger immediately, when you set the trigger you will find it need a interval. Even there is a file, however there is a interval it won't trigger right now.
Then it's about how ftp trigger works, when it trigger the logic app, if you check the trigger history you will find it has multiple Succeeded however only one fired history and there is a 2 minutes delay. The reason you could check the connector reference: How FTP triggers work. There is a description about this.
The trigger returns the file only when polling the server again. Sometimes, this behavior might cause a delay that is up to twice the trigger's polling interval.
We've a requirement to scan the files uploaded by the user and check if it has virus and then tag it as infected. I checked few blogs and other stackoverflow answers and got to know that we can use calmscan for the same.
However, I'm confused on what should be the path for virus scan in clamscan config. Also, is there tutorial that I can refer to. Our application backend is in Node.js.
I'm open to other libraries/services as well
Hard to say without further info (i.e the architecture your code runs on, etc).
I would say the easiest possible way to achieve what you want is to hook up a trigger on every PUT event on your S3 Bucket. I have never used any virus scan tool, but I believe that all of them run as a daemon within a server, so you could subscribe an SQS Queue to your S3 Bucket event and have a server (which could be an EC2 instance or an ECS task) with a virus scan tool installed poll the SQS queue for new messages.
Once the message is processed and a vulnerability is detected, you could simply invoke the putObjectTagging API on the malicious object.
We have been doing something similar, but in our case, its before the file storing in S3. Which is OK, I think, solution would still works for you.
We have one EC2 instance where we have installed the clamav. Then written a web-service that accepts Multi-part file and take that file content and internally invokes ClamAv command for scanning that file. In response that service returns whether the file is Infected or not.
Your solution, could be,
Create a web-service as mentioned above and host it on EC2(lets call it, virus scan service).
On Lambda function, call the virus scan service by passing the content.
Based on the Virus Scan service response, tag your S3 file appropriately.
If your open for paid service too, then in above the steps, #1 won't be applicable, replace the just the call the Virus-Scan service of Symantec or other such providers etc.
I hope it helps.
You can check this solution by AWS, it will give you an idea of a similar architecture: https://aws.amazon.com/blogs/developer/virus-scan-s3-buckets-with-a-serverless-clamav-based-cdk-construct/
We have more than 50k files coming in everyday and needs to be processed. For that we have developed POC apps with design like,
Polling app picks the file continuously from ftp zone.
Validate that file and create metadata in db table.
Another poller picks 10-20 files from db(only file id and status) and deliver it to slave apps as message
Slave app take message and launch a spring batch job, which is reading data, does biz validation in processors and writes validated data to db/another file.
We used spring integration and spring batch technology for this POC
Is it a good idea to launch spring batch job in slaves or directly implement read,process and write logic as plan java or spring bean objects?
Need some insight on launching this job where slave can have 10-25 MDP(spring message driven pojo) and each of this MDP is launching a job.
Note : Each file will have max 30 - 40 thousand records
Generally, using SpringIntegration and SpringBatch for such tasks is a good idea. This is what they are intended for.
With regard to SpringBatch, you get the whole retry, skip and restart handling out of the box. Moreover, you have all these readers and writers that are optimised for bulk operations. This works very well and you only have to concentrate on writing the appropriate mappers and such stuff.
If you want to use plain java or spring bean objects, you will probably end up developing such infrastructure code by yourself... incl. all the needed effort for testing and so on.
Concerning your design:
Besides validating and creation of the metadata entry, you could consider to load the entries directly into a database table. This would give you a better "transactional" control, if something fails. Your load job could look something like this:
step1:
tasklet to create an entry in metadata table with columns like
FILE_TO_PROCESS: XY.txt
STATE: START_LOADING
DATE: ...
ATTEMPT: ... first attempt
step2:
read and validate each line of the file and store it in a data table
DATA: ........
STATE:
FK_META_TABLE: ForeignKey to meta table
step3:
update metatable with status LOAD_completed
-STATE : LOAD_COMPLETED
So, as soon as your metatable entry gets the state LOAD_COMPLETED, you know that all entries of the files have been validated and are ready for further processing.
If something fails, you just can fix the file and reload it.
Then, to process further, you could just have jobs which poll periodically and check if there are new data in the database which should be processed. If more than one file had been loaded during the last period, simply process all files that are ready.
You could even have several slave-processes polling from time to time. Just do a read for update on the state of the metadata table or use an optimistic locking approach to prevent several slaves from trying to process the same entries.
With this solution, you don't need a message infrastructure and you can still scale the whole application without any problems.
I've a scenario where we have spring intergration file poller, waiting for files to be added to a directory, once written we process the file. We have a some large files and slow network so i some cases we are worried that the file transfer may not have finished when the poller wakes and attempts to process the file.
I've found this topic on 'file-inbound-channel-unable-to-read-file' which suggest using a custom filter to check if the file is readable before processing.
This second topic 'how-to-know-whether-a-file-copying-is-in-progress-complete' suggest that the file must be writable before it can be considered to be ready for processing.
I might have expected that this checking that the file is read/writable would already be done by the core spring intergration code?.
In the mean time i'm planning on creating a filter as per the first topic, but using the 'rw' check that the second suggests.
This is a classic problem and just checking if the file is writable is not reliable - what happens if the network crashed during the file transfer? You will still have an incomplete file.
The most reliable way to handle this is to have the sender send the file with a temporary suffix and rename it after the transfer is complete. Another common technique is to send a second file foo.done indicating that foo.xxx is complete, and use a custom filter for that.