Download a file using http with Spring Integration - spring-integration

Making the plunge into Java from .NET for a long time.
I am looking for is an example on how to periodically download a file, read the text from it and then take some action based on the read using Springs Integration library and the annotation based approach.
I want to pull GTFS formatted zip file from a transit provider. This provider produces a simple text file with a timestamp to indicate the last publishing time.
Specifically the producers of the data publish a text file at:
https://someserver.com/gfts/published.txt
This file has a simple timestamp to indicate when the last time their data file was published.
Then there is the data:
https://someserver.com/gfts/schedule.zip
I have tried to find some examples on how to go about polling the "published" file. Basically I want to periodically download the file and check the timestamp to determine if the schedule should be downloaded.
Most of the examples I have seen are using the XML based configuration with spring - and I barely am holding onto the annotation based. I have also seen examples of downloading a file using FTP / SFTP.
I need to use http AND I also need to include Basic Authorization (in the header).
This is as far as I have gotten. I am not sure how to go about wiring this up?
From the Spring Integration docs - this is how I am supposed to declare an outbound gateway (I think that is what I need?)
The question is now what? I need that HttpRequestExecutingMessageHandler to save the stream (file) a local file so I can read the contents and take some other action?
#Configuration
#EnableIntegration
public class GtfsConfiguration {
#Bean
public MessageChannel fileUpdateChannel () {
return new DirectChannel();
}
#Bean
#ServiceActivator(inputChannel = "fileUpdateChannel", polling = #Poller(fixedDelay="5000")
public HttpRequestExecutingMessageHandler fileUpdateGateway() {
HttpRequestExecutingMessageHandler handler = new HttpRequestExecutingMessageHandler("https://someserver.com/gtfs/raw/published.txt");
handler.setHttpMethod(HttpMethod.GET);
handler.setExpectedResponseType(byte[].class);
return handler;
}
}

If you need to download such a file periodically, you need to use a "fake" Inbound Channel Adapter, for example:
#Bean
#InboundChannelAdapter(value = "fileUpdateChannel"
poller = #Poller(fixedDelay = "1000", maxMessagesPerPoll = "1"))
public String downloadFileSchedule() {
return () -> "";
}
The #ServiceActivator for the HttpRequestExecutingMessageHandler is going to be called every second. You don't need to have there a #Poller on the #ServiceActivator. It isn't going to do anything by itself. Plus your fileUpdateChannel is a DirectChannel, not QueueChannel.
I don't think you need to save a downloaded file locally. I even would say that handler.setExpectedResponseType(String.class); is fully enough to get a file content as a reply message payload for downstream analyze.
The easiest way to configure a Basic Authorization is with the Apache Commons HTTP Client:
CredentialsProvider provider = new BasicCredentialsProvider();
UsernamePasswordCredentials credentials
= new UsernamePasswordCredentials("user1", "user1Pass");
provider.setCredentials(AuthScope.ANY, credentials);
HttpClient client = HttpClientBuilder.create()
.setDefaultCredentialsProvider(provider)
.build();
and use this in the HttpComponentsClientHttpRequestFactory, which you then should inject into the mentioned HttpRequestExecutingMessageHandler via its setRequestFactory(ClientHttpRequestFactory requestFactory).

Related

Trying to use HttpClient.GetStreamAsync straight to the adls FileClient.UploadAsync

I have an Azure Function that will call an external API via HttpClient. The external API returns a JSON response. I want to save the response directly to an ADLS File.
My simplistic code is:
public async Task UploadFileBulk(Stream contentToUpload)
{
await this._theClient.FileClient.UploadAsync(contentToUpload);
}
The this._theClient is a simple wrapper class around the various Azure Data Lake classes such as DataLakeServiceClient, DataLakeFileSystemClient, DataLakeDirectoryClient, DataLakeFileClient.
I'm happy this wrapper calls works as I expect, I spin one up, set the service, filesystem, directory and then a filename to create. I've used this wrapper class to create directories etc. so it works as I expect.
I am calling the above method as follows:
await dlw.UploadFileBulk(await this._httpClient.GetStreamAsync("<endpoint>"));
I see the file getting created in the Lake directory with the name I want, however if I then download the file using Sorage Explorer and then try to open it in say VS Code it's not in a recognisable format (I can "force" code to open it but it looks like binary format to me).
If I sniff the traffic with fiddler I can see the content from the external API is JSON, content-type is application/json and the body shows in fiddler as JSON.
If I look at the calls to the ADLS endpoint I can see a PUT call followed by two PATCH calls.
The first PATCH call looks like it is the one sending the content, it has a content-header of application/octet-stream and the request body is the "binary looking content".
I am using HttpClient.GetStreamAsync as I don't want my Function to have to load the entire API payload into memory (some of the external API endpoints return very large files over 100mb). I am thinking I can "stream the response from the external API straight into ADLS".
Is there a way to change how the ADLS FileClient.UploadAsync(Stream stream) method works so I can tell it to upload the file as a JSON file with a content type of application/json?
EDIT:
So turns out the External API was sendng back zipped content and so once I added the following extra AutomaticDecompression code to my functions startup I got the files uploaded to ADLS as expected.
public override void Configure(IFunctionsHostBuilder builder)
{
builder.Services.AddHttpClient("default", client =>
{
client.DefaultRequestHeaders.Add("Accept-Encoding", "gzip, deflate");
}).ConfigurePrimaryHttpMessageHandler(() => new HttpClientHandler
{
AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate
});
}
#Gaurav Mantri has given me some pointers on if the pattern of "streaming from an output to an input" is actually correct, I will research this further.
Regarding the issue, please refer to the following code
var uploadOptions = new DataLakeFileUploadOptions();
uploadOptions.HttpHeaders = new PathHttpHeaders();
uploadOptions.HttpHeaders.ContentType ="application/json";
await fileClient.UploadAsync(stream, uploadOptions);

Representing thread pooling in Spring Integration rather than ExecutorService

Currently, code similar to the following exists in one of our applications:
#Component
public class ProcessRequestImpl {
private ExecutorService executorService;
public processRequest(...) {
// code to pre-process request
executorService.execute(new Runnable() {
public void run() {
ProcessRequestImpl.this.doWork(...);
}
}
}
private void doWork(...) {
// register in external file that request is being processed
// call external service to handle request
}
}
The intent of this is to create a queue of requests to the external service. The external service may take some time to process each incoming request. After it handles each one, it will update the external file to register that the specific request has been processed.
ProcessRequestImpl itself is stateless, in that all state is set in the constructor and there is no external access to that state. The process() method is called by another component in the application.
If this were to be implemented in a Spring Integration application, which of the following two approaches would be best recommended:
Keep the above code as is.
Extract doWork(), into a separate endpoint, configure that endpoint to receive messages on a channel, and to use configuration to achieve the multi threading in place of the executor service.
Some of the reasons we are looking at Spring Integration are as follows:
To remove the workflow logic from the code itself, so that the workflow and the chain of processing is evident on a higher level.
To simplify each class, enhancing readability and testability.
To avoid threading code if possible, and define that at a higher level of abstraction in configuration.
Given the sample code, could those goals be achieved using Spring Integration. Also, what would be an example of the DSL to achieve that.
Thanks
Something like
#Bean
public IntegrationFlow flow() {
return IntegrationFlows.from(SomeGatewayInterface.class)
.handle("someBean", "preProcess")
.channel(MessageChannels.executor(someTaskExecutorBean())
.handle("someBean", "doWork")
.get();
The argument passed to the gateway method become the payload of the preprocess method, which would return some object that becomes the message payload, which becomes the parameter passed to doWork.

Service Fabric reverse proxy port configurability

I'm trying to write an encapsulation to get the uri for a local reverse proxy for service fabric and I'm having a hard time deciding how I want to approach configurability for the port (known as "HttpApplicationGatewayEndpoint" in the service manifest or "reverseProxyEndpointPort" in the arm template). The best way I've thought to do it would be to call "GetClusterManifestAsync" from the fabric client and parse it from there, but I'm also not a fan of that for a few reasons. For one, the call returns a string xml blob, which isn't guarded against changes to the manifest schema. I've also not yet found a way to query the cluster manager to find out which node type I'm currently on, so if for some silly reason the cluster has multiple node types and each one has a different reverse proxy port (just being a defensive coder here), that could potentially fail. It seems like an awful lot of effort to go through to dynamically discover that port number, and I've definitely missed things in the fabric api before, so any suggestions on how to approach this issue?
Edit:
I'm seeing from the example project that it's getting the port number from a config package in the service. I would rather not have to do it that way as then I'm going to have to write a ton of boilerplate for every service that'll need to use this to read configs and pass this around. Since this is more or less a constant at runtime then it seems to me like this could be treated as such and fetched somewhere from the fabric client?
After some time spent in the object browser I was able to find the various pieces I needed to make this properly.
public class ReverseProxyPortResolver
{
/// <summary>
/// Represents the port that the current fabric node is configured
/// to use when using a reverse proxy on localhost
/// </summary>
public static AsyncLazy<int> ReverseProxyPort = new AsyncLazy<int>(async ()=>
{
//Get the cluster manifest from the fabric client & deserialize it into a hardened object
ClusterManifestType deserializedManifest;
using (var cl = new FabricClient())
{
var manifestStr = await cl.ClusterManager.GetClusterManifestAsync().ConfigureAwait(false);
var serializer = new XmlSerializer(typeof(ClusterManifestType));
using (var reader = new StringReader(manifestStr))
{
deserializedManifest = (ClusterManifestType)serializer.Deserialize(reader);
}
}
//Fetch the setting from the correct node type
var nodeType = GetNodeType();
var nodeTypeSettings = deserializedManifest.NodeTypes.Single(x => x.Name.Equals(nodeType));
return int.Parse(nodeTypeSettings.Endpoints.HttpApplicationGatewayEndpoint.Port);
});
private static string GetNodeType()
{
try
{
return FabricRuntime.GetNodeContext().NodeType;
}
catch (FabricConnectionDeniedException)
{
//this code was invoked from a non-fabric started application
//likely a unit test
return "NodeType0";
}
}
}
News to me in this investigation was that all of the schemas for any of the service fabric xml is squirreled away in an assembly named System.Fabric.Management.ServiceModel.

How to add EventSource to a web application

We finally got EventSource and ElasticSearch correctly configured in our service fabric cluster. Now that we have that we want to add EventSources to our web applications that interact with our service fabric applications so that we can view all events (application logs) in one location and filter / query via Kibana.
Our issue seems to be related to the differences between a service fabric app which is an exe and a .NET 4.6 (not .net CORE) web app which is stateless. In service Fabric we place the using statement that instantiates the pipeline in Program.cs and set an infinite sleep.
private static void Main()
{
try
{
using (var diagnosticsPipeline = ServiceFabricDiagnosticPipelineFactory.CreatePipeline("CacheApp-CacheAPI-DiagnosticsPipeline"))
{
ServiceEventSource.Current.ServiceTypeRegistered(Process.GetCurrentProcess().Id, typeof(Endpoint).Name);
// Prevents this host process from terminating so services keeps running.
Thread.Sleep(Timeout.Infinite);
}
How do I do this in a web app? This is the pipeline code we are using for a non ServiceFabric implementation of the EventSource. This is what we are using:
using (var pipeline = DiagnosticPipelineFactory.CreatePipeline("eventFlowConfig.json"))
{
IEnumerable ie = System.Diagnostics.Tracing.EventSource.GetSources();
ServiceEventSource.Current.Message("initialize eventsource");
}
We are able to see the pipeline and send events to ElasticSearch from within the using statement but not outside of it. So the question is:
how/where do we place our pipeline using statement for a web app?
Do we need to instantiate and destroy the pipeline that every time we log or is there a way to reuse that pipeline across the stateless web events? It would seem like that would be very expensive and hurt performance. Maybe we can we cache a pipeline?
That’s the jist, let me know if you need clarification. I see lots of doco out there for client apps but not much for web apps.
Thanks,
Greg
UPDATE WITH SOLUTION CODE
DiagnosticPipeline pipeline;
protected void Application_Start(Object sender, EventArgs e)
{
try
{
pipeline = DiagnosticPipelineFactory.CreatePipeline("eventFlowConfig.json");
IEnumerable ie = System.Diagnostics.Tracing.EventSource.GetSources();
AppEventSource.Current.Message("initialize eventsource");
}
}
protected void Application_End(Object sender, EventArgs e)
{
pipeline.Dispose();
}
Assuming ASP.NET Core the simplest way to initialize EventFlow pipeline would be in the Program.cs Main() method, for example:
public static void Main(string[] args)
{
using (var pipeline = DiagnosticPipelineFactory.CreatePipeline("eventFlowConfig.json"))
{
var host = new WebHostBuilder()
.UseKestrel()
.UseContentRoot(Directory.GetCurrentDirectory())
.UseIISIntegration()
.UseStartup<Startup>()
.UseApplicationInsights()
.Build();
host.Run();
}
}
This takes advantage of the fact that host.Run() will block until the server is shut down, and so the pipeline will exist during the time when requests are received and served.
Depending on the web framework you use things might vary. E.g. if the one you use offers "setup" and "cleanup" hooks, you could create a diagnostic pipeline during setup phase (and store a reference to it in some member variable), then dispose of it during cleanup phase. For example, in ASP.NET classic you'd put the code in global.asax.cs and leverage Application_OnStart and Application_OnEnd methods. See Application Instances, Application Events, and Application State in ASP.NET for details.
Creating a pipeline instance every time a request is served is quite inefficient, like you said. There is really no good reason to do that.

Spring Integration: Get rid of code duplication for setting up beans

For my SFTP client project, I am using spring integration. We have different clients and have to connect to different SFTP servers, but, all of the logic is same, so I have abstracted them out into AbstractSFTPEndPoint. Each client-specific class implements getClientId(), which is used by AbstractSFTPEndPoint to get client-specific details like SFTP credentials.
However, the entire logic is same for all the clients, but I am still having to implement specific classes for each client. This is mainly because we need separate "MessageSource" for each client.
How can I get rid of this duplication?
public class SFTPEndPointForClientAAAA extends AbstractSFTPEndPoint {
public String getClientId(){
return "clientAAAA";
}
#Bean(name = "channelForClientAAAA")
public QueueChannel inputFileChannel() {
return super.inputFileChannel();
}
#ServiceActivator(inputChannel = "channelForClientAAAA", poller = #Poller(fixedDelay = "500"))
public void serviceActivator(Message message) {
super.serviceActivator(message);
}
#Bean(name = "messageSourceForClientAAAA")
#InboundChannelAdapter(value = "channelForClientAAAA",
poller = #Poller(fixedDelay = "50", maxMessagesPerPoll = "2"))
public MessageSource messageSource() {
return super.messageSource();
}
}
Basically I have a bunch of SFTP hosts to connect to and apply same logic. I want that to be done automatically without having to implement class for each SFTP host.
See the dynamic ftp sample. It uses XML but the same techniques apply to Java configuration. It uses outbound adapters; inbound are a little more complicated because you might need to hook them into a common context. There are links in the readme for how to do that.
However, I recently answered a similar question for multiple IMAP mail adapters using Java configuration and then a follow-up question.
You should be able to use the technique used there.

Resources