best practice for logging mechanisam in ETL processing

best practice for logging mechanisam in ETL processing - azure

What is the best practice for logging mechanisam in ETL processing?
Actually we are developing ETL application .in this we want to use log analaytics to log data
Could anybody provide best practice for logging mechanism at industry standards.
i have googled below link :https://www.timmitchell.net/post/2016/03/14/etl-logging/
any help is appreciated.
Thanks in advance

I recently implemented one in one of the organisation. It is a custom built because of the technology choice. Following is what is included in the logging.
It acts as a wrapper around any ETL job aka there is a template developed and the template has in built logging
The template has a feature of master and child job and logs based on master or child
The logging captures the following:
Status of the job - success, failure, warning
Source details (e.g name of file or source table etc name)
data classification tagging
business owner of the incoming data source
row count of raw file vs the row count loaded
Send an alert to a distribution list if the job fails
Raises a ticket via service desk if job fails
It depends on your requirements, you may want to capture more or less.
Good luck

Related

Azure data factory - Dashboard Log Query - Filter failed pipelines who successfully rerun

I've been tasked with reducing monitor overhead of a data lake (~80TiB) with multiple ADF pipelines running (~2k daily). Currently we are logging Failed pipeline runs by doing a query on ADFPipelineRun. I do not own these pipelines, nor do I know the inner workings of existing and future pipes, I cannot make assumptions on how to filter these by custom logic in my queries. Currently the team is experiencing fatigue with these, most failed piperuns are solved during their reruns.
How can I filter these failures so they dont show up when a rerun succeeds?
The logs exposes a few id's that initially looks interesting, like Id, PipelineRunId, CorrelationId, RunId, but none of these will link a failed pipe to a successful one.
The logs does however show an interesting column, UserProperties, that apparently can be dynamically populated during the pipeline run. There may be a solution to be found here, however it would require time and friction for all existing factories to be reconfigured.
Are there any obvious solutions I have overlooked here? Preferably Azure native solutions. I can see that reruns and failures are linked inside ADF Studio, but I cannot see a way to query it externally.

After a discussion with the owner of the ADF pipes we realized the naming convention of the pipelines would allow me to filter out the noisy failing pipes that would later succeed. It's not a universal solution but it will work for us as the naming convention is enforced across the business unit I am supporting

Existing tool to parse and analyze logs

I'm coding an application via nodejs that parses APIs to collect data and organize it. However, the need for systematic logging and display of differential logs has risen. The application needs to show users what changed with each consecutive state changes or within a specified time span. Is there any existing tool that would help me achieve that?

Would Prometheus and Grafana be an incorrect tool to use for request logging, tracking and analysis?

I currently am creating a faster test harness for our team and will be recording a baseline from our prod sdk run and our staging sdk run. I am running the tests via jest and want to eventually fire the parsed requests and their query params to a datastore of sorts and have a nice UI around it for tracking.
I thought that Prometheus and Grafana would be able to provide that, but after getting a little POC for myself working yesterday it seems that this combo is more used for tracking application performance rather than request log handling/manipulation/tracking.
Is this the right tool to be using for what I am trying to achieve and if so might someone shed some light on where I might find some more reading aligned with what I am trying to do?

Prometheus does only one thing and it well. It collects metrics and store them. It is used for monitoring your infrastructure or applications to monitor performance, availability, error rates etc. You can write rules using PromQL expression to create alert based on conditions and send them to alert manager which can send it to Pager duty, slack, email or any ticketing system. Even though Prometheus comes with a UI for visualising the data it's better to use Grafana since it's pretty good with it and easy to analyse data.
If you are looking tools for distributed tracing you can check Jaeger

Spring Cloud DataFlow http polling and deduplication

I have been reading much Spring Cloud DataFlow and related documentation in order to produce a data ingest solution that will run in my organization's Cloud Foundry deployment. The goal is to poll an HTTP service for data, perhaps three times per day for the sake of discussion, and insert/update that data in a PostgreSQL database. The HTTP service seems to provide 10s of thousands of records per day.
One point of confusion thus far is a best practice in the context of a DataFlow pipeline for deduplicating polled records. The source data do not have a timestamp field to aid in tracking polling, only a coarse day-level date field. I also have no guarantee that records are not ever updated retroactively. The records appear to have a unique ID, so I can dedup the records that way, but I am just not sure based on the documentation how best to implement that logic in DataFlow. As far as I can tell, the Spring Cloud Stream starters do not provide for this out-of-the-box. I was reading about Spring Integration's smart polling, but I'm not sure that's meant to address my concern either.
My intuition is to create a custom Processor Java component in a DataFlow Stream that performs a database query to determine whether polled records have already been inserted, then inserts the appropriate records into the target database, or passes them on down the stream. Is querying the target database in an intermediate step acceptable in a Stream app? Alternatively, I could implement this all in a Spring Cloud Task as a batch operation which triggers based on some schedule.
What is the best way to proceed with respect to a DataFlow app? What are common/best practices for achieving deduplication as I described above in a DataFlow/Stream/Task/Integration app? Should I copy the setup of a starter app or just start from scratch, because I am fairly certain I'll need to write custom code? Do I even need Spring Cloud DataFlow, because I'm not sure I'll be using its DSL at all? Apologies for all the questions, but being new to Cloud Foundry and all these Spring projects, it's daunting to piece it all together.
Thanks in advance for any help.

You are on the right track, given your requirements you will most likely need to create a custom processor. You need to keep track of what has been inserted in order to avoid duplication.
There's nothing preventing you from writing such processor in a stream app, however performance may take a hit, since for each record you will issue a DB query.
If order is not important, you could parallelize the query so you could process several concurrent messages, but in the end your DB would still pay the price.
Another approach would to use a bloomfilter that can help quite a lot on speeding up your checking for inserted records.
You can start by cloning the starter apps, you could have a poller trigger an http client processor that fetches your data and then go through your custom code processor and finally to a jdbc-sink. Something like stream create time --triger.cron=<CRON_EXPRESSION> | httpclient --httpclient.url-expression=<remote_endpoint> | customProcessor | jdbc
One of the advantages of using SCDF is that you could independently scale your custom processor via deployment properties such as deployer.customProcessor.count=8

Spring Cloud Data Flow builds integration streams for data based on the Spring Cloud Stream, which, in turn, is fully based on the Spring Integration. And all the principles exist in Spring Integration can be applied everywhere there on the SCDF level.
That really might be a case that you won't be able to avoid some codding, but what you need is called in EIP Idempotent Receiver. And Spring Integration provides one for us:
#ServiceActivator(inputChannel = "processChannel")
#IdempotentReceiver("idempotentReceiverInterceptor")
public void handle(Message<?> message)

Ant script for message broker monitoring

Context
I want to develop an automated script for broker (IIB9/10) resource monitoring, capturing information about broker running status, message flows deployed, jvm usage, number of threads running, etc.
The initial thought is to have a report generated using scripts and then displayed over a browser.
Question
Can this be entirely done using only Ant scripts (i am not sure as have not explored iterative processing in Ant in detail) or a combination of Ant and batch/shell scripts is the best bet?
I know Web user interface in IIB10 does most of it but i want to add some features.

I suggest you to take a look at message flow statistics and accounting:
http://www-01.ibm.com/support/knowledgecenter/SSMKHH_9.0.0/com.ibm.etools.mft.doc/ac19100_.htm?lang=en
This is a feature of IIB by which it is capable of emitting resource statistics. The statistics are published to a topic in a well defined XML format. I would try solving your requirement by writing an application to read these messages and use the data in them to generate your graphs or other reports.
There is a support pack, IS03 which can give you an idea of such an application.
This will not cover everything you mentioned, for example monitoring what flows are deployed cannot be achieved like this, but it gives a comprehensive view of the load and performance of your applications:
http://www-01.ibm.com/support/knowledgecenter/SSMKHH_9.0.0/com.ibm.etools.mft.doc/bj10440_.htm?lang=en
And there is a resource statistics feature as well for monitoring resources used by your applications:
http://www-01.ibm.com/support/knowledgecenter/SSMKHH_9.0.0/com.ibm.etools.mft.doc/bj43310_.htm?lang=en

To get everything you will need a variety of tools I think. You can use Resource Stats and Accounting / Stats as suggested by Attila to get JVM and thread usage. The Broker publishes updates to a topic so you can create a simple subscriber to grab that info.
For deploy related info, stop / start state and so forth I would be looking at building simple Integration API or REST API applications to call from ant.
You can find documentation for these API's here:
http://www-01.ibm.com/support/knowledgecenter/SSMKHH_10.0.0/com.ibm.etools.mft.doc/be43410_.htm?lang=en
and here:
http://www-01.ibm.com/support/knowledgecenter/api/content/nl/en-us/SSMKHH_10.0.0/com.ibm.etools.mft.restapi.doc/index.html

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string