Is there a way to log stats/artifacts from AWS Glue Job using mlfow? - python-3.x

Could you please let me know if any such feature available in the current version of mlflow?

I think the general answer here is that you can log arbitrary data and artifacts from your experiment to your MLflow tracking server using mlflow_log_artifact() or mlflow_set_tag(), depending on how you want to do it. If there's an API to get data from Glue and you can fetch it during your MLflow run, then you can log it. Write a csv, save a .png to disk and log that, or declare a variable and access it when you are setting the tag.
This applies for Glue or any other API that you are getting a response from. One of the key benefits of MLflow is that it is such a general framework, so you can track what matters to that particular experiment.
Hope this helps!

Related

Cognos REST API and scheduling schema loading

I am trying to find out more informations about using the REST API in order to create a schedule for schema loading. Right now, I have to reload the particular schemas via my data server connections manually (click on every schema and Load Metadata) and would like to automate this process.
Any pointers will be much appreciated.
Thank you
If the metadata of your data warehouse is so in flux that you need to reload the metadata so frequently that you want to automate the process then you need to understand that your data warehouse is in no way ready for use.
So, the question becomes why would you want to frequently reload the metadata of a data source schema? I'm guessing that you are refreshing the data of your data base and, because your query cache has not expired, you are not seeing the new data.
So the answer is, you probably don't want to do what you think you need to do unless you can convince me otherwise.
Also, if you enter some obvious search terms you will find the Cognos analytics REST api documentation without too much difficulty.

Deploying PySpark as a service

I have code in PySpark that parallelizes heavy computation. It takes two files and several parameters, performs the computation and generates information that at this moment is stored in a CSV file (tomorrow it would be ideal that the information is stored in a Postgres database).
Now, I want to consume this functionality as a service from a system made in Django, from which users will set the parameters of the Spark service, the selection of the two files and then query the results of the process.
I can think of several ways to cover this, but I don't know which one is the most convenient in terms of simplicity of implementation:
Use Spark API-REST: this would allow a request to be made from Django to the Spark cluster. The problem is that I found no official documentation, I get everything through blogs whose parameters and experiences correspond to a particular situation or solution. At no point does it mention, for example, how I could send files to be consumed through the API by Spark, or get the result.
Develop a simple API in Spark's Master to receive all parameters and execute the spark-submit command at the OS level. The awkwardness of this solution is that everything must be handled at the OS level, the parameter files and the final process result must be saved on disk and accessible by the Django server that wants to get it to save its information in the DB lately.
Integrate the Django app in the Master server, writing PySpark code inside It, Spark connects to the master server and runs the code that manipulates the RDDs. This scheme does not convince me because it sacrifices the independence between Spark and the Django application, which is already huge.
If someone could enlighten me about this, maybe due to lack of experience I am overlooking a cleaner, more robust, or idiomatic solution.
Thanks in advance

Existing tool to parse and analyze logs

I'm coding an application via nodejs that parses APIs to collect data and organize it. However, the need for systematic logging and display of differential logs has risen. The application needs to show users what changed with each consecutive state changes or within a specified time span. Is there any existing tool that would help me achieve that?

How to process large .kryo files for graph data using TinkerPop/Gremlin

I am new to Apache TinkerPop.
I have done some basic stuff like installing TinkerPop Gremlin console, creating graph .kryo file, loaded it in gremlin console and executed some basic gremlin queries. All good till now.
But i wanted to check how can we process .kryo files which are very much large in size says more than 1000GB. If i create a single .kryo file, loading it in console(or using some code) is not feasible i think.
Is there any way we can deal with graph data which is pretty huge in size?
basically i have some graph based data stored in Amazon Neptune DB, i want to take it out and store it in some files(e.g .kryo) and process later for gremlin queries. Thanks in advance.
Rather than use Kyro which is Java specific, I would recommend using something more language agnostic such as CSV files. If you are using Amazon Neptune you can use the Neptune Export tool to export your data as CSV files.
Documentation
Git repo
Cloud Formation template

best practice for logging mechanisam in ETL processing

What is the best practice for logging mechanisam in ETL processing?
Actually we are developing ETL application .in this we want to use log analaytics to log data
Could anybody provide best practice for logging mechanism at industry standards.
i have googled below link :https://www.timmitchell.net/post/2016/03/14/etl-logging/
any help is appreciated.
Thanks in advance
I recently implemented one in one of the organisation. It is a custom built because of the technology choice. Following is what is included in the logging.
It acts as a wrapper around any ETL job aka there is a template developed and the template has in built logging
The template has a feature of master and child job and logs based on master or child
The logging captures the following:
Status of the job - success, failure, warning
Source details (e.g name of file or source table etc name)
data classification tagging
business owner of the incoming data source
row count of raw file vs the row count loaded
Send an alert to a distribution list if the job fails
Raises a ticket via service desk if job fails
It depends on your requirements, you may want to capture more or less.
Good luck

Resources