For those who don't know, Matillion is an ETL/ELT tool that can be used to handle Snowflake data flows (among other).
One interesting feature is that we can write script tasks in either bash or python.
I had a similar experience in the past with SQL Server Integration Services where it was possible to write C# within tasks as well.
IMHO this presented two big flaws
SSIS packages being stored as "blob" made them extremely ill suited to version control. Any tiny change (like just adjusting a task on a pipeline) usually made comparison between two versions practically impossible
Sharing code between tasks was extremely difficult (was it possible???)
Matillion "jobs" are stored as json and, like SSIS, it is impossible to compare two versions of the same job, regardless how tiny the change
Also, coding something big in python within a simple text window is just not thinkable
So, I would like to write my Python code outside Matillion and just use Matillion tasks as "glue" between the different functions/packages I would write outside.
Has someone experience of doing this?
How can I make my Python file/package available to Matillion Python scripts?
How could I handle different versions of my Python packages in the different Matillion "Versions" of my jobs?
Thanks
I am sure that by now you might have figured something out.
For everyone, in case they are looking, here is another method:
Within Matillion there is a python script component.
We can use that to trigger python scripts outside of Matillion.
I am doing it all the time with the following snippet:
import sys
import os
return_var=os.system("sshpass -p <Remote_Machine_User_Password> ssh <Remote_Machine_User>#<Remote_Machine_IP> -o StrictHostKeyChecking=no '/path/to/python/executable/in/remote/machine /path/to/python/script/in/remote/machine'")
print(return_var)
Thanks!
To answer your question: "How can I make my Python file/package available to Matillion Python scripts? How could I handle different versions of my Python packages in the different Matillion "Versions" of my jobs?"
I think "bash" component within Matillion will be helpful for you. The only requirement here is you should be able to ssh from matillion server to any other server where your external python package exists.
Within the bash component of your Matillion job, you just need to include the ssh code to the remote server and command line to execute your python package. I believe "Here doc" or Bash Here Document is what will help you:
ssh -T $remote_server_ip << doc
ls -lrt
python /path/to/python_script/python_script.py
doc
Regarding your question about versioning I am not sure but I am also interested in knowing a solution.
Thanks
Related
In the documentation, it is written that it can be used for writing custom Django-admin commands. But my question is why do we need to write custom Django admin commands? The given example in the official documentation is a bit dry to me. I would be really grateful if someone give real-world examples from which I can connect its real-life use.
Django doc on management/commands:https://docs.djangoproject.com/en/2.2/howto/custom-management-commands/
I mainly use it from Cron / Scheduled Tasks..
Some potential examples would be:
Sending out Reports/Emails
Running Scripts to Update+Sync some Values
Updating the Cache
Any large update to values- save it to a command to run on the Prod Env
I make it + test it locally, but then I don't want to Copy+Paste it in a SSH terminal cause it sometimes gets all sorts of messed up in the paste.
I also have a management command dothing that sets up the entire project.. runs migrations, collects static, imports db, creates test users, creates required folder structures, etc.
I also have a couple of commands that I use, that I haven't made into Views.. Little tools to help me validate and clean data, spits out a representation of it
Django scheduled operations and report generation from cron is the obvious one.
Another I use is for loading data into the DB from csv files. It's easy in the management command environment to handle bad rows. I write the original csv row into an exceptions file (with a error-description column appended) and can then look at it and decide what to do about these rows. Sometimes, just a trivial edit and feed it through the management command again. It's possible to do the same via a view, but extra work for IMO no gain.
I am new to writing Python code. I have currently written a few modules for data analysis projects. The data is queried from AWS Redshift tables and summarized in CSVs and Excel spreadsheets.
At this point I do not want to pass it on other users in the org as I do not want to expose the code.
Is there an easy way to operationalize the code without exposing it?
PS: I am in the process of learning front-end development (Flask, HTML, CSS) so users can input data and get results back.
Python programs are almost always shipped as bare source. There are ways of compiling Python code into binaries, but this is not a common thing to do and usually I would not recommend it, as it's not as easy as one might expect (which is too bad, really).
That said, you can check out cx_Freeze and Cython.
I have several apps I'm developing that are for end users that have no idea how to use Python. I have already discovered how to setup a package that allows them to run any script without Python knowledge but I don't know how to minimize the distribution size by only including subsets (I.e. the actual function calls in large libs like NumPy) of each imported library that are required. Is there a way to output the actual subcomponents of each imported library that are actually accessed during the function? All my internet searches end up with cyclical imports which is not what I need. There must be some Python dependency walker equivalent I have yet to discover. Much appreciated any libs that can outline this.
[UPDATE]
I converted Snakefood 1.4 over to Python 3x (3.5 tested to build) with python setup.py install and saved it here: https://github.com/mrslezak/snakefood per the accepted answer.
Use Snakefood
Here's a command
sfood -i -r myscript.py | sfood-cluster > dependencies.txt
I'm looking for a way for one program to send a string to another program (both in TCL). I've been looking into "threading", however I haven't been able to understand how it works and how to do what I want with it.
I would suggest you look at the comm package in tcllib. This package provides remote script execution between Tcl interpreters using sockets as the communications mechanism. Since both sides are in Tcl, this is an easy way to go.
I'm looking for some software that allows me to control a server based application, that is, there are bunch of interdependent processes that I'd like to be able to start up, shut down and monitor in a controller manner.
I've come across programs like Autosys, but that's expensive and very much over the top for what I want. I've also seen AppCtl, but that seems not to handle dependencies. Maybe it would be possible to repurpose the init scripts?
Oh, and as an added complication it should be able to run on a Solaris 10 or Linux box without installing any new binaries. On the boxes I've seen recently, that means shell scripts and Perl but not Python.
Do any such programs exist or do I need to dust off my copy of Programming Perl?
Try Supervise, which is what qmail uses to keep track of it's services/startup applications:
http://cr.yp.to/daemontools/supervise.html
G'day,
Have a look in /etc/init.d for something similar and use that as a basis. See also crontab, or maybe at, to run on a regular basis.
cheers,
Rob
Solaris-only as far as I know, but wouldn't Solaris 10's SMF do what you want?
Try GNU Batch. It looks like it supports what you need.
http://www.gnu.org/software/gnubatch/