How to change snaplex dynamically based upon usage? - snaplogic

While executing child pipeline I want to fetch all the snaplex available and assign them based on the usage to the child pipeline. So far I am able to randomize the snaplex while spawning the clild pipeline.
Any help is appreciated.

I'm not sure what exactly the requirement is but if performance is what you are concerned with then I would suggest running child pipelines on the same plex as the parent.
As per SnapLogic documentation -
When running a parent and child Pipeline on the same node, data is not transferred over a network, which improves the performance of the Pipeline execution.
This is because -
When using the Pipeline Execute Snap, if you specify a different Snaplex to run a child Pipeline, then the input documents and Pipeline parameters are passed (through the control plane using encrypted transport) to a node in the designated Snaplex.
Please refer to - SnapLogic Docs - Managing Snaplexes

Related

Cross Job Dependencies in Databricks Workflow

I am trying to create a data pipeline in Databricks using Workflows UI. I have significant number of tasks which I wanted to split across multiple jobs and have dependencies defined across them. But it seems like, in Databricks there cannot be cross job dependencies, and therefore all tasks must be defined in the same job, and dependencies across different tasks can be defined. This is resulting in a very big and messy job diagram.
Is there any better way here ?
P.S. I have access only to the UI portal, won't be able to execute Jobs API ( if there is some way to do this is via API )
It's possible to trigger another job but you will need to use REST API for that, plus you will need to handle it's execution, etc.
But ability to have another job as a subtask is coming - if you watch recent quarterly roadmap webinar, you will see a slide about "Enhanced control flow" that mentions "Trigger another job" functionality.

Is it possible to run different tasks on different schedules with prefect?

I'm moving my first steps with prefect, and I'm trying to see what its degrees of freedom are. To this end, I'm investigating whether prefect supports running different tasks on different schedules in the same python process. For example, Task A might have to run every 5 minutes, while Task B might run twice a day with a Cron scheduler.
It seems to me that schedules are associated with a Flow, not with a task, so to do the above, one would have to create two distinct one-task Flows, each with its own schedule. But even as that, given that running a flow is a blocking operation, I can't see how to "start" both flows concurrently (or pseudo-concurrently, I'm perfectly aware the flows won't execute on separate threads).
Is there a built-in way of getting the tasks running on their independent schedules? I'm under the impression that there is a way to achieve this, but given my limited experience with prefect, I'm completely missing it.
Many thanks in advance for any pointers.
You are right that schedules are associated with Flows and not Tasks, so the only place to add a schedule is a Flow. Running a Flow is a blocking operation if you are using the open source Prefect core only. For production use cases, it's recommended running your Flows against Prefect Cloud or Prefect Server. Cloud is the managed offering and Server is when you host it yourself. Note that Cloud has a very generous free tier.
When using a backend, you will use an agent that will kick off the flow run in a new process. This will not be blocking.
To start with using a backend, you can check the docs here
This Prefect Discourse topic discusses a very similar problem and shows how you could solve it using a flow-of-flows orchestrator pattern.
One way to approach it is to leverage Caching to avoid recomputation of certain tasks that require lower-frequency scheduling than the main flow.

Azure Yaml Schema Batch Trigger

can anyone explain what Batch in Azure YAML Schema Trigger does?
The only explanation on MSFT website is
batch changes if true; start a new build for every push if false (default)
and this isn't really clear to me
Batch changes or Batch trigger actually means batching your CI runs.
If you have many team members uploading changes often, you may want to reduce the number of runs you start. If you set batch to true, when a pipeline is running, the system waits until the run is completed, then starts another run with all changes that have not yet been built.
To clarify this example, let us say that a push A to master caused the above pipeline to run. While that pipeline is running, additional pushes B and C occur into the repository. These updates do not start new independent runs immediately. But after the first run is completed, all pushes until that point of time are batched together and a new run is started.
My interpretation of MS documentation is that the batch boolean is meant to address concerns with encountering frequent pushes to the same trigger branch or set of branches (and possibly tags) and works such that if the build pipeline is already running, any additional changes pushed to the listed branches will be batched together and queued behind the current run. This does mean that those subsequent pushes will be part of the same subsequent pipeline run which is a little strange, but given that's how Microsoft intended it to work it should be fine.
Basically, for repos that have a high potential for demanding pipeline runs and multiple overlapping pushes occurring, batching is great.
For reference, here is the documentation link: https://learn.microsoft.com/en-us/azure/devops/pipelines/repos/azure-repos-git?view=azure-devops&tabs=yaml#batching-ci-runs

Migrating multi-threaded program to Docker containers

I have an old algorithm that executes two steps, say A and B. A deals with connecting to an external service and retrieve data points. A needs to be executed n times, and all the data so gathered is passed as input to B. To help scale the A part, it was implemented using multi-threading where 10 threads are spawned and each connects to n/10 endpoints. Once all threads complete execution, the complete dataset is provided as input to B.
We are planning to create a Docker image of this algo. While we do that I would like to explore if we can do away with the multi threading and instead deploy multiple containers. This gives us better scalability as n is a variable and sometimes is very small or very large.
I could use Kubernetes to orchestrate these containers. However, I see two challenges:
1. How do I gather back all the data points into my core algo?
2. How do I know that all containers have finished processing, and that I could move to step B?
Any pointers or help is appreciated.

Scheduling Azure container instances on demand

I have tasks running on VM and the following sequence of events. For scaling purposes I need to be able to run operations on demand and possibly in parallel.
A simple sequence of events
1. Execute task
2. Task create dataset file.
3. Startup container instance (Linux)
4. In container Execute operations on data set
5. Write updated dataset
6. Vm consume dataset
Environment is Azure.
Azure files for exchanging dataset. (2,5)
PowerShell for creating and starting container.
PowerShell could be used for sequence 4
I do not wish to use platform specific event handlers as it may be necessary to port to other runtime environments. This is a simple use case which I guess many has touched on before. Anyone have any idea if HashiCorp Nomad could bring value? Any tips for other tooling which can bring added value?

Resources