Spark: How Can I analyze the physical plan before query execution? - apache-spark

I am trying to solve the following problem on databricks (on Azure): I essentially want to analyze the physical plan of a query before it's execution. The idea is essentially that if the physical plan does contain a certain path, I want to fail the query execution.
I need to analyze the Physical Plan and not the Logical Plan, as I want to block commands that read from a certain path. However when I use spark.read.parquet(path) the path doe not show up in the Logical Plan but does show up in the physical plan. Further, I cannot use access restrictions as I want to block this only for certain clusters in a databricks workspace and not for all clusters.
I found the QueryExecutionListener which can be extended to create a custom class and override the functions onSuccess and onFailure. However these functions are only executed post the success/failure of the query and thus doesn't suit my case. Alternatively I found that we can extend the Rule class from org.apache.spark.sql.catalyst.rules.Rule and override the apply function. However, in this scenario I can only analyze the Logical Plan and not the Physical Plan.

Related

One Azure Function that can listen to Cosmos DB Change Feed for all containers

Right now I have one Cosmos DB that have three different containers, therefore I use three different functions that are listening for Change Feed events from this Cosmos DB.
In the future amount of my containers will be grown from 3 to 100.
So, is it possible to have one function that will be listening for all changes in all containers and that can detect from what container changes have come?
The recommended pattern with Cosmos DB is to have a single or few data containers and partition data logically via property values, rather than segmenting into many containers. If at all possible, for change feed and other reasons, it would be worthwhile to review the proposed design to see if there is a way to consolidate containers and avoid this pain.
That said, if an unknown and growing number of containers must be supported, one way that might be achieved dynamically is with the Change Feed Processor via the SDK. When instantiating a processor instance using GetChangeFeedProcessorBuilder, you can provide the container name as a parameter. Given a configured or discovered list of all target containers, multiple change feed processor instances could be created and run in parallel.
This could be hosted in multiple ways. Consider using an ASP.NET Core app with an IHostedService, and avoiding Azure Functions in this case.
In short: no.
Change feed in Azure Cosmos DB is a persistent record of changes to a container in the order they occur. Change feed support in Azure Cosmos DB works by listening to an Azure Cosmos container for any changes.
and
Change feed is available for each logical partition key within the container, and it can be distributed across one or more consumers for parallel processing.
The documentation on Change feed in Azure Cosmos DB clearly states a Change Feed is for one specific container.
There probably are, however, different approaches you could take to solve your problem. Most important question is: what actually is the problem you are trying to solve?
If you need to process the changes in Cosmos DB using a Function, I can imaging the logic for processing the changes can (will) be different for each type of data. So for each container. If this is not the case the data doesn't have to be in different containers?
One option could be to create a timer triggered Function that will be Reading change feed with a pull model. This enables you to loop the containers in that Function and prepare processing the changes per container (for instance by putting the information in a queue or using Durable Functions with the Fan-Out/Fan-In pattern).

Azure Functions: Understanding Change Feed in the context of multiple apps

According to the below diagram on https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed-processor, at least 4 partition key ranges are distributed between two hosts. What I'm struggling to understand in this diagram is the distinction between a host and a consumer. In the context of Azure Functions, would it be true to say that a host is a Function app whereas a consumer is an active/warm instance?
I'd like to create a setup with N many Function apps each with 0-200 active instances (depending on workload). At the same time, I'd like to read Change Feed. If I use a CosmosDBTrigger with the same connection string and lease container in each app, is this taken care of automatically or do I need a manual implementation?
The documentation you linked is mainly for the Change Feed Processor, but the Azure Functions binding actually runs the Change Feed Processor underneath.
When just using CFP, it's maybe easier to understand because you are mainly in control of the instances and distribution, but I'll try to map it to Functions.
The document mentions a deployment unit concept:
A single change feed processor deployment unit consists of one or more instances with the same processorName and lease container configuration. You can have many deployment units where each one has a different business flow for the changes and each deployment unit consisting of one or more instances.
For example, you might have one deployment unit that triggers an external API anytime there is a change in your container. Another deployment unit might move data, in real time, each time there is a change. When a change happens in your monitored container, all your deployment units will get notified.
The deployment unit in Functions is the Function App. One Function App can span many instances. So each instance/host within that Function App deployment, will act as a available host/consumer.
Further down, the article talks about the dynamic scaling and what it says is basically that, within a Deployment Unit (Function App), the leases will get evenly distributed. So if you have 20 leases and 10 Function App instances, then each instance will own 2 leases and process them independently from the other instances.
One important note on that article is, scaling enables a higher CPU pool, but not a necessarily a higher parallelism.
As the documentation mentions, even on a single instance, CFP will process and read each lease it owns on an independent Task. The problem is, all these parallel processing is sharing the same CPU, so adding more instances will help if you currently see the instance having a CPU thread/bottleneck.
Now, in your example, you want to have N Function Apps, I assume that each one, doing something different. Basically, microservice deployments which would trigger on any change, but do a different task or fire a different business flow.
This other article covers that. Basically you can either, have each Function App use a separate Lease collection (having the monitored collection be the same) or you can share the lease collection but use a different LeaseCollectionPrefix for each Function App deployment. If the number of Function Apps you will be shared the lease collection is high, please check the RU usage on the lease collection as you might need to increase it (there is a note about it on the article).

How to ensure only one Azure Function BlobTrigger runs at a time?

I have a use case to implement multiple BlobTriggers in Azure Functions (using the Linux Consumption Plan). For example in Azure Storage I would have 5 different clients with a directory structure like:
client1/file.txt
client2.file.txt
client3/file.txt
client4/file.txt
client5/file.txt
It's possible for both client1/file.txt and client2/file.txt to be dropped off at the same time in Azure Storage. To prevent race conditions and exceeding the 1.5 GB memory limit, I would like the BlobTrigger for client1/file.txt to wait for the BlobTrigger for client2/file.txt to finish or vice versa (the order doesn't matter here, just that both of them eventually execute).
Do I have to set up a queue process separately? Can I use the preview setting WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUTto achieve this easily?
Edit: Would using durable functions be a better solution?
You should be able to do this by making sure the MAX SCALE OUT value is set to 1, this way it will only process 1 file at a time. You can also change your consumption\pricing model from consumption to app service plan one. This way you can use the tier you want, then you can have more memory available as well (depending on the tier you choose).

Running HDInsight jobs howto

Few questions regarding HDInsight jobs approach.
1) How to schedule HDInsight job? Is there any ready solution for it? For example if my system will constantly get a large number of new input files collected that we need to run map/reduce job upon, what is the recommended way to implemented on-going processing?
2) From the price perspective, it is recommended to remove the HDInsight cluster for the time when there is no job running. As I understand there is no way to automate this process if we decide to run the job daily? Any recommendations here?
3) Is there a way to ensure that the same files are not processed more than once? How do you solve this issue?
4) I might be mistaken, but it looks like every hdinsight job requires a new output storage folder to store reducer results into. What is the best practice for merging of those results so that reporting always works on the whole data set?
Ok, there's a lot of questions in there! Here are I hope a few quick answers.
There isn't really a way of scheduling job submission in HDInsight, though of course you can schedule a program to run the job submissions for you. Depending on your workflow, it may be worth taking a look at Oozie, which can be a little awkward to get going on HDInsight, but should help.
On the price front, I would recommend that if you're not using the cluster, you should destroy it and bring it back again when you need it (those compute hours can really add up!). Note that this will lose anything you have in the HDFS, which should be mainly intermediate results, any output or input data held in the asv storage will persist in and Azure Storage account. You can certainly automate this by using the CLI tools, or the rest interface used by the CLI tools. (see my answer on Hadoop on Azure Create New Cluster, the first one is out of date).
I would do this by making sure I only submitted the job once for each file, and rely on Hadoop to handle the retry and reliability side, so removing the need to manage any retries in your application.
Once you have the outputs from your initial processes, if you want to reduce them to a single output for reporting the best bet is probably a secondary MapReduce job with the outputs as its inputs.
If you don't care about the individual intermediate jobs, you can just chain these directly in the one MapReduce job (which can contain as many map and reduce steps as you like) through Job chaining see Chaining multiple MapReduce jobs in Hadoop for a java based example. Sadly the .NET api does not currently support this form of job chaining.
However, you may be able to just use the ReducerCombinerBase class if your case allows for a Reducer->Combiner approach.

Resource mange external nodes in Jenkins for tests

My problem is that I have code that need a rebooted node. I have many long running Jenkins test jobs that needs to be executed on rebooted nodes.
My existing solution is to define multiple "proxy" machines in Jenkins with the same label (TestLable) and 1 executor per machine. I bind all the test jobs to the label (TestLable). In the test execution script I detect the Jenkins machine (Jenkins env. NODE_NAME) and use that to know what physical physical machine the tests should use.
Do anybody know of a better solution?
The above works but I need to define a high number of “nodes/machines” that may not be needed. What I would like was a plugin that would be able to grant a token to a Jenkins job. This way a job would not be executed before a Jenkins executor and a token was free. The token should be a string so that my test jobs could use it to know what external node it could use.
We have written our own scheduler that allocates stuff before starting Jenkins nodes. There may be a better solution - but this works for us mostly. I've yet to come across an off-the-shelf scheduler that can deal with complicated allocation of different hardware resources. We have n box types, allocated to n build types.
Some build types we have are not compatible together without destroying all persistent data - which may be required as it takes a long time to gather. Some jobs require combinations of these hardware types. We store the details in a DB, and then use business logic to determine how it is allocated. We've often found that particular job types need additional business logic or extra data fields to account for their specific requirements.
So it may be the best way is to write your own scheduler, in your own language of choice, which takes into account your particular needs.

Resources