U-SQL - Execution related queries - azure

I wrote multiple U-SQL scripts and its output got stored in ADLA, based on this I have few question.
How we can run dependent jobs in U-SQL?
How to execute statement based on some condition like
If RecordCount > 0 then
insert into table1
endif
How we can schedule U-SQL jobs?
Can we write multiple scripts and call them from main script?
During script execution, compiler prepare and compiles the code. It took almost 30-40 secs. How we can bundle the compiled code and create the ADF pipeline?

You can schedule and orchestrate U-SQL jobs with Azure Data Factory or by writing your own scheduler with one of the SDKs (Powershell, C#, Java, node.js, Python).
U-SQL supports two ways for conditional execution:
If your conditional can be evaluated at compile time, e.g., when you pass a parameter value or check for the existence of a file, you can use the IF statement.
If your conditional can only be determined during the execution of the script, then you can use the WHERE clause as wBob outlines in his comment.
As wBob mentions, you can encapsulate most of the U-SQL statements in procedures and then call them from other scripts/procedures, or you can write your own way of inclusion/orchestration if you need script file reuse.
There is currently no ability to reuse and submit just compiled code, since the compilation depends on the exact information such as what files are present and the statistics of the accessed data.

Related

How a failed databricks job can continue where it left?

I have a databricks job that run many commands and at the end it tries to save the results to a folder. However, it is failed because it tried to write a file to folder but folder was not exists.
I simply created the folder.
However, how can I make it continue where it left without executing all the previous commands.
I assume that by Databricks job you refer to the way to run non-interactive code in a Databricks cluster.
I do not think that what you ask is possible, namely getting the output of a certain Spark task from a previous job run on Databricks. As pointed out in the other answer, "if job is finished, then all processed data is gone". This has to do with the way Spark works under the hood. If you are curious about this topic, I suggest you start reading this post about Transformations and Actions in Spark.
Although you can think of a few workarounds, for instance if you are curious about certain intermediate outputs of your job, you could decide to temporary write your DataFrame/Dataset to some external location. In this way you can easily resume a the job from your preferred point by reading one of your checkpoints as input. This approach is a bit clanky and I do not recommend it, but it's a quick and dirty solution you might want to choose if you are in the testing/designing phase.
A more robust solution would involve splitting your job in multiple sub-jobs and setting upstream & downstream dependencies among them. You can do that using Databricks natively (Task dependencies section) or an external scheduler that integrates with Databricks, like Airflow.
In this way, you can split your tasks and you will be able to have an higher control granularity on your Application. So, in case of again failures on the writing step, you will be able to re run only the writing easily.
If job is finished, then all processed data is gone, until you write some intermediate states (additional tables, etc.) from which you can continue processing. In most cases, Spark actually execute the code only when it's writing results of execution of provided transformations.
So right now you just need to rerun the job.

Cassandra:Executing multiple DDL statements in cql file

I would like to execute over hundred of user-defined-type statements. These statements are encapsulated in a .cql file.
While executing .cql file everytime for new cases, I find that many of the statements within it gets skipped.
Therefore, I would like to know if there is any performance issues of executing 100s of statements composed in .cql file
Note: I am executing .cql files on behalf of a Python script via os.system method
The performance of executing 100's of DDL statements via code (or cql file/cqlsh) is proportional to the number of nodes in the cluster. In a distributed system like Cassandra all nodes have to agree for the schema change and more the number of nodes, more the time it takes for schema agreement.
There is essentially a timeout value maxSchemaAgreementWaitSeconds which determines how long coordinator node will wait before replying back to client. Typically case for schema deployment is one or two tables and the default value for this parm works just fine.
Since in the special case of multiple DDL executed at once via code/cqlsh; its better to increase the value for maxSchemaAgreementWaitSeconds say to 20sec. Its going to a take a little longer for the schema deployment, but it will make sure the deployment succeeds.
Java reference
Python reference

Is it possible to run a table to table mapping scenario in parallel (multi threading)

Is it possible to run a table to table mapping scenario in parallel (multi threading)
we have a huge table and we already created table mapping and scenario on the mapping.
we also executing it from loadplan.
but is there way I can run the scenario in multiple threads to make the data transfer faster.
I am using groovy to script all these task.
It will be better if I get someway to script it in groovy.
A load plan with Parallel steps or a packages with scenarios in asynchronous mode will do for the parallelism part.
An issue you might run in, depending on which KMs are used, is that the same name will be used by temporary tables in all mappings. To avoid that, select the "Use Unique Temporary Object Names" checkbox appears in the Physical tab of your mapping. It will generate a different name for these objects for each execution.
It is possible on the ODI side, you may need some modifications on the mapping to not load any duplicate data. We have a similar flow where we use modula function on a numeric key to split source data into partitions. Then this data gets loaded into target.
To run this interface in multi-thread way, we have a package with a loop that executes the scenario asynchronously of this mapping with a MODULO_VALUE variable.
For loading data we are using oracle sqlloader utility, it is able to work in a parallel way to load data into one target table. I am not sure about if data pump utility also has this ability. But I know if you try to load data by SQL as a multithread approach you would get a ORA-00054: resource busy and acquire with NOWAIT specified error.
As you see there is no Groovy code included in this flow, all handled by ODI mappings, packages and KMs. I hope this helps.

Is there any way to pass a U-SQL script a parameter from a C# program?

I'm using U-SQL with a table in Azure Data Lake Analytics. Is there any way to pass a list of partition keys generated in a C# program to the U-SQL script then have the script return all the elements in those partitions?
Do you want to run the C# code on your dev box and pass values to a U-SQL script or run C# code inside your U-SQL Script? Your description is not clear. Based on your question title, I will answer your first question.
Passing values as parameters from a C# program: The ADLA SDK (unlike Azure Data Factory) does not yet provide a parameter model for U-SQL scripts (please file a request at http://aka.ms/adlfeedback, although I know it is on our backlog already, having external customer demand helps in prioritization).
However it is fairly easy to add your parameter values by prepending DECLARE statements like the following in the beginning of the script and have the script refer to them as variables.
DECLARE #param = new SqlArray<int>( 1, 2, 3, 4); // 1,2,3,4 were calculated in your C# code (I assume you have int partition keys).
Then you should be able to use the array in a predicate (e.g., #param.Contains(partition_col)). That will not (yet, we have a workitem for it) trigger partition elimination though.
If you want partition elimination, you will have to have a fixed set of parameter values and use them in an IN clause. E.g., you want to check up to 3 months, you would write the query predicate as:
WHERE partition_col IN (#p1, #p2, #p3);
And you prepend definitions for #p1, #p2 and #p3, possibly duplicating values for the parameters you do not need.

How do we get the current date in a PS file name qualifier using JCL?

How do we get the current date in a PS file name qualifier using JCL?
Example out put file name: Z000417.BCV.TEST.D120713 (YYMMDD format).
This can be done, but not necessarily in a straightforward manner. The straightforward manner would be to use a system symbol in your JCL. Unfortunately this only works for batch jobs if it has been enabled for the job class on more recent versions of z/OS.
Prior to z/OS v2, IBM's stated reason this didn't work is that your job could be submitted on a machine in London, the JCL could be interpreted on a machine in Sydney, and the job could actually execute on a machine in Chicago. Which date (or time) should be on the dataset? There is no one correct answer, and so we all created our own solutions to the problem that incorporates the answer we believe to be correct for our organization.
If you are able to use system symbols in your batch job JCL, there is a list of valid symbols available to you.
One way to accomplish your goal is to use a job scheduling tool. I am familiar with Control-M, which uses what are called "auto-edit variables." These are special constructs that the product provides. The Control-M solution would be to code your dataset name as
Z000417.BCV.TEST.D%%ODATE.
Some shops implement a scheduled job that creates a member in a shared PDS. The member consists of a list of standard JCL SET statements...
// SET YYMMDD=120713
// SET CCYYMMDD=20120713
// SET MMDDYY=071312
...and so on. This member is created once a day, at midnight, by a job scheduled for that purpose. The job executes a program written in that shop to create these SET statements.
Another answer is you could use ISPF file tailoring in batch to accomplish your goal. This would work because the date would be set in the JCL before the job was submitted. While this will work, I don't recommend it unless you're already familiar with file tailoring and executing ISPF in batch in your shop. I think it's kind of complicated for something this simple to accomplish in other ways outlined in this reply.
You could use a GDG instead of a dataset with a date in its name. If what you're looking for is a unique name, that's what GDGs accomplish (among other things).
The last idea that comes to my mind is to create your dataset with a name not containing the date, then use a Unix System Services script to construct an ALTER command (specifying the NEWNAME parameter) for IDCAMS, then execute IDCAMS to rename your dataset.
If you are loading the jobs using JOBTRAC/CONTROL-M schedulers,
getting the date in required format is possibly easy. The format
could be 'OSYMD', which will be replaced by scheduler on the fly
before it loads the job. It has got many formats to pacify the need.
You can also make use of a JCL utility, which i dont remember exactly but I would. This takes the file name from a SYSIN dataset and makes as the DSN name of the output. The SYSIN dataset can be created in the previous step by using a simple DFSORT &DATE commands. Do lemme know if you need syntax, i prefer to google and handson.

Resources