Question is simple:
master_dim.py calls dim_1.py and dim_2.py to execute in parallel. Is this possible in databricks pyspark?
Below image is explaning what am trying to do, it errors for some reason, am i missing something here?
Just for others in case they are after how it worked:
from multiprocessing.pool import ThreadPool
pool = ThreadPool(5)
notebooks = ['dim_1', 'dim_2']
pool.map(lambda path: dbutils.notebook.run("/Test/Threading/"+path, timeout_seconds= 60, arguments={"input-data": path}),notebooks)
your problem is that you're passing only Test/ as first argument to the dbutils.notebook.run (the name of notebook to execute), but you don't have notebook with such name.
You need either modify list of paths from ['Threading/dim_1', 'Threading/dim_2'] to ['dim_1', 'dim_2'] and replace dbutils.notebook.run('Test/', ...) with dbutils.notebook.run(path, ...)
Or change dbutils.notebook.run('Test/', ...) to dbutils.notebook.run('/Test/' + path, ...)
Databricks now has workflows/multitask jobs. Your master_dim can call other jobs to execute in parallel after finishing/passing taskvalue parameters to dim_1, dim_2 etc.
Related
I want to be able to have the outputs/functions/definitions of a notebook available to be used by other notebooks in the same cluster without always have run the original one over and over...
For instance, i want to avoid:
definitions_file: has multiple commands, functions etc...
notebook_1
#invoking definitions file
%run ../../0_utilities/definitions_file
notebook_2
#invoking definitions file
%run ../../0_utilities/definitions_file
.....
Therefore i want that definitions_file is available for all other notebooks running in the same cluster.
I am using azure databricks.
Thank you!
No, there is no such thing as "shared notebook" that is implicitly imported. The closest thing you can do is to package your code as a Python library or into Python file inside Repos, but you still will need to write from my_cool_package import * in all notebooks.
I am doing a simple Cloud Function based on a file upload into GCS, this would trigger a Dataflow job. For the sake of simplicity, my current pipeline simply reads the file from GCS and then writes it to another bucket. While this Dataflow job works well without Cloud Function, Cloud Function does something else. It logs the file details correctly, it triggers a Dataflow job, but then Dataflow fails with a "module not found" error. Hence, while the function executes and triggers the job properly, the Dataflow job does not come through. Here is the code that I have:
def hello_gcs(event, context):
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
input_file = f"gs://{event['bucket']}/{event['name']}"
output_path = 'gs://<gcs_output_path>'
dataflow_options = ['--project=<project_name>', '--runner=DataflowRunner','--region=<region>','--temp_location=gs://<temp_location>']
options = PipelineOptions(dataflow_options, save_main_session = True)
print('Event ID: {}'.format(context.event_id))
print('Event type: {}'.format(context.event_type))
print('Bucket: {}'.format(event['bucket']))
print('File: {}'.format(event['name']))
print('Metageneration: {}'.format(event['metageneration']))
print('Created: {}'.format(event['timeCreated']))
print('Updated: {}'.format(event['updated']))
p = beam.Pipeline(options=options)
print_files = (p | beam.io.ReadFromText(input_file) | beam.io.WriteToText(output_path, file_name_suffix='.txt'))
result = p.run()
I also have a "requirements.txt" file added in the same directory as my function for the following two dependencies:
apache-beam[gcp]==2.39.0
functions-framework==3.*
I have seen in multiple comments that making a Dataflow template bypasses this issue, but I am wondering if anyone may have an idea why this error is being thrown, if it can be circumvented through modification of the current setup, and if not, how to alternately create a template such that this input file can be fed as a parameter?
Thank you!
This is probably a limitation of the save_main_session approach to staging dependencies. The functions-framework is not needed for Beam or Dataflow, but is just something that is loaded into the interpreter during the execution of your Cloud Function.
I suggest disabling the save_main_session option and/or using the --requirements_file or --setup_file options to provide a specification of the dependencies your pipeline will need at runtime.
Detailed documentation for dependency management is at https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
Being new to Spark, I need to read data from MySQL DB, and then update(or upsert) rows in another table based on what I've read.
AFAIK, unfortunately, there's no way I can do update with DataFrameWriter, so I want to try querying directly to the DB after/while iterating over partitions.
For now I'm writing a script and testing with local gluepyspark shell, Spark version 3.1.1-amzn-0.
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
def f(p):
pass
sc.parallelize([1, 2, 3, 4, 5]).foreachPartition(lambda p: f(p))
When I try to import this simple code in gluepyspark shell, it raises errors saying "SparkContext should only be created and accessed on the driver."
However, there are some conditions under which it works.
It works if I run the script via gluesparksubmit.
It works if I use lambda expression instead of function declaration.
It works if I declare a function within REPL and pass it as argument.
It does not work if I put both def func(): () and .foreachPartition(func) call in the same script.
Moving the function declaration to another module also seems to work. But this couldn't be an option for I need to pack things in one job script.
Could anyone please help me understand:
why the error is thrown
why the error is NOT thrown in other cases
Complete error log: https://justpaste.it/37tj6
I'm using a Jupyter Notebook on a Spark EMR cluster, want to learn more about a certain command but I don't know what the right technology stack is to search. Is that Spark? Python? Jupyter special syntax? Pyspark?
When I try to google it, I get only a couple results and none of them actually include the content I quoted. It's like it ignores the %%.
What does "%%spark_sql" do, what does it originate from, and what are arguments you can pass to it like -s and -n?
An example might look like
%%spark_sql -s true
select
*
from df
These are called magic commands/functions. Try running %pinfo %%spark_sql or %pinfo2 %%spark_sqlin a Jupyter cell and see if it gives you a detailed information about %%spark_sql.
I am able to create a UDF function and register to spark using spark.UDF method. However, this is per session only.
How to register python UDF functions automatically when the Cluster starts?. These functions should be available to all users. Example use case is to convert time from UTC to local time zone.
This is not possible; this is not like UDFs in Hive.
Code the UDF as part of the package / program you submit or in the jar included in the Spark App, if using spark-submit.
However,
spark.udf.register.udf("...
is required to be done as well. This applies to Databrick notebooks, etc. The UDFs need to be re-registered per Spark Context/Session.
acutally you can create a permanent function but not from a notebook
you need to create it from a JAR file
https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html
CREATE [TEMPORARY] FUNCTION [db_name.]function_name AS class_name
[USING resource, ...]
resource:
: (JAR|FILE|ARCHIVE) file_uri