Apache Beam + Databricks Notebook - map function error - databricks

I am trying to run a simple pipeline using Apache Beam on DataBricks Notebooks, but I am unable to create any custom functions. Here is a simple example:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
def my_func(s):
print(s)
pipeline_options = PipelineOptions([
"--runner=DirectRunner",
])
with beam.Pipeline(options=pipeline_options) as p:
(
p
| "Create data" >> beam.Create(['foo', 'bar', 'baz'])
| "print result" >> beam.Map(my_func)
)
Produces:
RuntimeError: Unable to pickle fn CallableWrapperDoFn(<function Map.<locals>.<lambda> at 0x7fb5a17a6b80>): It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
This error also occurs if I use a lambda function. Everything works as expected if I use the builtin print function. Why am I getting this error? How can I pass custom functions to my pipeline in this environment?

Reference to SparkContext seems to be getting pulled in when the lambda function gets serialized by Beam.
Instead of using beam.Map can you try defining a new Beam DoFn and using the ParDo transform ?

Related

Issues Triggering Dataflow Job from Cloud Function: ModuleNotFoundError: No module named 'functions_framework'

I am doing a simple Cloud Function based on a file upload into GCS, this would trigger a Dataflow job. For the sake of simplicity, my current pipeline simply reads the file from GCS and then writes it to another bucket. While this Dataflow job works well without Cloud Function, Cloud Function does something else. It logs the file details correctly, it triggers a Dataflow job, but then Dataflow fails with a "module not found" error. Hence, while the function executes and triggers the job properly, the Dataflow job does not come through. Here is the code that I have:
def hello_gcs(event, context):
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
input_file = f"gs://{event['bucket']}/{event['name']}"
output_path = 'gs://<gcs_output_path>'
dataflow_options = ['--project=<project_name>', '--runner=DataflowRunner','--region=<region>','--temp_location=gs://<temp_location>']
options = PipelineOptions(dataflow_options, save_main_session = True)
print('Event ID: {}'.format(context.event_id))
print('Event type: {}'.format(context.event_type))
print('Bucket: {}'.format(event['bucket']))
print('File: {}'.format(event['name']))
print('Metageneration: {}'.format(event['metageneration']))
print('Created: {}'.format(event['timeCreated']))
print('Updated: {}'.format(event['updated']))
p = beam.Pipeline(options=options)
print_files = (p | beam.io.ReadFromText(input_file) | beam.io.WriteToText(output_path, file_name_suffix='.txt'))
result = p.run()
I also have a "requirements.txt" file added in the same directory as my function for the following two dependencies:
apache-beam[gcp]==2.39.0
functions-framework==3.*
I have seen in multiple comments that making a Dataflow template bypasses this issue, but I am wondering if anyone may have an idea why this error is being thrown, if it can be circumvented through modification of the current setup, and if not, how to alternately create a template such that this input file can be fed as a parameter?
Thank you!
This is probably a limitation of the save_main_session approach to staging dependencies. The functions-framework is not needed for Beam or Dataflow, but is just something that is loaded into the interpreter during the execution of your Cloud Function.
I suggest disabling the save_main_session option and/or using the --requirements_file or --setup_file options to provide a specification of the dependencies your pipeline will need at runtime.
Detailed documentation for dependency management is at https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

Exception: "SparkContext should only be created and accessed on the driver" while trying foreach()

Being new to Spark, I need to read data from MySQL DB, and then update(or upsert) rows in another table based on what I've read.
AFAIK, unfortunately, there's no way I can do update with DataFrameWriter, so I want to try querying directly to the DB after/while iterating over partitions.
For now I'm writing a script and testing with local gluepyspark shell, Spark version 3.1.1-amzn-0.
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
def f(p):
pass
sc.parallelize([1, 2, 3, 4, 5]).foreachPartition(lambda p: f(p))
When I try to import this simple code in gluepyspark shell, it raises errors saying "SparkContext should only be created and accessed on the driver."
However, there are some conditions under which it works.
It works if I run the script via gluesparksubmit.
It works if I use lambda expression instead of function declaration.
It works if I declare a function within REPL and pass it as argument.
It does not work if I put both def func(): () and .foreachPartition(func) call in the same script.
Moving the function declaration to another module also seems to work. But this couldn't be an option for I need to pack things in one job script.
Could anyone please help me understand:
why the error is thrown
why the error is NOT thrown in other cases
Complete error log: https://justpaste.it/37tj6

Writing unique parquet file per windows with Apache Beam Python

I am trying to stream messages from kafka consumer to google cloud storage with 30 seconds windows using apache beam. Used beam_nuggets.io for reading from a kafka topic. However, I wasn't able to write unique parquet files to GCS per each window.
You can see my code below:`
import apache_beam as beam
from apache_beam.transforms.trigger import AfterAny, AfterCount, AfterProcessingTime, AfterWatermark, Repeatedly
from apache_beam.portability.api.beam_runner_api_pb2 import AccumulationMode
from apache_beam.options.pipeline_options import PipelineOptions
from beam_nuggets.io import kafkaio
import json
from datetime import datetime
import pandas as pd
import config as conf
import apache_beam.transforms.window as window
consumer_config = {"topic": "Uswrite",
"bootstrap_servers": "*.*.*.*:9092",
"group_id": "notification_consumer_group_33"}
folder_name = datetime.now().strftime('%Y-%m-%d')
def format_result(consume_message):
data = json.loads(consume_message[1])
file_name = datetime.now().strftime("%Y_%m_%d-%I_%M_%S")
df = pd.DataFrame(data).T #, orient='index'
df.to_parquet(f'gs://{conf.gcs}/{folder_name}/{file_name}.parquet',
storage_options={"token": "gcp.json"}, engine='fastparquet')
print(consume_message)
with beam.Pipeline(options=PipelineOptions()) as p:
consumer_message = (p | "Reading messages from Kafka" >> kafkaio.KafkaConsume(consumer_config=consumer_config)
| 'Windowing' >> beam.WindowInto(window.FixedWindows(30),
trigger=AfterProcessingTime(30),
allowed_lateness=900,
accumulation_mode=AccumulationMode.ACCUMULATING)
| 'CombineGlobally' >> beam.Map(format_result))
# window.FixedWindows(30),trigger=beam.transforms.trigger.AfterProcessingTime(30),
# accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING
# allowed_lateness=100,CombineGlobally(format_result).without_defaults() allowed_lateness=30,
Using the code above, a new parquet file is generated for each message. What I would like to do is to group messages by 30 seconds windows and generate one parquet file for each window.
I tried different configurations below with no success:
beam.CombineGlobally(format_result).without_defaults()) instead of beam.Map(format_result))
beam.ParDo(format_result))
In addition, I have few more questions:
Even though I set the offset by "auto.offset.reset": "earliest",
kafka producer starts to read from the last message even if I change
the consumer group and can’t figure out why.
Also, I am puzzled by the usage of trigger, allowed_lateness, accumulation_mode.
I am not sure if I need them for the this task.
As you can see in the code block
above, I also tried using these parameters but it didn’t help.
I searched everywhere but couldn’t find a single example that explains this use case.
`
Here are some changes you should make to your pipeline to get this result:
Remove your trigger if you want a single output per window. Triggers are only needed for getting multiple results per window.
Add a GroupByKey or Combine operation to aggregate the elements. Without such an operation, the windowing has no effect.
I recommend using parquetio from the Beam project itself to ensure you get scalable exactly-once behavior. (See the pydoc from 2.33.0 release)
I took a look at the GroupByKey example in the python documentation
Messages I read from KafkaConsumer (I used kafkaio from
beam_nuggets.io) have a type of tuple, and in order to use
GroupByKey, I tried to create a list in the convert_to_list function
by appending the tuples I got from Kafka Consumer. However,
GroupByKey still produces no output.
import apache_beam as beam
from beam_nuggets.io import kafkaio
new_list = []
def convert_to_list(consume_message):
new_list.append(consume_message)
return new_list
with beam.Pipeline() as pipeline:
dofn_params = (
pipeline
| "Reading messages from Kafka" >> kafkaio.KafkaConsume(consumer_config=consumer_config)
| 'Fixed 30sec windows' >> beam.WindowInto(beam.window.FixedWindows(30))
| 'consume message added list' >> beam.ParDo(convert_to_list)
| 'GroupBykey' >> beam.GroupByKey()
| 'print' >> beam.Map(print))
I also tried a similar pipeline but this time, I created a list of
tuples with beam.Create() instead of reading from kafka, and it
works successfully. You can view this pipeline below:
import apache_beam as beam
from beam_nuggets.io import kafkaio
with beam.Pipeline() as pipeline:
dofn_params = (
pipeline
| 'Created Pipeline' >> beam.Create([(None, '{"userId": "921","xx":"123"]),(None, '{"userId": "92111","yy":"123"]))
| 'Fixed 30sec windows' >> beam.WindowInto(beam.window.FixedWindows(30))
| 'GroupBykey' >> beam.GroupByKey()
| 'print' >> beam.Map(print))
I assume the issue in the first approach is related to generating an external list instead of pcollection, but I am not sure. Can you guide me on how to proceed?
Another thing I tried is to use ReadFromKafka function from apache_beam.io.kafka module. But this time I got the following error:
ERROR:apache_beam.utils.subprocess_server:Starting job service with ['java', '-jar', 'user_directory’/.apache_beam/cache/jars\\beam-sdks-java-io-expansion-service-2.33.0.jar', '59627']
Java version 11.0.12 is installed on my computer and the ‘java’ command is available.

Writing test for pyspark udf

I am having some internal python dependency being executed inside a spark pandas_udf . To pass parameters we are wrapping this inside another function.
Code looks like below
def wrapper_fn(df, parameters):
#pandas_udf(schema,GROUPED_MAP)
def run_pandas_code():
""" Importing some python library and using it """
return pandas_df
return df.groupby(<key>).apply(run_pandas_code)
I want to write a test that executes the function wrapper_fn . But when i write these test, i get pickle error . Can someone recommend a good way to test pyspark udf's .
It was possible to do this eventually. The problem was with the reference to some class which pyspark weren't able to serialise.

Logging from pandas udf

I am trying to log from a pandas udf called within a python transform.
Because the code is being called on the executor is does not show up in the driver's logs.
I have been looking at some options on SO but so far the closest option is this one
Any idea on how to surface the logs in the driver logs or any other log files available under build is welcome.
import logging
logger = logging.getLogger(__name__)
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def my_udf(my_pdf):
logger.info('calling my udf')
do_some_stuff()
results_df = my_df.groupby("Name").apply(my_udf)
As you said, the work done by the UDF is done by the executor not the driver, and Spark captures the logging output from the top-level driver process. If you are using a UDF within your PySpark query and need to log data, create and call a second UDF that returns the data you wish to capture and store it in a column to view once the build is finished:
def some_transformation(some_input):
logger.info("log output related to the overall query")
#F.udf("integer")
def custom_function(integer_input):
return integer_input + 5
#F.udf("string")
def custom_log(integer_input):
return "Original integer was %d before adding 5" % integer_input
df = (
some_input
.withColumn("new_integer", custom_function(F.col("example_integer_col"))
.withColumn("debugging", custom_log(F.col("example_integer_col"))
)
I also explain another option is you are more familiar with pandas here:
How to debug pandas_udfs without having to use Spark?
Edit: I have a complete answer here: In Palantir Foundry, how do I debug pyspark (or pandas) UDFs since I can't use print statements?
It is not ideal (as it stops the code) but you can do
raise Exception(<variable_name>)
inside the pandas_udf and it gives you the value of the named variable.

Resources