Writing unique parquet file per windows with Apache Beam Python - python-3.x

I am trying to stream messages from kafka consumer to google cloud storage with 30 seconds windows using apache beam. Used beam_nuggets.io for reading from a kafka topic. However, I wasn't able to write unique parquet files to GCS per each window.
You can see my code below:`
import apache_beam as beam
from apache_beam.transforms.trigger import AfterAny, AfterCount, AfterProcessingTime, AfterWatermark, Repeatedly
from apache_beam.portability.api.beam_runner_api_pb2 import AccumulationMode
from apache_beam.options.pipeline_options import PipelineOptions
from beam_nuggets.io import kafkaio
import json
from datetime import datetime
import pandas as pd
import config as conf
import apache_beam.transforms.window as window
consumer_config = {"topic": "Uswrite",
"bootstrap_servers": "*.*.*.*:9092",
"group_id": "notification_consumer_group_33"}
folder_name = datetime.now().strftime('%Y-%m-%d')
def format_result(consume_message):
data = json.loads(consume_message[1])
file_name = datetime.now().strftime("%Y_%m_%d-%I_%M_%S")
df = pd.DataFrame(data).T #, orient='index'
df.to_parquet(f'gs://{conf.gcs}/{folder_name}/{file_name}.parquet',
storage_options={"token": "gcp.json"}, engine='fastparquet')
print(consume_message)
with beam.Pipeline(options=PipelineOptions()) as p:
consumer_message = (p | "Reading messages from Kafka" >> kafkaio.KafkaConsume(consumer_config=consumer_config)
| 'Windowing' >> beam.WindowInto(window.FixedWindows(30),
trigger=AfterProcessingTime(30),
allowed_lateness=900,
accumulation_mode=AccumulationMode.ACCUMULATING)
| 'CombineGlobally' >> beam.Map(format_result))
# window.FixedWindows(30),trigger=beam.transforms.trigger.AfterProcessingTime(30),
# accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING
# allowed_lateness=100,CombineGlobally(format_result).without_defaults() allowed_lateness=30,
Using the code above, a new parquet file is generated for each message. What I would like to do is to group messages by 30 seconds windows and generate one parquet file for each window.
I tried different configurations below with no success:
beam.CombineGlobally(format_result).without_defaults()) instead of beam.Map(format_result))
beam.ParDo(format_result))
In addition, I have few more questions:
Even though I set the offset by "auto.offset.reset": "earliest",
kafka producer starts to read from the last message even if I change
the consumer group and can’t figure out why.
Also, I am puzzled by the usage of trigger, allowed_lateness, accumulation_mode.
I am not sure if I need them for the this task.
As you can see in the code block
above, I also tried using these parameters but it didn’t help.
I searched everywhere but couldn’t find a single example that explains this use case.
`

Here are some changes you should make to your pipeline to get this result:
Remove your trigger if you want a single output per window. Triggers are only needed for getting multiple results per window.
Add a GroupByKey or Combine operation to aggregate the elements. Without such an operation, the windowing has no effect.
I recommend using parquetio from the Beam project itself to ensure you get scalable exactly-once behavior. (See the pydoc from 2.33.0 release)

I took a look at the GroupByKey example in the python documentation
Messages I read from KafkaConsumer (I used kafkaio from
beam_nuggets.io) have a type of tuple, and in order to use
GroupByKey, I tried to create a list in the convert_to_list function
by appending the tuples I got from Kafka Consumer. However,
GroupByKey still produces no output.
import apache_beam as beam
from beam_nuggets.io import kafkaio
new_list = []
def convert_to_list(consume_message):
new_list.append(consume_message)
return new_list
with beam.Pipeline() as pipeline:
dofn_params = (
pipeline
| "Reading messages from Kafka" >> kafkaio.KafkaConsume(consumer_config=consumer_config)
| 'Fixed 30sec windows' >> beam.WindowInto(beam.window.FixedWindows(30))
| 'consume message added list' >> beam.ParDo(convert_to_list)
| 'GroupBykey' >> beam.GroupByKey()
| 'print' >> beam.Map(print))
I also tried a similar pipeline but this time, I created a list of
tuples with beam.Create() instead of reading from kafka, and it
works successfully. You can view this pipeline below:
import apache_beam as beam
from beam_nuggets.io import kafkaio
with beam.Pipeline() as pipeline:
dofn_params = (
pipeline
| 'Created Pipeline' >> beam.Create([(None, '{"userId": "921","xx":"123"]),(None, '{"userId": "92111","yy":"123"]))
| 'Fixed 30sec windows' >> beam.WindowInto(beam.window.FixedWindows(30))
| 'GroupBykey' >> beam.GroupByKey()
| 'print' >> beam.Map(print))
I assume the issue in the first approach is related to generating an external list instead of pcollection, but I am not sure. Can you guide me on how to proceed?
Another thing I tried is to use ReadFromKafka function from apache_beam.io.kafka module. But this time I got the following error:
ERROR:apache_beam.utils.subprocess_server:Starting job service with ['java', '-jar', 'user_directory’/.apache_beam/cache/jars\\beam-sdks-java-io-expansion-service-2.33.0.jar', '59627']
Java version 11.0.12 is installed on my computer and the ‘java’ command is available.

Related

Apache Beam + Databricks Notebook - map function error

I am trying to run a simple pipeline using Apache Beam on DataBricks Notebooks, but I am unable to create any custom functions. Here is a simple example:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
def my_func(s):
print(s)
pipeline_options = PipelineOptions([
"--runner=DirectRunner",
])
with beam.Pipeline(options=pipeline_options) as p:
(
p
| "Create data" >> beam.Create(['foo', 'bar', 'baz'])
| "print result" >> beam.Map(my_func)
)
Produces:
RuntimeError: Unable to pickle fn CallableWrapperDoFn(<function Map.<locals>.<lambda> at 0x7fb5a17a6b80>): It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
This error also occurs if I use a lambda function. Everything works as expected if I use the builtin print function. Why am I getting this error? How can I pass custom functions to my pipeline in this environment?
Reference to SparkContext seems to be getting pulled in when the lambda function gets serialized by Beam.
Instead of using beam.Map can you try defining a new Beam DoFn and using the ParDo transform ?

Issues Triggering Dataflow Job from Cloud Function: ModuleNotFoundError: No module named 'functions_framework'

I am doing a simple Cloud Function based on a file upload into GCS, this would trigger a Dataflow job. For the sake of simplicity, my current pipeline simply reads the file from GCS and then writes it to another bucket. While this Dataflow job works well without Cloud Function, Cloud Function does something else. It logs the file details correctly, it triggers a Dataflow job, but then Dataflow fails with a "module not found" error. Hence, while the function executes and triggers the job properly, the Dataflow job does not come through. Here is the code that I have:
def hello_gcs(event, context):
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
input_file = f"gs://{event['bucket']}/{event['name']}"
output_path = 'gs://<gcs_output_path>'
dataflow_options = ['--project=<project_name>', '--runner=DataflowRunner','--region=<region>','--temp_location=gs://<temp_location>']
options = PipelineOptions(dataflow_options, save_main_session = True)
print('Event ID: {}'.format(context.event_id))
print('Event type: {}'.format(context.event_type))
print('Bucket: {}'.format(event['bucket']))
print('File: {}'.format(event['name']))
print('Metageneration: {}'.format(event['metageneration']))
print('Created: {}'.format(event['timeCreated']))
print('Updated: {}'.format(event['updated']))
p = beam.Pipeline(options=options)
print_files = (p | beam.io.ReadFromText(input_file) | beam.io.WriteToText(output_path, file_name_suffix='.txt'))
result = p.run()
I also have a "requirements.txt" file added in the same directory as my function for the following two dependencies:
apache-beam[gcp]==2.39.0
functions-framework==3.*
I have seen in multiple comments that making a Dataflow template bypasses this issue, but I am wondering if anyone may have an idea why this error is being thrown, if it can be circumvented through modification of the current setup, and if not, how to alternately create a template such that this input file can be fed as a parameter?
Thank you!
This is probably a limitation of the save_main_session approach to staging dependencies. The functions-framework is not needed for Beam or Dataflow, but is just something that is loaded into the interpreter during the execution of your Cloud Function.
I suggest disabling the save_main_session option and/or using the --requirements_file or --setup_file options to provide a specification of the dependencies your pipeline will need at runtime.
Detailed documentation for dependency management is at https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

Logging from pandas udf

I am trying to log from a pandas udf called within a python transform.
Because the code is being called on the executor is does not show up in the driver's logs.
I have been looking at some options on SO but so far the closest option is this one
Any idea on how to surface the logs in the driver logs or any other log files available under build is welcome.
import logging
logger = logging.getLogger(__name__)
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def my_udf(my_pdf):
logger.info('calling my udf')
do_some_stuff()
results_df = my_df.groupby("Name").apply(my_udf)
As you said, the work done by the UDF is done by the executor not the driver, and Spark captures the logging output from the top-level driver process. If you are using a UDF within your PySpark query and need to log data, create and call a second UDF that returns the data you wish to capture and store it in a column to view once the build is finished:
def some_transformation(some_input):
logger.info("log output related to the overall query")
#F.udf("integer")
def custom_function(integer_input):
return integer_input + 5
#F.udf("string")
def custom_log(integer_input):
return "Original integer was %d before adding 5" % integer_input
df = (
some_input
.withColumn("new_integer", custom_function(F.col("example_integer_col"))
.withColumn("debugging", custom_log(F.col("example_integer_col"))
)
I also explain another option is you are more familiar with pandas here:
How to debug pandas_udfs without having to use Spark?
Edit: I have a complete answer here: In Palantir Foundry, how do I debug pyspark (or pandas) UDFs since I can't use print statements?
It is not ideal (as it stops the code) but you can do
raise Exception(<variable_name>)
inside the pandas_udf and it gives you the value of the named variable.

Python Streaming Dataflow "WriteToPubSub" behaviour

I am trying out a Streaming dataflow to read from PubSub and write to another PubSub. I am using python 3.7.3 version. The pipeline looks something like this,
lines = (pipe | "Read from PubSub" >> beam.io.ReadFromPubSub(topic=TOPIC).with_output_types(bytes)
| "Transformation" >> beam.ParDo(PubSubToDict())
| "Write to PubSub" >> beam.io.WriteToPubSub(topic=OUTPUT, with_attributes=False)
)
The "Transformation" step is something where I need to so some custom transformation. I am ensuring that the output of this transform is bytes. Something like this,
class PubSubToDict(beam.DoFn):
def process(self, element):
"""pubsub input is a byte string"""
data = element.decode('utf-8')
"""do some custom transform here"""
data = data.encode('utf-8')
return data
Now when I publish a test message, I get an error like this,
ERROR: Data being published to Pub/Sub must be sent as a bytestring. [while running 'Write to PubSub']
I managed to solve this by returning an array instead like this,
return [data]
But I don't know the reason why this worked. So I was looking for an explanation to this.
Regards,
Prasad
It worked because ParDo lets a pipeline step return multiple output elements for a single input element, so it expects an iterable to be returned.
you could also do yield data

How to load large multi file parquet files for tensorflow/pytorch

I am trying to load a few parquet files from a directory into Python for tensorflow/pytorch.
The files are too large to be loaded through the pyarrow.parquet functions
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('dir')
table = dataset.read()
This gives out of memory error.
I have also tried using petastorm, but that doesn't work for make_reader() because it isn't of the petastorm type.
with make_batch_reader('dir') as reader:
dataset = make_petastorm_dataset(reader)
When I used the make_batch_reader() and then the make_petastorm_dataset(reader), it again gave an zip not iterable error or something along those lines.
I am not sure how to load the file into Python for ML training.
Some quick help would be greatly appreciated.
Thanks
Zash
For pyarrow, you can list the directory with Python, iterate over *.parquet files, open each one as pq.ParquetFile, and read it one row group at a time. This will alleviate the memory pressure, but won't be super fast without parallelization.
For petastorm, you are right to use make_batch_reader(). Indeed, the error messages are not always helpful; but you can inspect the stack trace and investigate where in petastorm code it originates from.
You can load entire data using dask using below code.
You can also load only chucks of data whenever needed by computing only those lines using the index. [Assuming you have different index].
import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob
#delayed
def load_chunk(pth):
x = ParquetFile(pth).to_pandas()
x = x.drop('[unwanted_columns_to_save_space]',axis=1)
return x
files = glob.glob('./your_path/*.parquet')
ddf = dd.from_delayed([load_chunk(f) for f in files])
df = ddf.compute()

Resources