I try to read a simple BigQuery table.
This hangs on:
WARNING:root:Dataset thijs-dev:temp_dataset_b234824381e04e1324234237724b485f95c does not exist so we will create it as temporary with location=EU
For this I use the following script:
python main.py \
--runner DirectRunner \
--project thijs-dev \
--temp_location gs://thijs/tmp/ \
--job_name thijs-dev-load \
--save_main_session
And the complete Python script:
import apache_beam as beam
import logging
import argparse
def run(argv=None):
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
with beam.Pipeline(argv=pipeline_args) as p:
""" Read all data from source_table """
source_data = (p | beam.io.Read(beam.io.BigQuerySource(query="select * from `thijs-dev.metathijs.thijs_locations`", use_standard_sql=True)))
if __name__ == '__main__':
print("Start")
logging.getLogger().setLevel(logging.INFO)
run()
Turns out Dataflow is just extremely slow. It takes half an hour to process 26MB of data but it is working afterall.
Related
I tried to run the below simple beam pipeline on a spark cluster (GCP Dataproc):
import argparse
from apache_beam import (
CombinePerKey,
DoFn,
FlatMap,
GroupByKey,
ParDo,
Pipeline,
PTransform,
WindowInto,
WithKeys,
io,
)
from apache_beam.options.pipeline_options import PipelineOptions
class WriteToGCS(DoFn):
def __init__(self, output_path):
self.output_path = output_path
def process(self, key_value, window=DoFn.WindowParam):
"""Write messages in a batch to Google Cloud Storage."""
ts_format = "%H:%M"
window_start = window.start.to_utc_datetime().strftime(ts_format)
window_end = window.end.to_utc_datetime().strftime(ts_format)
shard_id, batch = key_value
filename = "-".join([self.output_path, window_start, window_end, str(shard_id)])
with io.gcsio.GcsIO().open(filename=filename, mode="w") as f:
for message_body, publish_time in batch:
print(">>>>>>>>>>", message_body, publish_time, "===========")
f.write(f"{message_body},{publish_time}\n".encode("utf-8"))
def run(input_topic, output_path, window_size=1.0, num_shards=5, pipeline_args=None):
# Set `save_main_session` to True so DoFns can access globally imported modules.
pipeline_options = PipelineOptions(pipeline_args, streaming=True, save_main_session=True)
with Pipeline(options=pipeline_options) as p:
# Read from PubSub into a PCollection.
lines = p | io.ReadStringsFromPubSub(topic=input_topic)
# lines | "Write to GCS" >> ParDo(WriteToGCS(output_path))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--input_topic",
help="The Cloud Pub/Sub topic to read from." '"projects/<PROJECT_ID>/topics/<TOPIC_ID>".',
)
parser.add_argument(
"--window_size",
type=float,
default=1.0,
help="Output file's window size in minutes.",
)
parser.add_argument(
"--output_path",
help="Path of the output GCS file including the prefix.",
)
parser.add_argument(
"--num_shards",
type=int,
default=5,
help="Number of shards to use when writing windowed elements to GCS.",
)
known_args, pipeline_args = parser.parse_known_args()
run(
known_args.input_topic,
known_args.output_path,
known_args.window_size,
known_args.num_shards,
pipeline_args
)
On my local with DirectRunner it is working fine, i can see the messages are consumed from pub/sub and get writing to GCS without any issue.
But when i tried to follow this instruction: Running on Dataproc cluster (YARN backed), after submitted the job to the remote spark cluster, i kept getting the below error:
[2022-05-07 11:56:45.801]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
orImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:732)
Caused by: java.lang.IllegalArgumentException: PCollectionNodes [PCollectionNode{id=ref_PCollection_PCollection_1, PCollection=unique_name: "42_ReadStringsFromPubSub/ReadFromPubSub/Read.None"
coder_id: "ref_Coder_BytesCoder_1"
is_bounded: UNBOUNDED
windowing_strategy_id: "ref_Windowing_Windowing_1"
}] were consumed but never produced
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:440)
at org.apache.beam.runners.core.construction.graph.QueryablePipeline.buildNetwork(QueryablePipeline.java:234)
at org.apache.beam.runners.core.construction.graph.QueryablePipeline.<init>(QueryablePipeline.java:127)
at org.apache.beam.runners.core.construction.graph.QueryablePipeline.forPrimitivesIn(QueryablePipeline.java:90)
at org.apache.beam.runners.core.construction.graph.GreedyPipelineFuser.<init>(GreedyPipelineFuser.java:70)
at org.apache.beam.runners.core.construction.graph.GreedyPipelineFuser.fuse(GreedyPipelineFuser.java:93)
at org.apache.beam.runners.spark.SparkPipelineRunner.run(SparkPipelineRunner.java:114)
at org.apache.beam.runners.spark.SparkPipelineRunner.main(SparkPipelineRunner.java:263)
... 5 more
P.S.:
i followed the same instruction: Running on Dataproc cluster (YARN backed) to run the apache_beam.examples.wordcount example and it is working ok.
apache-beam = {extras = ["gcp"], version = "~2.38.0"}
Seems someone had faced exactly same error with FlinkRunner also: Error while running beam streaming pipeline (Python) with pub/sub io in embedded Flinkrunner (apache_beam [GCP])
What could be the issue or what am i missing?
I wanted to read the pubsub topic and write data to BigTable with the dataflow code written in Python. I could find the sample code in JAVA but not in Python.
How can we assign columns in a row from pubsub to different column families and write the data to Bigtable?
To write to Bigtable in a Dataflow pipeline, you'll need to create direct rows and pass them to the WriteToBigTable doFn. Here is a brief example that just passes in the row keys and adds one cell for each key nothing too fancy:
import datetime
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io.gcp.bigtableio import WriteToBigTable
from google.cloud.bigtable import row
class MyOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_argument(
'--bigtable-project',
help='The Bigtable project ID, this can be different than your '
'Dataflow project',
default='bigtable-project')
parser.add_argument(
'--bigtable-instance',
help='The Bigtable instance ID',
default='bigtable-instance')
parser.add_argument(
'--bigtable-table',
help='The Bigtable table ID in the instance.',
default='bigtable-table')
class CreateRowFn(beam.DoFn):
def process(self, key):
direct_row = row.DirectRow(row_key=key)
direct_row.set_cell(
"stats_summary",
b"os_build",
b"android",
datetime.datetime.now())
return [direct_row]
def run(argv=None):
"""Build and run the pipeline."""
options = MyOptions(argv)
with beam.Pipeline(options=options) as p:
p | beam.Create(["phone#4c410523#20190501",
"phone#4c410523#20190502"]) | beam.ParDo(
CreateRowFn()) | WriteToBigTable(
project_id=options.bigtable_project,
instance_id=options.bigtable_instance,
table_id=options.bigtable_table)
if __name__ == '__main__':
run()
I am just starting to explore this now and can link to a more polished version on GitHub once it's complete. Hope this helps you get started.
Building on top of what was proposed and adding PubSub, here’s a working version..
Pre requisites
GCS Bucket created (for Dataflow temp/staging files)
PubSub topic created
PubSub subscription created
BigTable instance created
BigTable table created
BigTable column family must be created (no visible error otherwise !)
Example of the latter with cbt:
cbt -instance test-instance createfamily test-table cf1
Code
Define and run the Dataflow pipeline.
# Packages
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io.gcp.bigtableio import WriteToBigTable
from google.cloud import pubsub_v1
# Classes
class CreateRowFn(beam.DoFn):
def __init__(self, pipeline_options):
self.instance_id = pipeline_options.bigtable_instance
self.table_id = pipeline_options.bigtable_table
def process(self, key):
from google.cloud.bigtable import row
import datetime
direct_row = row.DirectRow(row_key=key)
direct_row.set_cell(
'cf1',
'field1',
'value1',
timestamp=datetime.datetime.now())
yield direct_row
# Options
class XyzOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_argument('--bigtable_project', default='nested'),
parser.add_argument('--bigtable_instance', default='instance'),
parser.add_argument('--bigtable_table', default='table')
pipeline_options = XyzOptions(
save_main_session=True, streaming=True,
runner='DataflowRunner',
project=PROJECT,
region=REGION,
temp_location=TEMP_LOCATION,
staging_location=STAGING_LOCATION,
requirements_file=REQUIREMENTS_FILE,
bigtable_project=PROJECT,
bigtable_instance=INSTANCE,
bigtable_table=TABLE)
# Pipeline
def run (argv=None):
with beam.Pipeline(options=pipeline_options) as p:
input_subscription=f"projects/{PROJECT}/subscriptions/{SUBSCRIPTION}"
_ = (p
| 'Read from Pub/Sub' >> beam.io.ReadFromPubSub(subscription=input_subscription).with_output_types(bytes)
| 'Conversion UTF-8 bytes to string' >> beam.Map(lambda msg: msg.decode('utf-8'))
| 'Conversion string to row object' >> beam.ParDo(CreateRowFn(pipeline_options))
| 'Writing row object to BigTable' >> WriteToBigTable(project_id=pipeline_options.bigtable_project,
instance_id=pipeline_options.bigtable_instance,
table_id=pipeline_options.bigtable_table))
if __name__ == '__main__':
run()
Publish a message b"phone#1111" to PubSub topic (e.g. using the Python PublisherClient()).
Table content (using happybase)
b'phone#1111': {b'cf1:field1': b'value1'}
Row length: 1
When I go to use operators/hooks like the BigQueryHook I see a message that these operators are deprecated and to use the airflow.gcp... operator version. However when i try and use it in my dag it fails and says no module named airflow.gcp. I have the most up to date airflow composer version w/ beta features, python3. Is it possible to install these operators somehow?
I am trying to run a Dataflow Job in python 3 using beam 2.15. I have tried virtualenv operator, but that doesn't work because it only allows python2.7. How can I do this?
The newest Airflow version available in Composer is either 1.10.2 or 1.10.3 (depending on the region). By then, those operators were in the contrib section.
Focusing on how to run Python 3 Dataflow jobs with Composer you'd need for a new version to be released. However, if you need an immediate solution you can try to back-port the fix.
In this case I defined a DataFlow3Hook which extends the normal DataFlowHook but that it does not hard-code python2 in the start_python_dataflow method:
class DataFlow3Hook(DataFlowHook):
def start_python_dataflow(
...
py_interpreter: str = "python3"
):
...
self._start_dataflow(variables, name, [py_interpreter] + py_options + [dataflow],
label_formatter)
Then we'll have our custom DataFlowPython3Operator calling the new hook:
class DataFlowPython3Operator(DataFlowPythonOperator):
def execute(self, context):
...
hook = DataFlow3Hook(gcp_conn_id=self.gcp_conn_id,
delegate_to=self.delegate_to,
poll_sleep=self.poll_sleep)
...
hook.start_python_dataflow(
self.job_name, formatted_options,
self.py_file, self.py_options, py_interpreter="python3")
Finally, in our DAG we just use the new operator:
task = DataFlowPython3Operator(
py_file='/home/airflow/gcs/data/main.py',
task_id=JOB_NAME,
dag=dag)
See full code here. Job runs with Python 3.6:
Environment details and dependencies used (Beam job was a minimal example):
softwareConfig:
imageVersion: composer-1.8.0-airflow-1.10.3
pypiPackages:
apache-beam: ==2.15.0
google-api-core: ==1.14.3
google-apitools: ==0.5.28
google-cloud-core: ==1.0.3
pythonVersion: '3'
Let me know if that works for you. If so, I'd recommend moving the code to a plugin for code readability and to reuse it across DAGs.
As an alternative, you can use the PythonVirtualenvOperator on older airflow versions. Given some beam pipeline (wrapped in a function) saved as dataflow_python3.py:
def main():
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
import argparse
import logging
class ETL(beam.DoFn):
def process(self, row):
#do data processing
def run(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument(
'--input',
dest='input',
default='gs://bucket/input/input.txt',
help='Input file to process.'
)
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_args.extend([
'--runner=DataflowRunner',
'--project=project_id',
'--region=region',
'--staging_location=gs://bucket/staging/',
'--temp_location=gs://bucket/temp/',
'--job_name=job_id',
'--setup_file=./setup.py'
])
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
with beam.Pipeline(options=pipeline_options) as p:
rows = (p | 'read rows' >> beam.io.ReadFromText(known_args.input))
etl = (rows | 'process data' >> beam.ParDo(ETL()))
logging.getLogger().setLevel(logging.DEBUG)
run()
You can run it using the following DAG file:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonVirtualenvOperator
import sys
import dataflow_python3 as py3 #import your beam pipeline file here
default_args = {
'owner': 'John Smith',
'depends_on_past': False,
'start_date': datetime(2016, 1, 1),
'email': ['email#gmail.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 3,
'retry_delay': timedelta(minutes=1),
}
CONNECTION_ID = 'proj_id'
with DAG('Dataflow_Python3', schedule_interval='#once', template_searchpath=['/home/airflow/gcs/dags/'], max_active_runs=15, catchup=True, default_args=default_args) as dag:
dataflow_python3 = PythonVirtualenvOperator(
task_id='dataflow_python3',
python_callable=py3.main, #this is your beam pipeline callable
requirements=['apache-beam[gcp]', 'pandas'],
python_version=3,
dag=dag
)
dataflow_python3
I have run Python 3 Beam -2.17 by using DataflowTemplateOperator and it worked like a charm.
Use below command to create template:
python3 -m scriptname --runner DataflowRunner --project project_id --staging_location staging_location --temp_location temp_location --template_location template_location/script_metadata --region region --experiments use_beam_bq_sink --no_use_public_ips --subnetwork=subnetwork
scriptname would be name of your Dataflow Python file(without .py extension)
--template_location - The location where dataflow template would be created, don't put any extension like .json to it. Simply, scriptname_metadata would work.
--experiments use_beam_bq_sink - This parameter would be used if your sink is BigQuery otherwise you can remove it.
import datetime as dt
import time
from airflow.models import DAG
from airflow.contrib.operators.dataflow_operator import DataflowTemplateOperator
lasthour = dt.datetime.now() - dt.timedelta(hours=1)
args = {
'owner': 'airflow',
'start_date': lasthour,
'depends_on_past': False,
'dataflow_default_options': {
'project': "project_id",
'staging_location': "staging_location",
'temp_location': "temp_location",
'region': "region",
'runner': "DataflowRunner",
'job_name': 'job_name' + str(time.time()),
},
}
dag = DAG(
dag_id='employee_dataflow_dag',
schedule_interval=None,
default_args=args
)
Dataflow_Run = DataflowTemplateOperator(
task_id='dataflow_pipeline',
template='template_location/script_metadata',
parameters ={
'input':"employee.csv",
'output':'project_id:dataset_id.table',
'region':"region"
},
gcp_conn_id='google_cloud_default',
poll_sleep=15,
dag=dag
)
Dataflow_Run
I try to use the concurrent.future multithreading in Python with subprocess.run to launch an external Python script. But I have some troubles with the shell=True part of the subprocess.run().
Here is an example of the external code, let's call it test.py:
#! /usr/bin/env python3
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-x', '--x_nb', required=True, help='the x number')
parser.add_argument('-y', '--y_nb', required=True, help='the y number')
args = parser.parse_args()
print('result is {} when {} multiplied by {}'.format(int(args.x_nb) * int(args.y_nb),
args.x_nb,
args.y_nb))
In my main python script I have:
#! /usr/bin/env python3
import subprocess
import concurrent.futures
import threading
...
args_list = []
for i in range(10):
cmd = './test.py -x {} -y 2 '.format(i)
args_list.append(cmd)
# just as an example, this line works fine
subprocess.run(args_list[0], shell=True)
# this multithreading is not working
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
executor.map(subprocess.run, args_list)
The problem here is that I can't pass the shell=True option to the executor.map.
I have already tried without success:
args_list = []
for i in range(10):
cmd = './test.py -x {} -y 2 '.format(i)
args_list.append((cmd, eval('shell=True'))
or
args_list = []
for i in range(10):
cmd = './test.py -x {} -y 2 '.format(i)
args_list.append((cmd, 'shell=True'))
Anyone has an idea on how to solve this problem?
I don't think the map method can call a function with keyword args directly but there are 2 simple solutions to your issue.
Solution 1: Use a lambda to set the extra keyword argument you want
The lambda is basically a small function that calls your real function, passing the arguments through. This is a good solution if the keyword arguments are fixed.
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
executor.map(lambda args: subprocess.run(args, shell=True), args_list)
Solution 2: Use executor.submit to submit the functions to the executor
The submit method lets you specify args and keyword args to the target function.
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
for args in args_list:
executor.submit(subprocess.run, args, shell=True)
Can anyone provide guidance on why this simple Flask app complains about Only one SparkContext may be running in this JVM. I'm not attempting to load more than one context, obviously.
Code:
import flask
from pyspark import SparkContext
from operator import itemgetter
app = flask.Flask(__name__)
#app.route('/')
def homepage():
return 'Example: /dt/140'
#app.route('/dt/<int:delaythreshold>')
def dt(delaythreshold):
global flights_rdd
flights_dict = \
flights_rdd \
.filter( lambda (day, delay): delay >= threshold ) \
.countByValue()
sorted_flight_tuples = \
sorted( flights_dict.items(), key=itemgetter(1), reverse=True )
return render_template('delays.html', tuples=sorted_flight_tuples[:5])
if __name__ == '__main__':
global flights_rdd
sc = SparkContext()
flights_rdd = \
sc.textFile('/tmp/flights.csv', 4) \
.map( lambda s: s.split(',') ) \
.map( lambda l: ( l[0][:4], int(lst[1]) ) ) \
.cache()
app.config['DEBUG'] = True
app.run(host='0.0.0.0')
Thanks in advance.
You probably shouldn't create "global" resources such as the SparkContext in the __main__ section.
In particular, if you run your app in debug mode the module is instantly reloaded a second time upon start - hence the attempt to create a second SparkContext. (Add e.g. print 'creating sparkcontext' to your __main__ section before creating the SparkContext - you'll see it twice).
Check the flask documenation for proposals on how to cache global resources.
Following http://flask.pocoo.org/docs/0.10/appcontext/#context-usage you could e.g. retrieve the SparkContext as follows:
from flask import g
def get_flights():
flights_rdd = getattr(g, '_flights_rdd', None)
if flights_rdd is None:
# create flights_rdd on the fly
sc = g._sc = SparkContext()
flights_rdd = \
sc.textFile('/tmp/flights.csv', 4) \
.map( lambda s: s.split(',') ) \
.map( lambda l: ( l[0][:4], int(lst[1]) ) ) \
.cache()
g._flights_rdd = flights_rdd
return flights_rdd
#app.teardown_appcontext
def teardown_sparkcontext(exception):
sc = getattr(g, '_sc', None)
if sc is not None:
sc.close()
Then use flights_rdd = get_flights() instead of the global flights_rdd.