I'm trying to parse the datetime to later do group by at certain hours in structured streaming.
Currently I have code like this:
distinct_table = service_table\
.select(psf.col('crime_id'),
psf.col('original_crime_type_name'),
psf.to_timestamp(psf.col('call_date_time')).alias('call_datetime'),
psf.col('address'),
psf.col('disposition'))
Which gives output in console:
+---------+------------------------+-------------------+--------------------+------------+
| crime_id|original_crime_type_name| call_datetime| address| disposition|
+---------+------------------------+-------------------+--------------------+------------+
|183652852| Burglary|2018-12-31 18:52:00|600 Block Of Mont...| HAN|
|183652839| Passing Call|2018-12-31 18:51:00|500 Block Of Clem...| HAN|
|183652841| 22500e|2018-12-31 18:51:00|2600 Block Of Ale...| CIT|
When I try to apply this udf to convert the timestamp (call_datetime column):
import pyspark.sql.functions as psf
from dateutil.parser import parse as parse_date
#psf.udf(StringType())
def udf_convert_time(timestamp):
d = parse_date(timestamp)
return str(d.strftime('%y%m%d%H'))
I get a Nonetype error..
File "/Users/dev/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 229, in main
process()
File "/Users/dev/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/dev/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 149, in <lambda>
func = lambda _, it: map(mapper, it)
File "<string>", line 1, in <lambda>
File "/Users/dev/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 74, in <lambda>
return lambda *a: f(*a)
File "/Users/PycharmProjects/data-streaming-project/solution/streaming/data_stream.py", line 29, in udf_convert_time
d = parse_date(timestamp)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 301, in parse
res = self._parse(timestr, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 349, in _parse
l = _timelex.split(timestr)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 143, in split
return list(cls(s))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 137, in next
token = self.get_token()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 68, in get_token
nextchar = self.instream.read(1)
AttributeError: 'NoneType' object has no attribute 'read'
This is the query plan:
pyspark.sql.utils.StreamingQueryException: u'Writing job aborted.\n=== Streaming Query ===\nIdentifier: [id = 958a6a46-f718-49c4-999a-661fea2dc564, runId = fc9a7a78-c311-42b7-bbed-7718b4cc1150]\nCurrent Committed Offsets: {}\nCurrent Available Offsets: {KafkaSource[Subscribe[service-calls]]: {"service-calls":{"0":200}}}\n\nCurrent State: ACTIVE\nThread State: RUNNABLE\n\nLogical Plan:\nProject [crime_id#25, original_crime_type_name#26, call_datetime#53, address#33, disposition#32, udf_convert_time(call_datetime#53) AS parsed_time#59]\n+- Project [crime_id#25, original_crime_type_name#26, to_timestamp(\'call_date_time, None) AS call_datetime#53, address#33, disposition#32]\n +- Project [SERVICE_CALLS#23.crime_id AS crime_id#25, SERVICE_CALLS#23.original_crime_type_name AS original_crime_type_name#26, SERVICE_CALLS#23.report_date AS report_date#27, SERVICE_CALLS#23.call_date AS call_date#28, SERVICE_CALLS#23.offense_date AS offense_date#29, SERVICE_CALLS#23.call_time AS call_time#30, SERVICE_CALLS#23.call_date_time AS call_date_time#31, SERVICE_CALLS#23.disposition AS disposition#32, SERVICE_CALLS#23.address AS address#33, SERVICE_CALLS#23.city AS city#34, SERVICE_CALLS#23.state AS state#35, SERVICE_CALLS#23.agency_id AS agency_id#36, SERVICE_CALLS#23.address_type AS address_type#37, SERVICE_CALLS#23.common_location AS common_location#38]\n +- Project [jsontostructs(StructField(crime_id,StringType,true), StructField(original_crime_type_name,StringType,true), StructField(report_date,StringType,true), StructField(call_date,StringType,true), StructField(offense_date,StringType,true), StructField(call_time,StringType,true), StructField(call_date_time,StringType,true), StructField(disposition,StringType,true), StructField(address,StringType,true), StructField(city,StringType,true), StructField(state,StringType,true), StructField(agency_id,StringType,true), StructField(address_type,StringType,true), StructField(common_location,StringType,true), value#21, Some(America/Los_Angeles)) AS SERVICE_CALLS#23]\n +- Project [cast(value#8 as string) AS value#21]\n +- StreamingExecutionRelation KafkaSource[Subscribe[service-calls]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]\n'
I'm using StringType for all columns and using to_timestamp for timestamp columns (which seems to work).
I verified and all the data I'm using (just like 100 rows) do have values. Any idea how to debug this?
EDIT
Input is coming from Kafka - Schema is shown above in the error log (all StringType())
Best not to use udf because they don't use spark catalyst optimizer and especially when the spark.sql.functions modules have functions available. This code will transform your timestamp.
import pyspark.sql.functions as F
import pyspark.sql.types as T
rawData = [(183652852, "Burglary", "2018-12-31 18:52:00", "600 Block Of Mont", "HAN"),
(183652839, "Passing Call", "2018-12-31 18:51:00", "500 Block Of Clem", "HAN"),
(183652841, "22500e", "2018-12-31 18:51:00", "2600 Block Of Ale", "CIT")]
df = spark.createDataFrame(rawData).toDF("crime_id",\
"original_crime_type_name",\
"call_datetime",\
"address",\
"disposition")
date_format_source="yyyy-MM-dd HH:mm:ss"
date_format_target="yyyy-MM-dd HH"
df.select("*")\
.withColumn("new_time_format",\
F.from_unixtime(F.unix_timestamp(F.col("call_datetime"),\
date_format_source),\
date_format_target)\
.cast(T.TimestampType()))\
.withColumn("time_string", F.date_format(F.col("new_time_format"), "yyyyMMddHH"))\
.select("call_datetime", "new_time_format", "time_string")\
.show(truncate=True)
+-------------------+-------------------+-----------+
| call_datetime| new_time_format|time_string|
+-------------------+-------------------+-----------+
|2018-12-31 18:52:00|2018-12-31 18:00:00| 2018123118|
|2018-12-31 18:51:00|2018-12-31 18:00:00| 2018123118|
|2018-12-31 18:51:00|2018-12-31 18:00:00| 2018123118|
+-------------------+-------------------+-----------+
Related
Trying to pass a partitioned TabularDataset into a ParallelRunStep as input, but getting the error and can't figure out why azureml ParallelRunStep can't recognize the partitioned dataset:
UserInputNotPartitionedByGivenKeys: The input dataset 'partitioned_combined_scored_dataset_input' is not partitioned by 'model_name'.
Traceback (most recent call last):
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/master_role_process.py", line 111, in run
loop.run_until_complete(self.master_role.start())
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/master_role.py", line 303, in start
await self.wait_for_first_task()
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/master_role.py", line 288, in wait_for_first_task
await self.wait_for_input_init()
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/master_role.py", line 126, in wait_for_input_init
self.future_create_tasks.result()
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/task_producer.py", line 199, in create_tasks
raise exc
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/task_producer.py", line 190, in create_tasks
for task_group in self.get_task_groups(provider.get_tasks()):
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/task_producer.py", line 169, in get_task_groups
for index, task in enumerate(tasks):
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/partition_by_keys_provider.py", line 77, in get_tasks
raise UserInputNotPartitionedByGivenKeys(message=message, compliant_message=compliant_message)
UserInputNotPartitionedByGivenKeys: The input dataset 'partitioned_combined_scored_dataset_input' is not partitioned by 'model_name'.
ParallelRunConfig & ParallelRunStep
parallel_run_config = ParallelRunConfig(
source_directory=source_dir_for_snapshot,
entry_script="src/steps/script.py",
partition_keys=["model_name"],
error_threshold=10,
allowed_failed_count=15,
allowed_failed_percent=10,
run_max_try=3,
output_action="append_row",
append_row_file_name="output_file.csv",
environment=aml_run_config.environment,
compute_target=aml_run_config.target,
node_count=2
)
parallelrun_step = ParallelRunStep(
name="Do Some Parallel Stuff on Each model_name",
parallel_run_config=parallel_run_config ,
inputs=[partitioned_combined_scored_dataset],
output=OutputFileDatasetConsumptionConfig(name='output_dataset'),
arguments=["--score-id", score_id_pipeline_param,
"--partitioned-combined-dataset", partitioned_combined_scored_dataset],
allow_reuse=True
)
partitioned_combined_scored_dataset
partitioned_combined_scored_dataset = DatasetConsumptionConfig(
name="partitioned_combined_scored_dataset_input",
dataset=PipelineParameter(
name="partitioned_combined_dataset",
default_value=future_partitioned_dataset)
)
and then partitioned_combined_scored_dataset was previously created and uploaded using:
partitioned_dataset = TabularDatasetFactory.from_parquet_files(path=(Datastore.get(ws, ), f"{partitioned_combined_datasets_dir}/*.parquet"))\
.partition_by(
partition_keys=['model_name'],
target=DataPath(Datastore(), 'some/path/to/partitioned')
)
I know TabularDataset.partition_by() uploads to a GUID folder generated by AML so that some/path/to/partitioned actually creates some/path/to/partitioned/XXXXXXXX/{model_name}/part0.parquet for each partition on model_name according to documentation, so we've accounted for this when defining the tabular dataset passed into the PipelineParameter for partitioned_combined_scored_dataset at runtime... using
TabularDatasetFactory.from_parquet_files(path=(Datastore(), f"{partitioned_combined_dataset_dir}/*/*/*.parquet"))
I would like to obtain row schemas in Apache Beam (Python) for use with SQL transforms. However, I ran into the issue explained below.
The schema is defined as follows:
class RowSchema(typing.NamedTuple):
colA: str
colB: typing.Optional[str]
coders.registry.register_coder(RowSchema, coders.RowCoder)
The following example infers the schema correctly:
with beam.Pipeline(options=pipeline_options) as p:
pcol = (p
| "Create" >> beam.Create(
[
RowSchema(colA='a1', colB='b1'),
RowSchema(colA='a2', colB=None)])
.with_output_types(RowSchema)
| beam.Map(print)
)
The following attempt, however, raises "ValueError: Type names and field names must be valid identifiers: 'run.<locals>.RowSchema'"
with beam.Pipeline(options=pipeline_options) as p:
pcol = (p
| "Create" >> beam.Create(
[
{'colA': 'a1', 'colB': 'b1'},
{'colA': 'a2', 'colB': None}])
| 'ToRow' >> beam.Map(
lambda x: RowSchema(**x)) \
.with_output_types(RowSchema)
| beam.Map(print)
)
Full stack trace:
Traceback (most recent call last):
File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "home/src/main.py", line 326, in <module>
run()
File "home/src/main.py", line 267, in run
| 'ToRow' >> beam.Map(lambda x: RowSchema(**x)).with_output_types(RowSchema)
File "home/lib/python3.9/site-packages/apache_beam/transforms/core.py", line 1661, in Map
pardo = FlatMap(wrapper, *args, **kwargs)
File "home/lib/python3.9/site-packages/apache_beam/transforms/core.py", line 1606, in FlatMap
pardo = ParDo(CallableWrapperDoFn(fn), *args, **kwargs)
File "home/lib/python3.9/site-packages/apache_beam/transforms/core.py", line 1217, in __init__
super().__init__(fn, *args, **kwargs)
File "home/lib/python3.9/site-packages/apache_beam/transforms/ptransform.py", line 861, in __init__
self.fn = pickler.loads(pickler.dumps(self.fn))
File "home/lib/python3.9/site-packages/apache_beam/internal/pickler.py", line 51, in loads
return desired_pickle_lib.loads(
File "home/lib/python3.9/site-packages/apache_beam/internal/dill_pickler.py", line 289, in loads
return dill.loads(s)
File "home/lib/python3.9/site-packages/dill/_dill.py", line 275, in loads
return load(file, ignore, **kwds)
File "home/lib/python3.9/site-packages/dill/_dill.py", line 270, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "home/lib/python3.9/site-packages/dill/_dill.py", line 472, in load
obj = StockUnpickler.load(self)
File "home/lib/python3.9/site-packages/dill/_dill.py", line 788, in _create_namedtuple
t = collections.namedtuple(name, fieldnames)
File "/usr/lib/python3.9/collections/__init__.py", line 390, in namedtuple
raise ValueError('Type names and field names must be valid '
ValueError: Type names and field names must be valid identifiers: 'run.<locals>.RowSchema'
The failed attempt works if I change the schema definition to
RowSchema = typing.NamedTuple('RowSchema', [('colA', str), ('colB', typing.Optional[str])])
The error snippet seems to be correctly formatted according to some of the references below.
References:
Apache Beam infer schema using NamedTuple (Python)
https://beam.apache.org/documentation/programming-guide/#inferring-schemas
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_xlang_sql.py
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/sql_taxi.py
Tested on Python 3.9, Beam 2.37.0, and multiple runners including DirectRunner, DataflowRunner and PortableRunner.
Solved it by simply moving the schema definition outside the run function.
class RowSchema(typing.NamedTuple):
colA: str
colB: typing.Optional[str]
coders.registry.register_coder(RowSchema, coders.RowCoder)
def run(argv=None, save_main_session=True):
...
with beam.Pipeline(options=pipeline_options) as p:
...
I am doing delta-lake merge operation using python api and pyspark . After doing the merge operation I call the compaction operation but the compaction gives the following error:
Error:
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 170, in load
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1248, in __call__
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1212, in _build_args
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1199, in _get_args
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_collections.py", line 501, in convert
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1248, in __call__
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1218, in _build_args
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1218, in <listcomp>
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 298, in get_command_part
AttributeError: 'DeltaTable' object has no attribute '_get_object_id'
Code
delta_table = "delta_lake_path"
df = spark.read.csv("s3n://input_file.csv",header=True)
delta_table = DeltaTable.forPath(spark, delta_table)
delta_table.merge(df, "df.id = delta_table.id" ).whenNotMatchedInsertAll().execute()
#compaction
spark.read.format("delta").load(delta_table).repartition(1).write.option("dataChange",
"False").format("delta").mode("overwrite").save(delta_table)
Can anyone suggest me why the spark session is not able to create another delta table instance .
I need to perform both merge and compaction in the same script since I want to run the compaction only on the partitions in which the merge is performed . The partitions are derived from the unique values present in the dataframe df created from input_file.csv
I think your problem lies with delta_table variable - at first it is a string containing delta lake path, but then you are creating a delta table object trying to pass it into .load() method. Separating those variables could help:
delta_table_path = "delta_lake_path"
df = spark.read.csv("s3n://input_file.csv",header=True)
delta_table = DeltaTable.forPath(spark, delta_table_path)
delta_table.merge(df, "df.id = delta_table.id" ).whenNotMatchedInsertAll().execute()
#compaction
spark.read.format("delta").load(delta_table_path).repartition(1).write.option("dataChange",
"False").format("delta").mode("overwrite").save(delta_table_path )
I try to use dask to read parquet table in s3 like this:
import dask.dataframe as dd
s3_path = "s3://my_bucket/my_table"
times = dd.read_parquet(
s3_path,
storage_options={
"client_kwargs": {
"endpoint_url": bucket_endpoint_url,
},
"profile_name": bucket_profile,
}
)
result = times.groupby(['account', 'system_id'])['exec_time'].sum().nlargest(num_row).compute().reset_index().to_dict(orient='records')
I only have pyarrow and s3fs install.
When I read it using LocalCluster like below, it works great
client = LocalCluster(n_workers=1, threads_per_worker=1, processes=False)
But when I read it using true cluster, it throws this error:
client = Client('master_ip:8786')
TypeError: ('Could not serialize object of type tuple.', "(<function apply at 0x7f9f9c9942f0>, <function _apply_chunk at 0x7f9f76ed1510>, [(<function _read_pyarrow_parquet_piece at 0x7f9f76eedea0>, <dask.bytes.s3.DaskS3FileSystem object at 0x7f9f5a83edd8>, ParquetDatasetPiece('my_bucket/my_table/0a0a6e71438a43cd82985578247d5c97.parquet', row_group=None, partition_keys=[]), ['account', 'system_id', 'upload_time', 'name', 'exec_time'], [], False, <pyarrow.parquet.ParquetPartitions object at 0x7f9f5a565278>, []), 'account', 'system_id'], {'chunk': <methodcaller: sum>, 'columns': 'exec_time'})")
distributed.batched - ERROR - Error in batched write
Traceback (most recent call last):
File "/project_folder/lib64/python3.6/site-packages/distributed/batched.py", line 94, in _background_send
on_error='raise')
File "/project_folder/lib64/python3.6/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/project_folder/lib64/python3.6/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/project_folder/lib64/python3.6/site-packages/distributed/comm/tcp.py", line 224, in write
'recipient': self._peer_addr})
File "/project_folder/lib64/python3.6/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/project_folder/lib64/python3.6/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/project_folder/lib64/python3.6/site-packages/distributed/comm/utils.py", line 50, in to_frames
res = yield offload(_to_frames)
File "/project_folder/lib64/python3.6/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/project_folder/lib64/python3.6/site-packages/distributed/comm/utils.py", line 43, in _to_frames
context=context))
File "/project_folder/lib64/python3.6/site-packages/distributed/protocol/core.py", line 54, in dumps
for key, value in data.items()
File "/project_folder/lib64/python3.6/site-packages/distributed/protocol/core.py", line 55, in <dictcomp>
if type(value) is Serialize}
File "/project_folder/lib64/python3.6/site-packages/distributed/protocol/serialize.py", line 164, in serialize
raise TypeError(msg, str(x)[:10000])
Do you know what could be the problem?
Thanks,
Serialisation of pyarrow objects have been problematic in pyarrow 0.13.0, which should be fixed in the next release. Can you try downgrading your pyarrow version?
I have the following code below. Essentially what I'm trying to do is to generate some new columns from the values in existing ones. After I do that, I save the dataframe with the new columns as a table in the cluster. Sorry I'm new to pyspark still.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
from pyspark.sql.functions import udf, array
from pyspark.sql.types import DecimalType
import numpy as np
import math
df = sqlContext.sql('select * from db.mytable')
angle_av = udf(lambda (x, y): -10 if x == 0 else math.atan2(y/x)*180/np.pi, DecimalType(20,10))
df = df.withColumn('a_v_angle', angle_av(array('a_v_real', 'a_v_imag')))
df.createOrReplaceTempView('temp')
sqlContext.sql('create table new_table as select * from temp')
These operations actually don't produce any errors. I then attempt to store the df as a table and get the following error (i'm guessing since this is when the operations are actually executed):
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 171, in main
process()
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 166, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 103, in <lambda>
func = lambda _, it: map(mapper, it)
File "<string>", line 1, in <lambda>
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 70, in <lambda>
return lambda *a: f(*a)
File "<stdin>", line 14, in <lambda>
TypeError: unsupported operand type(s) for /: 'NoneType' and 'NoneType'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
This happens because input values are null / None. function should check its input and proceed accordingly.
f x == 0 or x is None
or just
if not x