No attribute error passing broadcast variable from PySpark to Java function - apache-spark

I have a java class registered in PySpark, and Im trying to pass a Broadcast variable from PySpark to a method in this class. Like so:
from py4j.java_gateway import java_import
java_import(spark.sparkContext._jvm, "net.a.b.c.MyClass")
myPythonGateway = spark.sparkContext._jvm.MyClass()
with open("tests/fixtures/file.txt", "rb") as binary_file:
data = spark.sparkContext.broadcast(binary_file.read())
myPythonGateway.setData(data)
But this is throwing:
AttributeError: 'Broadcast' object has no attribute '_get_object_id'
However, if I pass the byte[] directly, without wrapping it in broadcast(), it works fine. But I need this variable to be broadcast, as it will be used repeatedly.

According to the py4j docs, the above error will be thrown if you try to pass a Python collection to a method that expects a Java collection. The docs give the following solution:
You can explicitly convert Python collections using one of the following converter located in the py4j.java_collections module: SetConverter, MapConverter, ListConverter.
An example is provided there also.
Presumably, this error is occurring when py4j tries to convert the value attribute of the Broadcast object, so converting this may fix the problem e.g.
converted_data = ListConverter().convert(binary_file.read(),spark.sparkContext._jvm._gateway_client)
broadcast_data = spark.sparkContext.broadcast(converted_data)
myPythonGateway.setData(broadcast_data)

Related

Serialize Python class using AvroProducer confluent-kafka

I am pretty new to the confluent-kafka and python, just would like to know if there a way in python we could serialize the python class to an kafka message using avro schema.
I am currently using AvroProducer provided by confluent-kafka, however, i am only able tot serialize a json string. And the downstream application is using Java, and downstream would like to deserialize the message back to an Java Object.
producer = AvroProducer(self.producer_config, default_key_schema=key_schema, default_value_schema=value_schema)
value = ast.literal_eval(str(message))
try:
producer.produce(topic=topic, partition=partition, key=str(message['Id']),
value=value)
I believe this is the part that is causing issue:
value = ast.literal_eval(str(message))
The message is originally a dictionary. I have to convert it to a json string in order to serialize it. It just looks really weird.
Could you help advise how to serialie a python class, instead of serialize a json string using AvroProducer? Thanks!

Python xml.etree.ElementTree getroot() doesnt work [duplicate]

I was using parse function to modify a xml and it worked but I tried to use .fromstring and it showed an error
AttributeError: 'Element' object has no attribute 'getroot'
here is the part of code.
AttributeError: 'Element' object has no attribute 'getroot'
This is because the fromstring method returns the root object already. SO whatever your var that stores the information form the "fromstring()" method is, that var is the root.

Want to use TriggerDagRunOperator in Airflow to trigger many sub-dags by using only Main-dag with bashoperator(sub-dag operator)

Unable to understand the concept of Payload in airflow with TriggerDagRunOperator.Please help me to understand this term in a very easy way.
the TriggerDagRunOperator triggers a DAG run for a specified dag_id. This needs a trigger_dag_id with type string and a python_callable param which is a reference to a python function that will be called while passing it the context object and a placeholder object obj for your callable to fill and return if you want a DagRun created. This obj object contains a run_id and payload attribute that you can modify in your function.
The run_id should be a unique identifier for that DAG run, and the payload has to be a picklable object that will be made available to your tasks while executing that DAG run. Your function header should look like def foo(context, dag_run_obj):
picklable simply means it can be serialized by the pickle module. For a basic understanding of this, see what can be pickled and unpickled?. The pickle protocol provides more details, and shows how classes can customize the process.
Reference: https://github.com/apache/airflow/blob/d313d8d24b1969be9154b555dd91466a2489e1c7/airflow/operators/dagrun_operator.py#L37

'TensorBoard' object has no attribute 'writer' error when using Callback.on_epoch_end()

Since Model.train_on_batch() doesn't take a callback input, I tried using Callback.on_epoch_end() in order to write my loss to tensorboard
However, trying to run the on_epoch_end() method results in the titular error, 'TensorBoard' object has no attribute 'writer'. Other solutions to my original problem with writing to tensorboard included calling the Callback.writer attribute, and running these solutions gave the same error. Also, the tensorflow documentation for the TensorBoard class doesn't mention a writer attribute
I'm somewhat of a novice programmer, but it seems to me that the on_epoch_end() method is also at some point calling the writer attribute, but I'm confused as to why the function would use an attribute that doesn't exist
Here's the code I'm using to create the callback:
logdir = "./logs/"
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=logdir)
and this is the callback code that I try to run in my training loop:
logs = {
'encoder':encoder_loss[0],
'discriminator':d_loss,
'generator':g_loss,
}
tensorboard_callback.on_epoch_end(i, logs)
where encoder_loss, d_loss, and g_loss are my scalars, and i is the batch number
Is the error a result of some improper code on my part, or is tensorflow trying to reference something that doesn't exist?
Also, if anyone knows another way to write to tensorboard using Model.train_on_batch, that would also solve my problem
Since you are using a callback without the fit method, you also need to pass your model to the TensorBoard object:
logdir = "./logs/"
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=logdir)
tensorboard_callback.set_model(model=model)

How do I mock a a file IO so that I can override the name attribute in a unit test?

I'm trying to write a unit test mock opening a file and passing it into a function that uses it to dump a JSON object into. How do I create a fake IO object that mimics the behavior of an open file handle but uses similar attributes, specifically .name?
I've read through tons of answers on here and all of them work around the problem in various ways. I've tried mock patching builtins.open, I've tried mock patching the open being called inside my module, but the main error I keep running into is that when I try to access the fake IO object's .name attribute, I get an AttributeError:
AttributeError: 'CallbackStream' object has no attribute 'name'
So here's a simple function that writes a dictionary to disk in JSON format and takes an open file handle:
def generate(data, json_file):
# data is a dict
logging.info(f"Writing out spec file to {json_file.name}")
json.dump(data, json_file)
Here's what I've tried to unit test:
#patch("builtins.open", new_callable=mock_open())
def test_generate_json_returns_zero(self, mock_open):
mocked_file = mock_open()
mocked_file.name = "FakeFileName"
data = {'stuff': 'stuff2'}
generate(data, json_file=mocked_file)
However, that produces the AttributeError above, where I can't use json_file.name because it doesn't exist as an attribute. I thought setting it explicitly would work, but it didn't.
I can bypass the issue by using a temporary file, via `tempfile.TemporaryFile:
def test_generate_json_returns_zero(self, mock_open):
data = {'stuff': 'stuff2'}
t = TemporaryFile("w")
generate(data, json_file=t)
But that doesn't solve the actual problem, which is that I can't figure out how to mock the file handle so that I don't actually need to create a real file on disk.
I can't get past the .name not being a valid attribute. How do I mock the file object such that I could actually use the .name attribute of an IO object and still fake a json.dump() to it?
The new_callable parameter is meant to be an alternative class of Mock, so when you call:
#patch("builtins.open", new_callable=mock_open())
It patches builtins.open by replacing it with what mock_open() returns, rather than a MagicMock object, which is what you actually need, so change the line to simply:
#patch("builtins.open")
and it should work.
Your test never actually calls open, so there's no need to patch it. Just create a Mock instance with the attribute you need.
def test_generate_json_returns_zero(self):
mocked_file = Mock()
mocked_file.name = "FakeFileName"
data = {'stuff': 'stuff2'}
generate(data, json_file=mocked_file)

Resources