How to lazy load log messages in Glue python? - apache-spark

Running a glueetl, GlueVersion 2.0, python 3, AWS Glue job I am trying to do the recommended python3 lazy loading log messages with
logger.info("Attempting to run python module 1 {entrypoint}".format(entrypoint=entrypoint))
logger.info("Attempting to run python module 2 %s", entrypoint)
This produces an error on the second line but the first line succeeds and prints the string.
2021-06-25 04:12:58,782 INFO [Thread-7] log.GlueLogger (GlueLogger.scala:info(8)): Attempting to run python module 1 main.test_program
2021-06-25 04:12:58,818 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Error from Python:Traceback (most recent call last):
File "/tmp/main_etl_script.py", line 103, in <module>
spark=spark)
File "/tmp/main_etl_script.py", line 32, in main_handler
logger.info("Attempting to run python module 2 %s", entrypoint)
File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 332, in get_return_value
format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling o85.info. Trace:
py4j.Py4JException: Method info([class java.lang.String, class java.lang.String]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
My logger setup is simple:
import logging
...
glue_context = GlueContext(SparkContext())
logger = glue_context.get_logger()
Why does this occur?

It doesn't like the comma after the 2 %s":
logger.info("Attempting to run python module 2 %s", entrypoint)
try this with a f string:
logger.info(f"Attempting to run python module 2 {entrypoint}")

Related

Method writeStream does not exist

I'm trying to run pyspark readStream/writeStream using python multiprocessing as a databricks job. Not directly from the notebook but saving the script in .py file inside dbfs:. I can run the readStream/writeStream from notebook but when I try to do the same from the file it throws Method writeStream([]) does not exist
Code
import multiprocessing
import time
from pyspark.sql import SparkSession
def _get_c3_value(df, epochId, v):
print('inside get_c3')
A = df.collect()[0][0]
v.value = A
print(A)
def _readStream_c3(v):
spark = SparkSession.builder.getOrCreate()
df = (spark.readStream.format('delta').option('readChangeFeed','true').option('startingVersion','latest').option('_change_type', 'insert').load("dbfs:/mnt/path/to/file").writeStream.foreachBatch(lambda df, epochId: _get_c3_value(df, epochId,v)).outputMode('append').start())
df.awaitTermination()
if __name__ == '__main__':
spark = SparkSession.builder.appName('multi-processing'
).master('spark://1.2.3.4:8077').getOrCreate()
for n in spark.sparkContext.getConf().getAll():
print (n)
v = multiprocessing.Value('d', 0.0)
p1 = multiprocessing.Process(target=_readStream_c3, args=(v, ))
p1.start()
p1.join()
StdOut
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/tmp/tmp3syog799.py", line 22, in _readStream_c3
df = (spark.readStream.format('delta').option('readChangeFeed','true').option('startingVersion','latest').option('_change_type', 'insert').load("dbfs:/mnt/path/to/file").writeStream.foreachBatch(lambda df, epochId: _get_c3_value(df, epochId)).outputMode('append').start())
File "/databricks/spark/python/pyspark/sql/dataframe.py", line 272, in writeStream
return DataStreamWriter(self)
File "/databricks/spark/python/pyspark/sql/streaming.py", line 755, in __init__
self._jwrite = df._jdf.writeStream()
File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/databricks/spark/python/pyspark/sql/utils.py", line 117, in deco
return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py", line 330, in
get_return_value
raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling o1311.writeStream. Trace:
py4j.Py4JException: Method writeStream([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:341)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:349)
at py4j.Gateway.invoke(Gateway.java:286)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
StdErr
ERROR: Query termination received for [id=0bf6163e-797c-4918-a057-8a90fa1f9d2d, runId=679fb0b5-ec0c-456e-939b-c0ba4c34325f], with exception: py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: Object ID unknown
at py4j.Protocol.getReturnValue(Protocol.java:476)
at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:108)
at com.sun.proxy.$Proxy102.call(Unknown Source)
at org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$.$anonfun$callForeachBatch$1(ForeachBatchSink.scala:208)
at org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$.$anonfun$callForeachBatch$1$adapted(ForeachBatchSink.scala:208)
at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatchLegacy(ForeachBatchSink.scala:89)
at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.$anonfun$addBatch$1(ForeachBatchSink.scala:68)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:680)
at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:719)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:239)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:386)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:186)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:141)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:336)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:717)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:301)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:299)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:73)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:717)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$5(MicroBatchExecution.scala:283)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.withSchemaEvolution(MicroBatchExecution.scala:816)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:280)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:301)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:299)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:73)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:239)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:233)
at org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:359)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:250)
What am I doing wrong ?. I can run the streaming part fine in notebook but not as a python script.
Thanks in advance!

Dynamically creating a structured streaming pipeline with python

I'm trying to create a structured streaming pipeline that will read N kafka topics, do some payload validation, explode the payload and write to:
n kafka topics
Amazon s3
I've followed this article to generate the pipeline.
The shape of my piepline can either be:
Subscription
|
---process---
| | | | |
N outputs
or
N Subscriptions
| | | | |
N Processes
| | | | |
N outputs
This is the code I'm using:
spark = SparkSession \
.builder \
.appName(f"ingest") \
.master("local[*]") \
.getOrCreate()
def start_job(spark, topic):
# pipeline logic here
for topic in pipeline_config.list_topics():
thread = threading.Thread(target=start_job, args=(spark, topic))
thread.start()
spark.streams.awaitAnyTermination()
Whenever I run this I get java.util.ConcurrentModificationException: Another instance of this query was just started by a concurrent session.:
Exception in thread Thread-2 (start_job):
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
Exception in thread Thread-3 (start_job):
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
self.run()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 946, in run
self.run()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 946, in run
self._target(*self._args, **self._kwargs) self._target(*self._args, **self._kwargs)
File "/Users/USER/git/schema-tools/pipeline2.py", line 164, in start_job
File "/Users/USER/git/schema-tools/pipeline2.py", line 164, in start_job
parsed_with_metadata \parsed_with_metadata \
File "/Users/USER/venvs/schema-tools-310/lib/python3.10/site-packages/pyspark/sql/streaming.py", line 1491, in start
File "/Users/USER/venvs/schema-tools-310/lib/python3.10/site-packages/pyspark/sql/streaming.py", line 1491, in start
return self._sq(self._jwrite.start())
File "/Users/USER/venvs/schema-tools-310/lib/python3.10/site-packages/py4j/java_gateway.py", line 1304, in __call__
return self._sq(self._jwrite.start())
File "/Users/USER/venvs/schema-tools-310/lib/python3.10/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
return_value = get_return_value( File "/Users/USER/venvs/schema-tools-310/lib/python3.10/site-packages/pyspark/sql/utils.py", line 111, in deco
File "/Users/USER/venvs/schema-tools-310/lib/python3.10/site-packages/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
return f(*a, **kw)
File "/Users/USER/venvs/schema-tools-310/lib/python3.10/site-packages/py4j/protocol.py", line 326, in get_return_value
File "/Users/USER/venvs/schema-tools-310/lib/python3.10/site-packages/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o314.start.
: java.util.ConcurrentModificationException: Another instance of this query was just started by a concurrent session.
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:411)
at org.apache.spark.sql.streaming.DataStreamWriter.startQuery(DataStreamWriter.scala:466)
at org.apache.spark.sql.streaming.DataStreamWriter.startInternal(DataStreamWriter.scala:456)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:301)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
: An error occurred while calling o312.start.
: java.util.ConcurrentModificationException: Another instance of this query was just started by a concurrent session.
Is this really not possible?
Running on:
py3.10
spark 3.1.2
packages org.apache.spark:spark-streaming-kafka-0-10_2.12:3.1.2, org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2, org.apache.commons:commons-pool2:2.11.1
Mac M1 pro
PS: I can do something similar using foreach batch but I really don't like the approach.

Validate Email using validate_email package in azure databricks for 300k records result in timeout error

I am trying to validate 300 000 mail ids using validate_email package and write it to a csv in azure databricks where i am getting timeout error.
Py4JJavaError Traceback (most recent call last)
<command-365284720716518> in <module>()
----> 1 latest_dup_df.write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/MailResult/latest_dup_df_all")
/databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
736 self._jwrite.save()
737 else:
--> 738 self._jwrite.save(path)
739
740 #since(1.4)
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o548.save.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:192)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:110)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:128)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:146)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:134)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:187)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:183)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:134)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:114)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:111)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:240)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:97)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:170)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:710)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:306)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:292)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:235)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 4.0 failed 4 times, most recent failure: Lost task 2.3 in stage 4.0 (TID 16, 10.139.64.8, executor 1): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 480, in main
process()
File "/databricks/spark/python/pyspark/worker.py", line 472, in process
serializer.dump_stream(out_iter, outfile)
File "/databricks/spark/python/pyspark/serializers.py", line 456, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/databricks/spark/python/pyspark/serializers.py", line 149, in dump_stream
for obj in iterator:
File "/databricks/spark/python/pyspark/serializers.py", line 445, in _batched
for item in iterator:
File "<string>", line 1, in <lambda>
File "/databricks/spark/python/pyspark/worker.py", line 87, in <lambda>
return lambda *a: f(*a)
File "/databricks/spark/python/pyspark/util.py", line 99, in wrapper
return f(*args, **kwargs)
File "<command-3884158641112366>", line 6, in <lambda>
File "/databricks/python/lib/python3.5/site-packages/validate_email.py", line 134, in validate_email
mx_hosts = get_mx_ip(hostname)
File "/databricks/python/lib/python3.5/site-packages/validate_email.py", line 102, in get_mx_ip
MX_DNS_CACHE[hostname] = DNS.mxlookup(hostname)
File "/databricks/python/lib/python3.5/site-packages/DNS/lazy.py", line 56, in mxlookup
l = dnslookup(name, qtype, timeout)
File "/databricks/python/lib/python3.5/site-packages/DNS/lazy.py", line 38, in dnslookup
result = Base.DnsRequest(name=name, qtype=qtype).req(timeout=timeout)
File "/databricks/python/lib/python3.5/site-packages/DNS/Base.py", line 324, in req
self.sendUDPRequest(server)
File "/databricks/python/lib/python3.5/site-packages/DNS/Base.py", line 377, in sendUDPRequest
raise first_socket_error
File "/databricks/python/lib/python3.5/site-packages/DNS/Base.py", line 352, in sendUDPRequest
r=self.processUDPReply()
File "/databricks/python/lib/python3.5/site-packages/DNS/Base.py", line 135, in processUDPReply
raise TimeoutError('Timeout')
DNS.Base.TimeoutError: Timeout
I am using Azure Databricks with python 3 and py3dns package. I am very new to spark and azure databricks. I also tried after setting DNS.defaults['server']=['8.8.8.8', '8.8.4.4']. But still unable to resolve. Below is the code i tried. Is there an efficient way to validate 3 lakh mail ids. It is taking 7-8 hours and then job gets aborted and i get timeout error. I also tried with python 2. same issue.
import DNS
DNS.defaults['server']=['8.8.8.8', '8.8.4.4']
from email_validator import validate_email, EmailNotValidError
from validate_email import validate_email
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf
validate_mail_udf = udf(lambda x : validate_email(x,verify=True), BooleanType())
upd_df = upd_df.withColumn('is_mail_valid', validate_mail_udf(('mail_id')))
upd_df.write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/MailResult/")
Expected result is a new column named 'is_mail_valid' having boolean values representing whether the mail really exist or not for all 300 000 records and write the result df to a csv in azure databricks
If you are using Spark 2.3 and above version then you can use vectorized UDFs which will use PYarrow. Please follow below link for more details.
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
validate_email(x,verify=True) checks with the SMPT server whether the host exists. So you are essentially spamming the servers with 300 000 requests which is likely to block you on any of the stages and prevent further validation.
If your goal is to make sure those emails are valid, you can run this package with verify=False which is just going to run every email against a regex. This operation on 300k records should be really quick, no more than minutes.
If you want to verify whether an email actually exists, there is no good way to do this in bulk as email servers should be doing their best to prevent you from doing that :) validate_email takes a timeout parameter though, so you can try to increase that from default, but overall I'd advise very much against this approach.

KinesisUtils.createStream error in Spark streaming + Kinesis

I'm trying to stream data from AWS Kinesis using Spark Streaming + Kinesis Integration
My code looks like:
sc = SparkContext('local[*]', 'app_name')
ssc = StreamingContext(sc, 10)
kinesisStream = KinesisUtils.createStream(ssc,
kinesisAppName='kinesis_app_name',
streamName='kinesis_stream_name',
endpointUrl='https://kinesis.ap-southeast-2.amazonaws.com',
regionName='ap-southeast-2',
initialPositionInStream=InitialPositionInStream.TRIM_HORIZON,
checkpointInterval=10)
The command to run the script: spark-submit --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.2.0 script.py. I'm using Spark 2.2.0 with Pyspark.
The error I got:
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/py4j/java_gateway.py", line 1035, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/py4j/java_gateway.py", line 883, in send_command
response = connection.send_command(command)
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/py4j/java_gateway.py", line 1040, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "kinesis_to_s3.py", line 63, in
checkpointInterval=streaming_interval)
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/pyspark/streaming/kinesis.py", line 92, in createStream
stsSessionName, stsExternalId)
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/py4j/java_gateway.py", line 1133, in call
answer, self.gateway_client, self.target_id, self.name)
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/py4j/protocol.py", line 327, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o27.createStream
Exception in thread "Thread-2" java.lang.NoClassDefFoundError: com/amazonaws/services/kinesis/clientlibrary/lib/worker/InitialPositionInStream
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetPublicMethods(Class.java:2902)
at java.lang.Class.getMethods(Class.java:1615)
at py4j.reflection.ReflectionEngine.getMethodsByNameAndLength(ReflectionEngine.java:345)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:305)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 12 more

Spark 2.0 with Zeppelin 0.6.1 - SQLContext not available

I am running spark 2.0 and zeppelin-0.6.1-bin-all on a Linux server. The default spark notebook runs just fine, but when I try to create and run a new notebook in pyspark using sqlContext I get the error "py4j.Py4JException: Method createDataFrame([class java.util.ArrayList, class java.util.ArrayList, null]) does not exist"
I tried running a simple code,
%pyspark
wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])
wordsDF.show()
print type(wordsDF)
wordsDF.printSchema()
I get the error,
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-7635635698598314374.py", line 266, in
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-7635635698598314374.py", line 259, in
exec(code)
File "", line 1, in
File "/spark/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/context.py", line 299, in createDataFrame
return self.sparkSession.createDataFrame(data, schema, samplingRatio)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 316, in get_return_value
format(target_id, ".", name, value))
Py4JError: An error occurred while calling o48.createDataFrame. Trace:
py4j.Py4JException: Method createDataFrame([class java.util.ArrayList, class java.util.ArrayList, null]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
When I try the same code with "sqlContext = SQLContext(sc)" it works just fine.
I have tried setting the interpreter "zeppelin.spark.useHiveContext false" configuration but it did not work.
I must obviously be missing something since this is such a simple operation. Please advice if there is any other configuration to be set or what I am missing.
I tested the same piece of code with Zeppelin 0.6.0 and it is working fine.
SparkSession is the default entry-point for Spark 2.0.0, which is mapped to spark in Zeppelin 0.6.1 (as it is in the Spark shell). Have you tried spark.createDataFrame(...)?

Resources