psycopg2 fails on aws glue on subpackage _psycopg - psycopg2

I am trying to get a Glue Spark job running with Python to talk to a Redshift cluster.
But I have trouble getting Psycopg2 to run ... anybody got this going? It complains about a sub-package _psycopg.
Help please! Thanks.

AWs glue has trouble with modules that arent pure python libraries. Try using pg8000 as an alternative

Now with Glue Version 2 you can pass in python libraries as parameters to Glue Jobs. I used pyscopg2-binary instead of pyscopg2 and it worked for me. Then in the code I did import psycopg2.
--additional-python-modules

Related

Fb-Prophet, Apache Spark in Colab and AWS SageMaker/ Lambda

I am using Google-Colab for creating a model by using FbProphet and i am try to use Apache Spark in the Google-Colab itself. Now can i upload this Google-colab notebook in aws Sagemaker/Lambda for free (without charge for Apache Spark and only charge for AWS SageMaker)?
In short, You can upload the notebook without any issue into SageMaker. Few things to keep in mind
If you are using the pyspark library in colab and running spark locally, you should be able to do the same by installing necessary pyspark libs in Sagemaker studio kernels. Here you will only pay for the underlying compute for the notebook instance. If you are experimenting then I would recommend you to use https://studiolab.sagemaker.aws/ to create a free account and try things out.
If you had a separate spark cluster setup then you may need a similar setup in AWS using EMR so that you can connect to the cluster to execute the job.

How to load elasticsearch in python using sqlalchemy?

Am trying to connect to elasticsearch using below in Jupyter notebook
engine = create_engine("elasticsearch+https://user:pwd#host:9200/")
however it gives the error:
Can't load plugin: sqlalchemy.dialects:elasticsearch.https
Can anyone please help?
TLDR; Simply install elasticsearch-dbapi:
pip install elasticsearch-dbapi
Details:
SQLAlchemy uses "dialects" to support reading and writing to different DBMS's.
SQLAlchemy natively supports MS SQL Server, Oracle, MySQL, Postgres and SQLite as well as others. Here is the full list: https://docs.sqlalchemy.org/en/14/dialects/
Elastic is not in that list. Hence, you need to install a library that delivers the dialect for reading/writing from/to Elastic.

PySpark / Kafka - org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated

So, I'm working on setting up a development environment for working with PySpark and Kafka. I'm working through getting things setup so I can run these tutorials in a Jupyter notebook as a 'hello world' exercise: https://spark.apache.org/docs/3.1.1/structured-streaming-kafka-integration.html
Unfortunately, I'm currently hitting the following error when I attempt to connect to the Kafka stream:
Py4JJavaError: An error occurred while calling o68.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated
at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:583)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:805)
at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:723)
at java.base/java.util.ServiceLoader$3.next(ServiceLoader.java:1395)
...
Now, some digging has told me that the most common cause of this issue is version mismatches (either for the Spark, or Scala versions in use). However, I'm able to confirm that these are aligned properly:
Spark: 3.1.1
Scala: 2.12.10
conf/spark-defaults.conf
...
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1
EDIT
So, some additional observations from trying to figure this out:
It looks like this is at least partially a Jupyter notebook issue, as I can now get things working just fine via the pyspark shell.
Looks like my notebook is firing up its own instance of Spark, so maybe there's some difference in how Spark is being run there vs from a terminal window?
At a loss for how they're different though, as both environments should be using mostly default configurations.
Ok - looks like it doesn't work when invoked via the regular Python REPL either, which is leading me to think there's something different about the spark context being created by the pyspark shell and the one I'm creating in my notebook.
Ok - looks like something differs when things are run via Jupyter - hadoop.common.configuration.version has a value of 0.23.0 for the notebook instance, but 3.0.0 for the pyspark shell instance. Not sure why this might be or what it may mean yet.
What else should I check to confirm that this is setup correctly?
Ok - so it looks like the difference was that findspark was locating and using a different Spark Home directory (one that came installed with the pyspark installation via pip).
It also looks like Spark 3.1.1 for Hadoop 2.7 has issues with the Kafka client (or maybe needs to be configured differently) but Spark 3.1.1 for Hadoop 3.2 works fine.
Solution was to ensure that I explicitly chose my SPARK_HOME by passing the spark_home path to findspark.init()
findspark.init(spark_home='/path/to/desired/home')
Things to watch out for that got me and might trip you up too:
If you've installed pyspark through pip / mambaforge this will also deploy a second SPARK_HOME - this can create dependency / library confusion.
Many of the scripts in bin/ use SPARK_HOME to determine where to execute, so don't assume that just because you're running a script from one home that you're running spark IN that home.

Is there any way for using Spark SQL Connector in zeppelin Notebook

I want to use Spark SQL Connector to read and write data to the SQL Server. As a proof of concept I thought of utilizing zeppelin notebook for performing the task.
I am able to load the dependency using below line of code in zeppelin :
%spark.dep
z.load("com.microsoft.azure:azure-sqldb-spark:1.0.2")
But I am not able to write any of the import statements from that package like
import com.microsoft.azure.sqldb.spark.config._
import com.microsoft.azure.sqldb.spark.connect._
import com.microsoft.azure.sqldb.spark.query._
import com.microsoft.azure.sqldb.spark._
import com.microsoft.azure.sqldb.spark.bulkcopy._
I am getting error that object azure is not a member of package com.microsoft. Do anyone have any idea on why this would be happening?
try to add this dependency in Interpreter -> Spark -> Dependencies, in this case it's not necessary to use z.load. Restart interpreter after saving artifact.

Running Python 3 streaming job in EMR without bootstraping

I haven't got a chance to try but I am wondering if EMR instances has Python 3 installed out of the box. From my experiment I do know it has Python 2 for sure. If there is an easy way of checking default packages (installed) given an AMI, that would be great to know.
According to "AMI versions supported", Python 3 is not supported yet.

Resources