Set up jupyter on EMR to read from cassandra using cql? - apache-spark

When I try to set the spark context in jupyter with
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages datastax:spark-cassandra-connector:2.4.0-s_2.11 --conf spark.cassandra.connection.host=x.x.x.x pyspark-shell'
or
spark = SparkSession.builder \
.appName('SparkCassandraApp') \
.config('spark.cassandra.connection.host', 'x.x.x.x') \
.config('spark.cassandra.connection.port', 'xxxx') \
.config('spark.cassandra.output.consistency.level','ONE') \
.master('local[2]') \
.getOrCreate()
I still cannot make a connection to the cassandra cluster with the code
dataFrame = spark.read.format("org.apache.spark.sql.cassandra").option("keyspace", "keyspace").option("table", "table").load()
dataFrame = dataFrame.limit(100)
dataFrame.show()
Comes up with error:
An error was encountered:
An error occurred while calling o103.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra.
Please find packages at http://spark.apache.org/third-party-projects.html
A similar question was asked here modify jupyter kernel to add cassandra connection in spark
but i do not see a valid answer.

Related

Read CSV file on Spark

I am started working with Spark and found out one problem.
I tried reading CSV file using the below code:
df = spark.read.csv("/home/oybek/Serverspace/Serverspace/Athletes.csv")
df.show(5)
Error:
Py4JJavaError: An error occurred while calling o38.csv.
: java.lang.OutOfMemoryError: Java heap space
I am working in Linux Ubuntu, VirtualBox:~/Serverspace.
You can try changing the driver memory by creating a spark session variable like below:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local[*]') \
.config("spark.driver.memory", "4g") \
.appName('read-csv') \
.getOrCreate()

Not able to configure custom hive-metastore-client in spark

We are facing some challenges working with spark and hive.
We need to connect to hive-metastore from spark and we have to use custom hive-metastore-client inside spark.
code snippet:
spark = SparkSession \
.builder \
.config('spark.sql.hive.metastore.version','3.1.2' ) \
.config('spark.sql.hive.metastore.jars', 'hive-standalone-metastore-3.1.2-sqlquery.jar') \
.config("spark.yarn.dist.jars", "hive-exec-3.1.2.jar") \
.config('hive.metastore.uris', "thrift://localhost:9090") \
the above code works with inbuilt hive-metastore-client but fails with custom one with error:
py4j.protocol.Py4JJavaError: An error occurred while calling o79.databaseExists.
: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.metadata.HiveException
To resolve this issue, we have configured custom hive-exec in code and also tried to pass using command line:
/usr/local/Cellar/apache-spark/3.1.2/libexec/bin/spark-submit spark-session.py --jars hive-exec-3.1.2.jar
/usr/local/Cellar/apache-spark/3.1.2/libexec/bin/spark-submit spark-session.py --conf spark.executor.extraClassPath hive-exec-3.1.2.jar
/usr/local/Cellar/apache-spark/3.1.2/libexec/bin/spark-submit spark-session.py --driver-class-path hive-exec-3.1.2.jar
But the issue is not resolved yet, any suggestion would help us.

Cannot run spark-nlp due to Exception: Java gateway process exited before sending its port number

I have a working Pyspark installation running through Jupyter on a Ubuntu VM.
Only one Java version (openjdk version "1.8.0_265"), and I can I can run a local Spark (v2.4.4) session like this without problems:
import pyspark
from pyspark.sql import SparkSession
memory_gb = 24
conf = (
pyspark.SparkConf()
.setMaster('local[*]')
.set('spark.driver.memory', '{}g'.format(memory_gb))
)
spark = SparkSession \
.builder \
.appName("My Name") \
.config(conf=conf) \
.getOrCreate()
Now I want to use spark-nlp. I've installed spark-nlp using pip install spark-nlp in the same virtual environment my Pyspark is in.
However, when I try to use it, I get the error Exception: Java gateway process exited before sending its port number.
I've tried to follow the instructions in the documentation here, but to no success.
So doing
spark = SparkSession \
.builder \
.appName("RevDNS Stats") \
.config(conf=conf) \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.5")\
.getOrCreate()
only results in the error mentioned above.
How do I fix this?

Error loading spark sql context for redshift jdbc url in glue

Hello I am trying to fetch month-wise data from a bunch of heavy redshift table(s) in glue job.
As far as I know glue documentation on this is very limited.
The query works fine in SQL Workbench which I have connected using the same jdbc connection being used in glue 'myjdbc_url'.
Below is what I have tried and seeing error -
from pyspark.context import SparkContext
sc = SparkContext()
sql_context = SQLContext(sc)
df1 = sql_context.read \
.format("jdbc") \
.option("url", myjdbc_url) \
.option("query", mnth_query) \
.option("forward_spark_s3_credentials","true") \
.option("tempdir", "s3://my-bucket/sprk") \
.load()
print("Total recs for month :"+str(mnthval)+" df1 -> "+str(df1.count()))
However it shows me driver error in the logs as below -
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
at scala.Option.getOrElse(Option.scala:121)
I have used following too but to no avail. Ends up in Connection
refused error.
sql_context.read \
.format("com.databricks.spark.redshift")
.option("url", myjdbc_url) \
.option("query", mnth_query) \
.option("forward_spark_s3_credentials","true") \
.option("tempdir", "s3://my-bucket/sprk") \
.load()
What is the correct driver to use. As I am using glue which is a managed service with transient cluster in the background. Not sure what am I missing.
Please help what is the right driver?

Loading data from GCS using Spark Local

I am trying to read data from GCS buckets on my local machine, for testing purposes. I would like to sample some of the data in the cloud
I have downloaded the GCS Hadoop Connector JAR.
And setup the sparkConf as follow:
conf = SparkConf() \
.setMaster("local[8]") \
.setAppName("Test") \
.set("spark.jars", "path/gcs-connector-hadoop2-latest.jar") \
.set("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
.set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "path/to/keyfile")
sc = SparkContext(conf=conf)
spark = SparkSession.builder \
.config(conf=sc.getConf()) \
.getOrCreate()
spark.read.json("gs://gcs-bucket")
I have also tried to set the conf like so:
sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.json.keyfile", "path/to/keyfile")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.enable", "true")
I am using PySpark install via PIP and running the code using the unit test module from IntelliJ
py4j.protocol.Py4JJavaError: An error occurred while calling o128.json.
: java.io.IOException: No FileSystem for scheme: gs
What should I do?
Thanks!
To solve this issue, you need to add configuration for fs.gs.impl property in addition to properties that you already configured:
sc._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")

Resources