Not able to connect to thrift server from spark shell - apache-spark

I am trying to connect to spark thrift server via spark shell by using following command:
val df = spark
.read
.option("url", "jdbc:hive2://localhost:10000")
.option("dbtable", "people")
.format("jdbc")
.load
error: not found: value spark
What could be the reason?

What message do you get when you type spark into the shell? make sure you have a valid spark session.
res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#2bc0b8c8
is what you should see, if you don't see that in the spark-shell then you likely didn't start the shell correctly.

Related

pyspark connection to MariaDB fails with ClassNotFoundException

I'm trying to retrieve data from MariaDB with pyspark.
I created spark_session with configuration to include jdbc jar file, but couldn't solve problem. Current code to create session looks like below.
path = "hdfs://nameservice1/user/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
# or path = "/home/PATH/TO/JDBC/mariadb-java-client-2.7.1.jar"
spark = SparkSession.config("spark.jars", path)\
.config("spark.driver.extraClassPath", path)\
.config("spark.executor.extraClassPath", path)\
.enableHiveSupport()
.getOrCreate()
Note that I've tried every case of configuration I know
(Check Permission, change directory both hdfs or local, add or remove configuration ...)
And then, code to load data is.
sql = "SOME_SQL_TO_RETRIEVE_DATA"
spark = spark.read.format('jdbc').option('dbtable', sql)
.option('url', 'jdbc:mariadb://{host}:{port}/{db}')\
.option("user", SOME_USER)
.option("password", SOME_PASSWORD)
.option("driver", 'org.mariadb.jdbc.Driver')
.load()
But it fails with java.lang.ClassNotFoundException: org.mariadb.jdbc.Driver
When I tried this with spark-submit, I saw log message.
... INFO SparkContext: Added Jar /PATH/TO/JDBC/mariadb-java-client-2.7.1.jar at spark://SOME_PATH/jars/mariadb-java-client-2.7.1.jar with timestamp SOME_TIMESTAMP
What is wrong?
For anyone who suffers from same problem.
I figured out. Spark Document says that
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
So instead setting configuration on python code, I added arguments on spark-submit following this document.
spark-submit {other arguments ...} \
--driver-class-path PATH/TO/JDBC/my-jdbc.jar \
--jars PATH/TO/JDBC/my-jdbc.jar \
MY_PYTHON_SCRIPT.py

How to connect to remote Cassandra server through pyspark for write operation?

I am trying to connect to remote Cassandra server through pyspark, but it is not performing write operation in Cassandra while running cronjob. The same code works on the server on jupyter notebook, but not through cronjob.
`os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] pyspark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:2.5.0 --conf spark.cassandra.connection.host=127.0.0.1 pyspark-shell --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions'
from pyspark import SparkContext
sc = SparkContext("local", "keyspace_name")
sqlContext = SQLContext(sc)
Data_to_Write.write.format("org.apache.spark.sql.cassandra").mode('append').options(table="tablename",keyspace="keyspace_name").save()`
I see this error in the cassandra logs : ERROR [Messaging-EventLoop-3-3] 2020-08-05 09:24:36,606 OutboundConnectionInitiator.java:373 - Failed to handshake with peer xx.xxx.xxx.xxx:9042(xx.xxx.xxx.xxx:9042) org.apache.cassandra.net.Crc$InvalidCrc –

saveAsTable ends in failure in Spark-yarn cluster environment

I set up a spark-yarn cluster environment, and try spark-SQL with spark-shell:
spark-shell --master yarn --deploy-mode client --conf spark.yarn.archive=hdfs://hadoop_273_namenode_ip:namenode_port/spark-archive.zip
One thing to mention is the Spark is in Windows 7. After spark-shell starts up successfully, I execute the commands as below:
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> val df_mysql_address = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://mysql_db_ip/db").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "ADDRESS").option("user", "root").option("password", "root").load()
scala> df_mysql_address.show
scala> df_mysql_address.write.format("parquet").saveAsTable("address_local")
"show" command returns result-set correctly, but the "saveAsTable" ends in failure. The error message says:
java.io.IOException: Mkdirs failed to create file:/C:/jshen.workspace/programs/spark-2.2.0-bin-hadoop2.7/spark-warehouse/address_local/_temporary/0/_temporary/attempt_20171018104423_0001_m_000000_0 (exists=false, cwd=file:/tmp/hadoop/nm-local-dir/usercache/hduser/appcache/application_1508319604173_0005/container_1508319604173_0005_01_000003)
I expect and guess the table is to be saved in the hadoop cluster, but you can see that the dir (C:/jshen.workspace/programs/spark-2.2.0-bin-hadoop2.7/spark-warehouse) is the folder in my Windows 7, not in hdfs, not even in the hadoop ubuntu machine.
How could I do? Please advise, thanks.
The way to get rid of the problem is to provide "path" option prior to "save" operation as shown below:
scala> df_mysql_address.write.option("path", "/spark-warehouse").format("parquet").saveAsTable("address_l‌​ocal")
Thanks #philantrovert.

Why does Structured Streaming fail with "java.lang.IncompatibleClassChangeError: Implementing class"?

I'd like to run a Spark application using Structured Streaming with PySpark.
I use Spark 2.2 with Kafka 0.10 version.
I fail with the following error:
java.lang.IncompatibleClassChangeError: Implementing class
spark-submit command used as below:
/bin/spark-submit \
--packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 \
--master local[*] \
/home/umar/structured_streaming.py localhost:2181 fortesting
structured_streaming.py code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("StructuredStreaming").config("spark.driver.memory", "2g").config("spark.executor.memory", "2g").getOrCreate()
raw_DF = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:2181").option("subscribe", "fortesting").load()
values = raw_DF.selectExpr("CAST(value AS STRING)").as[String]
values.writeStream.trigger(ProcessingTime("5 seconds")).outputMode("append").format("console").start().awaitTermination()
You need spark-sql-kafka for structured streaming:
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
Also make sure that you use the same versions of Scala (2.11 above) and Spark (2.2.0) as you use on your cluster.
Please reference This
You're using spark-streaming-kafka-0-10 which currently not support python.

Spark SQL RDD loads in pyspark but not in spark-submit: "JDBCRDD: closed connection"

I have the following simple code for loading a table from my Postgres database into an RDD.
# this setup is just for spark-submit, will be ignored in pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("GA")#.setMaster("localhost")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
# func for loading table
def get_db_rdd(table):
url = "jdbc:postgresql://localhost:5432/harvest?user=postgres"
print(url)
lower = 0
upper = 1000
ret = sqlContext \
.read \
.format("jdbc") \
.option("url", url) \
.option("dbtable", table) \
.option("partitionColumn", "id") \
.option("numPartitions", 1024) \
.option("lowerBound", lower) \
.option("upperBound", upper) \
.option("password", "password") \
.load()
ret = ret.rdd
return ret
# load table, and print results
print(get_db_rdd("mytable").collect())
I run ./bin/pyspark then paste that into the interpreter, and it prints out the data from my table as expected.
Now, if I save that code to a file named test.py then do ./bin/spark-submit test.py, it starts to run, but then I see these messages spam my console forever:
17/02/16 02:24:21 INFO Executor: Running task 45.0 in stage 0.0 (TID 45)
17/02/16 02:24:21 INFO JDBCRDD: closed connection
17/02/16 02:24:21 INFO Executor: Finished task 45.0 in stage 0.0 (TID 45). 1673 bytes result sent to driver
Edit: This is on a single machine. I haven't started any masters or slaves; spark-submit is the only command I run after system start. I tried with the master/slave setup with the same results.
My spark-env.sh file looks like this:
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=800m
export SPARK_EXECUTOR_MEMORY=800m
export SPARK_EXECUTOR_CORES=2
export SPARK_CLASSPATH=/home/ubuntu/spark/pg_driver.jar # Postgres driver I need for SQLContext
export PYTHONHASHSEED=1337 # have to make workers use same seed in Python3
It works if I spark-submit a Python file that just creates an RDD from a list or something. I only have problems when I try to use a JDBC RDD. What piece am I missing?
When using spark-submit you should supply the jar to the executors.
As mentioned in spark 2.1 JDBC documents:
To get started you will need to include the JDBC driver for you
particular database on the spark classpath. For example, to connect to
postgres from the Spark Shell you would run the following command:
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Note: The same should be for spark-submit command
Troubleshooting
The JDBC driver class must be visible to the primordial class loader
on the client session and on all executors. This is because Java’s
DriverManager class does a security check that results in it ignoring
all drivers not visible to the primordial class loader when one goes
to open a connection. One convenient way to do this is to modify
compute_classpath.sh on all worker nodes to include your driver JARs.
This is a horrible hack. I'm not considering this the answer, but it does work.
Alright, only pyspark works? Fine, then we'll use it. Wrote this Bash script:
cat $1 | $SPARK_HOME/bin/pyspark # pipe the Python file into pyspark
I run that script in my Python script that's submitting jobs. Also, I'm including the code I use to pass arguments between the processes, in case it helps someone:
new_env = os.environ.copy()
new_env["pyspark_argument_1"] = "some param I need in my Spark script" # etc...
p = subprocess.Popen(["pyspark_wrapper.sh {}".format(py_fname)], shell=True, env=new_env)
In my Spark script:
something_passed_from_submitter = os.environ["pyspark_argument_1"]
# do stuff in Spark...
I feel like Spark is better supported and (if this is a bug) less buggy with Scala than with Python 3, so that might be the better solution for now. But my script uses some files we wrote in Python 3, so...

Resources