How to connect to remote Cassandra server through pyspark for write operation? - apache-spark

I am trying to connect to remote Cassandra server through pyspark, but it is not performing write operation in Cassandra while running cronjob. The same code works on the server on jupyter notebook, but not through cronjob.
`os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] pyspark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:2.5.0 --conf spark.cassandra.connection.host=127.0.0.1 pyspark-shell --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions'
from pyspark import SparkContext
sc = SparkContext("local", "keyspace_name")
sqlContext = SQLContext(sc)
Data_to_Write.write.format("org.apache.spark.sql.cassandra").mode('append').options(table="tablename",keyspace="keyspace_name").save()`
I see this error in the cassandra logs : ERROR [Messaging-EventLoop-3-3] 2020-08-05 09:24:36,606 OutboundConnectionInitiator.java:373 - Failed to handshake with peer xx.xxx.xxx.xxx:9042(xx.xxx.xxx.xxx:9042) org.apache.cassandra.net.Crc$InvalidCrc –

Related

Connecting to Casssandra on remote client using Spark

I have two PCs, one of them is Ubuntu system that has Cassandra, and the other one is Windows PC.
I have made same installations of Java, Spark, Python and Scala versions on both PCs. My goal is read data with Jupyter Notebook using Spark from Cassandra that on other PC.
On the PC that has Cassandra, I was able to read data with connecting to Cassandra using Spark. But when I try to connect that Cassandra from remote client using Spark, I could not connect to Cassandra and get an error.
Representation of the system
Commands that run on Ubuntu PC which has Cassandra.
~/spark/bin ./pyspark --master spark://10.0.0.10:7077 --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0 --conf spark.driver.extraJavaOptions=-Xss512m --conf spark.executer.extraJavaOptions=-Xss512m
from spark.sql.functions import col
host = {"spark.cassandra.connection.host":'10.0.0.10,10.0.0.11,10.0.0.12',"table":"table_one","keyspace":"log_keyspace"}
data_frame = sqlContext.read.format("org.apache.spark.sql.cassandra").options(**hosts).load()
a = data_frame.filter(col("col_1")<100000).select("col_1","col_2","col_3","col_4","col_5").toPandas()
As a result of the above codes running, the data received from Cassandra can be displayed.
Commands trying to get data by connecting to Cassandra from another PC.
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = ' --master spark://10.0.0.10:7077 --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0 --conf spark.driver.extraJavaOptions=-Xss512m --conf spark.executer.extraJavaOptions=-Xss512m spark.cassandra.connection.host=10.0.0.10 pyspark '
import findspark
findspark.init()
findspark.find()
from pyspark import SparkContext SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql import SQLContext
conf = SparkConf().setAppName('example')
sc = pyspark.SparkContext(conf = conf)
spark = SparkSession(sc)
hosts ={"spark.cassandra.connection.host":'10.0.0.10',"table":"table_one","keyspace":"log_keyspace"}
sqlContext = SQLContext(sc)
data_frame = sqlContext.read.format("org.apache.spark.sql.cassandra").options(**hosts).load()
As a result of the above codes running, " :java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra. Please find packages at http://spark.apache.org/third-party-projects.html " error occurs.
What can I do for fixing this error?

Unable to Connect to ElasticSearch From Spark After creating the SparkSession, It is connecting If we set the ES Conf before SparkSession Creation

I am experiencing issue while connecting to ES from Spark,
It is connecting to ES if we pass the ES details from Spark Submit(i.e prior SparkSession creation) as below,
spark-submit --conf spark.es.nodes=<es_address_url> --conf spark.es.port=<port_number> --conf spark.es.net.ssl=true --conf spark.es.nodes.wan.only=true --class <> <jar_loacction>
and creating the SparkSession in code as,
val spark: SparkSession = SparkSession.builder().appName(conf.get("spark.app.name"))
.enableHiveSupport().getOrCreate()
Throwing Exception
If we set the ElasticSearch details to spark session object in code as below and not passing them from spark-submit,
spark.conf.set("spark.es.nodes", <ES_URL>)
spark.conf.set("spark.es.port", <PortNumber>)
spark.conf.set("spark.es.net.ssl", "true")
spark.conf.set("spark.es.nodes.wan.only", "true")
val esTableData = spark.read
.format("org.elasticsearch.spark.sql")
.option("pushdown", "true").option("es.ignoreNulls","true")
.load(<PathOfTheIndexToRead>)
Spark-Submit spark-submit --class <class_name> <jar_loacction>
Getting the Exception as below
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:348)
org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:338)
... 40 more
Same issue from Spark-shell also happening
spark-shell --packages org.elasticsearch:elasticsearch-spark-30_2.12:8.1.0
scala> spark.conf.set("spark.es.nodes","https://<URL>/")
scala> spark.conf.set("spark.es.port", "<pot_number>")
scala> spark.conf.set("spark.es.net.ssl", "true")
scala> spark.conf.set("spark.es.nodes.wan.only", "true")
scala> val DF = spark.read.format("org.elasticsearch.spark.sql").option("pushdown", "true").option("es.ignoreNulls","true").option("es.field.read.empty.as.null", "no").load(<name_of_the_index>)

Spark Streaming - Netcat messages are not received in Spark streaming

i am trying to test spark streaming. i have stand alone cloudera quickstart vm. started the spark-shell with the following command:
spark-shell --master yarn-client --conf spark.ui.port=23123
In the spark-shell i have executed the following statements:
sc.stop()
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
val conf = new SparkConf().setAppName("Spark Streaming")
val ssc = new StreamingContext(conf,org.apache.spark.streaming.Seconds(10))
val lines = ssc.socketTextStream("localhost",44444)
lines.print
In another terminal started the netcat service with the following command:
nc -lk 44444
In the spark-shell started the streaming context
ssc.start()
till now everything is fine. But, whatever the messages typed in the Netcat service are not received in Spark streaming.don't know where it is going wrong.
try spark-shell --master local[2] --conf spark.ui.port=23123 to see if it works.
If it works, then in your script, there is only one executor working, which is receiving message, but no executor is processing message.

saveAsTable ends in failure in Spark-yarn cluster environment

I set up a spark-yarn cluster environment, and try spark-SQL with spark-shell:
spark-shell --master yarn --deploy-mode client --conf spark.yarn.archive=hdfs://hadoop_273_namenode_ip:namenode_port/spark-archive.zip
One thing to mention is the Spark is in Windows 7. After spark-shell starts up successfully, I execute the commands as below:
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> val df_mysql_address = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://mysql_db_ip/db").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "ADDRESS").option("user", "root").option("password", "root").load()
scala> df_mysql_address.show
scala> df_mysql_address.write.format("parquet").saveAsTable("address_local")
"show" command returns result-set correctly, but the "saveAsTable" ends in failure. The error message says:
java.io.IOException: Mkdirs failed to create file:/C:/jshen.workspace/programs/spark-2.2.0-bin-hadoop2.7/spark-warehouse/address_local/_temporary/0/_temporary/attempt_20171018104423_0001_m_000000_0 (exists=false, cwd=file:/tmp/hadoop/nm-local-dir/usercache/hduser/appcache/application_1508319604173_0005/container_1508319604173_0005_01_000003)
I expect and guess the table is to be saved in the hadoop cluster, but you can see that the dir (C:/jshen.workspace/programs/spark-2.2.0-bin-hadoop2.7/spark-warehouse) is the folder in my Windows 7, not in hdfs, not even in the hadoop ubuntu machine.
How could I do? Please advise, thanks.
The way to get rid of the problem is to provide "path" option prior to "save" operation as shown below:
scala> df_mysql_address.write.option("path", "/spark-warehouse").format("parquet").saveAsTable("address_l‌​ocal")
Thanks #philantrovert.

Not able to connect to thrift server from spark shell

I am trying to connect to spark thrift server via spark shell by using following command:
val df = spark
.read
.option("url", "jdbc:hive2://localhost:10000")
.option("dbtable", "people")
.format("jdbc")
.load
error: not found: value spark
What could be the reason?
What message do you get when you type spark into the shell? make sure you have a valid spark session.
res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#2bc0b8c8
is what you should see, if you don't see that in the spark-shell then you likely didn't start the shell correctly.

Resources