How to write structured stream data to Cassandra table using pyspark? - apache-spark

This is my terminal command to run strm.py file
$SPARK_HOME/bin/spark-submit --master local --driver-memory 4g --num-executors 2 --executor-memory 4g --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 org.apache.spark:spark-cassandra-connector_2.11:2.4.0 strm.py
Error:
Cannot load main class from JAR org.apache.spark:spark-cassandra-connector_2.11:2.4.0 with URI org.apache.spark. Please specify a class through --class.
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
atorg.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:224)
at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:116)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1.(SparkSubmit.scala:907)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:907)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
So can anyone help me out what is the issue with this why it can't loading.

You have 2 problems:
you're incorrectly submitting your application - you don't have a comma between org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 and org.apache.spark:spark-cassandra-connector_2.11:2.4.0, so spark-submit treats cassandra connector as a jar, instead of using your python file.
current version of Spark Cassandra Connector doesn't support direct write for Spark Structured Streaming data - this functionality is available only in DSE Analytics. But you can workaround this by using foreachBatch, something like this (not tested, the working Scala code is available here):
def foreach_batch_function(df, epoch_id):
df.format("org.apache.spark.sql.cassandra").option("keyspace","test")\
.option("table", "my_tables").mode('append').save()
query.writeStream.foreachBatch(foreach_batch_function).start()

Related

PySpark driver memory exceptions while reading too many small files

PySpark version: 2.3.0, HDP 2.6.5
My source is populating a Hive table(HDFS backed) with 826 paritions and 1557242 small files(40KB). I know this is highly inefficient way to store data but I dont have control over my source
The problem now is when I need to do some historical load and I need to scan all the files at once the driver is having memory exceptions. Tried setting driver-memory to 8g,16g and similar configuration for driver.memory.overhead too. But the problem still persists.
What makes me wonder is this is failing in listing files I presume this is just metadata. Is there an explanation why file metadata would need so much memory?
py4j.protocol.Py4JJavaError: An error occurred while calling o351.saveAsTable.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.String.substring(String.java:1956)
at java.net.URI$Parser.substring(URI.java:2869)
at java.net.URI$Parser.parse(URI.java:3049)
at java.net.URI.<init>(URI.java:746)
at org.apache.hadoop.fs.Path.initialize(Path.java:202)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles$3$$anonfun$7.apply(InMemoryFileIndex.scala:251)
It would be helpfull if you can share the parameters you were passing in spark submit.
i too facing similar issue.. adjusting the parameters it worked.
try diffrent configs ( as number s i cant suggest it depends on server configs)
mine :
spark-submit \
--master yarn \
--deploy-mode client \
--driver-memory 5g \
--executor-memory 6g \
--executor-cores 3

Driver cores must be a positive number

I have upgraded Spark from version 3.1.1 to 3.2.1.
And now all existing Spark Jobs break with following ERROR.
Exception in thread "main" org.apache.spark.SparkException: Driver cores must be a positive number
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:634)
at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:257)
at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:234)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:119)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$3.<init>(SparkSubmit.scala:1026)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:1026)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:85)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
We are using Spark in cluster mode with apache mesos and co-located with cassandra.
I tried few options:
e.g. appl/spark/bin/spark-submit --name "Testjob" --deploy-mode cluster --master mesos://<master node>:7077 --executor-cores 4 --driver-memory 1G --driver-cores 1 -class ....
Do you have any hints or solutions for solving this problem.
Many thanks...
cheers
Unfortunately I think it is impossible to run Spark 3.2.x with Mesos in Cluster mode because of this feature and the way MesosClusterDispatcher works.
Basically what's happening the Dispatcher is submitting the Spark Application with the --driver-cores argument as a floating point number, and then Spark (SparkSubmitArguments.scala) reads it as String and parses it just like this:
driverCores.toInt
and of course this fails.
I proposed a quick fix for this but meanwhile I just built the code with the change I made in the PR. I also reported this as a bug.

Fail to enable hive support in Spark submit (spark 3)

I am using spark in an integration test suite. It has to run locally and read/write files to local file-system. I also want to read/write these data as tables.
In the first step of the suite I write some hive tables in the db feature_store specifying
spark.sql.warehouse.dir=/opt/spark/work-dir/warehouse. The step completes correctly and I see the files in the folder I expect.
Afterwards I run a spark-submit step with (among others) these confs
--conf spark.sql.warehouse.dir=/opt/spark/work-dir/warehouse --conf spark.sql.catalogImplementation=hive
and when trying to read a table previously written I get
Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'feature_store' not found
However if I try to do exactly the same thing with exactly the same configs in a spark-shell I am able to read the data.
In the spark-submit I use the following code to get the spark-session
SparkSession spark = SparkSession.active();
I have also tried to use instead
SparkSession spark = SparkSession.builder().enableHiveSupport().getOrCreate();
but I keep getting the same problem as above.
I have understood that the problem is related to the spark-submit not picking up hive as
catalog implementation. In fact I see that the class spark.catalog is not an instance of HiveCatalogImpl during the spark-submit (while it is when using spark-shell).

How to distribute JDBC jar on Cloudera cluster?

I've just installed a new Spark 2.4 from CSD on my CDH cluster (28 nodes) and am trying to install JDBC driver in order to read data from a database from within Jupyter notebook.
I downloaded and copied it on one node to the /jars folder, however it seems that I have to do the same on each and every host (!). Otherwise I'm getting the following error from one of the workers:
java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver
Is there any easy way (without writing bash scripts) to distribute the jar files with packages on the whole cluster? I wish Spark could distribute it itself (or maybe it does and I don't know how to do it).
Spark has a jdbc format reader you can use.
launch a scala shell to confirm your MS SQL Server driver is in your classpath
example
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
If driver class isn't showing make sure you place the jar on an edge node and include it in your classpath where you initialize your session
example
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Connect to your MS SQL Server via Spark jdbc
example via spark python
# option1
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
# option2
jdbcDF2 = spark.read \
.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
properties={"user": "username", "password": "password"})
specifics and additional ways to compile connection strings can be found here
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
you mentioned jupyter ... if you still cannot get the above to work try setting some env vars via this post (cannot confirm if this works though)
https://medium.com/#thucnc/pyspark-in-jupyter-notebook-working-with-dataframe-jdbc-data-sources-6f3d39300bf6
at the end of the day all you really need is the driver class placed on an edge node (client where you launch spark) and append it to your classpath then make the connection and parallelize your dataframe to scale performance since jdbc from rdbms reads data as single thread hence 1 partition

How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine?

Several postings on stackoverflow has responses with partial information about How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine. So I'd like to ask the following questions for complete information about how to do that:
In the Spark SQL app, do we need to use HiveContext to register tables? Or can we use just SQL Context?
Where and how do we use HiveThriftServer2.startWithContext?
When we run start-thriftserver.sh as in
/opt/mapr/spark/spark-1.3.1/sbin/start-thriftserver.sh --master spark://spark-master:7077 --hiveconf hive.server2.thrift.bind.host spark-master --hiveconf hive.server2.trift.port 10001
besides specifying the jar and main class of the Spark SQL app, do we need to specify any other parameters?
Are there any other things we need to do?
Thanks.
To expose DataFrame temp tables through HiveThriftServer2.startWithContext(), you may need to write and run a simple application, may not need to run start-thriftserver.sh.
To your questions:
HiveContext is needed; the sqlContext converted to HiveContext implicitly in spark-shell
Write a simple application, example :
import org.apache.spark.sql.hive.thriftserver._
val hiveContext = new HiveContext(sparkContext)
hiveContext.parquetFile(path).registerTempTable("my_table1")
HiveThriftServer2.startWithContext(hiveContext)
No need to run start-thriftserver.sh, but run your own application instead, e.g.:
spark-submit --class com.xxx.MyJdbcApp ./package_with_my_app.jar
Nothing else from server side, should start on default port 10000;
you may verify by connecting to the server with beeline.
In Java I was able to expose dataframe as temp tables and read the table content via beeline (as like regular hive table)
I have n't posted the entire program (with the assumption that you know already how to create dataframes)
import org.apache.spark.sql.hive.thriftserver.*;
HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc());
DataFrame orgDf = sqlContext.createDataFrame(orgPairRdd.values(), OrgMaster.class);
orgPairRdd is a JavaPairRDD, orgPairRdd.values() -> contains the entire class value (Row fetched from Hbase)
OrgMaster is a java bean serializable class
orgDf.registerTempTable("spark_org_master_table");
HiveThriftServer2.startWithContext(sqlContext);
I submitted the program locally (as the Hive thrift server is not running in port 10000 in that machine)
hadoop_classpath=$(hadoop classpath)
HBASE_CLASSPATH=$(hbase classpath)
spark-1.5.2/bin/spark-submit --name tempSparkTable --class packageName.SparkCreateOrgMasterTableFile --master local[4] --num-executors 4 --executor-cores 4 --executor-memory 8G --conf "spark.executor.extraClassPath=${HBASE_CLASSPATH}" --conf "spark.driver.extraClassPath=${HBASE_CLASSPATH}" --conf "spark.executor.extraClassPath=${hadoop_classpath}" --conf --jars /path/programName-SNAPSHOT-jar-with-dependencies.jar
/path/programName-SNAPSHOT.jar
In another terminal start the beeline pointing to this thrift service started using this spark program
/opt/hive/hive-1.2/bin/beeline -u jdbc:hive2://<ipaddressofMachineWhereSparkPgmRunninglocally>:10000 -n anyUsername
Show tables -> command will display the table that you registered in Spark
You can do describe also
In this example
describe spark_org_master_table;
then you can run regular queries in beeline against this table. (Until you kill the spark program execution)

Resources