Error while using Crealytics package to read Excel file - excel

I'm trying to read an Excel file from HDFS location using Crealytics package and keep getting an error (Caused by: java.lang.ClassNotFoundException:org.apache.spark.sql.connector.catalog.TableProvider). My code is below. Any tips? When running the below code, the spark session initiates fine and the Crealytics package loads without error. The error only comes when running the "spark.read" code. The file location I'm using is accurate.
def spark_session(spark_conf):
conf = SparkConf()
for (key, val) in spark_conf.items():
conf.set(key, val)
spark = SparkSession \
.builder \
.enableHiveSupport() \
.config(conf=conf) \
.getOrCreate()
return spark
spark_conf = {"spark.executor.memory": "16g",
"spark.yarn.executor.memoryOverhead": "3g",
"spark.dynamicAllocation.initialExecutors": 2,
"spark.driver.memory": "16g",
"spark.kryoserializer.buffer.max": "1g",
"spark.driver.cores": 32,
"spark.executor.cores": 8,
"spark.yarn.queue": "adhoc",
"spark.app.name": "CDSW_basic",
"spark.dynamicAllocation.maxExecutors": 32,
"spark.jars.packages": "com.crealytics:spark-excel_2.12:0.14.0"
}
df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader", "true") \
.load("/user/data/Block_list.xlsx")
I've also tried loading it outside of the session function with the code below yielding the same error once I try to read the file.
crealytics_driver_loc = "com.crealytics:spark-excel_2.12:0.14.0"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages ' + crealytics_driver_loc + ' pyspark-shell'

Looks like I'm answering my own question. After a great deal of fiddling around, I've found that using an old version of crealytics works with my setup, though I'm uncertain why. The package that worked was version 13 ("com.crealytics:spark-excel_2.12:0.13.0"), though the newest is version 15.

Related

Pyspark session configuration overwriting

I recently come across below program , after new session values are declared in variable , why stopping spark context ?
usually , spark session will be stopped at the end but here , why stopping after new configuration declaration ?
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("MyApp") \
.config("spark.driver.host", "localhost") \
.getOrCreate()
default_conf = spark.sparkContext._conf.getAll()
print(default_conf)
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'),
('spark.app.name', 'Spark Updated Conf'),
('spark.executor.cores', '4'),
('spark.cores.max', '4'),
('spark.driver.memory','4g')])
spark.sparkContext.stop()
spark = SparkSession \
.builder \
.appName("MyApp") \
.config(conf=conf) \
.getOrCreate()
default_conf = spark.sparkContext._conf.get("spark.cores.max")
print("updated configs " , default_conf)
I am trying to understand
I think it's only example code to show that you cannot modify running session config. You can create it, get the config that was generated, then stop and recreate with updated values.
Otherwise, it doesn't make any sense, because you'd just use the config you needed from the beginning...
default_conf variable at the end is only the cores setting, however.

closing pydeequ callback server

I'm using pydeequ with Spark 3.0.1 to perform some constraint checks on data.
As for testing with the VerificationSuite, after calling VerificationResult.checkResultsAsDataFrame(spark, result), it seems that the callback server which gets started by pydeequ does not get terminated automatically.
If I run code containing pydeequ on an EMR cluster for example, the port 25334 seems to stay open after the spark application closes, unless I explicitly create a JavaGateway with the spark session, and call the close() method.
from pydeequ.verification import *
from pyspark.sql import SparkSession, Row
spark = (SparkSession
.builder
.config("spark.jars.packages", pydeequ.deequ_maven_coord)
.config("spark.jars.excludes", pydeequ.f2j_maven_coord)
.getOrCreate())
df = spark.sparkContext.parallelize([
Row(a="foo", b=1, c=5),
Row(a="bar", b=2, c=6),
Row(a="baz", b=None, c=None)]).toDF()
from py4j.java_gateway import JavaGateway
check = Check(spark, CheckLevel.Warning, "Review Check")
checkResult = VerificationSuite(spark) \
.onData(df) \
.addCheck(
check.hasSize(lambda x: x < 3) \
.hasMin("b", lambda x: x == 0) \
.isComplete("c") \
.isUnique("a") \
.isContainedIn("a", ["foo", "bar", "baz"]) \
.isNonNegative("b")) \
.run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)
a = JavaGateway(spark.sparkContext._gateway)
a.close()
If I don't implement the last 2 lines of code, the callback server stays open on the port.
Is there a way around this?
PyDeequ github says to use these to shutdown Spark session:
spark.sparkContext._gateway.shutdown_callback_server()
spark.stop()

PySpark Kafka - NoClassDefFound: org/apache/commons/pool2

I am encountering problem with printing the data to console from kafka topic.
The error message I get is shown in below image.
As you can see in the above image that after batch 0 , it doesn't process further.
All this are snapshots of the error messages. I don't understand the root cause of the errors occurring. Please help me.
Following are kafka and spark version:
spark version: spark-3.1.1-bin-hadoop2.7
kafka version: kafka_2.13-2.7.0
I am using the following jars:
kafka-clients-2.7.0.jar
spark-sql-kafka-0-10_2.12-3.1.1.jar
spark-token-provider-kafka-0-10_2.12-3.1.1.jar
Here is my code:
spark = SparkSession \
.builder \
.appName("Pyspark structured streaming with kafka and cassandra") \
.master("local[*]") \
.config("spark.jars","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.executor.extraClassPath","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.executor.extraLibrary","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.driver.extraClassPath","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
#streaming dataframe that reads from kafka topic
df_kafka=spark.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers",kafka_bootstrap_servers)\
.option("subscribe",kafka_topic_name)\
.option("startingOffsets", "latest") \
.load()
print("Printing schema of df_kafka:")
df_kafka.printSchema()
#converting data from kafka broker to string type
df_kafka_string=df_kafka.selectExpr("CAST(value AS STRING) as value")
# schema to read json format data
ts_schema = StructType() \
.add("id_str", StringType()) \
.add("created_at", StringType()) \
.add("text", StringType())
#parse json data
df_kafka_string_parsed=df_kafka_string.select(from_json(col("value"),ts_schema).alias("twts"))
df_kafka_string_parsed_format=df_kafka_string_parsed.select("twts.*")
df_kafka_string_parsed_format.printSchema()
df=df_kafka_string_parsed_format.writeStream \
.trigger(processingTime="1 seconds") \
.outputMode("update")\
.option("truncate","false")\
.format("console")\
.start()
df.awaitTermination()
The error (NoClassDefFound, followed by the kafka010 package) is saying that spark-sql-kafka-0-10 is missing its transitive dependency on org.apache.commons:commons-pool2:2.6.2, as you can see here
You can either download that JAR as well, or you can change your code to use --packages instead of spark.jars option, and let Ivy handle downloading transitive dependencies
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache...'
spark = SparkSession.bulider...

Load/import CSV file in to mongodb using PYSPARK

I want to know how to load/import a CSV file in to mongodb using pyspark. I have a csv file named cal.csv placed in the desktop. Can somebody share the code snippet.
First read the csv as pyspark dataframe.
from pyspark import SparkConf,SparkContext
from pyspark.sql import SQLContext
sc = SparkContext(conf = conf)
sql = SQLContext(sc)
df = sql.read.csv("cal.csv", header=True, mode="DROPMALFORMED")
Then write it to mongodb,
df.write.format('com.mongodb.spark.sql.DefaultSource').mode('append')\
.option('database',NAME).option('collection',COLLECTION_MONGODB).save()
Specify the NAME and COLLECTION_MONGODB as created by you.
Also, you need to give conf and packages alongwith spark-submit according to your version,
/bin/spark-submit --conf "spark.mongodb.inuri=mongodb://127.0.0.1/DATABASE.COLLECTION_NAME?readPreference=primaryPreferred"
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/DATABASE.COLLECTION_NAME"
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
tester.py
Specify COLLECTION_NAME and DATABASE above. tester.py assuming name of the code file. For more information, refer this.
This worked for me. database:people Collection:con
pyspark --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/people.con?readPreference=primaryPreferred" \
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/people.con" \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.0
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/people.con") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/people.con") \
.getOrCreate()
df = spark.read.csv(path = "file:///home/user/Desktop/people.csv", header=True, inferSchema=True)
df.printSchema()
df.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").option("database","people").option("collection", "con").save()
Next go to mongo and check if collection is wrtten by following below steps
mongo
show dbs
use people
show collections
db.con.find().pretty()

Apache Spark 2.0 (PySpark) - DataFrame Error Multiple sources found for csv

I am trying to create a dataframe using the following code in Spark 2.0. While executing the code in Jupyter/Console, I am facing the below error. Can someone help me how to get rid of this error?
Error:
Py4JJavaError: An error occurred while calling o34.csv.
: java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name.
at scala.sys.package$.error(package.scala:27)
Code:
from pyspark.sql import SparkSession
if __name__ == "__main__":
session = SparkSession.builder.master('local')
.appName("RealEstateSurvey").getOrCreate()
df = session \
.read \
.option("inferSchema", value = True) \
.option('header','true') \
.csv("/home/senthiljdpm/RealEstate.csv")
print("=== Print out schema ===")
session.stop()
The error is because you must have both libraries (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat and com.databricks.spark.csv.DefaultSource) in your classpath. And spark got confused which one to choose.
All you need is tell spark to use com.databricks.spark.csv.DefaultSource by defining format option as
df = session \
.read \
.format("com.databricks.spark.csv") \
.option("inferSchema", value = True) \
.option('header','true') \
.csv("/home/senthiljdpm/RealEstate.csv")
Another alternative is to use load as
df = session \
.read \
.format("com.databricks.spark.csv") \
.option("inferSchema", value = True) \
.option('header','true') \
.load("/home/senthiljdpm/RealEstate.csv")
If anyone faced a similar issue in Spark Java, it could be because you have multiple versions of the spark-sql jar in your classpath. Just FYI.
I had faced the same issue, and got fixed when changed the Hudi version used in pom.xml from 9.0 to 11.1

Resources