Trying to connect to Greenplum database from PySpark getting an error - apache-spark

I'm trying to connect to Greenplum database using PySpark, but getting and error when executing code below.
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "C:/Users/SKamaliyev/Documents/Drivers/db/postgresql-42.2.20.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://address:port/gpl") \
.option("dbtable", "dwh.vw_dm_subs_kpi_monthly") \
.option("user", "skamaliyev") \
.option("password", "passmy") \
.option("driver", "org.postgresql.Driver") \
.load()
An error:
An error occurred while calling o192.load.
: java.lang.ClassNotFoundException: org.postgresql.Driver
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) ...
How can I solve this?

You dont have a driver, trying downloading the postgres driver and adding it in the path

Related

Why pyspark cannot show any data?

when I use Windows local spark like below, it work and Can see "df.count()"
spark = SparkSession \
.builder \
.appName("Structured Streaming ") \
.master("local[*]") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
.option("subscribe", kafka_topic_name) \
.option("startingOffsets", "latest") \
.load()
flower_df1 = df.selectExpr("CAST(value AS STRING)", "timestamp")
flower_schema_string = "sepal_length DOUBLE,sepal_length DOUBLE,sepal_length DOUBLE,sepal_length DOUBLE,species STRING"
flower_df2 = flower_df1.select(from_csv(col("value"), flower_schema_string).alias("flower"), "timestamp").select("flower.*", "timestamp")
flower_df2.createOrReplaceTempView("flower_find")
song_find_text = spark.sql("SELECT * FROM flower_find")
flower_agg_write_stream = song_find_text \
.writeStream \
.option("truncate", "false") \
.format("memory") \
.outputMode("update") \
.queryName("testedTable") \
.start()
while True:
df = spark.sql("SELECT * FROM testedTable")
print(df.count())
time.sleep(1)
But when I use my Virtual Box's Ubuntu's Spark, NEVER SEE any data.
below is the modification I made when I using Ubuntu's Spark.
SparkSession's master URL: "spark://192.168.15.2:7077"
Insert code flower_agg_write_stream.awaitTermination() above "while True:"
Did I do something wrong?
ADD.
when run modification code, log appears as below:
...
org.apache.spark.sql.AnalysisException: Table or view not found: testedTable;
...
unfortunately, I already try createOrReplaceGlobalTempView(). but it doesn't work too.

Spark Structural Streaming with Confluent Cloud Kafka connectivity issue

I am writing a Spark structured streaming application in PySpark to read data from Kafka in Confluent Cloud. The documentation for the spark readstream() function is too shallow and didn't specify much on the optional parameter part especially on the auth mechanism part. I am not sure what parameter goes wrong and crash the connectivity. Can anyone have experience in Spark help me to start this connection?
Required Parameter
> Consumer({'bootstrap.servers':
> 'cluster.gcp.confluent.cloud:9092',
> 'sasl.username':'xxx',
> 'sasl.password': 'xxx',
> 'sasl.mechanisms': 'PLAIN',
> 'security.protocol': 'SASL_SSL',
> 'group.id': 'python_example_group_1',
> 'auto.offset.reset': 'earliest' })
Here is my pyspark code:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "cluster.gcp.confluent.cloud:9092") \
.option("subscribe", "test-topic") \
.option("kafka.sasl.mechanisms", "PLAIN")\
.option("kafka.security.protocol", "SASL_SSL")\
.option("kafka.sasl.username","xxx")\
.option("kafka.sasl.password", "xxx")\
.option("startingOffsets", "latest")\
.option("kafka.group.id", "python_example_group_1")\
.load()
display(df)
However, I keep getting an error:
kafkashaded.org.apache.kafka.common.KafkaException: Failed to
construct kafka consumer
DataBrick Notebook- for testing
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4673082066872014/3543014086288496/1802788104169533/latest.html
Documentation
https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/latest/structured-streaming-kafka-integration.html
This error indicates that JAAS configuration is not visible to your Kafka consumer. To solve this issue include JASS based on the follow steps:
Step01: Create a file for below JAAS file : /home/jass/path
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=true
renewTicket=true
serviceName="kafka";
};
Step02: Call that JASS file path in spark-submit based on the below conf parameter .
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/jass/path"
Full spark-submit command :
/usr/hdp/2.6.1.0-129/spark2/bin/spark-submit --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.spark:spark-avro_2.11:2.4.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 --conf spark.ui.port=4055 --files /home/jass/path,/home/bdpda/bdpda.headless.keytab --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/jass/path" --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/home/jass/path" pysparkstructurestreaming.py
Pyspark Structured streaming sample code :
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
import time
# Spark Streaming context :
spark = SparkSession.builder.appName('PythonStreamingDirectKafkaWordCount').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Kafka Topic Details :
KAFKA_TOPIC_NAME_CONS = "topic_name"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'kafka_server:9093'
# Creating readstream DataFrame :
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"Clinet_id")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/home/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "password_rd") \
.option("kafka.sasl.kerberos.keytab","/home/path.keytab") \
.option("kafka.sasl.kerberos.principal","path") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
# Creating Writestream DataFrame :
df1.writeStream \
.option("path","target_directory") \
.format("csv") \
.option("checkpointLocation","chkpint_directory") \
.outputMode("append") \
.start()
ssc.awaitTermination()
We need to specified kafka.sasl.jaas.config to add the username and password for the Confluent Kafka SASL-SSL auth method. Its parameter looks a bit odd, but it's working.
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "pkc-43n10.us-central1.gcp.confluent.cloud:9092") \
.option("subscribe", "wallet_txn_log") \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SASL_SSL") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.sasl.jaas.config", """kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="xxx" password="xxx";""").load()
display(df)

PySpark Structured Streaming data writing into Cassandra not populating data

I want to write spark structured streaming data into cassandra. My spark version is 2.4.0.
My input source from Kafka is with JSON, so when writing to the console, it is OK, but when I query in the cqlsh Cassandra there is no record appended to the table. Can you tell me what is wrong?
schema = StructType() \
.add("humidity", IntegerType(), True) \
.add("time", TimestampType(), True) \
.add("temperature", IntegerType(), True) \
.add("ph", IntegerType(), True) \
.add("sensor", StringType(), True) \
.add("id", StringType(), True)
def writeToCassandra(writeDF, epochId):
writeDF.write \
.format("org.apache.spark.sql.cassandra") \
.mode('append') \
.options("spark.cassandra.connection.host", "cassnode1, cassnode2") \
.options(table="sensor", keyspace="sensordb") \
.save()
# Load json format to dataframe
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafkanode") \
.option("subscribe", "iot-data-sensor") \
.load() \
.select([
get_json_object(col("value").cast("string"), "$.{}".format(c)).alias(c)
for c in ["humidity", "time", "temperature", "ph", "sensor", "id"]])
df.writeStream \
.foreachBatch(writeToCassandra) \
.outputMode("update") \
.start()
I had the same issue in pyspark. try below steps
First, validate if it is connecting to cassandra. You can either point to a table which is not available and see if it is failing because of "table not found"
Try writeStream as below (include trigger and output mode before calling the cassandra update)
df.writeStream \
.trigger(processingTime="10 seconds") \
.outputMode("update") \
.foreachBatch(writeToCassandra) \

Spark 2.4.0 dependencies to write to AWS Redshift

I'm struggling to find the correct packages dependency and their relative version to write to a Redshfit DB with a Pyspark micro-batch approach.
What are the correct dependencies to achieve this goal?
As suggested from AWS tutorial is necessary to provide a JDBC driver
wget https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.20.1043/RedshiftJDBC4-no-awssdk-1.2.20.1043.jar
After this jar has been downloaded and make it available to the spark-submit command, this is how I provided dependencies to it:
spark-submit --master yarn --deploy-mode cluster \
--jars RedshiftJDBC4-no-awssdk-1.2.20.1043.jar \
--packages com.databricks:spark-redshift_2.10:2.0.0,org.apache.spark:spark-avro_2.11:2.4.0,com.eclipsesource.minimal-json:minimal-json:0.9.4 \
my_script.py
Finally this is the my_script.py that I provided to the spark-submit
from pyspark.sql import SparkSession
def foreach_batch_function(df, epoch_id, table_name):
df.write\
.format("com.databricks.spark.redshift") \
.option("aws_iam_role", my_role) \
.option("url", my_redshift_url) \
.option("user", my_redshift_user) \
.option("password", my_redshift_password) \
.option("dbtable", my_redshift_schema + "." + table_name)\
.option("tempdir", "s3://my/temp/dir") \
.mode("append")\
.save()
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", my_aws_access_key_id)
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", my_aws_secret_access_key)
my_schema = spark.read.parquet(my_schema_file_path).schema
df = spark \
.readStream \
.schema(my_schema) \
.option("maxFilesPerTrigger", 100) \
.parquet(my_source_path)
df.writeStream \
.trigger(processingTime='30 seconds') \
.foreachBatch(lambda df, epochId: foreach_batch_function(df, epochId, my_redshift_table))\
.option("checkpointLocation", my_checkpoint_location) \
.start(outputMode="update").awaitTermination()

How to access global temp view in another pyspark application?

I have a spark shell which invokes pyscript and has created a global temp view
This is what I am doing in first spark shell script
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Spark SQL Parllel load example") \
.config("spark.jars","/u/user/graghav6/sqljdbc4.jar") \
.config("spark.dynamicAllocation.enabled","true") \
.config("spark.shuffle.service.enabled","true") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.sql.shuffle.partitions","50") \
.config("hive.metastore.uris", "thrift://xxxxx:9083") \
.config("spark.sql.join.preferSortMergeJoin","true") \
.config("spark.sql.autoBroadcastJoinThreshold", "-1") \
.enableHiveSupport() \
.getOrCreate()
#after doing some transformation I am trying to create a global temp view of dataframe as:
df1.createGlobalTempView("df1_global_view")
spark.stop()
exit()
This is my second spark shell script:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Spark SQL Parllel load example") \
.config("spark.jars","/u/user/graghav6/sqljdbc4.jar") \
.config("spark.dynamicAllocation.enabled","true") \
.config("spark.shuffle.service.enabled","true") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.sql.shuffle.partitions","50") \
.config("hive.metastore.uris", "thrift://xxxx:9083") \
.config("spark.sql.join.preferSortMergeJoin","true") \
.config("spark.sql.autoBroadcastJoinThreshold", "-1") \
.enableHiveSupport() \
.getOrCreate()
newSparkSession = spark.newSession()
#reading dta from the global temp view
data_df_save = newSparkSession.sql(""" select * from global_temp.df1_global_view""")
data_df_save.show()
newSparkSession.close()
exit()
I am getting below error:
Stdoutput pyspark.sql.utils.AnalysisException: u"Table or view not found: `global_temp`.`df1_global_view`; line 1 pos 15;\n'Project [*]\n+- 'UnresolvedRelation `global_temp`.`df1_global_view`\n"
Looks like I am missing something. How can I shared the same global temp view across multiple sessions?
Am I closing the spark session incorrectly in first spark shell?
I have found couple of answers already on stack-overflow but was not able to figure out the cause.
You're using createGlobalTempView so it's a temporary view and won't be available after you close the app.
In other words, it will be available in another SparkSession, but not in another PySpark application.

Resources