How to access global temp view in another pyspark application? - apache-spark

I have a spark shell which invokes pyscript and has created a global temp view
This is what I am doing in first spark shell script
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Spark SQL Parllel load example") \
.config("spark.jars","/u/user/graghav6/sqljdbc4.jar") \
.config("spark.dynamicAllocation.enabled","true") \
.config("spark.shuffle.service.enabled","true") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.sql.shuffle.partitions","50") \
.config("hive.metastore.uris", "thrift://xxxxx:9083") \
.config("spark.sql.join.preferSortMergeJoin","true") \
.config("spark.sql.autoBroadcastJoinThreshold", "-1") \
.enableHiveSupport() \
.getOrCreate()
#after doing some transformation I am trying to create a global temp view of dataframe as:
df1.createGlobalTempView("df1_global_view")
spark.stop()
exit()
This is my second spark shell script:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Spark SQL Parllel load example") \
.config("spark.jars","/u/user/graghav6/sqljdbc4.jar") \
.config("spark.dynamicAllocation.enabled","true") \
.config("spark.shuffle.service.enabled","true") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.sql.shuffle.partitions","50") \
.config("hive.metastore.uris", "thrift://xxxx:9083") \
.config("spark.sql.join.preferSortMergeJoin","true") \
.config("spark.sql.autoBroadcastJoinThreshold", "-1") \
.enableHiveSupport() \
.getOrCreate()
newSparkSession = spark.newSession()
#reading dta from the global temp view
data_df_save = newSparkSession.sql(""" select * from global_temp.df1_global_view""")
data_df_save.show()
newSparkSession.close()
exit()
I am getting below error:
Stdoutput pyspark.sql.utils.AnalysisException: u"Table or view not found: `global_temp`.`df1_global_view`; line 1 pos 15;\n'Project [*]\n+- 'UnresolvedRelation `global_temp`.`df1_global_view`\n"
Looks like I am missing something. How can I shared the same global temp view across multiple sessions?
Am I closing the spark session incorrectly in first spark shell?
I have found couple of answers already on stack-overflow but was not able to figure out the cause.

You're using createGlobalTempView so it's a temporary view and won't be available after you close the app.
In other words, it will be available in another SparkSession, but not in another PySpark application.

Related

Trying to connect to Greenplum database from PySpark getting an error

I'm trying to connect to Greenplum database using PySpark, but getting and error when executing code below.
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "C:/Users/SKamaliyev/Documents/Drivers/db/postgresql-42.2.20.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://address:port/gpl") \
.option("dbtable", "dwh.vw_dm_subs_kpi_monthly") \
.option("user", "skamaliyev") \
.option("password", "passmy") \
.option("driver", "org.postgresql.Driver") \
.load()
An error:
An error occurred while calling o192.load.
: java.lang.ClassNotFoundException: org.postgresql.Driver
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) ...
How can I solve this?
You dont have a driver, trying downloading the postgres driver and adding it in the path

Why the performance of Redis is worse than Hive?

I'm using Hadoop to work on a big data project.
I can use spark to send some SQL command to Hive.
Since this process is slow, I try to write my data into Redis which is an open-source database and use spark to query my data from this database to speed up this process.
I have deployed redis server in my virtual machine, and I can use spark session to read, write and run sql command on redis by using spark-redis module.
https://github.com/RedisLabs/spark-redis
Here's my testing script. I use spark session to get table from hive and write into redis.
from pyspark.sql import SparkSession
import time
import pandas as pd
spark = SparkSession.builder \
.appName("read_and_write") \
.config("spark.sql.warehouse.dir", "/user/hive/warehouse") \
.enableHiveSupport() \
.getOrCreate()
# read table from hive
sparkDF = spark.sql("SELECT * FROM hive_table")
sparkDF.show()
# write table into redis
sparkDF.write.format("org.apache.spark.sql.redis") \
.option("table", "redis_table") \
.mode("overwrite") \
.save()
After writing process finish, I write two script to compare speed between redis and hive.
This script is to test hive:
from pyspark.sql import SparkSession
import time, json
spark = SparkSession.builder \
.appName("hive_spark_test") \
.config("hive.metastore.uris", "thrift://localhost:9083") \
.config("spark.debug.maxToStringFields", "500") \
.config("spark.sql.execution.arrow.enabled", True) \
.config("spark.sql.shuffle.partitions", 20) \
.config("spark.default.parallelism", 20) \
.config("spark.storage.memoryFraction", 0.5) \
.config("spark.shuffle.memoryFraction", 0.3) \
.config("spark.shuffle.consolidateFiles", False) \
.config("spark.shuffle.sort.bypassMergeThreshold", 200) \
.config("spark.shuffle.file.buffer", "32K") \
.config("spark.reducer.maxSizeInFlight", "48M") \
.enableHiveSupport() \
.getOrCreate()
for i in range(20):
# you can use your own sql command
sql_command = "SELECT testColumn1, SUM(testColumn2) AS testColumn2 FROM hive_table WHERE (date BETWEEN '2022-01-01' AND '2022-03-10') GROUP BY GROUPING SETS ((testColumn1))"
readDF = spark.sql(sql_command)
df_json = readDF.toJSON()
df_collect = df_json.collect()
res = [json.loads(i) for i in df_collect]
print(res)
Here's the result. Duration is 0.2s to 0.5s after few round.
enter image description here
This script is to test redis:
from pyspark.sql import SparkSession
import time, json
spark = SparkSession.builder \
.appName("redis_spark_test") \
.config("spark.redis.host", "localhost") \
.config("spark.redis.port", "6379") \
.config("spark.redis.max.pipeline.size", 200) \
.config("spark.redis.scan.count", 200) \
.config("spark.debug.maxToStringFields", "500") \
.config("spark.sql.execution.arrow.enabled", True) \
.config("spark.sql.shuffle.partitions", 20) \
.config("spark.default.parallelism", 20) \
.config("spark.storage.memoryFraction", 0.5) \
.config("spark.shuffle.memoryFraction", 0.3) \
.config("spark.shuffle.consolidateFiles", False) \
.config("spark.shuffle.sort.bypassMergeThreshold", 200) \
.config("spark.shuffle.file.buffer", "32K") \
.config("spark.reducer.maxSizeInFlight", "48M") \
.getOrCreate()
sql_command = """CREATE OR REPLACE TEMPORARY VIEW redis_table (
testColumn1 STRING,
testColumn2 INT,
testColumn3 STRING,
testColumn4 STRING,
date DATE,)
USING org.apache.spark.sql.redis OPTIONS (table 'redis_table')
"""
spark.sql(sql_command)
for i in range(20):
# you can use your own sql command
sql_command = "SELECT testColumn1, SUM(testColumn2) AS testColumn2 FROM redis_table WHERE (date BETWEEN '2022-01-01' AND '2022-03-10') GROUP BY GROUPING SETS ((testColumn1))"
readDF = spark.sql(sql_command)
df_json = readDF.toJSON()
df_collect = df_json.collect()
res = [json.loads(i) for i in df_collect]
print(res)
Here's the result. Duration is 1s to 2s after few round.
enter image description here
This result is conflicted with my survey. Redis should be faster than Hive, but I get the opposite result.
I want to know the reason and try to make Redis can run faster than Hive through Spark if that's possible.
Thank you.

Why pyspark cannot show any data?

when I use Windows local spark like below, it work and Can see "df.count()"
spark = SparkSession \
.builder \
.appName("Structured Streaming ") \
.master("local[*]") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
.option("subscribe", kafka_topic_name) \
.option("startingOffsets", "latest") \
.load()
flower_df1 = df.selectExpr("CAST(value AS STRING)", "timestamp")
flower_schema_string = "sepal_length DOUBLE,sepal_length DOUBLE,sepal_length DOUBLE,sepal_length DOUBLE,species STRING"
flower_df2 = flower_df1.select(from_csv(col("value"), flower_schema_string).alias("flower"), "timestamp").select("flower.*", "timestamp")
flower_df2.createOrReplaceTempView("flower_find")
song_find_text = spark.sql("SELECT * FROM flower_find")
flower_agg_write_stream = song_find_text \
.writeStream \
.option("truncate", "false") \
.format("memory") \
.outputMode("update") \
.queryName("testedTable") \
.start()
while True:
df = spark.sql("SELECT * FROM testedTable")
print(df.count())
time.sleep(1)
But when I use my Virtual Box's Ubuntu's Spark, NEVER SEE any data.
below is the modification I made when I using Ubuntu's Spark.
SparkSession's master URL: "spark://192.168.15.2:7077"
Insert code flower_agg_write_stream.awaitTermination() above "while True:"
Did I do something wrong?
ADD.
when run modification code, log appears as below:
...
org.apache.spark.sql.AnalysisException: Table or view not found: testedTable;
...
unfortunately, I already try createOrReplaceGlobalTempView(). but it doesn't work too.

Why am i getting java.net.SocketException while running the spark job

spark = SparkSession \
.builder \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.sql.streaming.schemaInference", "true") \
.config("spark.rpc.message.maxSize", "1024") \
.getOrCreate()
data = [('James','','Smith','1991-04-01','M',3000),
('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1)
]
columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()

Spark 2.4.0 dependencies to write to AWS Redshift

I'm struggling to find the correct packages dependency and their relative version to write to a Redshfit DB with a Pyspark micro-batch approach.
What are the correct dependencies to achieve this goal?
As suggested from AWS tutorial is necessary to provide a JDBC driver
wget https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.20.1043/RedshiftJDBC4-no-awssdk-1.2.20.1043.jar
After this jar has been downloaded and make it available to the spark-submit command, this is how I provided dependencies to it:
spark-submit --master yarn --deploy-mode cluster \
--jars RedshiftJDBC4-no-awssdk-1.2.20.1043.jar \
--packages com.databricks:spark-redshift_2.10:2.0.0,org.apache.spark:spark-avro_2.11:2.4.0,com.eclipsesource.minimal-json:minimal-json:0.9.4 \
my_script.py
Finally this is the my_script.py that I provided to the spark-submit
from pyspark.sql import SparkSession
def foreach_batch_function(df, epoch_id, table_name):
df.write\
.format("com.databricks.spark.redshift") \
.option("aws_iam_role", my_role) \
.option("url", my_redshift_url) \
.option("user", my_redshift_user) \
.option("password", my_redshift_password) \
.option("dbtable", my_redshift_schema + "." + table_name)\
.option("tempdir", "s3://my/temp/dir") \
.mode("append")\
.save()
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", my_aws_access_key_id)
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", my_aws_secret_access_key)
my_schema = spark.read.parquet(my_schema_file_path).schema
df = spark \
.readStream \
.schema(my_schema) \
.option("maxFilesPerTrigger", 100) \
.parquet(my_source_path)
df.writeStream \
.trigger(processingTime='30 seconds') \
.foreachBatch(lambda df, epochId: foreach_batch_function(df, epochId, my_redshift_table))\
.option("checkpointLocation", my_checkpoint_location) \
.start(outputMode="update").awaitTermination()

Resources