Pyspark session configuration overwriting - apache-spark

I recently come across below program , after new session values are declared in variable , why stopping spark context ?
usually , spark session will be stopped at the end but here , why stopping after new configuration declaration ?
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("MyApp") \
.config("spark.driver.host", "localhost") \
.getOrCreate()
default_conf = spark.sparkContext._conf.getAll()
print(default_conf)
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'),
('spark.app.name', 'Spark Updated Conf'),
('spark.executor.cores', '4'),
('spark.cores.max', '4'),
('spark.driver.memory','4g')])
spark.sparkContext.stop()
spark = SparkSession \
.builder \
.appName("MyApp") \
.config(conf=conf) \
.getOrCreate()
default_conf = spark.sparkContext._conf.get("spark.cores.max")
print("updated configs " , default_conf)
I am trying to understand

I think it's only example code to show that you cannot modify running session config. You can create it, get the config that was generated, then stop and recreate with updated values.
Otherwise, it doesn't make any sense, because you'd just use the config you needed from the beginning...
default_conf variable at the end is only the cores setting, however.

Related

Error while using Crealytics package to read Excel file

I'm trying to read an Excel file from HDFS location using Crealytics package and keep getting an error (Caused by: java.lang.ClassNotFoundException:org.apache.spark.sql.connector.catalog.TableProvider). My code is below. Any tips? When running the below code, the spark session initiates fine and the Crealytics package loads without error. The error only comes when running the "spark.read" code. The file location I'm using is accurate.
def spark_session(spark_conf):
conf = SparkConf()
for (key, val) in spark_conf.items():
conf.set(key, val)
spark = SparkSession \
.builder \
.enableHiveSupport() \
.config(conf=conf) \
.getOrCreate()
return spark
spark_conf = {"spark.executor.memory": "16g",
"spark.yarn.executor.memoryOverhead": "3g",
"spark.dynamicAllocation.initialExecutors": 2,
"spark.driver.memory": "16g",
"spark.kryoserializer.buffer.max": "1g",
"spark.driver.cores": 32,
"spark.executor.cores": 8,
"spark.yarn.queue": "adhoc",
"spark.app.name": "CDSW_basic",
"spark.dynamicAllocation.maxExecutors": 32,
"spark.jars.packages": "com.crealytics:spark-excel_2.12:0.14.0"
}
df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader", "true") \
.load("/user/data/Block_list.xlsx")
I've also tried loading it outside of the session function with the code below yielding the same error once I try to read the file.
crealytics_driver_loc = "com.crealytics:spark-excel_2.12:0.14.0"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages ' + crealytics_driver_loc + ' pyspark-shell'
Looks like I'm answering my own question. After a great deal of fiddling around, I've found that using an old version of crealytics works with my setup, though I'm uncertain why. The package that worked was version 13 ("com.crealytics:spark-excel_2.12:0.13.0"), though the newest is version 15.

Use all workers PySpark YARN

How do I use all the workers in the cluster when I run PySpark in a notebook?
I'm running on Google Dataproc with YARN.
I use this configuration:
import pyspark
from pyspark.sql import SparkSession
conf = pyspark.SparkConf().setAll([
('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar'),
('spark.jars.packages', 'graphframes:graphframes:0.7.0-spark2.3-s_2.11'),
('spark.executor.heartbeatInterval', "1000s"),
("spark.network.timeoutInterval", "1000s"),
("spark.network.timeout", "10000s"),
("spark.network.timeout", "1001s")
])
spark = SparkSession.builder \
.appName('testing bq v04') \
.config(conf=conf) \
.getOrCreate()
But it doesn't look like it is using all the available resources:
Here I provide some more context. The problem arises when I run label propagation algorithm with GraphFrame:
g_df = GraphFrame(vertices_df, edges_df)
result_iteration_2 = g_df.labelPropagation(maxIter=5)

Apache Spark structured streaming word count example in local mode is super slow

I'm trying to run Apache Spark word count example for structured streaming in local mode and I get a very high latency of 10-30 seconds. Here's the code I'm using (taken from https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html):
host = sys.argv[1]
port = int(sys.argv[2])
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
lines = spark \
.readStream \
.format("socket") \
.option("host", host) \
.option("port", port) \
.load()
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()
query = wordCounts \
.writeStream \
.outputMode("update") \
.format("console") \
.start()
query.awaitTermination()
In the programming guide it's mentioned that the latency should be about 100 ms, and this doesn't seem a complicated example. Another thing to mention is that when I'm running this without any processing (just streaming the data to the output), I see the results immediately.
The example was ran on Ubuntu 18.04, Apache Spark 2.4.4.
Is this normal, or am I doing something wrong here?
Thanks!
Gal

Hive - Create table LIKE doesn't work in spark-sql

In my pyspark job I'm trying to create a temp table using LIKE clause as below.
CREATE EXTERNAL TABLE IF NOT EXISTS stg.new_table_name LIKE stg.exiting_table_name LOCATION s3://s3-bucket/warehouse/stg/existing_table_name
My job fails as below -
mismatched input 'LIKE' expecting (line 1, pos 56)\n\n== SQL
==\nCREATE EXTERNAL TABLE IF NOT EXISTS stg.new_table_name LIKE
stg.exiting_table_name LOCATION
s3://s3-bucket/warehouse/stg/existing_table_name
Doesn't spark support LIKE clause to create new table using metadata of existing table?
My sparksession config:
self.session = SparkSession \
.builder \
.appName(self.app_name) \
.config("spark.dynamicAllocation.enabled", "false") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("mapreduce.fileoutputcommitter.algorithm.version", "2") \
.config("hive.load.dynamic.partitions.thread", "10") \
.config("hive.mv.files.thread", "30") \
.config("fs.trash.interval", "0") \
.enableHiveSupport()

Load/import CSV file in to mongodb using PYSPARK

I want to know how to load/import a CSV file in to mongodb using pyspark. I have a csv file named cal.csv placed in the desktop. Can somebody share the code snippet.
First read the csv as pyspark dataframe.
from pyspark import SparkConf,SparkContext
from pyspark.sql import SQLContext
sc = SparkContext(conf = conf)
sql = SQLContext(sc)
df = sql.read.csv("cal.csv", header=True, mode="DROPMALFORMED")
Then write it to mongodb,
df.write.format('com.mongodb.spark.sql.DefaultSource').mode('append')\
.option('database',NAME).option('collection',COLLECTION_MONGODB).save()
Specify the NAME and COLLECTION_MONGODB as created by you.
Also, you need to give conf and packages alongwith spark-submit according to your version,
/bin/spark-submit --conf "spark.mongodb.inuri=mongodb://127.0.0.1/DATABASE.COLLECTION_NAME?readPreference=primaryPreferred"
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/DATABASE.COLLECTION_NAME"
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
tester.py
Specify COLLECTION_NAME and DATABASE above. tester.py assuming name of the code file. For more information, refer this.
This worked for me. database:people Collection:con
pyspark --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/people.con?readPreference=primaryPreferred" \
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/people.con" \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.0
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/people.con") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/people.con") \
.getOrCreate()
df = spark.read.csv(path = "file:///home/user/Desktop/people.csv", header=True, inferSchema=True)
df.printSchema()
df.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").option("database","people").option("collection", "con").save()
Next go to mongo and check if collection is wrtten by following below steps
mongo
show dbs
use people
show collections
db.con.find().pretty()

Resources