Use all workers PySpark YARN - apache-spark

How do I use all the workers in the cluster when I run PySpark in a notebook?
I'm running on Google Dataproc with YARN.
I use this configuration:
import pyspark
from pyspark.sql import SparkSession
conf = pyspark.SparkConf().setAll([
('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar'),
('spark.jars.packages', 'graphframes:graphframes:0.7.0-spark2.3-s_2.11'),
('spark.executor.heartbeatInterval', "1000s"),
("spark.network.timeoutInterval", "1000s"),
("spark.network.timeout", "10000s"),
("spark.network.timeout", "1001s")
])
spark = SparkSession.builder \
.appName('testing bq v04') \
.config(conf=conf) \
.getOrCreate()
But it doesn't look like it is using all the available resources:
Here I provide some more context. The problem arises when I run label propagation algorithm with GraphFrame:
g_df = GraphFrame(vertices_df, edges_df)
result_iteration_2 = g_df.labelPropagation(maxIter=5)

Related

Pyspark session configuration overwriting

I recently come across below program , after new session values are declared in variable , why stopping spark context ?
usually , spark session will be stopped at the end but here , why stopping after new configuration declaration ?
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("MyApp") \
.config("spark.driver.host", "localhost") \
.getOrCreate()
default_conf = spark.sparkContext._conf.getAll()
print(default_conf)
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'),
('spark.app.name', 'Spark Updated Conf'),
('spark.executor.cores', '4'),
('spark.cores.max', '4'),
('spark.driver.memory','4g')])
spark.sparkContext.stop()
spark = SparkSession \
.builder \
.appName("MyApp") \
.config(conf=conf) \
.getOrCreate()
default_conf = spark.sparkContext._conf.get("spark.cores.max")
print("updated configs " , default_conf)
I am trying to understand
I think it's only example code to show that you cannot modify running session config. You can create it, get the config that was generated, then stop and recreate with updated values.
Otherwise, it doesn't make any sense, because you'd just use the config you needed from the beginning...
default_conf variable at the end is only the cores setting, however.

Unable to create Scheduler Pools in EMR using PySpark

I am fairly new to the concept of Spark Schedulers/Pooling and need to implement the same in one of my Projects. Just in order to understand the concept better, I scribbled the following streaming PySpark Code on my local and executed :
from pyspark.sql import SparkSession
import threading
def do_job(f1, f2):
df1 = spark.readStream.json(f1)
df2 = spark.readStream.json(f2)
df = df1.join(df2, "id", "inner")
df.writeStream.format("parquet").outputMode("append") \
.option("checkpointLocation", "checkpoint" + str(f1) + "/") \
.option("path", "Data/Sample_Delta_Data/date=A" + str(f1)) \
.start()
# outputs.append(df1.join(df2, "id", "inner").count())
if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("Demo") \
.master("local[4]") \
.config("spark.sql.autoBroadcastJoinThreshold", "50B") \
.config("spark.scheduler.mode", "FAIR") \
.config("spark.sql.streaming.schemaInference", "true") \
.getOrCreate()
file_prefix = "data_new/data/d"
jobs = []
outputs = []
for i in range(0, 6):
file1 = file_prefix + str(i + 1)
file2 = file_prefix + str(i + 2)
thread = threading.Thread(target=do_job, args=(file1, file2))
jobs.append(thread)
for j in jobs:
j.start()
for j in jobs:
j.join()
spark.streams.awaitAnyTermination()
# print(outputs)
As could be seen above, I am using FAIR Scheduler option and using 'Threading Library' in PySpark to implement Pooling.
As the matter fact, the above code is creating pools on my Local System but when I run the same on AWS EMR cluster, no Pools are getting created.
Am I missing something specific to AWS EMR ?
Suggestions please!
Regards.
Why are you using threading in pyspark? It handles executors cores --> [spark threading] for you. I understand you are new to spark but clearly not new to python. It's possible I missed a subtly here as streaming isn't my wheelhouse as much as spark is my wheelhouse.
The above code will launch all the work on the driver, and I think you really should read over the spark documentation to better understand how it handles parallelism. (You want to do the work on executors to really get the power you are looking for.)
With respect, this is how you do thing in python not pyspark/spark code. You likely are seeing the difference between client vs cluster code, and that could account for the difference. (It's a typical issues that occurs when coding locally vs in the cluster.)

Load/import CSV file in to mongodb using PYSPARK

I want to know how to load/import a CSV file in to mongodb using pyspark. I have a csv file named cal.csv placed in the desktop. Can somebody share the code snippet.
First read the csv as pyspark dataframe.
from pyspark import SparkConf,SparkContext
from pyspark.sql import SQLContext
sc = SparkContext(conf = conf)
sql = SQLContext(sc)
df = sql.read.csv("cal.csv", header=True, mode="DROPMALFORMED")
Then write it to mongodb,
df.write.format('com.mongodb.spark.sql.DefaultSource').mode('append')\
.option('database',NAME).option('collection',COLLECTION_MONGODB).save()
Specify the NAME and COLLECTION_MONGODB as created by you.
Also, you need to give conf and packages alongwith spark-submit according to your version,
/bin/spark-submit --conf "spark.mongodb.inuri=mongodb://127.0.0.1/DATABASE.COLLECTION_NAME?readPreference=primaryPreferred"
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/DATABASE.COLLECTION_NAME"
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
tester.py
Specify COLLECTION_NAME and DATABASE above. tester.py assuming name of the code file. For more information, refer this.
This worked for me. database:people Collection:con
pyspark --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/people.con?readPreference=primaryPreferred" \
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/people.con" \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.0
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/people.con") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/people.con") \
.getOrCreate()
df = spark.read.csv(path = "file:///home/user/Desktop/people.csv", header=True, inferSchema=True)
df.printSchema()
df.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").option("database","people").option("collection", "con").save()
Next go to mongo and check if collection is wrtten by following below steps
mongo
show dbs
use people
show collections
db.con.find().pretty()

Save CSV file to hbase table using Spark and Phoenix

Can someone point me to a working example of saving a csv file to Hbase table using Spark 2.2
Options that I tried and failed (Note: all of them work with Spark 1.6 for me)
phoenix-spark
hbase-spark
it.nerdammer.bigdata : spark-hbase-connector_2.10
All of them finally after fixing everything give similar error to this Spark HBase
Thanks
Add below parameters to your spark job-
spark-submit \
--conf "spark.yarn.stagingDir=/somelocation" \
--conf "spark.hadoop.mapreduce.output.fileoutputformat.outputdir=/s‌​omelocation" \
--conf "spark.hadoop.mapred.output.dir=/somelocation"
Phoexin has plugin and jdbc thin client which can connect(read/write) to HBASE, example are in https://phoenix.apache.org/phoenix_spark.html
Option 1 : Connect via zookeeper url - phoenix plugin
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
val df = sqlContext.load(
"org.apache.phoenix.spark",
Map("table" -> "TABLE1", "zkUrl" -> "phoenix-server:2181")
)
df
.filter(df("COL1") === "test_row_1" && df("ID") === 1L)
.select(df("ID"))
.show
Option 2 : Use JDBC thin client provied by phoenix query server
more info on https://phoenix.apache.org/server.html
jdbc:phoenix:thin:url=http://localhost:8765;serialization=PROTOBUF

connect to mysql from spark

I am trying to follow the instructions mentioned here...
https://www.percona.com/blog/2016/08/17/apache-spark-makes-slow-mysql-queries-10x-faster/
and here...
https://www.percona.com/blog/2015/10/07/using-apache-spark-mysql-data-analysis/
I am using sparkdocker image.
docker run -it -p 8088:8088 -p 8042:8042 -p 4040:4040 -h sandbox sequenceiq/spark:1.6.0 bash
cd /usr/local/spark/
./sbin/start-master.sh
./bin/spark-shell --driver-memory 1G --executor-memory 1g --executor-cores 1 --master local
This works as expected:
scala> sc.parallelize(1 to 1000).count()
But this shows an error:
val jdbcDF = spark.read.format("jdbc").options(
Map("url" -> "jdbc:mysql://1.2.3.4:3306/test?user=dba&password=dba123",
"dbtable" -> "ontime.ontime_part",
"fetchSize" -> "10000",
"partitionColumn" -> "yeard", "lowerBound" -> "1988", "upperBound" -> "2016", "numPartitions" -> "28"
)).load()
And here is the error:
<console>:25: error: not found: value spark
val jdbcDF = spark.read.format("jdbc").options(
How do I connect to MySQL from within spark shell?
With spark 2.0.x,you can use DataFrameReader and DataFrameWriter.
Use SparkSession.read to access DataFrameReader and use Dataset.write to access DataFrameWriter.
Suppose using spark-shell.
read example
val prop=new java.util.Properties()
prop.put("user","username")
prop.put("password","yourpassword")
val url="jdbc:mysql://host:port/db_name"
val df=spark.read.jdbc(url,"table_name",prop)
df.show()
read example 2
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql:dbserver")
.option("dbtable", “schema.tablename")
.option("user", "username")
.option("password", "password")
.load()
from spark doc
write example
import org.apache.spark.sql.SaveMode
val prop=new java.util.Properties()
prop.put("user","username")
prop.put("password","yourpassword")
val url="jdbc:mysql://host:port/db_name"
//df is a dataframe contains the data which you want to write.
df.write.mode(SaveMode.Append).jdbc(url,"table_name",prop)
Create the spark context first
Make sure you have jdbc jar files in attached to your classpath
if you are trying to read data from jdbc. use dataframe API instead of RDD as dataframes have better performance. refer to the below performance comparsion graph.
here is the syntax for reading from jdbc
SparkConf conf = new SparkConf().setAppName("app"))
.setMaster("local[2]")
.set("spark.serializer",prop.getProperty("spark.serializer"));
JavaSparkContext sc = new JavaSparkContext(conf);
sqlCtx = new SQLContext(sc);
df = sqlCtx.read()
.format("jdbc")
.option("url", "jdbc:mysql://1.2.3.4:3306/test")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable","dbtable")
.option("user", "dbuser")
.option("password","dbpwd"))
.load();
It looks like spark is not defined, you should use the SQLContext to connect to the driver like this:
import org.apache.spark.sql.SQLContext
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val dataframe_mysql = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://Public_IP:3306/DB_NAME").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "tblage").option("user", "sqluser").option("password", "sqluser").load()
Later you can user sqlcontext where you used spark (in spark.read etc)
This is a common problem for those migrating to Spark 2.0.0 from the earlier versions. The Spark documentation is not very good. To solve this, you have to define a SparkSession, like this:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL Example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
This solution is hidden in the Spark SQL, Dataframes and Data Sets Guide located here. SparkSession is the new entry point to the DataFrame API and it incorporates both SQLContext and HiveContext and has some additional advantages, so there is no need to define either of those anymore. Further information about this can be found here.
Please accept this as the answer, if you find this useful.

Resources