Driver doesn't stop on cluster mode - apache-spark

I've configured my cluster (1 master / 9 slaves) .
My problem is that when I submit an application (word through spark-submit with deploy-mode cluster) the driver doesn't stop, even though there is little data.
I submitted the application like that:
./spark-submit \
--class wordCount \
--master spark://master:6066 --deploy-mode cluster --supervise \
--executor-cores 1 --total-executor-cores 3 --executor-memory 1g \
hdfs://master:9000/user/exemple/word3.jar \
hdfs://master:9000/user/exemple/texte.txt
hdfs://master:9000/user/exemple/result 2
That's my program:
import org.apache.spark.SparkContext import
org.apache.spark.SparkContext._ import org.apache.spark.SparkConf
object SparkWordCount { def main(args: Array[String]) {
// create Spark context with Spark configuration
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
// get threshold
val threshold = args(1).toInt
// read in text file and split each document into words
val tokenized = sc.textFile(args(0)).flatMap(_.split(" "))
// count the occurrence of each word
val wordCounts = tokenized.map((_, 1)).reduceByKey(_ + _)
// filter out words with fewer than threshold occurrences
val filtered = wordCounts.filter(_._2 >= threshold)
// count characters
val charCounts = filtered.flatMap(_._1.toCharArray).map((_, 1)).reduceByKey(_ + _)
System.out.println(charCounts.collect().mkString(", ")) } }
Result :
Application Status

Related

Pyspark Job with Dataproc on GCP

I'm trying to running a pyspark job, but I keep getting job failure for this reason:
*Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at: https://console.cloud.google.com/dataproc/jobs/f8f8e95794e0457d80ea1b0c4df8d815?project=long-state-352923&region=us-central1 gcloud dataproc jobs wait 'f8f8e95794e0457d80ea1b0c4df8d815' --region 'us-central1' --project 'long-state-352923' **...***
here is also my code in running the job:
`from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('spark_hdfs_to_hdfs') \
.getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("WARN")
MASTER_NODE_INSTANCE_NAME="cluster-d687-m"
log_files_rdd = sc.textFile('hdfs://{}/data/logs_example/*'.format(MASTER_NODE_INSTANCE_NAME))
splitted_rdd = log_files_rdd.map(lambda x: x.split(" "))
selected_col_rdd = splitted_rdd.map(lambda x: (x[0], x[3], x[5], x[6]))
columns = ["ip","date","method","url"]
logs_df = selected_col_rdd.toDF(columns)
logs_df.createOrReplaceTempView('logs_df')
sql = """
SELECT
url,
count(*) as count
FROM logs_df
WHERE url LIKE '%/article%'
GROUP BY url
"""
article_count_df = spark.sql(sql)
print(" ### Get only articles and blogs records ### ")
article_count_df.show(5)`
i don't seem to understand the reasoning why its failing.
Is there a problem with code?

PySpark- Error accessing broadcast variable in udf while running in standalone cluster mode

#f.pandas_udf(returnType= DoubleType())
def square(r : pd.Series) -> pd.Series:
print('In pandas Udf square')
offset_value = offset.value
return (r * r ) + 10
if __name__ == "__main__":
spark = SparkSession.builder.appName("Spark").getOrCreate()
sc = spark.sparkContext
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
offset = sc.broadcast(10)
x = pd.Series(range(0,100))
df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))
df = df.withColumn('sq',square(df.x)).withColumn('sqsq', square(f.col('sq')))
start_time = datetime.datetime.now()
df.show()
offset.unpersist()
offset.destroy()
spark.stop()
The above code works well if i run pyspark submit command in local mode
Submit.cmd --master local[*] test.py
Same code, if i try to run in standalone cluster mode, i.e
Submit.cmd --master spark://xx.xx.0.24:7077 test.py
I get error while accessing broadcast variable in udf
java.io.IOException: Failed to delete original file 'C:\Users\xxx\AppData\Local\Temp\spark-bf6b4553-f30f-4e4a-a7f7-ef117329985c\executor-3922c28f-ed1e-4348-baa4-4ed08e042b76\spark-b59e518c-a20a-4a11-b96b-b7657b1c79ea\broadcast6537791588721535439' after copy to 'C:\Users\xxx\AppData\Local\Temp\spark-bf6b4553-f30f-4e4a-a7f7-ef117329985c\executor-3922c28f-ed1e-4348-baa4-4ed08e042b76\blockmgr-ee27f0f0-ee8b-41ea-86d6-8f923845391e\37\broadcast_0_python'
at org.apache.commons.io.FileUtils.moveFile(FileUtils.java:2835)
at org.apache.spark.storage.DiskStore.moveFileToBlock(DiskStore.scala:133)
at org.apache.spark.storage.BlockManager$TempFileBasedBlockStoreUpdater.saveToDiskStore(BlockManager.scala:424)
at org.apache.spark.storage.BlockManager$BlockStoreUpdater.$anonfun$save$1(BlockManager.scala:343)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
Without accessing broadcast variable in Udf, this code works fine.

Spark2 reads ORC files much slower than Spark1

I found that Spark2 loads ORC files much slower than Spark1, and then I tried some methods to speed up Spark2, but no success. Codes show as below:
Spark 1.5
val conf = new SparkConf().setAppName("LoadOrc")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.akka.frameSize", "512")
.set("spark.akka.timeout","800s")
.set("spark.storage.blockManagerHeartBeatMs", "300000")
.set("spark.kryoserializer.buffer.max","1024m")
.set("spark.executor.extraJavaOptions", "-Djava.util.Arrays.useLegacyMergeSort=true")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
val start = System.nanoTime()
val ret = hiveContext.read.orc(args(0)).count()
val end = System.nanoTime()
println(s"count: $ret")
println(s"Time taken: ${(end - start) / 1000 / 1000} ms")
sc.stop()
Spark UI
Spark1 UI
results
count: 2290811187
Time taken: 401063 ms
Spark 2
val spark = SparkSession.builder()
.appName("LoadOrc")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.akka.frameSize", "512")
.config("spark.akka.timeout","800s")
.config("spark.storage.blockManagerHeartBeatMs", "300000")
.config("spark.kryoserializer.buffer.max","1024m")
.config("spark.executor.extraJavaOptions", "-Djava.util.Arrays.useLegacyMergeSort=true")
.enableHiveSupport()
.getOrCreate()
println(spark.time(spark.read.format("org.apache.spark.sql.execution.datasources.orc")
.load(args(0)).count()))
spark.close()
Spark UI
Spark2 UI
results
Time taken: 1384464 ms
2290811187

How to see the dataframe in the console (equivalent of .show() for structured streaming)?

I'm trying to see what's coming in as my DataFrame..
here is the spark code
from pyspark.sql import SparkSession
import pyspark.sql.functions as psf
import logging
import time
spark = SparkSession \
.builder \
.appName("Console Example") \
.getOrCreate()
logging.info("started to listen to the host..")
lines = spark \
.readStream \
.format("socket") \
.option("host", "127.0.0.1") \
.option("port", 9999) \
.load()
data = lines.selectExpr("CAST(value AS STRING)")
query1 = data.writeStream.format("console").start()
time.sleep(10)
query1.awaitTermination()
I am getting the progress reports but obviously the input rows are 0 for each trigger..
2019-08-19 23:45:45 INFO MicroBatchExecution:54 - Streaming query made progress: {
"id" : "a4b26eaf-1032-4083-9e42-a9f2f0426eb7",
"runId" : "35c2b82a-191d-4998-9c98-17b24f5e3e9d",
"name" : null,
"timestamp" : "2019-08-20T06:45:45.458Z",
"batchId" : 0,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"durationMs" : {
"getOffset" : 0,
"triggerExecution" : 0
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "TextSocketSource[host: 127.0.0.1, port: 9999]",
"startOffset" : null,
"endOffset" : null,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSinkProvider#5f3e6f3"
}
}
My TCP server is spitting some stuff out and I can see it in the console too - but i just want to make sure if my spark job is receiving anything by printing out but difficult to do so.
This is my TCP server code.
import socket
import sys
import csv
import time
port = 9999
server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_socket.bind(('', port))
server_socket.listen(5)
connection_socket, addr = server_socket.accept()
file_path = "/Users/Downloads/youtube-new/USvideos.csv"
row_count = sum(1 for row in file_path)
with open(file_path, "r") as f:
reader = csv.reader(f, delimiter="\t")
while True:
for i, line in enumerate(reader):
try:
print(line)
data = line[0].encode('utf-8')
connection_socket.send(data)
time.sleep(2)
if (row_count == i-1):
break
except IndexError:
print("Index error")
server_socket.close()
server_socket.close()
I can see the line is getting printed out so I can at least say that this has accepted connection at localhost:9999 which is the host & port I'm using for spark job as well.
this is one of the data..
['8mhTWqWlQzU,17.15.11,"Wearing Online Dollar Store Makeup For A Week","Safiya Nygaard",22,2017-11-11T01:19:33.000Z,"wearing online dollar store makeup for a week"|"online dollar store makeup"|"dollar store makeup"|"daiso"|"shopmissa makeup"|"shopmissa haul"|"dollar store makeup haul"|"dollar store"|"shopmissa"|"foundation"|"concealer"|"eye primer"|"eyebrow pencil"|"eyeliner"|"bronzer"|"contour"|"face powder"|"lipstick"|"$1"|"$1 makeup"|"safiya makeup"|"safiya dollar store"|"safiya nygaard"|"safiya"|"safiya and tyler",2922523,119348,1161,6736,https://i.ytimg.com/vi/8mhTWqWlQzU/default.jpg,False,False,False,"I found this online dollar store called ShopMissA that sells all their makeup products for $1 and decided I had to try it out! So I replaced my entire everyday makeup routine with $1 makeup products, including foundation, concealer, eye primer, eyebrow pencil, eyeliner, bronzer, contour, face powder, and lipstick. What do you think? Would you try this?\\n\\nThis video is NOT sponsored!\\n\\nSafiya\'s Nextbeat: https://nextbeat.co/u/safiya\\nIG: https://www.instagram.com/safiyany/\\nTwitter: https://twitter.com/safiyajn\\nFacebook: https://www.facebook.com/safnygaard/\\n\\nAssistant Editor: Claire Wiley\\n\\nMUSIC\\nMind The Gap\\nvia Audio Network\\n\\nSFX\\nvia AudioBlocks"']
Everything in the bracket (notice I'm actually sending data[0])
from pyspark.sql import SparkSession
import pyspark.sql.functions as psf
import logging
import time
spark = SparkSession \
.builder \
.appName("Console Example") \
.getOrCreate()
logging.info("started to listen to the host..")
lines = spark \
.readStream \
.format("socket") \
.option("host", "127.0.0.1") \
.option("port", 9999) \
.load()
data = lines.selectExpr("CAST(value AS STRING)")
query1 = data.writeStream.queryName("counting").format("memory").outputMode("append").start()
for x in range(5):
spark.sql("select * from counting").show()
time.sleep(10)
Try this, it will show you data just as the method show() does in spark Sql. It will show you 5 sets of data, as we are looping five times.

Issue while running Spark application on Yarn

I have a testing spark environment(Single Node) running on AWS. I executed few adhoc queries in PySpark shell and everything went as expected, however, when I'm running the application using spark-submit , I get error.
Below is the code:
from __future__ import print_function
from pyspark import SparkContext, SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql import SQLContext as sql
conf = SparkConf().setAppName("myapp")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
if __name__ == "__main__":
#inp_data = loaded data from db
df = inp_data.select('Id','DueDate','Principal','delay','unpaid_emi','future_payment')
filterd_unpaid_emi = df.filter(df.unpaid_emi == 1)
par = filterd_unpaid_emi.groupBy('Id').sum('Principal').withColumnRenamed('sum(Principal)' , 'par')
temp_df = df.filter(df.unpaid_emi == 1)
temp_df_1 = temp_df.filter(temp_df.future_payment == 0)
temp_df_1.registerTempTable("mytable")
bucket_df_1 = sql("""select *, case
when delay<0 and delay ==0 then '9999'
when delay>0 and delay<7 then '9'
when delay>=7 and delay<=14 then '8'
when delay>=15 and delay<=29 then '7'
when delay>=30 and delay<=59 then '6'
when delay>=60 and delay<=89 then '5'
when delay>=90 and delay<=119 then '4'
when delay>=120 and delay<=149 then '3'
when delay>=150 and delay<=179 then '2'
else '1'
end as bucket
from mytable""")
bucket_df_1 = bucket_df_1.select(bucket_df_1.Id,bucket_df_1.Principal,bucket_df_1.delay,bucket_df_1.unpaid_emi,bucket_df_1.future_payment,bucket_df_1.bucket.cast("int").alias('buckets'))
min_bucket = bucket_df_1.groupBy('Id').min('buckets').withColumnRenamed('min(buckets)' , 'max_delay')
joinedDf = par.join(min_bucket, ["Id"])
#joinedDf.printSchema()
And below is the command to submit the application:
spark-submit \
--master yarn \
--driver-class-path /path to/mysql-connector-java-5.0.8-bin.jar \
--jars /path to/mysql-connector-java-5.0.8-bin.jar \
/path to/mycode.py
ERROR:
17/11/10 10:00:34 INFO SparkSqlParser: Parsing command: mytable
Traceback (most recent call last):
File "/path to/mycode.py", line 36, in <module>
from mytable""")
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 73, in __init__
AttributeError: 'str' object has no attribute '_jsc'
17/11/10 10:00:34 INFO SparkContext: Invoking stop() from shutdown hook
17/11/10 10:00:34 INFO SparkUI: Stopped Spark web UI at ........
I'm quite new to Spark so can someone please tell the mistake(s) i'm doing.?
Also, any feedback on improving coding style will be appreciated!
Spark Version : 2.2
You are using the imported SQLContext as sql to query your temp table (which is not bound to any spark instances), not the spark.sql (from the initialized spark instance). I also, changed some of your imports and code.
from __future__ import print_function
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
if __name__ == "__main__":
# move the initializations within the main
conf = SparkConf().setAppName("myapp")
# create the session
spark = SparkSession.builder.config(conf=conf) \
.getOrCreate()
# load your data and do what you need to do
#inp_data = loaded data from db
df = inp_data.select('Id','DueDate','Principal','delay','unpaid_emi','future_payment')
filterd_unpaid_emi = df.filter(df.unpaid_emi == 1)
par = filterd_unpaid_emi.groupBy('Id').sum('Principal').withColumnRenamed('sum(Principal)' , 'par')
temp_df = df.filter(df.unpaid_emi == 1)
temp_df_1 = temp_df.filter(temp_df.future_payment == 0)
temp_df_1.registerTempTable("mytable")
# use spark.sql to query your table
bucket_df_1 = spark.sql("""select *, case
when delay<0 and delay ==0 then '9999'
when delay>0 and delay<7 then '9'
when delay>=7 and delay<=14 then '8'
when delay>=15 and delay<=29 then '7'
when delay>=30 and delay<=59 then '6'
when delay>=60 and delay<=89 then '5'
when delay>=90 and delay<=119 then '4'
when delay>=120 and delay<=149 then '3'
when delay>=150 and delay<=179 then '2'
else '1'
end as bucket
from mytable""")
bucket_df_1 = bucket_df_1.select(bucket_df_1.Id,bucket_df_1.Principal,bucket_df_1.delay,bucket_df_1.unpaid_emi,bucket_df_1.future_payment,bucket_df_1.bucket.cast("int").alias('buckets'))
min_bucket = bucket_df_1.groupBy('Id').min('buckets').withColumnRenamed('min(buckets)' , 'max_delay')
joinedDf = par.join(min_bucket, ["Id"])
#joinedDf.printSchema()
Hope this helps, good luck!

Resources