Spark2 reads ORC files much slower than Spark1 - apache-spark

I found that Spark2 loads ORC files much slower than Spark1, and then I tried some methods to speed up Spark2, but no success. Codes show as below:
Spark 1.5
val conf = new SparkConf().setAppName("LoadOrc")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.akka.frameSize", "512")
.set("spark.akka.timeout","800s")
.set("spark.storage.blockManagerHeartBeatMs", "300000")
.set("spark.kryoserializer.buffer.max","1024m")
.set("spark.executor.extraJavaOptions", "-Djava.util.Arrays.useLegacyMergeSort=true")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
val start = System.nanoTime()
val ret = hiveContext.read.orc(args(0)).count()
val end = System.nanoTime()
println(s"count: $ret")
println(s"Time taken: ${(end - start) / 1000 / 1000} ms")
sc.stop()
Spark UI
Spark1 UI
results
count: 2290811187
Time taken: 401063 ms
Spark 2
val spark = SparkSession.builder()
.appName("LoadOrc")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.akka.frameSize", "512")
.config("spark.akka.timeout","800s")
.config("spark.storage.blockManagerHeartBeatMs", "300000")
.config("spark.kryoserializer.buffer.max","1024m")
.config("spark.executor.extraJavaOptions", "-Djava.util.Arrays.useLegacyMergeSort=true")
.enableHiveSupport()
.getOrCreate()
println(spark.time(spark.read.format("org.apache.spark.sql.execution.datasources.orc")
.load(args(0)).count()))
spark.close()
Spark UI
Spark2 UI
results
Time taken: 1384464 ms
2290811187

Related

Pyspark Job with Dataproc on GCP

I'm trying to running a pyspark job, but I keep getting job failure for this reason:
*Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at: https://console.cloud.google.com/dataproc/jobs/f8f8e95794e0457d80ea1b0c4df8d815?project=long-state-352923&region=us-central1 gcloud dataproc jobs wait 'f8f8e95794e0457d80ea1b0c4df8d815' --region 'us-central1' --project 'long-state-352923' **...***
here is also my code in running the job:
`from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('spark_hdfs_to_hdfs') \
.getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("WARN")
MASTER_NODE_INSTANCE_NAME="cluster-d687-m"
log_files_rdd = sc.textFile('hdfs://{}/data/logs_example/*'.format(MASTER_NODE_INSTANCE_NAME))
splitted_rdd = log_files_rdd.map(lambda x: x.split(" "))
selected_col_rdd = splitted_rdd.map(lambda x: (x[0], x[3], x[5], x[6]))
columns = ["ip","date","method","url"]
logs_df = selected_col_rdd.toDF(columns)
logs_df.createOrReplaceTempView('logs_df')
sql = """
SELECT
url,
count(*) as count
FROM logs_df
WHERE url LIKE '%/article%'
GROUP BY url
"""
article_count_df = spark.sql(sql)
print(" ### Get only articles and blogs records ### ")
article_count_df.show(5)`
i don't seem to understand the reasoning why its failing.
Is there a problem with code?

Execute PySpark code from a Java/Scala application

Is there a way to execute PySpark code from a Java/Scala application on an existing SparkSession?
Specifically, given a PySpark code that receives and returns pyspark dataframe, is there a way to submit it to Java/Scala SparkSession and get back the output dataframe:
String pySparkCode = "def my_func(input_df):\n" +
" from pyspark.sql.functions import *\n" +
" return input_df.selectExpr(...)\n" +
" .drop(...)\n" +
" .withColumn(...)\n"
SparkSession spark = SparkSession.builder().master("local").getOrCreate()
Dataset inputDF = spark.sql("SELECT * from my_table")
outputDf = spark.<SUBMIT_PYSPARK_METHOD>(pySparkCode, inputDF)

How to incrementally load , fit with new data , save the pipeline model in using spark?

Any pointers to incrementally train and build the model , and get the prediction on single element.
Trying to run a web application will write data to csv in a shared path , and the ml application will read data and loads the model , tries to fit the data and save the model , transform the test data. ( This is supposed to happen in loop)
But when loading the saved model second time , facing following exception ,
(am using a minmax scaler to normalize the data)
Exception in thread "main" java.lang.IllegalArgumentException: Output column features_intermediate already exists.
Any pointers would be much appreciated , Thank you
object RunAppPooling {
def main(args: Array[String]): Unit = { // start the spark session
val conf = new SparkConf().setMaster("local[2]").set("deploy-mode", "client").set("spark.driver.bindAddress", "127.0.0.1")
.set("spark.broadcast.compress", "false")
.setAppName("local-spark")
val spark = SparkSession
.builder()
.config(conf)
.getOrCreate()
val filePath = "src/main/resources/train.csv"
val modelPath = "file:///home/vagrant/custom.model"
val schema = StructType(
Array(
StructField("IDLE_COUNT", IntegerType),
StructField("TIMEOUTS", IntegerType),
StructField("ACTIVE_COUNT", IntegerType),
StructField("FACTOR_LOAD", DoubleType)))
while(true){
// read the raw data
val df_raw = spark
.read
.option("header", "true")
.schema(schema)
.csv(filePath)
df_raw.show()
println(df_raw.count())
// fill all na values with 0
val df = df_raw.na.fill(0)
df.printSchema()
// create the feature vector
val vectorAssembler = new VectorAssembler()
.setInputCols(Array("IDLE_COUNT", "TIMEOUTS", "ACTIVE_COUNT" ))
.setOutputCol("features_intermediate")
var lr1: PipelineModel = null
try {
lr1 = PipelineModel.load(modelPath)
} catch {
case ie: InvalidInputException => println(ie.getMessage)
}
import org.apache.spark.ml.feature.StandardScaler
val scaler = new StandardScaler().setWithMean(true).setWithStd(true).setInputCol("features_intermediate").setOutputCol("features")
var pipeline: Pipeline = null
if (lr1 == null) {
val lr =
new LinearRegression()
.setMaxIter(100)
.setRegParam(0.1)
.setElasticNetParam(0.8)
.setLabelCol("FACTOR_LOAD") // setting label column
// create the pipeline with the steps
pipeline = new Pipeline().setStages(Array( vectorAssembler, scaler, lr))
} else {
pipeline = new Pipeline().setStages(Array(vectorAssembler, scaler, lr1))
}
// create the model following the pipeline steps
val cvModel = pipeline.fit(df)
// save the model
cvModel.write.overwrite.save(modelPath)
var testschema = StructType(
Array(
StructField("PACKAGE_KEY", StringType),
StructField("IDLE_COUNT", IntegerType),
StructField("TIMEOUTS", IntegerType),
StructField("ACTIVE_COUNT", IntegerType)
))
val df_raw1 = spark
.read
.option("header", "true")
.schema(testschema)
.csv("src/main/resources/test_pooling.csv")
// fill all na values with 0
val df1 = df_raw1.na.fill(0)
val extracted = cvModel.transform(df1) //.toDF("prediction")
import org.apache.spark.sql.functions._
val test = extracted.select(mean(df("FACTOR_LOAD"))).collect()
println(test.apply(0))
}
}
}
I figured out a way at-least to get past away the exception not sure whether it is right apporach or not .Here it goes while creating the pipeline after loading the model , set the stages as only the model , because model has already defined with respective schema . Not sure whether this will normalize the new data or not .
pipeline = new Pipeline().setStages(Array( lr1))

Driver doesn't stop on cluster mode

I've configured my cluster (1 master / 9 slaves) .
My problem is that when I submit an application (word through spark-submit with deploy-mode cluster) the driver doesn't stop, even though there is little data.
I submitted the application like that:
./spark-submit \
--class wordCount \
--master spark://master:6066 --deploy-mode cluster --supervise \
--executor-cores 1 --total-executor-cores 3 --executor-memory 1g \
hdfs://master:9000/user/exemple/word3.jar \
hdfs://master:9000/user/exemple/texte.txt
hdfs://master:9000/user/exemple/result 2
That's my program:
import org.apache.spark.SparkContext import
org.apache.spark.SparkContext._ import org.apache.spark.SparkConf
object SparkWordCount { def main(args: Array[String]) {
// create Spark context with Spark configuration
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
// get threshold
val threshold = args(1).toInt
// read in text file and split each document into words
val tokenized = sc.textFile(args(0)).flatMap(_.split(" "))
// count the occurrence of each word
val wordCounts = tokenized.map((_, 1)).reduceByKey(_ + _)
// filter out words with fewer than threshold occurrences
val filtered = wordCounts.filter(_._2 >= threshold)
// count characters
val charCounts = filtered.flatMap(_._1.toCharArray).map((_, 1)).reduceByKey(_ + _)
System.out.println(charCounts.collect().mkString(", ")) } }
Result :
Application Status

How to create InputDStream with offsets in PySpark (using KafkaUtils.createDirectStream)?

How to use KafkaUtils.createDirectStream with the offsets for a particular Topic in Pyspark?
If you want to create an RDD from records in a Kafka topic, use a static set of tuples.
Make available all the imports
from pyspark.streaming.kafka import KafkaUtils, OffsetRange
Then you create a dictionary of Kafka Brokers
kafkaParams = {"metadata.broker.list": "host1:9092,host2:9092,host3:9092"}
Then you create your offsets object
start = 0
until = 10
partition = 0
topic = 'topic'
offset = OffsetRange(topic,partition,start,until)
offsets = [offset]
Finally you create the RDD:
kafkaRDD = KafkaUtils.createRDD(sc, kafkaParams,offsets)
To create Stream with offsets you need to do the following:
from pyspark.streaming.kafka import KafkaUtils, TopicAndPartition
from pyspark.streaming import StreamingContext
Then you create your sparkstreaming context using your sparkcontext
ssc = StreamingContext(sc, 1)
Next we set up all of our parameters
kafkaParams = {"metadata.broker.list": "host1:9092,host2:9092,host3:9092"}
start = 0
partition = 0
topic = 'topic'
Then we create our fromOffset Dictionary
topicPartion = TopicAndPartition(topic,partition)
fromOffset = {topicPartion: long(start)}
//notice that we must cast the int to long
Finally we create the Stream
directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic],kafkaParams,
fromOffsets=fromOffset)
You can do:
from pyspark.streaming.kafka import TopicAndPartition
topic = "test"
brokers = "localhost:9092"
partition = 0
start = 0
topicpartion = TopicAndPartition(topic, partition)
fromoffset = {topicpartion: int(start)}
kafkaDStream = KafkaUtils.createDirectStream(spark_streaming,[topic], \
{"metadata.broker.list": brokers}, fromOffsets = fromoffset)
Note: Spark 2.2.0, python 3.6

Resources