I tried to execute my scala code in Bluemix Spark service, once I can run it and get right result from my local virtual machine. When I ran it in Bluemix Spark, I can not get any response in notebook.
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.Matrix
val input = sc.textFile("swift://notebooks.spark/pca.csv")
val header = input.first()
val inputData = input.filter(x => x != header).map(line=>line.split(','))
val inputVector = input.map{d=>
Vectors.dense(
d(1).toDouble, d(2).toDouble, d(3).toDouble, d(4).toDouble, d(5).toDouble, d(6).toDouble,
d(7).toDouble, d(8).toDouble, d(9).toDouble, d(10).toDouble, d(11).toDouble)}
val rowMatrix = new RowMatrix(inputVector)
val pca: Matrix = rowMatrix.computePrincipalComponents(5)
When I execute the intput.take(2), I can get result well but no result for executing input.foreach(println). It's strange. How can I get result?
I have tested it on Bluemix in a Scala notebook.
val input = sc.textFile("swift://notebooks.spark/test.csv")
input.take(1) /** shows the first line */
input.foreach(println) /** nothing is displayed */
If you want to display the content of a RDD, then you can use the following code.
input.take(5).foreach(println) /** shows the first 5 lines */
input.collect().foreach(println) /** shows all lines */
I do not know how your local VM is set up, but I think you have to distinguish between running your code local or on a cluster.
Have look at this answer for more information: How to print the contents of RDD?
Related
Due to a specific need, I am solving the problem of running a python application inside a scala application.
Here is the sample code of my applications.
Parent application (Scala):
import scala.sys.process._
import java.io.File
import java.nio.file.{Path => JavaPath}
class SparkApplication(spark: SparkSession) {
def run(): Unit = {
val command =
"spark-submit" ::
s"--keytab $keytab" ::
"--principal my_username#DOMAIN.RU" ::
"--master yarn" ::
"--deploy-mode cluster" ::
s"$pyFile" :: Nil mkString " "
command.!!
}
val containerPath: JavaPath = new File(SparkFiles.getRootDirectory()).toPath.toAbsolutePath
val pyFile: File = moveToContainer("spark.py")
val keytab: File = moveToContainer("my_username.keytab")
def moveToContainer(fileName: String): File = {
val fileSourceStream = getClass.getResourceAsStream(fileName)
val file = containerPath.resolve(fileName).toFile
FileUtils.copyInputStreamToFile(fileSourceStream, file)
file
}
}
Child application (spark.py):
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName('PysparkSubProcess').enableHiveSupport().getOrCreate()
rdd = spark.sparkContext.parallelize([1,2,3,4,5,6])
print("RDD count")
print(rdd.count())
df = spark.sql("select 1 as col1")
df.write.format("parquet").saveAsTable("default.temp_pyspark_table")
So, as you can see above, I am successfully running a Scala application in cluster mode.
This scala application runs a python application inside itself (spark.py ).
Also, Kerberos is configured on our cluster. That's why I use my keytab file for authorization.
But that's not enough. While a Scala application has access to Have, Have remains unavailable for a Python application.
That is, I can't save my test table from spark.py.
And the question probably is, is there any way to use authorization from a Scala application for a Python application? So that I don't have to worry about the keytab file and the Hive configuration for the Python application.
I have heard that there are authorization tokens that are created and stored on drivers. Is it possible to reuse such tokens?
(--conf spark.executorEnv.HADOOP_TOKEN_FILE_LOCATION)
Or maybe there are workarounds?
I still can't finish this problem.
I am following the book Spark - Definitive Guide and I was writing basic program that streams the data . The books says that I should use awaitTermination() method to process the query correctly. When I run the below code , it runs indefinitely until I press Ctrl+C and it ends with exception. My question is how can I monitor the status of my streaming query and as soon as my streaming completes , my program should exit after showing the output. Like in the example code below , as soon as it reads all the files and writes the file on the console , it should have ended but it didn't . I also tried inserting activityQuery.stop() but that also didn't work. How can I achieve the same . Any help be appreciated.
from pyspark import SparkConf
from pyspark.sql import *
from pyspark.sql.functions import *
from time import sleep
conf = SparkConf()
spark = SparkSession.builder.config(conf=conf).appName('testapp').getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 2)
spark.conf.set("spark.sql.streaming.schemaInference", "true")
static = spark.read.format("json").load("/home/scom/.test/spark/Spark-The-Definitive-Guide/data/activity-data/")
dataSchema = static.schema
streaming = spark.readStream.schema(dataSchema).option("maxFilesPerTrigger", 1).json("/home/scom/.test/spark/Spark-The-Definitive-Guide/data/activity-data/")
activityCounts = streaming.groupBy("gt").count()
activityQuery = activityCounts.writeStream.queryName("activity_counts").format("console").outputMode("complete").start()
activityQuery.awaitTermination()
for x in range(5):
spark.sql("select * from activity_counts").show()
sleep(1)
I'm using spark to write data to HBase, but at the writing stage, only one executor and one core are executing.
I wonder why my code is not writing properly or what should I do to make it write faster?
Here is my code:
val df = ss.sql("SQL")
HBaseTableWriterUtil.hbaseWrite(ss, tableList, df)
def hbaseWrite(ss:SparkSession,tableList: List[String], df:DataFrame): Unit ={
val tableName = tableList(0)
val rowKeyName = tableList(4)
val rowKeyType = tableList(5)
hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, s"${tableName}")
//写入到HBase
val sc = ss.sparkContext
sc.hadoopConfiguration.addResource(hbaseConf)
val columns = df.columns
val result = df.rdd.mapPartitions(par=>{
par.map(row=>{
var rowkey:String =""
if("String".equals(rowKeyType)){
rowkey = row.getAs[String](rowKeyName)
}else if("Long".equals(rowKeyType)){
rowkey = row.getAs[Long](rowKeyName).toString
}
val put = new Put(Bytes.toBytes(rowkey))
for(name<-columns){
var value = row.get(row.fieldIndex(name))
if(value!=null){
put.addColumn(Bytes.toBytes("cf"),Bytes.toBytes(name),Bytes.toBytes(value.toString))
}
}
(new ImmutableBytesWritable,put)
})
})
val job = Job.getInstance(sc.hadoopConfiguration)
job.setOutputKeyClass(classOf[ImmutableBytesWritable])
job.setOutputValueClass(classOf[Result])
job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
result.saveAsNewAPIHadoopDataset(job.getConfiguration)
}
You may not control how many parallel execute may write to HBase.
Though you can start multiple Spark jobs in multiThreaded client program.
e.g. You can have a shell script which triggers multiple spark-submit command to induce parallelism. Each spark job can work on one set of data independent to each other and push into HBase.
This can also be done using Spark Java/Scala SparkLauncher API using it with Java concurrent API (e.g. Executor framework).
val sparkLauncher = new SparkLauncher
//Set Spark properties.only Basic ones are shown here.It will be overridden if properties are set in Main class.
sparkLauncher.setSparkHome("/path/to/SPARK_HOME")
.setAppResource("/path/to/jar/to/be/executed")
.setMainClass("MainClassName")
.setMaster("MasterType like yarn or local[*]")
.setDeployMode("set deploy mode like cluster")
.setConf("spark.executor.cores","2")
// Lauch spark application
val sparkLauncher1 = sparkLauncher.startApplication()
//get jobId
val jobAppId = sparkLauncher1.getAppId
//Get status of job launched.THis loop will continuely show statuses like RUNNING,SUBMITED etc.
while (true) {
println(sparkLauncher1.getState().toString)
}
However, the challenge is to track each of them for failure and automatic recovery. It may be tricky specially when partial data is already written into HBase. i.e. A job fails to process the complete set of data assigned to it. You may have to automatically clean the data from HBase before automatically retrigger.
I am connecting to hbase ( ver 1.2) via phoenix (4.11) queryserver from Spark 2.2.0, but the dataframe is returning the only table structure with empty rows thoug data is present in table.
Here is the code I am using to connect to queryserver.
// ---jar ----phoenix-4.11.0-HBase-1.2-thin-client.jar<br>
val prop = new java.util.Properties
prop.setProperty("driver", "org.apache.phoenix.queryserver.client.Driver")
val url = "jdbc:phoenix:thin:url=http://localhost:8765;serialization=PROTOBUF"
val d1 = spark.sqlContext.read.jdbc(url,"TABLE1",prop)
d1.show()
Can anyone please help me in solving this issue. Thanks in advance
If you are using spark2.2 the better approach would be to load directly via pheonix as a dataframe.This way you would provide the zookeeper url only and you can provide a predicate so that you load only the data required and not the entire data.
import org.apache.phoenix.spark._
import org.apache.hadoop.conf.Configuration
import org.apache.spark.sql.SparkSession
val configuration = new Configuration()
configuration.set("hbase.zookeeper.quorum", "localhost:2181");
val spark = SparkSession.builder().master("local").enableHiveSupport().getOrCreate()
val df=spark.sqlContext.phoenixTableAsDataFrame("TABLE1",Seq("COL1","COL2"),predicate = Some("\"COL1\" = 1"),conf = configuration)
Read this for more info on getting table as rdd and saving dataframes and rdd's .
I am trying to use the case class extraction feature of json4s in Spark,
ie calling jvalue.extract[MyCaseClass]. It works fine if I bring the JValue objects into the master and do the extraction there, but the same calls fail in the workers:
import org.json4s._
import org.json4s.jackson.JsonMethods._
import scala.util.{Try, Success, Failure}
val sqx = sqlContext
val data = sc.textFile(inpath).coalesce(2000)
case class PageView(
client: Option[String]
)
def extract(json: JValue) = {
implicit def formats = org.json4s.DefaultFormats
Try(json.extract[PageView]).toOption
}
val json = data.map(parse(_)).sample(false, 1e-6).cache()
// count initial inputs
val raw = json.count
// count successful extractions locally -- same value as above
val loc = json.toLocalIterator.flatMap(extract).size
// distributed count -- always zero
val dist = json.flatMap(extract).count // always returns zero
// this throws "org.json4s.package$MappingException: Parsed JSON values do not match with class constructor"
json.map(x => {implicit def formats = org.json4s.DefaultFormats; x.extract[PageView]}).count
The implicit for Formats is defined locally in the extract function since DefaultFormats is not serializable and defining it at top level caused it to be serialized to for transmission to the workers rather than constructed there. I think the proble still has something to do with the remote initialization of DefaultFormats, but I am not sure what it is.
When I call the extract method directly, insted of my extract function, like in the last example, it no longer complains about serialization but just throws an error that the JSON does not match the expected structure.
How can I get the extraction to work when distributed to the workers?
Edit
#WesleyMiao has reproduced the problem and found that it is specific to spark-shell. He reports that this code works as a standalone application.
I got the same exception as yours when running your code in spark-shell. However when I turn your code into a real spark app and submit it to a standalone spark cluster, I got expected results with no exception.
Below is the code I put in a simple spark app.
val data = sc.parallelize(Seq("""{"client":"Michael"}""", """{"client":"Wesley"}"""))
val json = data.map(parse(_))
val dist = json.mapPartitions { jsons =>
implicit val formats = org.json4s.DefaultFormats
jsons.map(_.extract[PageView])
}
dist.collect() foreach println
And when I run it using spark-submit, I got the following result.
PageView(Some(Michael))
PageView(Some(Wesley))
And I am also sure that it is running not in "local[*]" mode.
Now I suspect the reason we got exceptions while running in spark-shell has something to do with the case class PageView definition in spark-shell and how spark-shell serialize / distribute it to executor.
As suggested here I would move object creation into the map. I.e. I would have function createPageViews that has extract as internal function and will pass createPageViews to workers.
More precisely I would use mapPartitions instead of map - so it would have to call createPageViews (and it's internal function definition part) only once per partition - and not once per every record.