Spark MiniCluster - apache-spark

Spark MiniCluster - apache-spark

Is it possible to create a "Spark MiniCluster" entirely programmatically to run small Spark apps from inside a Scala program? I do NOT want to start the Spark shell, but instead get a "MiniCluster" entirely fabricated in the Main of my program.

You can create application and use local master to start Spark in standalone mode:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object LocalApp {
def main(args: Array[String]) {
val sc = new SparkContext("local[*]", "local-app", new SparkConf())
// Do whatever you need
sc.stop()
}
}
You can do exactly the same thing with any supported language.

Related

Spark streaming query loads datasource twice

I have implemented my own structured streaming data source in spark against a proprietary vendor messaging system. It is using V2 of the structured streaming API implementing MicroBatchReadSupport and DataSourceRegister. I modeled it much after some examples found here. I also followed the advice given at this stack overflow post. At first everything seems to be starting up properly when I call load on the readStream. However, when I try to direct the query to a writeStream, it tries to instantiate another MicroBatchReadSupport. This actually fails fast because I have a check in the createMicroBatchReader method to see if there was a schema provided, and if not throw an exception. And in the case of the second call to createMicroBatchReader, a schema isn't even provided even though the initial query did provide one. My code to start the stream (following closely examples from spark documentation) looks like the following
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
object Streamer {
def main(args: Array[String]): Unit = {
val schema = StructType(
StructField("TradeId", LongType, nullable = True) ::
StructField("Source", StringTYpe, nullable = True :: Nil
)
val spark = SparkSession
.builder
.getOrCreate()
val ampsStream = spark.readStream
.format("amps")
.option("topic", "/test")
.option("server", "SOME_URL")
.schema(schema)
.load()
ampsStream.printSchema()
val query = ampsStream.writeStream.format("console").start()
query.awaitTermination()
}
}
I've put breaks and debug statements in to test, and it gets called again right when I get to the writeStream.start. As mentioned the odd thing two is the second time around the Optional variable that is passed into createMicroBatchReader is empty, where as the first call properly has the schema. Any guidance would be greatly appreciated.
EDIT: I added some debugging statements and tested this out with the above mentioned repo # https://github.com/hienluu/wikiedit-streaming and I see the exact same issue when running WikiEditSourceV2Example.scala from this repo. Not sure if this is a bug, or if me and the author of the aforementionned repo are missing something.
EDIT 2: Adding the code for the amps streaming source
import java.util.Optional
import org.apache.spark.internal.Logging
import org.apache.spark.sql.sources.DataSourceRegister
import org.apache.spark.sql.sources.v2.reader.streaming.{MicroBatchReader, Offset}
import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, MicroBatchReadSupport}
import org.apache.spark.sql.types.StructType
class AmpsStreamingSource extends DataSourceV2 with MicroBatchReadSupport with DataSourceRegister {
override def shortName(): String = "amps"
override def createMicroBatchReader(schema: Optional[StructType],
checkpointLocation: String,
options: DataSourceOptions): MicroBatchReader = {
println("AmpsStreamingSource.createMicroBatchReader was called")
if(schema.isPresent) new AmpsMicroBatchReader(schema.get, options)
throw new IllegalArgumentException("Must proivde a schema for amps stream source")
}
}
and the signature of AmpsMicroBatchReader
class AmpsMicroBatchReader(schema: StructType, options: DataSourceOptions)
extends MicroBatchReader with MessageHandler with Logging

How to interact with different cassandra cluster from the same spark context

I want to migrate my old cassandra cluster data to a new cluster and thinking to write some spark jobs to do that. Is there any way to interact with multiple cassandra cluster from the same SparkContext. So that i can read the data from one cluster and write to another cluster using saveToCassandra function inside the same sparkJob.
val products = sc.cassandraTable("first_cluster","products").cache()
products.saveToCassandra("diff_cluster","products2")
Can we save the data into a different cluster ?

Example from spark-cassandra-connector
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql._
import org.apache.spark.SparkContext
def twoClusterExample ( sc: SparkContext) = {
val connectorToClusterOne = CassandraConnector(sc.getConf.set("spark.cassandra.connection.host", "127.0.0.1"))
val connectorToClusterTwo = CassandraConnector(sc.getConf.set("spark.cassandra.connection.host", "127.0.0.2"))
val rddFromClusterOne = {
// Sets connectorToClusterOne as default connection for everything in this code block
implicit val c = connectorToClusterOne
sc.cassandraTable("ks","tab")
}
{
//Sets connectorToClusterTwo as the default connection for everything in this code block
implicit val c = connectorToClusterTwo
rddFromClusterOne.saveToCassandra("ks","tab")
}
}

Error while trying to fetch table from remote hive2 server using spark

I'm trying to access tables from remote hive2 server from spark using the code below:
import org.apache.spark.SparkContext, org.apache.spark.SparkConf, org.apache.spark.sql._
import com.typesafe.config._
import java.io._
import org.apache.hadoop.fs._
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
object stack {
def main(args: Array[String]) {
val warehouseLocation = "/usr/hive/warehouse"
System.setProperty("javax.jdo.option.ConnectionURL","jdbc:mysql://sparkserver:3306/metastore?createDatabaseIfNotExist=true")
System.setProperty("javax.jdo.option.ConnectionUserName","hiveroot")
System.setProperty("javax.jdo.option.ConnectionPassword","hivepassword")
System.setProperty("hive.exec.scratchdir","/tmp/hive/${user.name}")
System.setProperty("spark.sql.warehouse.dir", warehouseLocation)
// System.setProperty("hive.metastore.uris", "thrift://sparkserver:9083")
System.setProperty("javax.jdo.option.ConnectionDriverName","com.mysql.jdbc.Driver")
System.setProperty("hive.metastore.warehouse.dir","/user/hive/warehouse")
val spark = SparkSession.builder().master("local")
.appName("spark remote")
// .config("javax.jdo.option.ConnectionURL","jdbc:mysql://sparkserver:3306/metastore?createDatabaseIfNotExist=true")
.config("javax.jdo.option.ConnectionURL","jdbc:mysql://sparkserver:3306/metastore?createDatabaseIfNotExist=true")
.config("javax.jdo.option.ConnectionUserName","hiveroot")
.config("javax.jdo.option.ConnectionPassword","hivepassword")
.config("hive.exec.scratchdir","/tmp/hive/${user.name}")
.config("spark.sql.warehouse.dir", warehouseLocation)
// .config("hive.metastore.uris", "thrift://sparkserver:9083")
.config("javax.jdo.option.ConnectionDriverName","com.mysql.jdbc.Driver")
.config("hive.querylog.location","/tmp/hivequerylogs/${user.name}")
.config("hive.support.concurrency","false")
.config("hive.server2.enable.doAs","true")
.config("hive.server2.authentication","PAM")
.config("hive.server2.custom.authentication.class","org.apache.hive.service.auth.PamAuthenticationProvider")
.config("hive.server2.authentication.pam.services","sshd,sudo")
.config("hive.stats.dbclass","jdbc:mysql")
.config("hive.stats.jdbcdriver","com.mysql.jdbc.Driver")
.config("hive.session.history.enabled","true")
.config("hive.metastore.schema.verification","false")
.config("hive.optimize.sort.dynamic.partition","false")
.config("hive.optimize.insert.dest.volume","false")
.config("datanucleus.fixedDatastore","true")
.config("hive.metastore.warehouse.dir","/user/hive/warehouse")
.config("datanucleus.autoCreateSchema","false")
.config("datanucleus.schema.autoCreateAll","true")
.config("datanucleus.schema.validateConstraints","true")
.config("datanucleus.schema.validateColumns","true")
.config("datanucleus.schema.validateTables","true")
.config("fs.default.name","hdfs://sparkserver:54310")
.config("dfs.namenode.name.dir","/usr/local/hadoop_tmp/hdfs/namenode")
.config("dfs.datanode.name.dir","/usr/local/hadoop_tmp/hdfs/datanode")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
sql("select * from sample.source").collect.foreach(println)
sql("select * from sample.destination").collect.foreach(println)
}
}
Connection request to the meta-store is been refused by remote hive server.
ERROR:Failed to start hive-metastore.service: Unit hive-metastore.service not found
Thank you!

Normally we don't need to point to remote metastore separately.
Hive-site.xml will have conf about pointing to metastore through jdbc internally.
The same conf can set as follows in program before initializing Hive-Context:
Give it a try.
System.setProperty("javax.jdo.option.ConnectionURL", "jdbc:mysql://<ip>/metastore?createDatabaseIfNotExist=true")
...("javax.jdo.option.ConnectionDriverName", "com.mysql.jdbc.Driver")
...("javax.jdo.option.ConnectionUserName", "mysql-user")
...("javax.jdo.option.ConnectionPassword", "mysql-passwd")

when you use this: .config("hive.metastore.uris", "hive2://hiveserver:9083"), hiveserver should be proper remote hive server's ip.
The conf hive.metastore.uris points to the hive-metastore service; and if you are running locally (in localhost) - and want remote-metastore; you need to start hive-metastore service separately.
`$HIVE_HOME/bin/hive --service metastore` -p 9083
Or - by-default, Hive uses local Hive-metastore; so in that case, you dont need to set any value for hive.metastore.uris
And - forgot to mention, the property you are setting - always uses thrift protocol - whether hiveserver1 or hiveserver2.
So, always use this:
.config("hive.metastore.uris", "thrift://hiveserver:9083")

Spark: Read HBase in secured cluster

I have an easy task: I want to read HBase data in a Kerberos secured cluster.
So far I tried 2 approaches:
sc.newAPIHadoopRDD(): here I don't know how to handle the kerberos authentication
create a HBase connection from the HBase API: Here I don't really know how to convert the result into RDDs
Furthermore there seem to be some HBase-Spark connectors. But somehow I didn't really manage to find them as Maven artifact and/or they require a fixed structure of the result (but I just need to have the HBase Result object since the columns in my data are not fixed).
Do you have any example or tutorials or ....?
I appreciate any help and hints.
Thanks in advance!

I assume that you are using spark + scala +Hbase
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.HTable;
object SparkWithMyTable {
def main(args: Array[String]) {
//Initiate spark context with spark master URL. You can modify the URL per your environment.
val sc = new SparkContext("spark://ip:port", "MyTableTest")
val tableName = "myTable"
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "list of cluster ip's")
conf.set("hbase.zookeeper"+ ".property.clientPort","2181");
conf.set("hbase.master", "masterIP:60000");
conf.set("hadoop.security.authentication", "kerberos");
conf.set("hbase.security.authentication", "kerberos");
UserGroupInformation.setConfiguration(conf);
UserGroupInformation.loginUserFromKeytab("user#---", keyTabPath);
// Add local HBase conf
// conf.addResource(new Path("file://hbase/hbase-0.94.17/conf/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, tableName)
// create my table with column family
val admin = new HBaseAdmin(conf)
if(!admin.isTableAvailable(tableName)) {
print("Creating MyTable")
val tableDesc = new HTableDescriptor(tableName)
tableDesc.addFamily(new HColumnDescriptor("cf1".getBytes()));
admin.createTable(tableDesc)
}else{
print("Table already exists!!")
val columnDesc = new HColumnDescriptor("cf1");
admin.disableTable(Bytes.toBytes(tableName));
admin.addColumn(tableName, columnDesc);
admin.enableTable(Bytes.toBytes(tableName));
}
//first put data into table
val myTable = new HTable(conf, tableName);
for (i <- 0 to 5) {
var p = new Put();
p = new Put(new String("row" + i).getBytes());
p.add("cf1".getBytes(), "column-1".getBytes(), new String(
"value " + i).getBytes());
myTable.put(p);
}
myTable.flushCommits();
//how to create rdd
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
//get the row count
val count = hBaseRDD.count()
print("HBase RDD count:"+count)
System.exit(0)
}
}
Maven Artifact
<dependency>
<groupId>it.nerdammer.bigdata</groupId>
<artifactId>spark-hbase-connector_2.10</artifactId>
<version>1.0.3</version> // Version can be changed as per your Spark version, I am using Spark 1.6.x
</dependency>
Can also have a look at
Spark play with HBase's Result object: handling HBase KeyValue and ByteArray in Scala with Spark -- Real World Examples
scan-that-works-on-kerberos
HBaseScanRDDExample.scala

Spark Streaming not detecting new HDFS files

I am running the program below on Spark 1.3.1. Spark Streaming is watching a directory in HDFS for new files and should process them as they come in. I have read that the best way to do this is to move the files from an existing HDFS location so that the operation is atomic.
I start my streaming job, I add a bunch of small files to a random HDFS directory, then I move these files from the original HDFS directory to the watched HDFS directory (all with simple shell commands). But my streaming job is not recognizing these as new files and therefore not processing them.
Currently I am using textFileStream but am open to using fileStream. However I am getting errors with this val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("hdfs:///name/spark-streaming/data/", (p: Path)=>true, false)
package com.com.spark.prototype
import java.io.FileInputStream
import org.apache.hadoop.fs.Path
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.mapred.TextInputFormat
import org.apache.spark._
import org.apache.spark.streaming._
import com.twitter.algebird.HyperLogLogMonoid
import org.apache.hadoop.io._
object HLLStreamingHDFSTest {
def functionToCreateContext(): StreamingContext = {
val conf = new SparkConf().set("spark.executor.extraClassPath", "/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/*:/home/hadoop/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar")
val ssc = new StreamingContext(conf, Seconds(5))
ssc.checkpoint("/name/spark-streaming/checkpointing")
val lines = ssc.textFileStream("hdfs:///name/spark-streaming/data/")
val hll = new HyperLogLogMonoid(15)
var globalHll = hll.zero
val users = lines.map(_.toString().toCharArray.map(_.toByte))
val approxUsers = users.mapPartitions(ids => {
ids.map(id => hll(id))
}).reduce(_ + _)
approxUsers.foreachRDD(rdd => {
if (rdd.count() != 0) {
val partial = rdd.first()
globalHll += partial
println()
println()
println("Estimated distinct users this batch: %d".format(partial.estimatedSize.toInt))
println("Estimated distinct users this batch: %d".format(globalHll.estimatedSize.toInt))
println()
println("Approx distinct users this batch: %s".format(partial.approximateSize.toString))
println("Approx distinct users overall: %s".format(globalHll.approximateSize.toString))
}
})
ssc
}
def main(args: Array[String]): Unit = {
val context = StreamingContext.getOrCreate("hdfs:///name/spark-streaming/checkpointing", functionToCreateContext _)
context.start()
context.awaitTermination()
}
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark MiniCluster - apache-spark

Is it possible to create a "Spark MiniCluster" entirely programmatically to run small Spark apps from inside a Scala program? I do NOT want to start the Spark shell, but instead get a "MiniCluster" entirely fabricated in the Main of my program.

Related

Spark streaming query loads datasource twice

How to interact with different cassandra cluster from the same spark context

Error while trying to fetch table from remote hive2 server using spark

Spark: Read HBase in secured cluster

Spark Streaming not detecting new HDFS files

Categories

Resources