Seeking Help
As I'm New to Spark just want to know.
I had setup spark master by changing spark-env.sh file changing parameters to
export SCALA_HOME=/cats/dev/scala/scala-2.12.2
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_INSTANCES=3
export SPARK_WORKER_DIR=/cats/dev/spark-2.1.0-bin-hadoop2.7/work/sparkdata
export SPARK_MASTER_IP="192.168.1.54"
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
for a multi-node on the same machine
for executing small project I had to use the command
(/cats/dev/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --class "Person" --master spark://cats-All-Series:7077 /cats/sbt_projects/test/people/target/scala-2.10/people-assembly-1.0.jar
after executing when I check web UI my app finished time was 6 seconds. but when I deploy on a single without out master and slave the same got executed in 0.3 seconds.
without a master, I had executed in spark-shell firstly loaded the code
(scala> :load /cats/scala/db2connectivity/sparkjdbc.scala) and executed(scala> sparkjdbc.connectSpark) for it.
>import org.apache.spark._
import org.apache.spark.sql.{Dataset, DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
>import java.io.Serializable
import java.util.List
import java.util.Properties
object Person extends App {
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
connectSpark()
/*Following is spark-jdbc code using scala*/
def connectSpark(): Unit ={
val url = "jdbc:db2://localhost:50000/sample"
val driver = "com.ibm.db2.jcc.DB2Driver"
val username = "db2inst1"
val password = "db2inst1"
val prop = new java.util.Properties
prop.setProperty("user",username)
prop.setProperty("password",password)
Class.forName(driver)
val jdbcDF = sqlContext.read
.format("jdbc")
.option("url", url)
.option("driver", driver)
.option("dbtable", "allbands")
.option("user", "db2inst1")
.option("password", "db2inst1")
.load()
jdbcDF.show()
println("Done loading")
}
}
Related
I am using Spark-Scala from local machine. Spark version 3.0.1. I need to read data from S3 bucket publicly open from IntelliJ IDEA. Below is my code
package AWS
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.SparkConf
object Read extends App {
val spark = SparkSession.builder()
.master("local[3]")
.appName("Accessing AWS S3")
.getOrCreate()
// spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "XXXXXXXXXXXXXXXXXXXXXXXXX")
// spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "XXXXXXXXXXXXXXXXXXXXXXXXX")
// spark.sparkContext.hadoopConfiguration.set("fs.s3n.endpoint", "s3.amazonaws.com")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "XXXXXXXXX")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "XXXXXXXXXXXXXXXXXXXXXXXXX")
spark.sparkContext.hadoopConfiguration.set("fs.s3n.endpoint", "s3.amazonaws.com")
val dept_df = spark.read.format("csv").load("s3a://hr-data-lake/departments.csv")
dept_df.printSchema()
dept_df.show(truncate = false)
}
//s3://hr-data-lake/departments.csv
//sc.hadoopConfiguration.set("fs.s3a.access.key", "XXXXXXXXX")
//sc.hadoopConfiguration.set("fs.s3a.secret.key", "XXXXXXXXXXXXXXXXXXXXXXXXX")
//spark.sparkContext.set("fs.s3a.access.key", "XXXXXXXXXXXXXXXXXXXXXXXXX")
//sc.hadoopConfiguration.set("fs.s3a.secret.key", "XXXXXXXXXXXXXXXXXXXXXXXXX")
Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
I am trying to integrate Spark and Neo4j. I am new to Neo4j. I have the following short Spark app
import com.typesafe.config._
import org.apache.spark.sql.SparkSession
import org.neo4j.spark._
object Neo4jStorer {
var conf :Config = null
def main(args: Array[String]): Unit = {
val spark = getSparkSession()
val sc = spark.sparkContext
val g = Neo4jGraph.loadGraph(sc, label1="a", relTypes=Seq("rel"), label2 = "b")
val vCount = g.toString
println("Count= " + vCount)
}
def getSparkSession(): SparkSession = {
SparkSession
.builder
.appName("SparkNeo4j")
.config("spark.neo4j.bolt.url", "neo4j://127.0.0.1:7687")
.config("spark.neo4j.bolt.user", "neo4j")
.config("spark.neo4j.bolt.password", "FakePassword")
.getOrCreate()
}
}
I used https://neo4j.com/blog/neo4j-3-0-apache-spark-connector/ as an example for this code as I am using Spark 3.0. When I run this I get the following
20/10/17 14:36:36 ERROR LoadBalancer: Failed to update routing table for database 'FakePassword'. Current routing table: Ttl 1602963396190, currentTime 1602963396527, routers AddressSet=[], writers AddressSet=[], readers AddressSet=[], database 'FakePassword'.
org.neo4j.driver.exceptions.FatalDiscoveryException: Unable to get a routing table for database 'FakePassword' because this database does not exist
If I change the password I get an authentication error and I see that again the incorrect password is shown as being a database. I created a database with the name FakePassword and I still got the same error. Why is this happening and how can I fix it?
Also when I tried to get g.vertices.count as is shown in the example I am following I get a compilation error.
With this code I am able to get data from a DataFrame into Neo4j, which is what I really wanted to do. This does not seem to be the ideal solution as it uses foreach. I am open to improvements.
import com.typesafe.config._
import org.apache.spark.sql.SparkSession
import org.neo4j.driver.{AuthTokens, GraphDatabase, Session}
import org.neo4j.spark._
object StackoverflowAnswer {
def main(args: Array[String]): Unit = {
val spark = getSparkSession()
val sc = spark.sparkContext
import spark.implicits._
val df = sc.parallelize(List(1, 2, 3)).toDF
df.foreach(
row => {
val query = "CREATE (n:NumLable {num: " + row.get(0).toString +"})"
Neo4jSess.session.run(query)
()
}
)
}
def getSparkSession(): SparkSession = {
SparkSession
.builder
.appName("SparkNeo4j")
.getOrCreate()
}
}
object Neo4jSess {
/**
* Store a Neo4j session in a object so that it can be used by Spark
*/
var conf :Config = null
this.conf = ConfigFactory.load().getConfig("DeltaStorer")
val neo4jUrl: String = "bolt://127.0.0.1:7687"
val neo4jUser: String = "neo4j"
val neo4jPassword: String = "FakePassword"
val driver = GraphDatabase.driver(neo4jUrl, AuthTokens.basic(neo4jUser, neo4jPassword))
val session: Session = driver.session()
}
Please try to update spark-defaults.conf:
spark.jars.packages neo4j-contrib:neo4j-spark-connector:2.4.5-M2
spark.neo4j.url bolt://XX.XXX.X.XXX:7687
spark.neo4j.user neo4j
spark.neo4j.password test
We are trying to write spark streaming application that will write to the hdfs. However whenever we are writing the files lots of duplicates shows up. This behavior is with or without we crashing application using the kill. And also for both Dstream and Structured apis. The source is kafka topic. The behavior of checkpoint directory sounds very random. I have not come across very relevant information on the issue.
Question is: Can checkpoint directory provide exactly once behavior?
scala version: 2.11.8
spark version: 2.3.1.3.0.1.0-187
kafka version : 2.11-1.1.0
zookeeper version : 3.4.8-1
HDP : 3.1
Any help is appreciate.
Thanks,
Gautam
object sparkStructuredDownloading {
val kafka_brokers="kfk01.*.com:9092,kfk02.*.com:9092,kfk03.*.com:9092"
def main(args: Array[String]): Unit = {
var topic = args(0).trim().toString()
new downloadingAnalysis(kafka_brokers ,topic).process()
}
}
class downloadingAnalysis(brokers: String,topic: String) {
def process(): Unit = {
// try{
val spark = SparkSession.builder()
.appName("sparkStructuredDownloading")
// .appName("kafka_duplicate")
.getOrCreate()
spark.conf.set("spark.streaming.stopGracefullyOnShutdown", "true")
println("Application Started")
import spark.implicits._
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import org.apache.spark.sql.streaming.Trigger
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", topic)
.option("startingOffsets", "latest")
//.option("kafka.group.id", "testduplicate")
.load()
val personJsonDf = inputDf.selectExpr("CAST(value AS STRING)") //Converting binary to text
println("READ STREAM INITIATED")
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import org.apache.spark.sql.streaming.Trigger
import spark.implicits._
val filteredDF= personJsonDf.filter(line=> new ParseLogs().validateLogLine(line.get(0).toString()))
spark.sqlContext.udf.register("parseLogLine", (logLine: String) => {
val df1 = filteredDF.selectExpr("parseLogLine(value) as result")
println(df1.schema)
println("WRITE STREAM INITIATED")
val checkpoint_loc="/warehouse/test_duplicate/download/checkpoint1"
val kafkaOutput = result.writeStream
.outputMode("append")
.format("orc")
.option("path", "/warehouse/test_duplicate/download/data1")
.option("maxRecordsPerFile", 10)
.trigger(Trigger.ProcessingTime("10 seconds"))
.start()
.awaitTermination()
}
I am trying to load a dataframe into a Hive table.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
object SparkToHive {
def main(args: Array[String]) {
val warehouseLocation = "file:${system:user.dir}/spark-warehouse"
val sparkSession = SparkSession.builder.master("local[2]").appName("Saving data into HiveTable using Spark")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("hive.metastore.warehouse.dir", "/user/hive/warehouse")
.config("spark.sql.warehouse.dir", warehouseLocation)
.getOrCreate()
**import sparkSession.implicits._**
val partfile = sparkSession.read.text("partfile").as[String]
val partdata = partfile.map(part => part.split(","))
case class Partclass(id:Int, name:String, salary:Int, dept:String, location:String)
val partRDD = partdata.map(line => PartClass(line(0).toInt, line(1), line(2).toInt, line(3), line(4)))
val partDF = partRDD.toDF()
partDF.write.mode(SaveMode.Append).insertInto("parttab")
}
}
I haven't executed it yet but I am getting the following error at this line:
import sparkSession.implicits._
could not find implicit value for evidence parameter of type org.apache.spark.sql.Encoder[String]
How can I fix this ?
Please move your case class Partclass outside of SparkToHive object. It should be fine then
And there are ** in you implicits import statement. Try
import sparkSession.sqlContext.implicits._
The mistake I made was
Case class should be outside the main and inside the object
In this line: val partfile = sparkSession.read.text("partfile").as[String], I used read.text("..") to get a file into Spark where we can use read.textFile("...")
I have deployed the jar of spark on cluster. I submit the spark job using spark-submit command follow by my project jar.
I have many Spak Conf in my project. Conf will be decided based on which class i am running but every time i run the spark job i got this warning
7/01/09 07:32:51 WARN SparkContext: Use an existing SparkContext, some
configuration may not take effect.
Query Does it mean SparkContext is already there and my job is picking this.
Query Why come configurations are not taking place
Code
private val conf = new SparkConf()
.setAppName("ELSSIE_Ingest_Cassandra")
.setMaster(sparkIp)
.set("spark.sql.shuffle.partitions", "8")
.set("spark.cassandra.connection.host", cassandraIp)
.set("spark.sql.crossJoin.enabled", "true")
object SparkJob extends Enumeration {
val Program1, Program2, Program3, Program4, Program5 = Value
}
object ElssieCoreContext {
def getSparkSession(sparkJob: SparkJob.Value = SparkJob.RnfIngest): SparkSession = {
val sparkSession = sparkJob match {
case SparkJob.Program1 => {
val updatedConf = conf.set("spark.cassandra.output.batch.size.bytes", "2048").set("spark.sql.broadcastTimeout", "2000")
SparkSession.builder().config(updatedConf).getOrCreate()
}
case SparkJob.Program2 => {
val updatedConf = conf.set("spark.sql.broadcastTimeout", "2000")
SparkSession.builder().config(updatedConf).getOrCreate()
}
}
}
And In Program1.scala i call by
val spark = ElssieCoreContext.getSparkSession()
val sc = spark.sparkContext