Apache Spark write takes hours - apache-spark

I have a spark job which makes a right join based on two tables, the reading and joining is pretty fast but when try to insert join resuls to cassandra db it is so slow. It takes more than 30 minutes to insert 1000 rows, takes 3 minutes to insert 9 rows record. Please see my configuration below. we have 3 cassandra and spark node and spark is installed for all nodes. I'm pretty new with Spark and cant understand what is wrong. ı can insert same sized data with dse driver less than 1 second (more than 2000 rows). I appreciate your time and help!!
Spark submit :
"dse -u " + username + " -p " + password + " spark-submit --class com.SparkJoin --executor-memory=20G " +
"SparkJoinJob-1.0-SNAPSHOT.jar " + filterMap.toString() + "
Spark core version : 2.7.2
spark-cassandra-connector_2.11 : 2.3.1
spark-sql_2.11 : 2.3.1
Spark Conf
SparkConf conf = new SparkConf(true).setAppName("Appname");
conf.set("spark.cassandra.connection.host", host);
conf.set("spark.cassandra.auth.username", username);
conf.set("spark.cassandra.auth.password", password);
conf.set("spark.network.timeout", "600s");
conf.set("spark.cassandra.connection.keep_alive_ms", "25000");
conf.set("spark.cassandra.connection.timeout_ms", "5000000");
conf.set("spark.sql.broadcastTimeout", "5000000");
SparkContext sc = new SparkContext(conf);
SparkSession sparkSession = SparkSession.builder().sparkContext(sc).getOrCreate();
SQLContext sqlContext = sparkSession.sqlContext();
sqlContext.setConf("spark.cassandra.connection.host", host);
sqlContext.setConf("spark.cassandra.auth.username", username);
sqlContext.setConf("spark.cassandra.auth.password", password);
sqlContext.setConf("spark.network.timeout", "600s");
sqlContext.setConf("spark.cassandra.connection.keep_alive_ms", "2500000");
sqlContext.setConf("spark.cassandra.connection.timeout_ms", "5000000");
sqlContext.setConf("spark.sql.broadcastTimeout", "5000000");
sqlContext.setConf("spark.executor.heartbeatInterval", "5000000");
sqlContext.setConf("spark.sql.crossJoin.enabled", "true");
Left and Right Table fetch;
Dataset<Row> resultsFrame = sqlContext.sql("select * from table where conditions");
return resultsFrame.map((MapFunction<Row, JavaObject>) row -> {
// some operations here
return obj;
}, Encoders.bean(JavaObject.class)
);
Join
Dataset<Row> result = RigtTableJavaRDD.join(LeftTableJavaRDD,
(LeftTableJavaRDD.col("col1").minus(RigtTableJavaRDD.col("col2"))).
between(new BigDecimal("0").subtract(twoHundredMilliseconds), new BigDecimal("0").add(twoHundredMilliseconds))
.and(LeftTableJavaRDD.col("col5").equalTo(RigtTableJavaRDD.col("col6")))
, "right");
Insert Result
CassandraJavaUtil.javaFunctions(resultRDD.javaRDD()).
writerBuilder("keyspace", "table", CassandraJavaUtil.mapToRow(JavaObject.class)).
saveToCassandra();

Related

Connection to remote hbase through scala spark

I am trying to connect to remote Hbase through scala and spark. Unable to succeeded.
Can anyone suggest any methods related to this.
Thanks in advance.
we have two methods to connect to HBASE from spark/scala
HBASE Rest Api
phoenix.apache -- https://phoenix.apache.org/
Hbase Rest API code
val hbaseCluster = new org.apache.hadoop.hbase.rest.client.Cluster()
hbaseCluster.add("localhost or UP", <port>)
val restClient = new Client(hbaseCluster)
val table = new RemoteHTable(restClient, "STUDENT")
println("connected...")
var p = new Put(Bytes.toBytes("row1"))
p.add(Bytes.toBytes("0"), Bytes.toBytes("NAME"),Bytes.toBytes("raju"))
p.add(Bytes.toBytes("0"), Bytes.toBytes("COURSE"),Bytes.toBytes("SCALA"))
p.add(Bytes.toBytes("0"), Bytes.toBytes("YEAR"),Bytes.toBytes("2017"))
table.put(p)
val scan = new Scan()
val scanner : ResultScanner = table.getScanner(scan)
println("got scanner...")
val g = new Get(Bytes.toBytes("row1"))
val result = table.get(g)
val name = Bytes.toString(result.getValue(Bytes.toBytes("0"),Bytes.toBytes("NAME")))
val course = Bytes.toString(result.getValue(Bytes.toBytes("0"),Bytes.toBytes("COURSE")))
val year = Bytes.toString(result.getValue(Bytes.toBytes("0"),Bytes.toBytes("YEAR")))
println("row1 " + "name: " + name + " course: " + course + " year:" + year);
for (result <- scanner) {
var userId = Bytes.toString(result.getValue("NAME".getBytes(), "ID".getBytes()))
println("userId " + userId)
}
}
}
Apache Phoenix
Phoenix provides spark plugin and JDBC connection as well.
spark plugin - https://phoenix.apache.org/phoenix_spark.html
JDBC Connection (query server)- https://phoenix.apache.org/server.html
I came across similar problem last week. Eventually I made it using the HBase Spark connector. It is quite a bit setup/configuration. I've documented my steps in the link below
Setup Apache Zeppelin with Spark and HBase

Spark sql querying a Hive table from workers

I am trying to querying a Hive table from a map operation in Spark, but when it run a query the execution getting frozen.
This is my test code
val sc = new SparkContext(conf)
val datasetPath = "npiCodesMin.csv"
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
val df = sparkSession.read.option("header", true).option("sep", ",").csv(datasetPath)
df.createOrReplaceTempView("npicodesTmp")
sparkSession.sql("DROP TABLE IF EXISTS npicodes");
sparkSession.sql("CREATE TABLE npicodes AS SELECT * FROM npicodesTmp");
val res = sparkSession.sql("SELECT * FROM npicodes WHERE NPI = '1588667638'") //This works
println(res.first())
val NPIs = sc.parallelize(List("1679576722", "1588667638", "1306849450", "1932102084"))//Some existing NPIs
val rows = NPIs.mapPartitions{ partition =>
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
partition.map{code =>
val res = sparkSession.sql("SELECT * FROM npicodes WHERE NPI = '"+code+"'")//The program stops here
res.first()
}
}
rows.collect().foreach(println)
It loads the data from a CSV, creates a new Hive table and fills it with the CSV data.
Then, if I query the table from the master it works perfectly, but if I try to do that in a map operation the execution getting frozen.
It do not generate any error, it continue running without do anything.
The Spark UI shows this situation
Actually, I am not sure if I can query a table in a distributed way, I cannot find it in the documentation.
Any suggestion?
Thanks.

Spark query appends keyspace to the spark temp table

I have a cassandraSQLContext where I do this:
cassandraSqlContext.setKeyspace("test");
Because if I don't, it complains about me setting up the default keyspace.
Now when I run this piece of code:
def insertIntoCassandra(siteMetaData: MetaData, dataFrame: DataFrame): Unit ={
System.out.println(dataFrame.show())
val tableName = siteMetaData.getTableName.toLowerCase()
dataFrame.registerTempTable("spark_"+ tableName)
System.out.println("Registered the spark table to spark_" + tableName)
val columns = columnMap.get(siteMetaData.getTableName)
val query = cassandraQueryBuilder.buildInsertQuery("test", tableName, columns)
System.out.println("Query: " + query);
cassandraSqlContext.sql(query)
System.out.println("Query executed")
}
It gives me this error log:
Registered the spark table to spark_test
Query: INSERT INTO TABLE test.tablename SELECT **the columns here** FROM spark_tablename
17/02/28 04:15:53 ERROR JobScheduler: Error running job streaming job 1488255351000 ms.0
java.util.concurrent.ExecutionException: java.io.IOException: Couldn't find test.tablename or any similarly named keyspace and table pairs
What I don't understand is why isnt cassandraSQLContext executing the printed out query, why does it append the keyspace to the spark temptable.
public String buildInsertQuery(String activeReplicaKeySpace, String tableName, String columns){
String sql = "INSERT INTO TABLE " + activeReplicaKeySpace + "." + tableName +
" SELECT " + columns + " FROM spark_" + tableName;
return sql;
}
So the problem was I was using two different instances of cassandraSQLContext.In one of the methods, I was instantiating a new cassandraSQLContext which was conflicting with the cassandraSQLContext being passed to insertIntoCassandra method.

Reading data from Hbase using Get command in Spark

I want to read data from an Hbase table using get command while I have also the key of the row..I want to do that in my Spark streaming application, Is there any source code someone can share?
You can use Spark newAPIHadoopRDD to read Hbase table, which returns and RDD.
For example:
val sparkConf = new SparkConf().setAppName("Hbase").setMaster("local")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
val tableName = "table"
conf.set("hbase.master", "localhost:60000")
conf.set("hbase.zookeeper.quorum", "localhost:2181")
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
println("Number of Records found : " + rdd.count())
sc.stop()
Or you can use any Spark Hbase connector like HortonWorks Hbase connector.
https://github.com/hortonworks-spark/shc
You can also use Spark-Phoenix API.
https://phoenix.apache.org/phoenix_spark.html

Issue while storing data from Spark-Streaming to Cassandra

SparkStreaming context reading a stream from RabbitMQ with an interval of 30 seconds. I want to modify the values of few columns of corresponding rows existing in cassandra and then want to store data back to Cassandra. For that i need to check whether the row for the particular primary key exist in Cassandra or not if, yes, fetch it and do the necessary operation. But the problem is, i create the StreamingContext on the driver and actions get performed on Worker. So, they are not able to get the StreamingContext object reason being it wasn't serialized and sent to workers and i get this error :
java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext. I also know that we cannot access the StreamingContext inside foreachRDD. But, How do i achieve the same functionality here without getting serialization error?
I have looked at fews examples here but it didn't help.
Here is the snippet of the code :
val ssc = new StreamingContext(sparkConf,30)
val receiverStream = RabbitMQUtils.createStream(ssc, rabbitParams)
receiverStream.start()
val lines = receiverStream.map(EventData.fromString(_))
lines.foreachRDD{ x => if (x.toLocalIterator.nonEmpty) {
x.foreachPartition { it => for (tuple <- it) {
val cookieid = tuple.cookieid
val sessionid = tuple.sessionid
val logdate = tuple.logdate
val EventRows = ssc.cassandraTable("SparkTest", CassandraTable).select("*")
.where("cookieid = '" + cookieid + "' and logdate = '" + logdate+ "' and sessionid = '" + sessionid + "')
Somelogic Whether row exist or not for Cookieid
} } }
The SparkContext cannot be serialized and passed across multiple workers in possibly different nodes. If you need to do something like this you could use forEachPartiion, mapPartitons.
Else do this withing your function that gets passed around
CassandraConnector(SparkWriter.conf).withSessionDo { session =>
....
session.executeAsync(<CQL Statement>)
and in the SparkConf you need to give the Cassandra details
val conf = new SparkConf()
.setAppName("test")
.set("spark.ui.enabled", "true")
.set("spark.executor.memory", "8g")
// .set("spark.executor.core", "4")
.set("spark.eventLog.enabled", "true")
.set("spark.eventLog.dir", "/ephemeral/spark-events")
//to avoid disk space issues - default is /tmp
.set("spark.local.dir", "/ephemeral/spark-scratch")
.set("spark.cleaner.ttl", "10000")
.set("spark.cassandra.connection.host", cassandraip)
.setMaster("spark://10.255.49.238:7077")
The Java CSCParser is a library that is not serializable. So Spark cannot send it possibly different nodes if you call map or forEach on the RDD. One workaround is using mapPartion, in which case one full Parition will be executed in one SparkNode. Hence it need not serialize for each call.Example
val rdd_inital_parse = rdd.mapPartitions(pLines).
def pLines(lines: Iterator[String]) = {
val parser = new CSVParser() ---> Cannot be serialized, will fail if using rdd.map(pLines)
lines.map(x => parseCSVLine(x, parser.parseLine))
}
Try with x.sparkContext.cassandraTable() instead of ssc.cassandraTable() and see if it helps

Resources