Spark query appends keyspace to the spark temp table - apache-spark

I have a cassandraSQLContext where I do this:
cassandraSqlContext.setKeyspace("test");
Because if I don't, it complains about me setting up the default keyspace.
Now when I run this piece of code:
def insertIntoCassandra(siteMetaData: MetaData, dataFrame: DataFrame): Unit ={
System.out.println(dataFrame.show())
val tableName = siteMetaData.getTableName.toLowerCase()
dataFrame.registerTempTable("spark_"+ tableName)
System.out.println("Registered the spark table to spark_" + tableName)
val columns = columnMap.get(siteMetaData.getTableName)
val query = cassandraQueryBuilder.buildInsertQuery("test", tableName, columns)
System.out.println("Query: " + query);
cassandraSqlContext.sql(query)
System.out.println("Query executed")
}
It gives me this error log:
Registered the spark table to spark_test
Query: INSERT INTO TABLE test.tablename SELECT **the columns here** FROM spark_tablename
17/02/28 04:15:53 ERROR JobScheduler: Error running job streaming job 1488255351000 ms.0
java.util.concurrent.ExecutionException: java.io.IOException: Couldn't find test.tablename or any similarly named keyspace and table pairs
What I don't understand is why isnt cassandraSQLContext executing the printed out query, why does it append the keyspace to the spark temptable.
public String buildInsertQuery(String activeReplicaKeySpace, String tableName, String columns){
String sql = "INSERT INTO TABLE " + activeReplicaKeySpace + "." + tableName +
" SELECT " + columns + " FROM spark_" + tableName;
return sql;
}

So the problem was I was using two different instances of cassandraSQLContext.In one of the methods, I was instantiating a new cassandraSQLContext which was conflicting with the cassandraSQLContext being passed to insertIntoCassandra method.

Related

Apache Spark write takes hours

I have a spark job which makes a right join based on two tables, the reading and joining is pretty fast but when try to insert join resuls to cassandra db it is so slow. It takes more than 30 minutes to insert 1000 rows, takes 3 minutes to insert 9 rows record. Please see my configuration below. we have 3 cassandra and spark node and spark is installed for all nodes. I'm pretty new with Spark and cant understand what is wrong. ı can insert same sized data with dse driver less than 1 second (more than 2000 rows). I appreciate your time and help!!
Spark submit :
"dse -u " + username + " -p " + password + " spark-submit --class com.SparkJoin --executor-memory=20G " +
"SparkJoinJob-1.0-SNAPSHOT.jar " + filterMap.toString() + "
Spark core version : 2.7.2
spark-cassandra-connector_2.11 : 2.3.1
spark-sql_2.11 : 2.3.1
Spark Conf
SparkConf conf = new SparkConf(true).setAppName("Appname");
conf.set("spark.cassandra.connection.host", host);
conf.set("spark.cassandra.auth.username", username);
conf.set("spark.cassandra.auth.password", password);
conf.set("spark.network.timeout", "600s");
conf.set("spark.cassandra.connection.keep_alive_ms", "25000");
conf.set("spark.cassandra.connection.timeout_ms", "5000000");
conf.set("spark.sql.broadcastTimeout", "5000000");
SparkContext sc = new SparkContext(conf);
SparkSession sparkSession = SparkSession.builder().sparkContext(sc).getOrCreate();
SQLContext sqlContext = sparkSession.sqlContext();
sqlContext.setConf("spark.cassandra.connection.host", host);
sqlContext.setConf("spark.cassandra.auth.username", username);
sqlContext.setConf("spark.cassandra.auth.password", password);
sqlContext.setConf("spark.network.timeout", "600s");
sqlContext.setConf("spark.cassandra.connection.keep_alive_ms", "2500000");
sqlContext.setConf("spark.cassandra.connection.timeout_ms", "5000000");
sqlContext.setConf("spark.sql.broadcastTimeout", "5000000");
sqlContext.setConf("spark.executor.heartbeatInterval", "5000000");
sqlContext.setConf("spark.sql.crossJoin.enabled", "true");
Left and Right Table fetch;
Dataset<Row> resultsFrame = sqlContext.sql("select * from table where conditions");
return resultsFrame.map((MapFunction<Row, JavaObject>) row -> {
// some operations here
return obj;
}, Encoders.bean(JavaObject.class)
);
Join
Dataset<Row> result = RigtTableJavaRDD.join(LeftTableJavaRDD,
(LeftTableJavaRDD.col("col1").minus(RigtTableJavaRDD.col("col2"))).
between(new BigDecimal("0").subtract(twoHundredMilliseconds), new BigDecimal("0").add(twoHundredMilliseconds))
.and(LeftTableJavaRDD.col("col5").equalTo(RigtTableJavaRDD.col("col6")))
, "right");
Insert Result
CassandraJavaUtil.javaFunctions(resultRDD.javaRDD()).
writerBuilder("keyspace", "table", CassandraJavaUtil.mapToRow(JavaObject.class)).
saveToCassandra();

How does Spark work with a JDBC connection?

I am new to Spark and I am trying to work on a spark-jdbc program to count the number of rows in a database.
I have come up with this code:
object PartitionRetrieval {
var conf = new SparkConf().setAppName("Spark-JDBC")
val log = LogManager.getLogger("Spark-JDBC Program")
Logger.getLogger("org").setLevel(Level.ERROR)
val conFile = "/home/hmusr/ReconTest/inputdir/testconnection.properties"
val properties = new Properties()
properties.load(new FileInputStream(conFile))
val connectionUrl = properties.getProperty("gpDevUrl")
val devUserName = properties.getProperty("devUserName")
val devPassword = properties.getProperty("devPassword")
val driverClass = properties.getProperty("gpDriverClass")
val tableName = "source.bank_accounts"
try {
Class.forName(driverClass).newInstance()
} catch {
case cnf: ClassNotFoundException =>
log.error("Driver class: " + driverClass + " not found")
System.exit(1)
case e: Exception =>
log.error("Exception: " + e.printStackTrace())
System.exit(1)
}
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().config(conf).master("yarn").enableHiveSupport().getOrCreate()
val gpTable = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable",tableName)
.option("user",devUserName)
.option("password",devPassword).load()
val rc = gpTable.filter(gpTable("source_system_name")==="ORACLE").count()
println("gpTable Count: " + rc)
}
}
So far, this code is working. But I have 2 conceptual doubts about this.
In Java, we create a connection class and use that connection to query multiple tables and close it once our requirement is met. But it appears to work in a different way.
If I have to query 10 tables in a database, should I use this line 10 times with different tables names in it:
In Java, we create a connection class and use that connection to query multiple tables and close it once our requirement is met. But it appears to work in a different way.
If I have to query 10 tables in a database, should I use this line 10 times with different tables names in it:
val gpTable = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable",tableName)
.option("user",devUserName)
.option("password",devPassword).load()
The current table used here has total rows of 2000. I can use the filter/select/aggregate functions accordingly.
But in our production there are tables with millions of rows and if I put one of the huge table in the above statement, even though our requirement has filtering it later, wouldn't is create a huge dataframe first?
Could anyone care to give me some insight regarding the doubts I mentioned above?
Pass an SQL query to it first known as pushdown to database.
E.g.
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample where k = 1) e", connectionProperties)
You can substitute with s""" the k = 1 for hostvars, or, build your own SQL string and reuse as you suggest, but if you don't the world will still exist.

Spark sql querying a Hive table from workers

I am trying to querying a Hive table from a map operation in Spark, but when it run a query the execution getting frozen.
This is my test code
val sc = new SparkContext(conf)
val datasetPath = "npiCodesMin.csv"
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
val df = sparkSession.read.option("header", true).option("sep", ",").csv(datasetPath)
df.createOrReplaceTempView("npicodesTmp")
sparkSession.sql("DROP TABLE IF EXISTS npicodes");
sparkSession.sql("CREATE TABLE npicodes AS SELECT * FROM npicodesTmp");
val res = sparkSession.sql("SELECT * FROM npicodes WHERE NPI = '1588667638'") //This works
println(res.first())
val NPIs = sc.parallelize(List("1679576722", "1588667638", "1306849450", "1932102084"))//Some existing NPIs
val rows = NPIs.mapPartitions{ partition =>
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
partition.map{code =>
val res = sparkSession.sql("SELECT * FROM npicodes WHERE NPI = '"+code+"'")//The program stops here
res.first()
}
}
rows.collect().foreach(println)
It loads the data from a CSV, creates a new Hive table and fills it with the CSV data.
Then, if I query the table from the master it works perfectly, but if I try to do that in a map operation the execution getting frozen.
It do not generate any error, it continue running without do anything.
The Spark UI shows this situation
Actually, I am not sure if I can query a table in a distributed way, I cannot find it in the documentation.
Any suggestion?
Thanks.

How to convert a Cassandra ResultSet to a Spark DataFrame?

I would normally load data from Cassandra into Apache Spark this way using Java:
SparkContext sparkContext = StorakleSparkConfig.getSparkContext();
CassandraSQLContext sqlContext = new CassandraSQLContext(sparkContext);
sqlContext.setKeyspace("midatabase");
DataFrame customers = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " +
"WHERE CAST(store_id as string) = '" + storeId + "'");
But imagine I have a sharder and I need to load several partion keys into this DataFrame. I could use WHERE IN (...) in my query and again use the cassandraSql method. But I am a bit reluctant to use WHERE IN due to the infamous problem with having a one-point-of-failure in terms of the coordinator node. This is explained here:
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Is there a way to use several queries but load them into a single DataFrame?
One way to do this would be run individual queries and unionAll/union multiple DataFrames/RDDs.
SparkContext sparkContext = StorakleSparkConfig.getSparkContext();
CassandraSQLContext sqlContext = new CassandraSQLContext(sparkContext);
sqlContext.setKeyspace("midatabase");
DataFrame customersOne = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId1 + "'");
DataFrame customersTwo = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId2 + "'");
DataFrame allCustomers = customersOne.unionAll(CustomersTwo)

cassandra uuid in SELECT statements with Spark SQL

I have the following table in cassandra (v. 2.2.3)
cqlsh> DESCRIBE TABLE historian.timelines;
CREATE TABLE historian.timelines (
assetid uuid,
tslice int,
...
value map<text, text>,
PRIMARY KEY ((assetid, tslice), ...)
) WITH CLUSTERING ORDER BY (deviceid ASC, paramid ASC, fts DESC)
...
;
And I want to extract the data through Apache Spark (v. 1.5.0) via the following java snippet (using the cassandra spark connector v. 1.5.0 and cassandra driver core v. 2.2.0 RC3):
// Initialize Spark SQL Context
CassandraSQLContext sqlContext = new CassandraSQLContext(jsc.sc());
sqlContext.setKeyspace(keyspace);
DataFrame df = sqlContext.sql("SELECT * FROM " + tableName +
" WHERE assetid = '085eb9c6-8a16-11e5-af63-feff819cdc9f' LIMIT 2");
df.show();
At this point I get the following error accessing show method above:
cannot resolve '(assetid = cast(085eb9c6-8a16-11e5-af63-feff819cdc9f as double))' due to data type mismatch:
differing types in '(assetid = cast(085eb9c6-8a16-11e5-af63-feff819cdc9f as double))' (uuid and double).;
So it seems that Spark SQL is not interpreting the assetid input as an UUID. What I could do to handle cassandra UUID type in Spark SQL queries?
Thanks!
Indeed your query parameter is a String not a UUID, simply convert the query param like this :
import java.util.UUID;
DataFrame df = sqlContext.sql("SELECT * FROM " + tableName +
" WHERE assetid = "+ UUID.fromString("085eb9c6-8a16-11e5-af63-feff819cdc9f") +" LIMIT 2");

Resources