Error in Caching a Table in SparkSQL - apache-spark

I am trying to cache a Table available in Hive(using spark-shell). Given below is my code
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> hiveContext.cacheTable("sparkdb.firsttable")
and I am getting the below Exception
org.apache.spark.sql.catalyst.analysis.NoSuchTableException
at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:112)
The table firsttable is available in database sparkdb(in Hive). Looks like the issue seems to be in providing database name. How do I achieve this?
PS : HiveQL query like the one shown below does work without any issues
scala> hiveContext.sql("select * from sparkdb.firsttable")
Find below results from few other method calls
scala> hiveContext.tables("sparkdb")
res14: org.apache.spark.sql.DataFrame = [tableName: string, isTemporary: boolean]
scala> hiveContext.tables("sparkdb.firsttable")
res15: org.apache.spark.sql.DataFrame = [tableName: string, isTemporary: boolean]

Aha! I was right, this seems to be SPARK-8105. So, for now, your best bet is to do the select * and cache that.

Related

Spark SQL 'Show table extended from db like table gives different result' from Hive

spark.sql("SHOW TABLE EXTENDED IN DB LIKE 'TABLE'")
Beeline >>SHOW TABLE EXTENDED IN DB LIKE 'TABLE';
Both queries have different results.
If I run the same query in Spark it is giving different result than Hive. Format and lastUpdatedTime is missing in Spark SQL.
If anyone have idea then please let me know how to see lastUpdatedTime of Hive table from Spark SQL
Try this -
scala> val df = spark.sql(s"describe extended ${db}.${table_name}").select("data_type").where("col_name == 'Table Properties'")
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [data_type: string]
scala> df.map(r => r.getString(0).split(",")(1).trim).collect
res39: Array[String] = Array(last_modified_time=1539848078)
scala> df.map(r => r.getString(0).split(",")(1).trim.split("=")(1)).collect.mkString
res41: String = 1539848078

RDD subtract dosen't work for user defined types [duplicate]

This question already has an answer here:
Case class equality in Apache Spark
(1 answer)
Closed 5 years ago.
I tried this simple example
scala> rdd2.collect
res45: Array[Person] = Array(Person(Mary,28,New York), Person(Bill,17,Philadelphia), Person(Craig,34,Philadelphia), Person(Leah,26,Rochester))
scala> rdd3.collect
res44: Array[Person] = Array(Person(Mary,28,New York), Person(Bill,17,Philadelphia), Person(Craig,35,Philadelphia), Person(Leah,26,Rochester))
scala> rdd2.subtract(rdd3).collect
res46: Array[Person] = Array(Person(Mary,28,New York), Person(Leah,26,Rochester), Person(Bill,17,Philadelphia), Person(Craig,34,Philadelphia))
I expect rdd2.subtract(rdd3).collect only should be Person(Craig,34,Philadelphia) but I get rdd2 as my output Can anyone please explain this ?
This is one of the known issues with scala REPL where equality conditions dont work properly in REPL. try the following to fix it. This issue occurs only in the REPL and would go away when you are running he application via spark-submit.
This issue is explained in detail in this ticket.
scala> :paste -raw // make sure you are using Scala 2.11 for the raw option to work.
// Entering paste mode (ctrl-D to finish)
package mytest;
case class Person(name: String, age: Int, city: String);
// Exiting paste mode, now interpreting.
scala> import mytest.Person
scala> val rdd2 = sc.parallelize(Seq(Person("Mary",28,"New York"), Person("Bill",17,"Philadelphia"), Person("Craig",34,"Philadelphia"), Person("Leah",26,"Rochester")))
rdd2: org.apache.spark.rdd.RDD[mytest.Person] = ParallelCollectionRDD[6] at parallelize at <console>:25
scala> val rdd3 = sc.parallelize(Seq(Person("Mary",28,"New York"), Person("Bill",17,"Philadelphia"), Person("Craig",35,"Philadelphia"), Person("Leah",26,"Rochester")))
rdd3: org.apache.spark.rdd.RDD[mytest.Person] = ParallelCollectionRDD[7] at parallelize at <console>:25
scala> rdd2.subtract(rdd3).collect
res1: Array[mytest.Person] = Array(Person(Craig,34,Philadelphia))

While joining two dataframe in spark, getting empty result

I am trying to join two dataframes in spark from database Cassandra.
val table1=cc.sql("select * from test123").as("table1")
val table2=cc.sql("select * from test1234").as("table2")
table1.join(table2, table1("table1.id") === table2("table2.id1"), "inner")
.select("table1.name", "table2.name1")
The result I am getting is empty.
You can try pure sql way, if you are un-sure of the syntax of join here.
table1.registerTempTable("tbl1")
table2.registerTempTable("tbl2")
val table3 = sqlContext.sql("Select tbl1.name, tbl2.name FROM tbl1 INNER JOIN tbl2 on tbl1.id=tbl2.id")
Also, you should see, if table1 and table2, really have same id's to do join on, in first place.
Update :-
import org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Ideally, yes, csc should also work.
You should refer to http://spark.apache.org/docs/latest/sql-programming-guide.html
First union both data frame and after that register as temp table

writetime of cassandra row in spark

i'm using spark with cassandra, and i want to select from my cassandra table the writeTime of my row. This is my request :
val lines = sc.cassandraTable[(String, String, String, Long)](CASSANDRA_SCHEMA, table).select("a", "b", "c", "writeTime(d)").count()
but it display this error :
java.io.IOException: Column channal not found in table test.mytable
I've tried also this request
val lines = sc.cassandraTable[(String, String, String, Long)](CASSANDRA_SCHEMA, table).select("a", "b", "c", WRITETIME("d")").count()
but it display this error :
<console>:25: error: not found: value WRITETIME
Please how can i get the writeTime of my row.
Thanks.
Edit: This has been fixed in the 1.2 release of the connector
Currently the Connector doesn't support passing through CQL functions when reading from Cassandra. I've taken note of this and will start up a ticket for implementing this functionality.
https://datastax-oss.atlassian.net/browse/SPARKC-55
For a workaround you can always use the direct connector within your operations like in
import com.datastax.spark.connector.cql.CassandraConnector
val cc = CassandraConnector(sc.getConf)
val select = s"SELECT WRITETIME(userId) FROM cctest.users where userid=?"
val ids = sc.parallelize(1 to 10)
ids.flatMap(id =>
cc.withSessionDo(session =>
session.execute(select, id.toInt: java.lang.Integer)
Code modified from
Filter from Cassandra table by RDD values
In cassandra-spark-connector 1.2, you can get TTL and write time by writing:
sc.cassandraTable(...).select("column1", WriteTime("column2"), TTL("column3"))
Take a look at this ticket.
For usage, take a look at integration tests here.

How to implement "Cross Join" in Spark?

We plan to move Apache Pig code to the new Spark platform.
Pig has a "Bag/Tuple/Field" concept and behaves similarly to a relational database. Pig provides support for CROSS/INNER/OUTER joins.
For CROSS JOIN, we can use alias = CROSS alias, alias [, alias …] [PARTITION BY partitioner] [PARALLEL n];
But as we move to the Spark platform I couldn't find any counterpart in the Spark API. Do you have any idea?
It is oneRDD.cartesian(anotherRDD).
Here is the recommended version for Spark 2.x Datasets and DataFrames:
scala> val ds1 = spark.range(10)
ds1: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> ds1.cache.count
res1: Long = 10
scala> val ds2 = spark.range(10)
ds2: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> ds2.cache.count
res2: Long = 10
scala> val crossDS1DS2 = ds1.crossJoin(ds2)
crossDS1DS2: org.apache.spark.sql.DataFrame = [id: bigint, id: bigint]
scala> crossDS1DS2.count
res3: Long = 100
Alternatively it is possible to use the traditional JOIN syntax with no join condition. Use this configuration option to avoid the error that follows.
spark.conf.set("spark.sql.crossJoin.enabled", true)
Error when that configuration is omitted (using the "join" syntax specifically):
scala> val crossDS1DS2 = ds1.join(ds2)
crossDS1DS2: org.apache.spark.sql.DataFrame = [id: bigint, id: bigint]
scala> crossDS1DS2.count
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
...
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
Related: spark.sql.crossJoin.enabled for Spark 2.x

Resources