Spark - get tables from database - apache-spark

I have to perform an operation on all the tables from given databases(s) and so I am using following code.
However, it gives me views as well, is there a way I can filter only tables?
code
def getTables(databaseName: String)(implicit spark: SparkSession): Array[String] = {
val tables = spark.sql(s"show tables from ${databaseName}").collect().map(_(1).asInstanceOf[String])
logger.debug(s"${tables.mkString(",")} found")
tables
}
also, `show views shows error"
scala> spark.sql("show views from gshah03;").show
org.apache.spark.sql.catalyst.parser.ParseException:
missing 'FUNCTIONS' at 'from'(line 1, pos 11)
== SQL ==
show views from gshah03;
-----------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:69)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:643)
... 49 elided

Try this-
val df = spark.range(1, 5)
df.createOrReplaceTempView("df_view")
println(spark.catalog.currentDatabase)
val db: Database = spark.catalog.getDatabase(spark.catalog.currentDatabase)
val tables: Dataset[Table] = spark.catalog.listTables(db.name)
tables.show(false)
/**
* default
* +-------+--------+-----------+---------+-----------+
* |name |database|description|tableType|isTemporary|
* +-------+--------+-----------+---------+-----------+
* |df_view|null |null |TEMPORARY|true |
* +-------+--------+-----------+---------+-----------+
*/

Related

How do I print out a spark.sql object?

I have a spark.sql object that includes a couple of variables.
import com.github.nscala_time.time.Imports.LocalDate
val first_date = new LocalDate(2020, 4, 1)
val second_date = new LocalDate(2020, 4, 7)
val mydf = spark.sql(s"""
select *
from tempView
where timestamp between '{0}' and '{1}'
""".format(start_date.toString, end_date.toString))
I want to print out mydf because I ran mydf.count and got 0 as the outcome.
I ran mydf and got back mydf: org.apache.spark.sql.DataFrame = [column: type]
I also tried println(mydf) and it didn't return the query.
There is this related question, but it does not have the answer.
How can I print out the query?
Easiest way would be store your query into a variable then print out the variable to get the query.
Use variable in spark.sql
Example:
In Spark-scala:
val start_date="2020-01-01"
val end_date="2020-02-02"
val query=s"""select * from tempView where timestamp between'${start_date}' and '${end_date}'"""
print (query)
//select * from tempView where timestamp between'2020-01-01' and '2020-02-02'
spark.sql(query)
In Pyspark:
start_date="2020-01-01"
end_date="2020-02-02"
query="""select * from tempView where timestamp between'{0}' and '{1}'""".format(start_date,end_date)
print(query)
#select * from tempView where timestamp between'2020-01-01' and '2020-02-02'
#use same query in spark.sql
spark.sql(query)
Here it is in PySpark.
start_date="2020-01-01"
end_date="2020-02-02"
q="select * from tempView where timestamp between'{0}' and '{1}'".format(start_date,end_date)
print(q)
Here is the onlnie running version: https://repl.it/repls/FeistyVigorousSpyware

How to convert a sql to spark dataset?

I have a Val test=sql ("Select * from table1) which returns a dataframe. I want to convert it to dataset which is not working.
test.toDS is throwing error.
Please provide more detail about the error.
If you want to convert a dataframe in dataset use the code below :
case class MyClass(field1: Int, field2: Long) // for example
val df = sql ("Select * from table1)
val ds : Dataset[MyClass] = df.as[MyClass]

How to pass dataframe in ISIN operator in spark dataframe

I want to pass dataframe which has set of values to new query but it fails.
1) Here I am selecting particular column so that I can pass under ISIN in next query
scala> val managerIdDf=finalEmployeesDf.filter($"manager_id"!==0).select($"manager_id").distinct
managerIdDf: org.apache.spark.sql.DataFrame = [manager_id: bigint]
2) My sample data:
scala> managerIdDf.show
+----------+
|manager_id|
+----------+
| 67832|
| 65646|
| 5646|
| 67858|
| 69062|
| 68319|
| 66928|
+----------+
3) When I execute final query it fails:
scala> finalEmployeesDf.filter($"emp_id".isin(managerIdDf)).select("*").show
java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.DataFrame [manager_id: bigint]
I also tried converting to List and Seq but it generates an error only. Like below when I try to convert to Seq and re run the query it throws an error:
scala> val seqDf=managerIdDf.collect.toSeq
seqDf: Seq[org.apache.spark.sql.Row] = WrappedArray([67832], [65646], [5646], [67858], [69062], [68319], [66928])
scala> finalEmployeesDf.filter($"emp_id".isin(seqDf)).select("*").show
java.lang.RuntimeException: Unsupported literal type class scala.collection.mutable.WrappedArray$ofRef WrappedArray([67832], [65646], [5646], [67858], [69062], [68319], [66928])
I also referred this post but in vain. This type of query I am trying it for solving subqueries in spark dataframe. Anyone here pls ?
An alternative approach using the dataframes and tempviews and free format SQL of SPARK SQL - don't worry about the logic, it's just convention and an alternative to your initial approach - that should equally suffice:
val df2 = Seq(
("Peter", "Doe", Seq(("New York", "A000000"), ("Warsaw", null))),
("Bob", "Smith", Seq(("Berlin", null))),
("John", "Jones", Seq(("Paris", null)))
).toDF("firstname", "lastname", "cities")
df2.createOrReplaceTempView("persons")
val res = spark.sql("""select *
from persons
where firstname
not in (select firstname
from persons
where lastname <> 'Doe')""")
res.show
or
val list = List("Bob", "Daisy", "Peter")
val res2 = spark.sql("select firstname, lastname from persons")
.filter($"firstname".isin(list:_*))
res2.show
or
val query = s"select * from persons where firstname in (${list.map ( x => "'" + x + "'").mkString(",") })"
val res3 = spark.sql(query)
res3.show
or
df2.filter($"firstname".isin(list: _*)).show
or
val list2 = df2.select($"firstname").rdd.map(r => r(0).asInstanceOf[String]).collect.toList
df2.filter($"firstname".isin(list2: _*)).show
In your case specifically:
val seqDf=managerIdDf.rdd.map(r => r(0).asInstanceOf[Long]).collect.toList 2)
finalEmployeesDf.filter($"emp_id".isin(seqDf: _)).select("").show
Yes, you cannot pass a DataFrame in isin. isin requires some values that it will filter against.
If you want an example, you can check my answer here
As per question update, you can make the following change,
.isin(seqDf)
to
.isin(seqDf: _*)

How to access an array using foreach in spark?

I have data like below :
tab1,c1|c2|c3
tab2,d1|d2|d3|d4|d5
tab3,e1|e2|e3|e4
I need to convert it to as below in spark:
select c1,c2,c3 from tab1;
select d1,d2,d3,d4,d5 from tab2;
select e1,e2,e3,e4 from tab3;
I am able to get like this:
d.foreach(f=>{println("select"+" "+f+" from"+";")})
select tab3,e1,e2,e3,e4 from;
select tab1,c1,c2,c3 from;
select tab2,d1,d2,d3,d4,d5 from;
Can anyone suggest?
I'm not seeing where spark fits in your question. What does the variable 'd' represent?
Here is my guess at something that may be helpful.
from pyspark.sql.types import *
from pyspark.sql.functions import *
mySchema = StructType([
StructField("table_name", StringType()),
StructField("column_name",
ArrayType(StringType())
)
])
df = spark.createDataFrame([
("tab1",["c1","c2","c3"]),
("tab2",["d1","d2","d3","d4","d5"]),
("tab3",["e1","e2","e3","e4"])
],
schema = mySchema
)
df.selectExpr('concat("select ", concat_ws(",", column_name), " from ", table_name, ";") as select_string').show(3, False)
Output:
+--------------------------------+
|select_string |
+--------------------------------+
|select c1,c2,c3 from tab1; |
|select d1,d2,d3,d4,d5 from tab2;|
|select e1,e2,e3,e4 from tab3; |
+--------------------------------+
You can also use a map operation on RDD.
Assuming you have a RDD of String like:
val rdd = spark.parallelize(Seq(("tab1,c1|c2|c3"), ("tab2,d1|d2|d3|d4|d5"), ("tab3,e1|e2|e3|e4")))
with this operation:
val select = rdd.map(str=> {
val separated = str.split(",", -1)
val table = separated(0)
val cols = separated(1).split("\\|", -1).mkString(",")
"select " + cols + " from " + table + ";"
})
you will get the expected result:
select.foreach(println(_))
select d1,d2,d3,d4,d5 from tab2;
select e1,e2,e3,e4 from tab3;
select c1,c2,c3 from tab1;

Is Datastax UUIDs a wrapper for java.util.UUID [duplicate]

below is the code block and the error recieved
> creating a temporary views
sqlcontext.sql("""CREATE TEMPORARY VIEW temp_pay_txn_stage
USING org.apache.spark.sql.cassandra
OPTIONS (
table "t_pay_txn_stage",
keyspace "ks_pay",
cluster "Test Cluster",
pushdown "true"
)""".stripMargin)
sqlcontext.sql("""CREATE TEMPORARY VIEW temp_pay_txn_source
USING org.apache.spark.sql.cassandra
OPTIONS (
table "t_pay_txn_source",
keyspace "ks_pay",
cluster "Test Cluster",
pushdown "true"
)""".stripMargin)
querying the views as below to be able to get new records from stage not present in source .
Scala> val df_newrecords = sqlcontext.sql("""Select UUID(),
| |stage.order_id,
| |stage.order_description,
| |stage.transaction_id,
| |stage.pre_transaction_freeze_balance,
| |stage.post_transaction_freeze_balance,
| |toTimestamp(now()),
| |NULL,
| |1 from temp_pay_txn_stage stage left join temp_pay_txn_source source on stage.order_id=source.order_id and stage.transaction_id=source.transaction_id where
| |source.order_id is null and source.transaction_id is null""")`
org.apache.spark.sql.AnalysisException: Undefined function: 'uuid()'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7
i am trying to get the UUIDs generated , but getting this error.
Here is a Simple Example How you can generate timeuuid :
import org.apache.spark.sql.SQLContext
val sqlcontext = new SQLContext(sc)
import sqlcontext.implicits._
//Import UUIDs that contains the method timeBased()
import com.datastax.driver.core.utils.UUIDs
//user define function timeUUID which will retrun time based uuid
val timeUUID = udf(() => UUIDs.timeBased().toString)
//sample query to test, you can change it to yours
val df_newrecords = sqlcontext.sql("SELECT 1 as data UNION SELECT 2 as data").withColumn("time_uuid", timeUUID())
//print all the rows
df_newrecords.collect().foreach(println)
Output :
[1,9a81b3c0-170b-11e7-98bf-9bb55f3128dd]
[2,9a831350-170b-11e7-98bf-9bb55f3128dd]
Source : https://stackoverflow.com/a/37232099/2320144
https://docs.datastax.com/en/drivers/java/2.0/com/datastax/driver/core/utils/UUIDs.html#timeBased--

Resources