Spark SQL - load data with JDBC using SQL statement, not table name - apache-spark

I think I am missing something but can't figure what.
I want to load data using SQLContext and JDBC using particular sql statement
like
select top 1000 text from table1 with (nolock)
where threadid in (
select distinct id from table2 with (nolock)
where flag=2 and date >= '1/1/2015' and userid in (1, 2, 3)
)
Which method of SQLContext should I use? Examples I saw always specify table name and lower and upper margin.
Thanks in advance.

You should pass a valid subquery as a dbtable argument. For example in Scala:
val query = """(SELECT TOP 1000
-- and the rest of your query
-- ...
) AS tmp -- alias is mandatory*"""
val url: String = ???
val jdbcDF = sqlContext.read.format("jdbc")
.options(Map("url" -> url, "dbtable" -> query))
.load()
* Hive Language Manual SubQueries: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries

val url = "jdbc:postgresql://localhost/scala_db?user=scala_user"
Class.forName(driver)
val connection = DriverManager.getConnection(url)
val df2 = spark.read
.format("jdbc")
.option("url", url)
.option("dbtable", "(select id,last_name from emps) e")
.option("user", "scala_user")
.load()
The key is "(select id,last_name from emps) e", here you can write a subquery in place of table_name.

Related

Query table names

I have been given access to a database. I am querying the data from a spark cluster. How do I check all the databases/tables I have access to?
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2};user='{3}';password='{4}'".format(jdbcHostname, jdbcPort, jdbcDatabase, jdbcUsername, jdbcPassword)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
df = spark.read.jdbc(url=jdbcUrl, properties=connectionProperties)
Access to the database has been authenticated.
In SQL Server itself:
select *
from sys.tables
Not sure if you use synonym or not as the way into the sys schema.
val tables = spark.read.jdbc(jdbc_url, "sys.tables", connectionProperties)
tables.select(...
If you have synonym, replace the sys.tables with that. There are different ways of writing, you go down the tables or own SQL Query approach approach. This is the tables approach. Here under the SQL Query approach, an example:
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample order by k DESC) e", connectionProperties)
SCALA version I just realized.
pyspark specifically
See: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Just the same approach, but specifically for this case postgres:
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
JDBC loading and saving can be achieved via either the load/save or jdbc methods, see the guide.

Is there a way to define “partitionColumn” in “option(”partitionColumn“,”colname“)” in Spark-JDBC if the column is of datatype: String?

I am trying to load data from RDBMS to a hive table on HDFS. I am reading the RDBMS table in the below way:
val mydata = spark.read
.format("jdbc")
.option("url", connection)
.option("dbtable", "select * from dev.userlocations")
.option("user", usrname)
.option("password", pwd)
.option("numPartitions",20)
.load()
I see in the executor logs that the option("numPartitions",20) is not given properly and the entire data in dumped into a single executor.
Now there are options to provide the partition column, lower bound & upper bound as below:
val mydata = spark.read
.format("jdbc")
.option("url", connection)
.option("dbtable", "select * from dev.userlocations")
.option("user", usrname)
.option("password", pwd)
.option("partitionColumn","columnName")
.option("lowerbound","x")
.option("upperbound","y")
.option("numPartitions",20).load()
The above one only works if I have the partition column is of numeric datatype. In the table I am reading, it is partitioned based on a column location. It is of size 5gb overall & there are 20 different partitions in the table based on. I have 20 distinct locations in the table. Is there anyway I can read the table in partitions based on the partition column of the table: location ?
Could anyone let me know if it can be implemented at all ?
You can use the predicates option for this. It takes an array of string and each item in the array is a condition for partitioning the source table. Total number of partitions determined by those conditions.
val preds = Array[String]("location = 'LOC1'", "location = 'LOC2' || location = 'LOC3'")
val df = spark.read.jdbc(
url = databaseUrl,
table = tableName,
predicates = preds,
connectionProperties = properties
)

Join is not working for Spark when using Drill [duplicate]

I think I am missing something but can't figure what.
I want to load data using SQLContext and JDBC using particular sql statement
like
select top 1000 text from table1 with (nolock)
where threadid in (
select distinct id from table2 with (nolock)
where flag=2 and date >= '1/1/2015' and userid in (1, 2, 3)
)
Which method of SQLContext should I use? Examples I saw always specify table name and lower and upper margin.
Thanks in advance.
You should pass a valid subquery as a dbtable argument. For example in Scala:
val query = """(SELECT TOP 1000
-- and the rest of your query
-- ...
) AS tmp -- alias is mandatory*"""
val url: String = ???
val jdbcDF = sqlContext.read.format("jdbc")
.options(Map("url" -> url, "dbtable" -> query))
.load()
* Hive Language Manual SubQueries: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
val url = "jdbc:postgresql://localhost/scala_db?user=scala_user"
Class.forName(driver)
val connection = DriverManager.getConnection(url)
val df2 = spark.read
.format("jdbc")
.option("url", url)
.option("dbtable", "(select id,last_name from emps) e")
.option("user", "scala_user")
.load()
The key is "(select id,last_name from emps) e", here you can write a subquery in place of table_name.

Query to create a dropdown list from two tables having same column name in spark

I have two tables student and employee having same column as location I want to write a query in spark sql to create a dropdown list for location column which should have records from both the tables.
You should read the location column from both tables separately and then merge them using union.
val employees:DataFrame = ... //read from employees table
val students:DataFrame = ... //read from students table
val locations:DataFrame = employees.select("location").union(students.select("location")).dropDuplicates
Reading from JDBC is done like this (taken from the spark documentation):
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.load()

Running custom Apache Phoenix SQL query in PySpark

Could someone provide an example using pyspark on how to run a custom Apache Phoenix SQL query and store the result of that query in a RDD or DF. Note: I am looking for a custom query and not an entire table to be read into a RDD.
From Phoenix Documentation, to load an entire table I can use this:
table = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "<TABLENAME>") \
.option("zkUrl", "<hostname>:<port>") \
.load()
I want to know what is the corresponding equivalent for using a custom SQL
sqlResult = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("sql", "select * from <TABLENAME> where <CONDITION>") \
.option("zkUrl", "<HOSTNAME>:<PORT>") \
.load()
Thanks.
This can be done using Phoenix as a JDBC data source as given below:
sql = '(select COL1, COL2 from TABLE where COL3 = 5) as TEMP_TABLE'
df = sqlContext.read.format('jdbc')\
.options(driver="org.apache.phoenix.jdbc.PhoenixDriver", url='jdbc:phoenix:<HOSTNAME>:<PORT>', dbtable=sql).load()
df.show()
However it should be noted that if there are column aliases in the SQL statement then the .show() statement would throw up an exception (It will work if you use .select() to select the columns that are not aliased), this is a possible bug in Phoenix.
Here you need to use .sql to work with custom queries. Here is syntax
dataframe = sqlContext.sql("select * from <table> where <condition>")
dataframe.show()
To Spark2, I didn't have problem with .show() function, and I did not use .select() function to print all values of DataFrame coming from Phoenix.
So, make sure that your sql query has been inside parentheses, look my example:
val sql = " (SELECT P.PERSON_ID as PERSON_ID, P.LAST_NAME as LAST_NAME, C.STATUS as STATUS FROM PERSON P INNER JOIN CLIENT C ON C.CLIENT_ID = P.PERSON_ID) "
val dft = dfPerson.sparkSession.read.format("jdbc")
.option("driver", "org.apache.phoenix.jdbc.PhoenixDriver")
.option("url", "jdbc:phoenix:<HOSTNAME>:<PORT>")
.option("useUnicode", "true")
.option("continueBatchOnError", "true")
.option("dbtable", sql)
.load()
dft.show();
It shows me:
+---------+--------------------+------+
|PERSON_ID| LAST_NAME|STATUS|
+---------+--------------------+------+
| 1005| PerDiem|Active|
| 1008|NAMEEEEEEEEEEEEEE...|Active|
| 1009| Admission|Active|
| 1010| Facility|Active|
| 1011| MeUP|Active|
+---------+--------------------+------+

Resources