Query table names - apache-spark

I have been given access to a database. I am querying the data from a spark cluster. How do I check all the databases/tables I have access to?
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2};user='{3}';password='{4}'".format(jdbcHostname, jdbcPort, jdbcDatabase, jdbcUsername, jdbcPassword)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
df = spark.read.jdbc(url=jdbcUrl, properties=connectionProperties)
Access to the database has been authenticated.

In SQL Server itself:
select *
from sys.tables
Not sure if you use synonym or not as the way into the sys schema.
val tables = spark.read.jdbc(jdbc_url, "sys.tables", connectionProperties)
tables.select(...
If you have synonym, replace the sys.tables with that. There are different ways of writing, you go down the tables or own SQL Query approach approach. This is the tables approach. Here under the SQL Query approach, an example:
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample order by k DESC) e", connectionProperties)
SCALA version I just realized.
pyspark specifically
See: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Just the same approach, but specifically for this case postgres:
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
JDBC loading and saving can be achieved via either the load/save or jdbc methods, see the guide.

Related

spark Connect Two Database tables to produce a third data

DataFrameLoadedFromLeftDatabase=data loaded using DataFrameReader from first database say LeftDB.
I need to
iterate through each row in this dataframe,
connect to a second database say RightDB,
find some matching record from RightDB,
and do some business logic
This is an iterative operation so it is not simply doable with a JOIN between LeftDB and RightDB to find some new fields, create a New Dataframe targetDF and write into a third Database say ThirdDB using DataframeWriter
I know that I can use
val targetDF = DataFrameLoadedFromLeftDatabase.mapPartitions(
partition => {
val rightDBconnection = new DbConnection // establish a connection to RightDB
val result = partition.map(record => {
readMatchingFromRightDBandDoBusinessLogicTransformationAndReturnAList(record, rightDBconnection)
}).toList
rightDBconnection.close()
result.iterator
}
).toDF()
targetDF.write
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "table3")
.option("user", "username")
.option("password", "password")
.save()
I am wondering whether apache spark is suitable for these type of chatty data processing applications
I am wondering whether interating throguh each record in RightDB will be too chatty in this approach
I am looking forward with some advices to improve this design to make use of SPARK capabilites. I also wanted to make sure the processing do not cause too much shuffle operations for performance reasons
Ref: Related SO Post
At this kind of situations we always prefer spark.sql. Basically define two different DFs and join them based on a query then you can apply your business logic afterwards.
For example;
import org.apache.spark.sql.{DataFrame, SparkSession}
// Add your columns here
case class MyResult(ID: String, NAME: String)
// Create a SparkSession
val spark = SparkSession.builder()
.appName("Join Tables and Add Prefix to ID Column")
.config("spark.master", "local[*]")
.getOrCreate()
// Read the first table from DB1
val firstTable: DataFrame = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://localhost/DB1")
.option("dbtable", "FIRST_TABLE")
.option("user", "your_username")
.option("password", "your_password")
.load()
firstTable.createOrReplaceTempView("firstTable")
// Read the second table from DB2
val secondTable: DataFrame = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://localhost/DB2")
.option("dbtable", "SECOND_TABLE")
.option("user", "your_username")
.option("password", "your_password")
.load()
secondTable.createOrReplaceTempView("secondTable")
// Apply you filtering here
val result: DataFrame = spark.sql("SELECT f.*, s.* FROM firstTable as f left join secondTable as s on f.ID = s.ID")
val finalData = result.as[MyResult]
.map{record=>
// Do your business logic
businessLogic(record)
}
// Write the result to the third table in DB3
finalData.write
.format("jdbc")
.option("url", "jdbc:postgresql://localhost/DB3")
.option("dbtable", "THIRD_TABLE")
.option("user", "your_username")
.option("password", "your_password")
.save()
If your tables are big, you can execute a query and read its results directly. If you do this you can reduce your input sizes by filtering by dates etc:
val myQuery = """
(select * from table
where // do your filetering here
) foo
"""
val df = sqlContext.format("jdbc").
option("url", "jdbc:postgresql://localhost/DB").
.option("user", "your_username")
.option("password", "your_password")
.option("dbtable", myQuery)
.load()
Other than this, it is hard to do record specific operations directly via spark. You have to maintain your client connections etc as custom logics. Spark designed to read/write huge amounts data. It creates pipelines for this purpose. Simple operations will be an overhead for it. Always do your API calls (or single DB calls) in your map functions. If you use a cache layer in there, it could be life saving in terms of performance. Always try to use a connection pool in your custom database calls, otherwise spark will try to execute all of mapping operations with different connections which may create a pressure on your database and cause connection failures.
Can think of a lot of improvements but in general all of them are going to depend on having the data pre-distributed in a HDFS, HBase, Hive database, MongoDB,...
I mean: You are thinking "relational data with distributed processing mindset" ... I though we were already beyond that XD

In Spark, when reading from SQL Server, how do you add binary parameters?

I want to use Spark to read data from a table in Azure SQL. However I don't want to entire table, so I have used the "query" option so I can filter down what is needed. However, I can't find a way to pass a binary (SQL's RowVersion) parameter into the query. How can this be done?
df = spark.read.format("jdbc") \
.option("url", "jdbc:sqlserver://serverName.database.windows.net;databaseName=databaseName") \
.option("query", "SELECT * FROM dbo.tableName WHERE RowVersion > ?") \
.option("accesstoken", access_token) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()
I reproduce the same thing in my environment and created a sample table state with row version name RV in Azure SQL.
Now I can read the row version data in Azure Databricks .please follow the below code :
Hostname = "<server_name>.database.windows.net"
Database = "<data_base_name"
Port = "1433"
username = "username"
password = "pass"
Url = "jdbc:sqlserver://{0}:{1};database={2}".format(Hostname,Port,Database)
connProp = {
"user" : username,
"password" : password,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
query = "(Select rv from states ) states"
df = spark.read.jdbc(url=Url, table=query, properties=connProp)
display(df)

Running custom Apache Phoenix SQL query in PySpark

Could someone provide an example using pyspark on how to run a custom Apache Phoenix SQL query and store the result of that query in a RDD or DF. Note: I am looking for a custom query and not an entire table to be read into a RDD.
From Phoenix Documentation, to load an entire table I can use this:
table = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "<TABLENAME>") \
.option("zkUrl", "<hostname>:<port>") \
.load()
I want to know what is the corresponding equivalent for using a custom SQL
sqlResult = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("sql", "select * from <TABLENAME> where <CONDITION>") \
.option("zkUrl", "<HOSTNAME>:<PORT>") \
.load()
Thanks.
This can be done using Phoenix as a JDBC data source as given below:
sql = '(select COL1, COL2 from TABLE where COL3 = 5) as TEMP_TABLE'
df = sqlContext.read.format('jdbc')\
.options(driver="org.apache.phoenix.jdbc.PhoenixDriver", url='jdbc:phoenix:<HOSTNAME>:<PORT>', dbtable=sql).load()
df.show()
However it should be noted that if there are column aliases in the SQL statement then the .show() statement would throw up an exception (It will work if you use .select() to select the columns that are not aliased), this is a possible bug in Phoenix.
Here you need to use .sql to work with custom queries. Here is syntax
dataframe = sqlContext.sql("select * from <table> where <condition>")
dataframe.show()
To Spark2, I didn't have problem with .show() function, and I did not use .select() function to print all values of DataFrame coming from Phoenix.
So, make sure that your sql query has been inside parentheses, look my example:
val sql = " (SELECT P.PERSON_ID as PERSON_ID, P.LAST_NAME as LAST_NAME, C.STATUS as STATUS FROM PERSON P INNER JOIN CLIENT C ON C.CLIENT_ID = P.PERSON_ID) "
val dft = dfPerson.sparkSession.read.format("jdbc")
.option("driver", "org.apache.phoenix.jdbc.PhoenixDriver")
.option("url", "jdbc:phoenix:<HOSTNAME>:<PORT>")
.option("useUnicode", "true")
.option("continueBatchOnError", "true")
.option("dbtable", sql)
.load()
dft.show();
It shows me:
+---------+--------------------+------+
|PERSON_ID| LAST_NAME|STATUS|
+---------+--------------------+------+
| 1005| PerDiem|Active|
| 1008|NAMEEEEEEEEEEEEEE...|Active|
| 1009| Admission|Active|
| 1010| Facility|Active|
| 1011| MeUP|Active|
+---------+--------------------+------+

spark sql load all data into memory without using where clause

so I have a really huge table contains billions of rows, I tried the Spark DataFrame API to load data, here is my code:
sql = "select * from mytable where day = 2016-11-25 and hour = 10"
df = sqlContext.read \
.format("jdbc") \
.option("driver", driver) \
.option("url", url) \
.option("user", user) \
.option("password", password) \
.option("dbtable", table) \
.load(sql)
df.show()
I typed the sql in mysql, it returns about 100 rows,but the above sql did not work in spark sql, it occurs OOM error, it seems like that spark sql load all data into memory without using where clause. So how can spark sql using where clause?
I have solved the problem. the spark doc gives the answer:
spark doc
So the key is to change the "dbtalble" option, make your sql a subquery. The correct answer is :
// 1. write your query sql as a subquery
sql = "(select * from mytable where day = 2016-11-25 and hour = 10) t1"
df = sqlContext.read \
.format("jdbc") \
.option("driver", driver) \
.option("url", url) \
.option("user", user) \
.option("password", password) \
.option("dbtable", sql) \ // 2. change "dbtable" option to your subquery sql
.load(sql)
df.show()
sql = "(select * from mytable where day = 2016-11-25 and hour = 10) as t1"
df = sqlContext.read
.format("jdbc")
.option("driver", driver)
.option("url", url)
.option("user", user)
.option("password", password)
.option("dbtable", sql)
.load(sql)
df.show()

Spark SQL - load data with JDBC using SQL statement, not table name

I think I am missing something but can't figure what.
I want to load data using SQLContext and JDBC using particular sql statement
like
select top 1000 text from table1 with (nolock)
where threadid in (
select distinct id from table2 with (nolock)
where flag=2 and date >= '1/1/2015' and userid in (1, 2, 3)
)
Which method of SQLContext should I use? Examples I saw always specify table name and lower and upper margin.
Thanks in advance.
You should pass a valid subquery as a dbtable argument. For example in Scala:
val query = """(SELECT TOP 1000
-- and the rest of your query
-- ...
) AS tmp -- alias is mandatory*"""
val url: String = ???
val jdbcDF = sqlContext.read.format("jdbc")
.options(Map("url" -> url, "dbtable" -> query))
.load()
* Hive Language Manual SubQueries: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
val url = "jdbc:postgresql://localhost/scala_db?user=scala_user"
Class.forName(driver)
val connection = DriverManager.getConnection(url)
val df2 = spark.read
.format("jdbc")
.option("url", url)
.option("dbtable", "(select id,last_name from emps) e")
.option("user", "scala_user")
.load()
The key is "(select id,last_name from emps) e", here you can write a subquery in place of table_name.

Resources