How do I connect to Teradata from pyspark? - apache-spark

I am trying to connect to teradata from pyspark. Please can you tell me how can I do that? I tried looking online but can't find anything.
I have the following jars :
tdgssconfig-15.10.00.14.jar
teradata-connector-1.4.1.jar

You need the jdbc connector JAR(you can get it for the teradata version that you are connecting to. And then use something like this :
sc.addJar("yourDriver.jar")
val jdbcDF = sqlContext.load("jdbc", Map(
"url" -> "jdbc:teradata://<server_name>, TMODE=TERA, user=my_user, password=*****",
"dbtable" -> "schema.table_name",
"driver" -> "com.teradata.jdbc.TeraDriver"))

Step 1: Find the appropriate jdbc driver for the version of teradata you are using:
https://downloads.teradata.com/download/connectivity/jdbc-driver
Step 2: Go through the tutorial here
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Sample code:
val jdbcDF = spark.read
.format("jdbc")
.option("url", "<jdbc_connection_string>")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.load()

Related

ClickHouse housepower driver with spark

I'm new to Stack and Spark so please forgive me my simplicity and mistakes!
I have a problem with Clickhouse and spark (2.4.7), I work on a jupyter notebook.
Basically, I want to insert dataframe with Array column to Clickhouse with Array(String) column. Using yandex driver this is impossible, because jdbc doesn't support Arrays, right? ;)
So I wanted to run Spark Session with housepower jar: clickhouse-native-jdbc-shaded-2.6.4.jar, because I read that they added handling Arrays - correct me if I'm wrong.
And I want to get a query from Clickhouse via jdbc.
spark = SparkSession\
.builder\
.enableHiveSupport()\
.appName(f'custom-events-test)\
.config("spark.jars", "drivers/clickhouse-native-jdbc-shaded-2.6.4.jar")\
.getOrCreate()
My query:
query = """
select date as date,
partnerID as partnerID,
sessionID as sessionID,
toString(mapKeys(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as keys,
toString(mapValues(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as values
from audience.uber_all
array join eventContents as ec
PREWHERE date = ('2022-07-22')
WHERE partnerID = 'XXX'
and ec.custom != '{}'
order by date, partnerID
"""
and my code:
df_tab = spark.read \
.format("jdbc") \
.option("driver", "com.github.housepower.jdbc.ClickHouseDriver") \
.option("url", f"jdbc:clickhouse://{ch_host}:9000/{ch_db}") \
.option("query", query) \
.option("user", ch_user) \
.option("password", ch_pass) \
.load()
But there I get an error:
housepower_error
BUT when I run above query with yandex driver: ru.yandex.clickhouse.ClickHouseDriver
everything works fine. (even with housepower jar)
This error also appears when I want to import column like this:
JSONExtractKeysAndValues(ec.custom, 'String')
or
toString(JSONExtractKeysAndValues(ec.custom, 'String'))
What am I doing wrong ?
And tell me how to insert a DF with Array column using spark jdbc to Clickhouse table also with Array(String) column? I was looking everywhere but coudn't find a solution...
Thank you in advance !

Error reading data from DB2 table in Spark

I'm trying to read data from DB2 table using spark jdbc, but getting the below error.
Exception in thread "main" com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-104, SQLSTATE=42601, SQLERRMC=*;SELECT * FROM select;UNION, DRIVER=4.19.26
Code:
val df = spark.sqlContext.read.format("jdbc")
.option("driver", "com.ibm.db2.jcc.DB2Driver")
.option("url", "jdbc:db2://server/database")
.option("user", "user")
.option("password", "password")
.option("dbtable","select * from schema.table limit 10").load()
Any help is appreciated. Thank you.

Query table names

I have been given access to a database. I am querying the data from a spark cluster. How do I check all the databases/tables I have access to?
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2};user='{3}';password='{4}'".format(jdbcHostname, jdbcPort, jdbcDatabase, jdbcUsername, jdbcPassword)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
df = spark.read.jdbc(url=jdbcUrl, properties=connectionProperties)
Access to the database has been authenticated.
In SQL Server itself:
select *
from sys.tables
Not sure if you use synonym or not as the way into the sys schema.
val tables = spark.read.jdbc(jdbc_url, "sys.tables", connectionProperties)
tables.select(...
If you have synonym, replace the sys.tables with that. There are different ways of writing, you go down the tables or own SQL Query approach approach. This is the tables approach. Here under the SQL Query approach, an example:
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample order by k DESC) e", connectionProperties)
SCALA version I just realized.
pyspark specifically
See: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Just the same approach, but specifically for this case postgres:
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
JDBC loading and saving can be achieved via either the load/save or jdbc methods, see the guide.

Spark reading table after connection to HiverServer2 only gives schema not data

I try to connect to a remote hive cluster using the following code and I get the table data as expected
val spark = SparkSession
.builder()
.appName("adhocattempts")
.config("hive.metastore.uris", "thrift://<remote-host>:9083")
.enableHiveSupport()
.getOrCreate()
val seqdf=sql("select * from anon_seq")
seqdf.show
However, when I try to do this via HiveServer2, I get no data in my dataframe. This table is based on a sequencefile. Is that the issue, since I am actually trying to read this via jdbc?
val sparkJdbc = SparkSession.builder.appName("SparkHiveJob").getOrCreate
val sc = sparkJdbc.sparkContext
val sqlContext = sparkJdbc.sqlContext
val driverName = "org.apache.hive.jdbc.HiveDriver"
Class.forName(driverName)
val df = sparkJdbc.read
.format("jdbc")
.option("url", "jdbc:hive2://<remote-host>:10000/default")
.option("dbtable", "anon_seq")
.load()
df.show()
Can someone help me understand the purpose of using HiveServer2 with jdbc and relevant drivers in Spark2?

connect to mysql from spark

I am trying to follow the instructions mentioned here...
https://www.percona.com/blog/2016/08/17/apache-spark-makes-slow-mysql-queries-10x-faster/
and here...
https://www.percona.com/blog/2015/10/07/using-apache-spark-mysql-data-analysis/
I am using sparkdocker image.
docker run -it -p 8088:8088 -p 8042:8042 -p 4040:4040 -h sandbox sequenceiq/spark:1.6.0 bash
cd /usr/local/spark/
./sbin/start-master.sh
./bin/spark-shell --driver-memory 1G --executor-memory 1g --executor-cores 1 --master local
This works as expected:
scala> sc.parallelize(1 to 1000).count()
But this shows an error:
val jdbcDF = spark.read.format("jdbc").options(
Map("url" -> "jdbc:mysql://1.2.3.4:3306/test?user=dba&password=dba123",
"dbtable" -> "ontime.ontime_part",
"fetchSize" -> "10000",
"partitionColumn" -> "yeard", "lowerBound" -> "1988", "upperBound" -> "2016", "numPartitions" -> "28"
)).load()
And here is the error:
<console>:25: error: not found: value spark
val jdbcDF = spark.read.format("jdbc").options(
How do I connect to MySQL from within spark shell?
With spark 2.0.x,you can use DataFrameReader and DataFrameWriter.
Use SparkSession.read to access DataFrameReader and use Dataset.write to access DataFrameWriter.
Suppose using spark-shell.
read example
val prop=new java.util.Properties()
prop.put("user","username")
prop.put("password","yourpassword")
val url="jdbc:mysql://host:port/db_name"
val df=spark.read.jdbc(url,"table_name",prop)
df.show()
read example 2
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql:dbserver")
.option("dbtable", “schema.tablename")
.option("user", "username")
.option("password", "password")
.load()
from spark doc
write example
import org.apache.spark.sql.SaveMode
val prop=new java.util.Properties()
prop.put("user","username")
prop.put("password","yourpassword")
val url="jdbc:mysql://host:port/db_name"
//df is a dataframe contains the data which you want to write.
df.write.mode(SaveMode.Append).jdbc(url,"table_name",prop)
Create the spark context first
Make sure you have jdbc jar files in attached to your classpath
if you are trying to read data from jdbc. use dataframe API instead of RDD as dataframes have better performance. refer to the below performance comparsion graph.
here is the syntax for reading from jdbc
SparkConf conf = new SparkConf().setAppName("app"))
.setMaster("local[2]")
.set("spark.serializer",prop.getProperty("spark.serializer"));
JavaSparkContext sc = new JavaSparkContext(conf);
sqlCtx = new SQLContext(sc);
df = sqlCtx.read()
.format("jdbc")
.option("url", "jdbc:mysql://1.2.3.4:3306/test")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable","dbtable")
.option("user", "dbuser")
.option("password","dbpwd"))
.load();
It looks like spark is not defined, you should use the SQLContext to connect to the driver like this:
import org.apache.spark.sql.SQLContext
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val dataframe_mysql = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://Public_IP:3306/DB_NAME").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "tblage").option("user", "sqluser").option("password", "sqluser").load()
Later you can user sqlcontext where you used spark (in spark.read etc)
This is a common problem for those migrating to Spark 2.0.0 from the earlier versions. The Spark documentation is not very good. To solve this, you have to define a SparkSession, like this:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL Example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
This solution is hidden in the Spark SQL, Dataframes and Data Sets Guide located here. SparkSession is the new entry point to the DataFrame API and it incorporates both SQLContext and HiveContext and has some additional advantages, so there is no need to define either of those anymore. Further information about this can be found here.
Please accept this as the answer, if you find this useful.

Resources