Access temporary spark dataframe in spark.read teradata sql query - apache-spark

Suppose you have a dataframe df, registered as a temporary view in spark.
Is it possible to access the dataframe in a spark.read teradata sql command?
I.e.
df = spark.createDataFrame([(1, 5), (2, 6),], ['id', 'val'])
df = createOrReplaceTempView('df')
td_query = """(
SELECT * FROM EX_SCHEMA.EX_TABLE AS EX_TABLE LEFT JOIN df ON (EX_TABLE.ID = df.ID)
) qry
"""
df_td = spark.read.format('jdbc').option('driver', 'com.teradata.jdbc.TeraDriver')\
.option('url', url).option('user', user_td).option('password', password_td)\
.option('dbtable', td_query).load()
I know that the left join could easily be done in spark afterwards, this is just an example for the principal issue.

Related

How to join a hive table with a pandas dataframe in pyspark?

I have a hive table db.hive_table and a pandas dataframe df. I wish to use pyspark.SparkSession.builder.enableHiveSupport().getOrCreate().sql to join them. How can I do this?
Convert both to pyspark dataframe and then join dfs.
# Pandas to Spark
df_sp = spark_session.createDataFrame(df_pd)
# Convert hive table to df - sqlContext is of type HiveContext
df_hive = sqlContext.table(tablename)
Join the two dfs.
joined_df = df_sp.join(df_hive, df_sp.id == df_hive.id).select('df_sp.*', 'df_hive.*')

Create table in hive through spark

I am trying to connect to Hive through Spark using below code but unable to do so. The code fails with NoSuchDatabaseException Database 'raw' not found. I have database named 'raw' in hive. What am I missing here?
val spark = SparkSession
.builder()
.appName("Connecting to hive")
.config("hive.metastore.uris", "thrift://myserver.domain.local:9083")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
val frame = Seq(("one", 1), ("two", 2), ("three", 3)).toDF("word", "count")
frame.show()
frame.write.mode("overwrite").saveAsTable("raw.temp1")
Output for spark.sql("SHOW DATABASES")

Spark Data frame Join: Non matching Records from first Dataframe

Hi all I have 2 Dataframes and I'm applying some join condition on those dataframes.
1.after join condition i want all the data from first dataframe whose name,id,code,lastname is not matching which second dataframe.I have written below code.
val df3=df1.join(df2,df1("name") !== df2("name_2") &&
df1("id") !== df2("id_2") &&
df1("code") !== df2("code_2") &&
df1("lastname") !== df2("lastname_2"),"inner")
.drop(df2("id_2"))
.drop(df2("name_2"))
.drop(df2("code_2"))
.drop(df2("lastname"))
expected result.
DF1
id,name,code,lastname
1,A,001,p1
2,B,002,p2
3,C,003,p3
DF2
id_2,name_2,code_2,lastname_2
1,A,001,p1
2,B,002,p4
4,D,004,p4
DF3
id,name,code,lastname
3,C,003,p3
Can someone please help me is this the correct way to do this or Should I use sql inner query with 'not In '?. I am new to spark and using first time dataframe methods
so I am not sure this is the correct way or not?
I recommend you using Spark API to work with data:
val df1 =
Seq((1, "20181231"), (2, "20190102"), (3, "20190103"), (4, "20190104"), (5, "20190105")).toDF("id", "date")
val df2 =
Seq((1, "20181231"), (2, "20190102"), (4, "20190104"), (5, "20190105")).toDF("id", "date")
Option1. You can get all rows are not included in other dataframe:
val df3=df1.except(df2)
Option2. You can use a specific fields to do anti join, for example 'id':
val df3 = df1.as("table1").join(df2.as("table2"), $"table1.id" === $"table2.id", "leftanti")
df3.show()

SparkSession: using the SQL API to query Cassandra

In Python, using SparkSession I can load a Cassandra keyspace and table like:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("TestApp") \
.getOrCreate()
cassandra = spark.read.format("org.apache.spark.sql.cassandra")
df = cassandra.load(keyspace="testdb", table="test")
df.collect()
How can I use the SQL API instead? Something like:
SELECT * FROM testdb.test
Try register temp table in Spark and run queries against it like in a following snippet:
df.createOrReplaceTempView("my_table")
df2 = spark.sql("SELECT * FROM my_table")

Pyspark: Transforming PythonRDD to Dataframe

Could someone guide me to convert PythonRDD to a DataFrame.
As per my understanding, reading a file should create a DF, but in my case it has created a PythonRDD. I finding it hard to convert PythonRDD to a DataFrame. Could not find CreateDataFrame() or toDF().
Please find my below code to read a tab seperated text file:
rdd1 = sparkCxt.textFile(setting.REFRESH_HDFS_DIR + "/Refresh")
rdd2 = rdd1.map(lambda row: unicode(row).lower().strip()\
if type(row) == unicode else row)
Now, I would want to convert PythonRDD to a DF.
I wanted to convert to DF to map the schema, so that I could do further processing at column level.
Also, please suggest if you think there is a better approach.
Please reply if more details are required.
Thank you.
Spark DataFrames can be created directly from a text file, but you should use sqlContext instead of sc (SparkContext), since sqlContext is an entry point for working with DataFrames.
df = sqlContext.read.text('path/to/my/file')
This will create a DataFrame with a single column named value. You can use UDF functions to split it into required columns.
Another approach would be to read the text files to an RDD, split it into columns using map, reduce, filter and other operations, and then convert the final RDD to a DataFrame.
For example, let's say we have a RDD named my_rdd with the following structure:
[(1, 'Alice', 23), (2, 'Bob', 25)]
We can easily convert it to a DataFrame:
df = sqlContext.createDataFrame(my_rdd, ['id', 'name', 'age'])
where id, name and age are names for our columns.
you can try using toPandas() although you should be cautious when doing so since converting an rdd to pandas DataFrame will be like bringing all data into memory which might cause OOM error if your distributed data is large.
I would use the Spark-csv package (Spark-csv Github) and import directly into a dataframe after defining the schema.
For example:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
customSchema = StructType([ \
StructField("year", IntegerType(), True), \
StructField("make", StringType(), True), \
StructField("model", StringType(), True), \
StructField("comment", StringType(), True), \
StructField("blank", StringType(), True)])
df = sqlContext.read \
.format('com.databricks.spark.csv') \
.options(header='true') \
.load('cars.csv', schema = customSchema)
This defaults to a comma for the delimiter, but you can change that to a tab with something like:
df = sqlContext.read \
.format('com.databricks.spark.csv') \
.options(header='true', delimiter='\t') \
.load('cars.csv', schema = customSchema)
Note that it is possible to infer the schema using another option, but this does require reading the entire file prior to loading the dataframe.

Resources