SparkSession: using the SQL API to query Cassandra - apache-spark

In Python, using SparkSession I can load a Cassandra keyspace and table like:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("TestApp") \
.getOrCreate()
cassandra = spark.read.format("org.apache.spark.sql.cassandra")
df = cassandra.load(keyspace="testdb", table="test")
df.collect()
How can I use the SQL API instead? Something like:
SELECT * FROM testdb.test

Try register temp table in Spark and run queries against it like in a following snippet:
df.createOrReplaceTempView("my_table")
df2 = spark.sql("SELECT * FROM my_table")

Related

Create table in hive through spark

I am trying to connect to Hive through Spark using below code but unable to do so. The code fails with NoSuchDatabaseException Database 'raw' not found. I have database named 'raw' in hive. What am I missing here?
val spark = SparkSession
.builder()
.appName("Connecting to hive")
.config("hive.metastore.uris", "thrift://myserver.domain.local:9083")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
val frame = Seq(("one", 1), ("two", 2), ("three", 3)).toDF("word", "count")
frame.show()
frame.write.mode("overwrite").saveAsTable("raw.temp1")
Output for spark.sql("SHOW DATABASES")

Reading Excel (.xlsx) file in pyspark

I am trying to read a .xlsx file from local path in PySpark.
I've written the below code:
from pyspark.shell import sqlContext
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local') \
.appName('Planning') \
.enableHiveSupport() \
.config('spark.executor.memory', '2g') \
.getOrCreate()
df = sqlContext.read("C:\P_DATA\tyco_93_A.xlsx").show()
Error:
TypeError: 'DataFrameReader' object is not callable
You can use pandas to read .xlsx file and then convert that to spark dataframe.
from pyspark.sql import SparkSession
import pandas
spark = SparkSession.builder.appName("Test").getOrCreate()
pdf = pandas.read_excel('excelfile.xlsx', sheet_name='sheetname', inferSchema='true')
df = spark.createDataFrame(pdf)
df.show()
You could use crealytics package.
Need to add it to spark, either by maven co-ordinates or while starting the spark shell as below.
$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.12:0.13.1
For databricks users- need to add it as a library by navigating
Cluster - 'clusterName' - Libraries - Install New - Provide 'com.crealytics:spark-excel_2.12:0.13.1' under maven coordinates.
df = spark.read
.format("com.crealytics.spark.excel")
.option("dataAddress", "'Sheet1'!")
.option("header", "true")
.option("inferSchema", "true")
.load("C:\P_DATA\tyco_93_A.xlsx")
More options are available in below github page.
https://github.com/crealytics/spark-excel

Pyspark And Cassandra - Extracting Data Into RDD as Fields from Map Field

I have a table with a map field with data that looks as follows from Cassandra,
test_id test_map
1 {tran_id=99, tran_type=sample}
I am attempting to add these fields to the existing RDD that I am pulling this data from as new fields to the exact same key which would look as follows,
test_id test_map tran_id tran_type
1 {tran_id=99, trantype=sample} 99 sample
I'm able to pull the fields fine using spark context but I can't find a good method to transform this field into the RDD as expected above.
Sample Code:
import os
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.0 --conf spark.cassandra.connection.host=xxx.xxx.xxx.xxx pyspark-shell'
sc = SparkContext("local", "test")
sqlContext = SQLContext(sc)
def test_df(keys_space_name, table_name):
table_df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=table_name, keyspace=keys_space_name)\
.load()
return table_df
df_test = test_df("test", "test")
Then to query data I use Spark SQL in such format:
df_test.registerTempTable("dftest")
df = sqlContext.sql(
"""
select * from dftest
"

Spark 2.0 - Databricks xml reader Input path does not exist

I am trying to use Databricks XML file reader api.
Sample code:
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Java Spark SQL basic example")
.config("spark.sql.warehouse.dir", "file:///C:/TestData")
.getOrCreate();
//val sqlContext = new SQLContext(sc)
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load("books.xml")
df.show()
If i give the file path directly , its looking for some warehouse directory. so i set the spark.sql.warehouse.dir option, but now it throws Input path does not exist.
It is actually looking under the project root directory , why is it looking for project root directory?
Finally its working.. We need to specify warehouse directory as well pass the absolute file path in the load method. I am not sure what is the use of warehouse directory.
The main part is we dont need to give C: as mentioned by other Stackoverflow answer.
working code:
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Java Spark SQL basic example")
.config("spark.sql.warehouse.dir", "file:///TestData/")
.getOrCreate();
//val sqlContext = new SQLContext(sc)
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load("file:///TestData/books.xml")
df.show()

why spark python udf execution time 10x difference on different partition strategy?

I got huge (over 10x~100x) execution time difference between 2 jobs with only difference on partition strategy, wanting to know why :)
Observation:
repartition by partition number with equalized record runs 10~100x slower than 2.
repartition by column: phone_country_code
from spark history, only difference are 1. got minor larger(10~20%) shuffle read size.
My environment:
Spark 1.6.1 on EMR 4.7
Python 2.7
submit job using pyspark
Spark Job:
python udf to parse phone number for time zone info
read data from redshift via spark-redshift and write back
code sample:
from pyspark import SparkContext, SparkConf
from pyspark.sql.types import DateType, TimestampType, StringType
from pyspark.sql import SQLContext
from pyspark.sql.functions import col, udf
conf = SparkConf().setAppName("extract_local_time")
sc = SparkContext(conf=conf)
sql_context = SQLContext(sc)
sc.addPyFile("s3://xxx/xxx.zip")
def local_time(phone_number, datetime_org):
from util import phonenumber_util
local_time = phonenumber_util.convert_to_local_datetime_by_phone_number(
phone_number,
datetime_org)
return local_time.replace(tzinfo=None)
local_time_func = udf(local_time, TimestampType())
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://xxx") \
.option("query", "select * from xxx") \
.option("tempdir", "s3n://xxx") \
.load()
# df = df.repartition(12*10) # partition strategy 1
df = df.repartition('phone_country_code') # partition strategy 2
df2 = df.withColumn("datetime_local", local_time_func(col("phone_number"), col("datetime")))
df2.registerTempTable("xxx")
sql_context.sql("SELECT * FROM xxx") \
.write.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://xxx") \
.option("tempdir", "s3n://xxx") \
.option("dbtable", "xxx") \
.mode("overwrite") \
.save()
data sample:
phone_number, phone_country_code
55-82981399971, 55
1-7073492922, 1
90-5395889859, 90
My guess:
some optimization on jvm-py level on udf that depends on partitions's record distribution?
Thanks for any further suggestions :)

Resources