Use WHERE or FILTER whe creating TempView - cassandra

Is it possible to use where or filter when creating a SparkSQL TempView ?
I have a Cassandra table words with
word | count
------------
apples | 20
banana | 10
I tried
%spark
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options( Map ("keyspace"-> "temp", "table"->"words" ))
.where($"count" > 10)
.load()
.createOrReplaceTempView("high_counted")
or
%spark
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options( Map ("keyspace"-> "temp", "table"->"words" ))
.where("count > 10")
.load()
.createOrReplaceTempView("high_counted")

You cannot do a WHERE or FILTER without .load()ing the table as #undefined_variable suggested.
Try:
%spark
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options( Map ("keyspace"-> "temp", "table"->"words" ))
.load()
.where($"count" > 10)
.createOrReplaceTempView("high_counted")
Alternatively, you can do a free form query as documented here.
Spark evaluated statements in lazy fashion and the above statement is a Transformation. (If you are thinking we need to filter before we load)

Related

Why does Spark SQL require double percent symbols for like conditions?

When I am using a like condition in Spark SQL, it seems that it requires the use of 2 percent symbols %%.
However, I could not find any documentation on this in the Spark SQL docs. I am curious as to why my set-up might be causing this requirement.
https://spark.apache.org/docs/3.3.0/sql-ref-syntax-qry-select-like.html
Example data
product_table
id
product_type
region
location
measurement
43635665
ORANGE - Blood Orange
EU
FRA
30.5
78960788
APPLE GrannySmith
NA
USA
16.0
12312343
APPLE [Organic Washington]
NA
CAN
7.1
67867634
ORANGE, NavelOrange
NA
MEX
88.4
import pyspark
from pyspark.sql import functions as F
APP_NAME = "Product: Fruit Template"
SPARK_CONF = [
("spark.dynamicAllocation.maxExecutors", "5"),
("spark.executor.memory", "10g"),
("spark.executor.cores", "4"),
("spark.executor.memoryOverhead", "2000"),
]
spark_conf = pyspark.SparkConf()
spark_conf.setAppName(APP_NAME)
spark_conf.setAll(SPARK_CONF)
sc = pyspark.SparkContext(conf=spark_conf)
spark = pyspark.sql.SparkSession(sc)
def sql(query):
return spark.sql(query)
df = sql("""
SELECT *
FROM product_table
""")
this returns data
df.filter(F.col("product_type").like("ORANGE%%")).show()
whereas this returns an empty dataframe
df.filter(F.col("product_type").like("ORANGE%")).show()
Maybe worth noting, the same issue happens when the LIKE condition is used in the SQL string
this returns data
df_new = sql("""
SELECT *
FROM product_table
WHERE product_type like 'ORANGE%%'
""")
df_new.show()
whereas this returns an empty dataframe
df_new = sql("""
SELECT *
FROM product_table
WHERE product_type like 'ORANGE%'
""")
df_new.show()
i am using PySpark Version: 2.3.2.
conf = (SparkConf()
.set("spark.executor.instances", "24")
.set("spark.executor.cores", "5")
.set("spark.executor.memory", "33g")
.set("spark.driver.memory", "55g")
.set("spark.driver.maxResultSize", "10g")
.set("spark.sql.catalogImplementation", "hive")
.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
)
spark = (
SparkSession.builder.appName("default")
.enableHiveSupport()
.config(conf=conf)
.getOrCreate()
)
df = spark.createDataFrame(
[('43635665','ORANGE - Blood Orange'),
('78960788','APPLE GrannySmith'),
('12312343','APPLE [Organic Washington'),
('67867634','ORANGE, NavelOrange')],
['id', 'product_type'])
df.createOrReplaceTempView("product_table")
def sql(query):
print(query)
return spark.sql(query)
df2 = sql("""
SELECT *
FROM product_table
""")
df2.filter(F.col("product_type").like("ORANGE%")).show(truncate=False)

How to extract urls from HYPERLINKS in excel file when reading into scala spark dataframe

I have an Excel file with Column A containing HYPERLINKS like this:
=HYPERLINK("https://google.com","View Link")
I can load the Excel file in scala spark dataframe using com.crealytics.spark.excel library but only with the 'View Link' text which DOES NOT contain the url
import org.apache.spark.sql._
import org.apache.spark.sql.types._
object Tut {
def main(args: Array[String]): Unit = {
println("started")
val spark = SparkSession
.builder()
.appName("MySpark")
.config("spark.master", "local")
.getOrCreate()
val customSchema = StructType(Array(
StructField("A", StringType, nullable = false),
StructField("B", IntegerType, nullable = false)))
val df = spark.read.format("com.crealytics.spark.excel")
.option("useHeader", "true").schema(customSchema)
.option("dataAddress", "A1")
.load("/MY_PATH/src/main/resources/SampFile.xlsx")
df.printSchema()
df.show()
}
}
My goal is to load the entire content of the HYPERLINK as a string:
=HYPERLINK("https://google.com","View Link")
and then extract the url
https://google.com.
Do you know if there is a way to do this using com.crealytics.spark.excel library or any other spark library? Thanks in advance!
About the other question link you provided in the comments, they're trying to read the column as BinaryType, and cast it out of the box into StringType, well, such thing is not possible (even in scala itself), since you need to know how to read the bytes and represent it as a human readable string, right? for instance the encoding, etc.
Now we know that we need to define a custom approach. I used a sample in-code dataframe, and this approach worked:
scala> import spark.implicits._
import spark.implicits._
scala> val df = Seq(
| ("ddd".getBytes, 1)
| ).toDF("A", "B")
df: org.apache.spark.sql.DataFrame = [A: binary, B: int]
scala> val btos: Array[Byte] => String = bytes => new String(bytes) // short fot bytes to string
btos: Array[Byte] => String = $Lambda$2322/665683021#738f6e44
scala> spark.udf.register("btos", btos)
res0: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$2322/665683021#738f6e44,StringType,List(Some(class[value[0]: binary])),Some(btos),true,true)
scala> df.withColumn("C", expr("btos(A)")).show
+----------+---+---+
| A| B| C|
+----------+---+---+
|[64 64 64]| 1|ddd|
+----------+---+---+
Hope this works for you.

Structured streaming debugging input

Is there a way for me to print out the incoming data? For e.g. I have a readStream on a folder looking for JSON files, however there seems to be an issue as I am seeing 'nulls' in the aggregation output.
val schema = StructType(
StructField("id", LongType, false) ::
StructField("sid", IntegerType, true) ::
StructField("data", ArrayType(IntegerType, false), true) :: Nil)
val lines = spark.
readStream.
schema(schema).
json("in/*.json")
val top1 = lines.groupBy("id").count()
val query = top1.writeStream
.outputMode("complete")
.format("console")
.option("truncate", "false")
.start()
To print the data you can add queryName in the write stream, by using that queryName you can print.
In your Example
val query = top1.writeStream
.outputMode("complete")
.queryName("xyz")
.format("console")
.option("truncate", "false")
.start()
run this and you can display data by using SQL query
%sql select * from xyz
or you can Create Dataframe
val df = spark.sql("select * from xyz")

Join files in Apache Spark

I have a file like this. code_count.csv
code,count,year
AE,2,2008
AE,3,2008
BX,1,2005
CD,4,2004
HU,1,2003
BX,8,2004
Another file like this. details.csv
code,exp_code
AE,Aerogon international
BX,Bloomberg Xtern
CD,Classic Divide
HU,Honololu
I want the total sum for each code but in the final output, I want the exp_code. Like this
Aerogon international,5
Bloomberg Xtern,4
Classic Divide,4
Here is my code
var countData=sc.textFile("C:\path\to\code_count.csv")
var countDataKV=countData.map(x=>x.split(",")).map(x=>(x(0),1))
var sum=countDataKV.foldBykey(0)((acc,ele)=>{(acc+ele)})
sum.take(2)
gives
Array[(String, Int)] = Array((AE,5), (BX,9))
Here sum is RDD[(String, Int)]. I am kind of confused about how to pull the exp_code from the other file. Please guide.
You need to calculate the sum after groupby with code and then join another dataframe. Below is similar example.
import spark.implicits._
val df1 = spark.sparkContext.parallelize(Seq(("AE",2,2008), ("AE",3,2008), ("BX",1,2005), ("CD",4,2004), ("HU",1,2003), ("BX",8,2004)))
.toDF("code","count","year")
val df2 = spark.sparkContext.parallelize(Seq(("AE","Aerogon international"),
("BX","Bloomberg Xtern"), ("CD","Classic Divide"), ("HU","Honololu"))).toDF("code","exp_code")
val sumdf1 = df1.select("code", "count").groupBy("code").agg(sum("count"))
val finalDF = sumdf1.join(df2, "code").drop("code")
finalDF.show()
If you are using spark version > 2.0 you can use following code directly.
com.databricks.spark.csv is available by default as part of spark 2.0
val codeDF = spark
.read
.option("header", "true")
.option("inferSchema", "true")
.csv("hdfs://pathTo/code_count.csv")
val detailsDF = spark
.read
.option("header", "true")
.option("inferSchema", "true")
.csv("hdfs://pathTo/details.csv")
//
//
import org.apache.spark.sql.functions._
val resDF = codeDF.join(detailsDF,codeDF.col("code")===detailsDF.col("code")).groupBy(codeDF.col("code"),detailsDF.col("exp_code")).agg(sum("count").alias("cnt"))
output:
If you are using spark <=1.6 version. you can use following code.
you can follow this link to use com.databricks.spark.csv
https://github.com/databricks/spark-csv
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
import hiveContext.implicits._
val codeDF = hiveContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("delimiter",",")
.load("hdfs://pathTo/code_count.csv")
val detailsDF = hiveContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter",",")
.load("hdfs://pathTo/details.csv")
import org.apache.spark.sql.functions._
val resDF = codeDF.join(detailsDF,codeDF.col("code")===detailsDF.col("code")).groupBy(codeDF.col("code"),detailsDF.col("exp_code")).agg(sum("count").alias("cnt"))

Spark: Why does Python read input files twice but Scala doesn't when creating a DataFrame?

I'm using Spark v1.5.2. I wrote a program in Python and I don't understand why it reads the input files twice. The same program written in Scala only reads the input files once.
I use an accumulator to count the number of times that map() is called. From the accumulator value, I infer the number of times the input file is read.
The input file contains 3 lines of text.
Python:
from pyspark import SparkContext, SQLContext
from pyspark.sql.types import *
def createTuple(record): # used with map()
global map_acc
map_acc += 1
return (record[0], record[1].strip())
sc = SparkContext(appName='Spark test app') # appName is shown in the YARN UI
sqlContext = SQLContext(sc)
map_acc = sc.accumulator(0)
lines = sc.textFile("examples/src/main/resources/people.txt")
people_rdd = lines.map(lambda l: l.split(",")).map(createTuple) #.cache()
fieldNames = 'name age'
fields = [StructField(field_name, StringType(), True) for field_name in fieldNames.split()]
schema = StructType(fields)
df = sqlContext.createDataFrame(people_rdd, schema)
print 'record count DF:', df.count()
print 'map_acc:', map_acc.value
#people_rdd.unpersist()
$ spark-submit --master local[1] test.py 2> err
record count DF: 3
map_acc: 6 ##### why 6 instead of 3??
Scala:
import org.apache.spark._
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
object SimpleApp {
def main(args: Array[String]) {
def createTuple(record:Array[String], map_acc: Accumulator[Int]) = { // used with map()
map_acc += 1
Row(record(0), record(1).trim)
}
val conf = new SparkConf().setAppName("Scala Test App")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val map_acc = sc.accumulator(0)
val lines = sc.textFile("examples/src/main/resources/people.txt")
val people_rdd = lines.map(_.split(",")).map(createTuple(_, map_acc))
val fieldNames = "name age"
val schema = StructType(
fieldNames.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
val df = sqlContext.createDataFrame(people_rdd, schema)
println("record count DF: " + df.count)
println("map_acc: " + map_acc.value)
}
}
$ spark-submit ---class SimpleApp --master local[1] test.jar 2> err
record count DF: 3
map_acc: 3
If I remove the comments from the Python program and cache the RDD, then the input files are not read twice. However, I don't think I should have to cache the RDD, right? In the Scala version I don't need to cache the RDD.
people_rdd = lines.map(lambda l: l.split(",")).map(createTuple).cache()
...
people_rdd.unpersist()
$ spark-submit --master local[1] test.py 2> err
record count DF: 3
map_acc: 3
$ hdfs dfs -cat examples/src/main/resources/people.txt
Michael, 29
Andy, 30
Justin, 19
It happens because in 1.5 createDataFrame eagerly validates provided schema on a few elements:
elif isinstance(schema, StructType):
# take the first few rows to verify schema
rows = rdd.take(10)
for row in rows:
_verify_type(row, schema)
In contrast current versions validate schema for all elements but it is done lazily and you wouldn't see the same behavior. For example this would fail instantaneously in 1.5:
from pyspark.sql.types import *
rdd = sc.parallelize([("foo", )])
schema = StructType([StructField("foo", IntegerType(), False)])
sqlContext.createDataFrame(rdd, schema)
but 2.0 equivalent would fail when you try to evaluate DataFrame.
In general you shouldn't expect that Python and Scala code will behave the same way unless you strictly limit yourself to interactions with SQL API. PySpark:
Implements almost all RDD methods natively so the same chain of transformations can result in a different DAG.
Interactions with Java API may require an eager evaluation to provide type information for Java classes.

Resources