compute string length in Spark SQL DSL - apache-spark

Edit: this is an old question concerning Spark 1.2
I've been trying to compute on the fly the length of a string column in a SchemaRDD for orderBy purposes. I am learning Spark SQL so my question is strictly about using the DSL or the SQL interface that Spark SQL exposes, or to know their limitations.
My first attempt has been to use the integrated relational queries, for instance
notes.select('note).orderBy(length('note))
with no luck at the compilation:
error: not found: value length
(Which makes me wonder where to find what "Expression" this DSL can actually resolve. For instance, it resolves "+" for column additions.)
Then I tried
sql("SELECT note, length(note) as len FROM notes")
This fails with
java.util.NoSuchElementException: key not found: length
(Then I reread this (I'm running 1.2.0)
http://spark.apache.org/docs/1.2.0/sql-programming-guide.html#supported-hive-features
and wonder in what sense Spark SQL supports the listed hive features.)
Questions: is the length operator really supported in Expressions and/or in SQL statements? If yes, what is the syntax? (bonus: is there a specific documentation about what is resolved in Spark SQL Expressions, and what would be the syntax in general?)
Thanks!

Try this in Spark Shell:
case class Note(id:Int,text:String)
val notes=List(Note(1,"One"),Note(2,"Two"),Note(3,"Three"))
val notesRdd=sc.parallelize(notes)
import org.apache.spark.sql.hive.HiveContext
val hc=new HiveContext(sc)
import hc.createSchemaRDD
notesRdd.registerTempTable("note")
hc.sql("select id, text, length(text) from note").foreach(println)
It works on by setup (out of the box spark 1.2.1 with hadoop 2.4):
[2,Two,3]
[1,One,3]
[3,Three,5]

It now exists!
Your spark.sql("SELECT note, LENGTH(note) as len FROM notes") should work.
I'm running Spark 2.2.0, just did it and it worked.

Related

Why are raw strings not supported in Spark SQL when running in EMR?

In Pyspark, using a Spark SQL function such as regexp_extract or regexp_replace, raw strings (string literals prefixed with r) are supported when running locally, but not when running on EMR.
A simple example to reproduce the issue is:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
spark.sql(r"select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value").show()
which will run successfully on Pyspark 3.3.0 locally but raises the parsing exception:
pyspark.sql.utils.ParseException:
Literals of type 'R' are currently not supported
when executed on EMR. Looking at the session configuration options for Spark SQL, there doesn't appear to be any options which would change how raw strings are parsed - the closest option is spark.sql.parser.quotedRegexColumnNames.
Anecdotally, I remember having a conversation with a colleague a few years ago who said something about AWS having an internal custom Spark for running on EMR, but I have not found any documentation to corroborate that. Also, even if that were the case, I imagine they would maintain support for critical features like this.
There also could be a Spark configuration option which I missed during my investigation.
For anyone who may have some deeper insight or recognizes the issue, why does this discrepancy exist?
Thank you in advance!
Related posts:
SparkSQL Regular Expression: Cannot remove backslash from text (Developer tried a recommended solution for regex problem using raw strings, but failed with the parsing error on EMR - example code snippet based on this question)
regexp extract pyspark sql: ParseException Literals of type 'R' are currently not supported ("If I try this code with Pyspark locally it works (Pyspark version 3.3.0), but when I run this code in an EMR job it fails, I'm using emr-6.6.0 as application")

In Spark 2.4, Doesn't Spark JDBC allow to specifying Built in function as the partitionColumn?

I am trying to change spark version 2.2.1 to 2.4.0
In spark 2.2, Following worked fine.
val query = "(select id, myPartitionColumnString from myTable) query"
val splitColumn = "CHECKSUM(myPartitionColumnString)"
spark.read.jdbc(jdbcUrl, query, splitColumn, lowerBound, upperBound, numPartitions, connectionProperties)
But In spark 2.4, It cause Error like this
User-defined partition column CHECKSUM(myPartitionColumnString) not found in the JDBC relation: struct<id: int, myPartitionColumnString: string>
I'm sure CheckSum is defined.
They removed it during introduction of "pass direct SQL query" functionality. Breaking change was introduced in 2.4.0. It was more of a hack, there's no way to achieve this now. You can still get it in 2.3 tho
PS: if somebody finds another way to achieve same behaviour, please contact me, I'm very interested

How to work with PySpark, SparkSQL and Cassandra?

I am a bit confused with the different actors in this story: PySpark, SparkSQL, Cassandra and the pyspark-cassandra connector.
As I understand Spark evolved quite a bit and SparkSQL is now a key component (with the 'dataframes'). Apparently there is absolutely no reason to work without SparkSQL, especially if connecting to Cassandra.
So my question is: what component are needed and how do I connect them together in the simplest way possible?
With spark-shell in Scala I could do simply
./bin/spark-shell --jars spark-cassandra-connector-java-assembly-1.6.0-M1-SNAPSHOT.jar
and then
import org.apache.spark.sql.cassandra.CassandraSQLContext
val cc = new CassandraSQLContext(sc)
cc.setKeyspace("mykeyspace")
val dataframe = cc.sql("SELECT count(*) FROM mytable group by beamstamp")
How can I do that with pyspark?
Here are a couple of subquestions along with partial answers I have collected (correct if I'm wrong).
Is pyspark-casmandra needed (I don't think so - I don't understand what is was doing in the first place)
Do I need to use pyspark or could I use my regular jupyter notebook and import the necessary things myself?
Pyspark should be started with the spark-cassandra-connector package as described in the Spark Cassandra Connector python docs.
./bin/pyspark
--packages com.datastax.spark:spark-cassandra-connector_$SPARK_SCALA_VERSION:$SPARK_VERSION
With this loaded you will be able to use any of the Dataframe operations already present inside of Spark on C* dataframes. More details on options of using C* dataframes.
To set this up to run with jupyter notebook just set up your env with the following properties.
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
And calling pyspark will start up a notebook correctly configured.
There is no need to use pyspark-cassandra unless you are interseted in working with RDDs in python which has a few performance pitfalls.
In Python connector is exposed DataFrame API. As long as spark-cassandra-connector is available and SparkConf contains required configuration there is no need for additional packages. You can simply specify the format and options:
df = (sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(table="mytable", keyspace="mykeyspace")
.load())
If you wan to use plain SQL you can register DataFrame as follows:
df.registerTempTable("mytable")
## Optionally cache
sqlContext.cacheTable("mytable")
sqlContext.sql("SELECT count(*) FROM mytable group by beamstamp")
Advanced features of the connector, like CassandraRDD are not exposed to Python so if you need something beyond DataFrame capabilities then pyspark-cassandra may prove useful.

What is the difference between Apache Spark SQLContext vs HiveContext?

What are the differences between Apache Spark SQLContext and HiveContext ?
Some sources say that since the HiveContext is a superset of SQLContext developers should always use HiveContext which has more features than SQLContext. But the current APIs of each contexts are mostly same.
What are the scenarios which SQLContext/HiveContext is more useful ?.
Is HiveContext more useful only when working with Hive ?.
Or does the SQLContext is all that needs in implementing a Big Data app using Apache Spark ?
Spark 2.0+
Spark 2.0 provides native window functions (SPARK-8641) and features some additional improvements in parsing and much better SQL 2003 compliance so it is significantly less dependent on Hive to achieve core funcionality and because of that HiveContext (SparkSession with Hive support) seems to be slightly less important.
Spark < 2.0
Obviously if you want to work with Hive you have to use HiveContext. Beyond that the biggest difference as for now (Spark 1.5) is a support for window functions and ability to access Hive UDFs.
Generally speaking window functions are a pretty cool feature and can be used to solve quite complex problems in a concise way without going back and forth between RDDs and DataFrames. Performance is still far from optimal especially without PARTITION BY clause but it is really nothing Spark specific.
Regarding Hive UDFs it is not a serious issue now, but before Spark 1.5 many SQL functions have been expressed using Hive UDFs and required HiveContext to work.
HiveContext also provides more robust SQL parser. See for example: py4j.protocol.Py4JJavaError when selecting nested column in dataframe using select statetment
Finally HiveContext is required to start Thrift server.
The biggest problem with HiveContext is that it comes with large dependencies.
When programming against Spark SQL we have two entry points depending on
whether we need Hive support. The recommended entry point is the HiveContext to
provide access to HiveQL and other Hive-dependent functionality. The more basic
SQLContext provides a subset of the Spark SQL support that does not depend on
Hive.
-The separation exists for users who might have conflicts with including all of
the Hive dependencies.
-Additional features of HiveContext which are not found in in SQLContext include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables.
-Using a HiveContext does not require an existing Hive setup.
HiveContext is still the superset of sqlcontext,it contains certain extra properties such as it can read the configuration from hive-site.xml,in case you have hive use otherwise simply use sqlcontext

Spark SQL Stackoverflow

I'm a newbie on spark and spark sql and I was trying to make the example that is on Spark SQL website, just a simple SQL query after loading the schema and data from a JSON files directory, like this:
import sqlContext.createSchemaRDD
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val path = "/home/shaza90/Desktop/tweets_1428981780000"
val tweet = sqlContext.jsonFile(path).cache()
tweet.registerTempTable("tweet")
tweet.printSchema() //This one works fine
val texts = sqlContext.sql("SELECT tweet.text FROM tweet").collect().foreach(println)
The exception that I'm getting is this one:
java.lang.StackOverflowError
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
Update
I'm able to execute select * from tweet but whenever I use a column name instead of * I get the error.
Any Advice?
This is SPARK-5009 and has been fixed in Apache Spark 1.3.0.
The issue was that to recognize keywords (like SELECT) with any case, all possible uppercase/lowercase combinations (like seLeCT) were generated in a recursive function. This recursion would lead to the StackOverflowError you're seeing, if the keyword was long enough and the stack size small enough. (This suggests that if upgrading to Apache Spark 1.3.0 or later is not an option, you can use -Xss to increase the JVM stack size as a workaround.)

Resources