Spark DataFrame column name case sensitivity in sparkSQL and Spark Submit

Spark DataFrame column name case sensitivity in sparkSQL and Spark Submit - apache-spark

When i am querying dataframes on spark-shell(1.6 version) ,the column names are case insensitive .
On Spark-Shell
val a = sqlContext.read.parquet("<my-location>")
a.filter($"name" <=> "andrew").count()
a.filter($"NamE" <=> "andrew").count()
Both the above results gives me the right count.
But when i build this in a jar and run via "spark-submit",below code fails saying NamE does not exist,since underlying parquet data was saved with column as "name"
Fails:
a.filter($"NamE" <=> "andrew").count()
Pass:
a.filter($"name" <=> "andrew").count()
Am i missing something here?is there a way i can make it case-insensitive.
I know i can use a select before filtering and make all columns as lowercase alias ,but wanted to know why is it behaving differently.

It's a bit tricky here: the plain answer is because you think you're using the same SQLContext in both cases when, actually, you're not. In spark-shell, a SQLContext is created for you, but it's actually a HiveContext:
scala> sqlContext.getClass
res3: Class[_ <: org.apache.spark.sql.SQLContext] = class org.apache.spark.sql.hive.HiveContext
and in your spark-submit, you probably use a simple SQLContext. According to #LostInOverflow's link: Hive is case insensitive, while Parquet is not, so my guess is the following: by using a HiveContext you're probably using some code associated to Hive to download your Parquet data. Hive being case insensitive, it works fine. With a simple SQLContext, it doesn't, which is the expected behavior.

The part you're missing:
... is case insensitive, while Parquet is not
You can try:
val b = df.toDF(df.columns.map(_.toLowerCase): _*)
b.filter(...)

Try to control the case sensitivity with sqlContext explicitly.
Turn off case sensitivity using below statement and check if it helps.
sqlContext.sql("set spark.sql.caseSensitive=false")

Related

What changes are required when moving simple synapsesql implementation from Spark 2.4.8 to Spark 3.1.2?

I have a simple implementation of .write.synapsesql() method (code shown below) that works in Spark 2.4.8 but not in Spark 3.1.2 (documentation/example here). The data in use is a simple notebook-created foobar type table. Searching for key phrases online from and about the error did not turn up any new information for me.
What is the cause of the error in 3.1.2?
Spark 2.4.8 version (behaves as desired):
val df = spark.sql("SELECT * FROM TEST_TABLE")
df.write.synapsesql("my_local_db_name.schema_name.test_table", Constants.INTERNAL, None)
Spark 3.1.2 version (extra method is same as in documentation, can also be left out with a similar result):
val df = spark.sql("SELECT * FROM TEST_TABLE")
df.write.synapsesql("my_local_db_name.schema_name.test_table", Constants.INTERNAL, None,
Some(callBackFunctionToReceivePostWriteMetrics))
The resulting error (only in 3.1.2) is:
WriteFailureCause -> java.lang.IllegalArgumentException: Failed to derive `https` scheme based staging location URL for SQL COPY-INTO}

As the documentation from the question states, ensure that you are setting the options correctly with
val writeOptionsWithAADAuth:Map[String, String] = Map(Constants.SERVER -> "<dedicated-pool-sql-server-name>.sql.azuresynapse.net",
Constants.TEMP_FOLDER -> "abfss://<storage_container_name>#<storage_account_name>.dfs.core.windows.net/<some_temp_folder>")
and including the options in your .write statement like so:
df.write.options(writeOptionsWithAADAuth).synapsesql(...)

The first entry point to Spark SQL

I got some problems finding the what is the first line executed in Spark source code
after I run "spark.sql(SQL_QUERY).explain()".
Does anyone have any idea which module/package I could start to look into?
Thanks.

First of all you need to make spark session or sqlContext and a registered Temporary table from a DataFrame than query on the temporary table like this
results = spark.sql("SELECT * FROM people")
names = results.map(lambda p: p.name)

So I guess the first line is this one :
https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L642
But have already been many lines "executed", specifically to create the SparkSession

PySpark throwing ParseException for syntactical correct Hive Query

I got a DDL query that works fine within beeline, but when I try to run the same query within a sparkSession it throws a parse Exception.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
# Initialise Hive metastore
SparkContext.setSystemProperty("hive.metastore.uris","thrift://localhsost:9083")
# Create Spark Session
sparkSession = (SparkSession\
.builder\
.appName('test_case')\
.enableHiveSupport()\
.getOrCreate())
sparkSession.sql("CREATE EXTERNAL TABLE B LIKE A")
Pyspark Exception:
pyspark.sql.utils.ParseException: u"\nmismatched input 'LIKE' expecting <EOF>(line 1, pos 53)\n\n== SQL ==\nCREATE EXTERNAL TABLE B LIKE A\n-----------------------------------------------------^^^\n"
How Can I make the hiveQL function work within pySpark?
The problem seems to be that the query is executed like a SparkSQL-Query and not like a HiveQL-Query, even though I got enableHiveSupport activated for the sparkSession.

Spark SQL queries use SparkSQL by default. To enable HiveQL syntax, I believe you need to give it a hint about your intent via a comment. (In fairness, I don't think this is well-documented; I've only been able to find a tangential reference to this being a thing here, and only in the Scala version of the example.)
For example, I'm able to get my command to parse by writing:
%sql
-- `USING HIVE`
CREATE TABLE narf LIKE poit
Now, I don't have Hive Support enabled on my session, so my query fails... but it does parse!
Edit: Since your SQL statement is in a Python string, you can use a multi-line string to use the single-line comment syntax, like this:
sparkSession.sql("""
-- `USING HIVE`
CREATE EXTERNAL TABLE B LIKE A
""")
There's also a delimited comment syntax in SQL, e.g.
sparkSession.sql("/* `USING HIVE` */ CREATE EXTERNAL TABLE B LIKE A")
which may work just as well.

Unable to append "Quotes" in write for dataframe

I am trying to save a dataframe as .csv in spark. It is required to have all fields bounded by "Quotes". Currently, the file is not enclosed by "Quotes".
I am using Spark 2.1.0
Code :
DataOutputResult.write.format("com.databricks.spark.csv").
option("header", true).
option("inferSchema", false).
option("quoteMode", "ALL").
mode("overwrite").
save(Dataoutputfolder)
Output format(actual) :
Name, Id,Age,Gender
XXX,1,23,Male
Output format (Required) :
"Name", "Id" ," Age" ,"Gender"
"XXX","1","23","Male"
Options I tried so far :
QuoteMode, Quote in the options during it as file, But with no success.

("quote", "all"), replace quoteMode with quote
or play with concat or concat_wsdirectly on df columns and save without quote - mode
import org.apache.spark.sql.functions.{concat, lit}
val newDF = df.select(concat($"Name", lit("""), $"Age"))
or create own udf function to add desired behaviour, pls find more examples in Concatenate columns in apache spark dataframe

Unable to add as a comment to the above answer, so posting as an answer.
In Spark 2.3.1, use quoteAll
df1.write.format("csv")
.option("header", true)
.option("quoteAll","true")
.save(Dataoutputfolder)
Also, to add to the comment of #Karol Sudol (great answer btw), .option("quote","\u0000") will work only if one is using Pyspark with Python 3 which has default encoding as 'utf-8'. A few reported that the option did not work, because they must be using Pyspark with Python 2 whose default encoding is 'ascii'. Therefore the error "java.lang.RuntimeException: quote cannot be more than one character"

Existing column can't be found by DataFrame#filter in PySpark

I am using PySpark to perform SparkSQL on my Hive tables.
records = sqlContext.sql("SELECT * FROM my_table")
which retrieves the contents of the table.
When I use the filter argument as a string, it works okay:
records.filter("field_i = 3")
However, when I try to use the filter method, as documented here
records.filter(records.field_i == 3)
I am encountering this error
py4j.protocol.Py4JJavaError: An error occurred while calling o19.filter.
: org.apache.spark.sql.AnalysisException: resolved attributes field_i missing from field_1,field_2,...,field_i,...field_n
eventhough this field_i column clearly exists in the DataFrame object.
I prefer to use the second way because I need to use Python functions to perform record and field manipulations.
I am using Spark 1.3.0 in Cloudera Quickstart CDH-5.4.0 and Python 2.6.

From Spark DataFrame documentation
In Python it’s possible to access a DataFrame’s columns either by attribute (df.age) or by indexing (df['age']). While the former is convenient for interactive data exploration, users are highly encouraged to use the latter form, which is future proof and won’t break with column names that are also attributes on the DataFrame class.
It seems that the name of your field can be a reserved word, try with:
records.filter(records['field_i'] == 3)

What I did was to upgrade my Spark from 1.3.0 to 1.4.0 in Cloudera Quick Start CDH-5.4.0 and the second filtering feature works. Although I still can't explain why 1.3.0 has problems on that.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark DataFrame column name case sensitivity in sparkSQL and Spark Submit - apache-spark

The part you're missing: ... is case insensitive, while Parquet is not You can try: val b = df.toDF(df.columns.map(_.toLowerCase): _*) b.filter(...)

Try to control the case sensitivity with sqlContext explicitly. Turn off case sensitivity using below statement and check if it helps. sqlContext.sql("set spark.sql.caseSensitive=false")

Related

What changes are required when moving simple synapsesql implementation from Spark 2.4.8 to Spark 3.1.2?

The first entry point to Spark SQL

PySpark throwing ParseException for syntactical correct Hive Query

Unable to append "Quotes" in write for dataframe

Existing column can't be found by DataFrame#filter in PySpark

Categories

Resources