How to properly separate columns

How to properly separate columns - apache-spark

I'm having trouble with Spark SQL. I tried to import a CSV file into spark DB. My columns are separated by semicolons. I have tried to separate the columns by using sep to do so, but to my dismay, the columns are not separated properly.
Is this how Spark SQL works or is there a difference the conventional Spark SQL and the one in DataBricks. I am new to SparkSQL, a whole new environment from the original SQL language, sorry pardon me for my knowledge for SparkSQL.
USE CarSalesP1935727;
CREATE TABLE IF NOT EXISTS Products
USING CSV
OPTIONS (path "/FileStore/tables/Products.csv", header "true", inferSchema
"true", sep ";");
SELECT * FROM Products LIMIT 10

Not sure about the problem, working well -
Please note that the env is not databricks
val path = getClass.getResource("/csv/test2.txt").getPath
println(path)
/**
* file data
* -----------
* id;sequence;sequence
* 1;657985;657985
* 2;689654;685485
*/
spark.sql(
s"""
|CREATE TABLE IF NOT EXISTS Products
|USING CSV
|OPTIONS (path "$path", header "true", inferSchema
|"true", sep ";")
""".stripMargin)
spark.sql("select * from Products").show(false)
/**
* +---+---------+---------+
* |id |sequence1|sequence2|
* +---+---------+---------+
* |1 |657985 |657985 |
* |2 |689654 |685485 |
* +---+---------+---------+
*/

Related

How to use columns to create queries (e.g. WHERE clause)?

I have a table with a column with one row with where clause.
from pyspark.sql.types import *
where_clause_df=spark.createDataFrame([('A > 1',)],schema=StructType([StructField("a_where", StringType(), nullable=True)]))
where_clause_df.createOrReplaceTempView("where_clause").show()
spark.sql("select * from where_clause").show()
+-------+
|a_where|
+-------+
| A > 1|
+-------+
With another table,
sample_df=spark.createDataFrame([(1,)],schema=StructType([StructField("A", IntegerType(), nullable=True)]))
sample_df.createOrReplaceTempView("sample")
spark.sql("select * from sample").show()
I want to use this a_where to apply with the table sample. Something like:
spark.sql("""
select * from sample where (select a_where from where_clause)
""").show()
Is it possible with Spark SQL ?

tl;dr Use collect on the where_clause table.
Think of data as something available (almost) always on executors where you are not allowed to execute queries from. That's by design.
Since you want to execute queries you should have all you need on the driver and so you need to bring this extra metadata for your queries (like where clauses) to the driver. Bingo! That's exactly collect.
Mind though that the data you can "download" to the driver using collect has to be within the amount of memory available for this one single driver process (and that's likely the case).

You are trying to extract the where clause string from your TempView, hence getting the error
You can modify your code slightly to achieve this
where_clause_df=sql.createDataFrame([('A > 1',)],schema=StructType([StructField("a_where", StringType(), nullable=True)]))
where_clause_df.createOrReplaceTempView("where_clause")
sql.sql("select * from where_clause").show()
sample_df=sql.createDataFrame([(1,)],schema=StructType([StructField("A", IntegerType(), nullable=True)]))
sample_df.createOrReplaceTempView("sample")
sql.sql("select * from sample").show()
### Contains 'A > 1'
where_clause = sql.sql("select a_where from where_clause").collect()[0][0]
query = f"""
select *
from sample
where {where_clause}
"""
sql.sql(query).show()
+---+
| A|
+---+
+---+
Further if there are multiple conditions , you can iterate over them and modify the query in each iteration to extract the results

BROADCASTJOIN hint is not working in PySpark SQL

I am trying to provide broadcast hint to table which is smaller in size, but physical plan is still showing me SortMergeJoin.
spark.sql('select /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.serial_id = c.serial_id').explain()
Output :
Note :
Size of tables are in KBs (test data)
Joining column 'serial_id' is not partitioned column
Using glue catalog as metastore (AWS)
Spark Version - Spark 2.4.4
I have tried BROADCASTJOIN and MAPJOIN hint as well
When I am trying to use created_date [partitioned column] instead of serial_id as my joining condition, it is showing me BroadCast Join -
spark.sql('select /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.created_date = c.created_date').explain()
Output -
Why spark behavior is strange with AWS Glue Catalog as my metastore?

In BROADCAST hint we need to pass the alias name of the table (as you have alias kept in your sql statement).
Try with /*+ BROADCAST(c) */* instead of /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ *
spark.sql('select /*+ BROADCAST(c) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.serial_id = c.serial_id').explain()

Spark - get tables from database

I have to perform an operation on all the tables from given databases(s) and so I am using following code.
However, it gives me views as well, is there a way I can filter only tables?
code
def getTables(databaseName: String)(implicit spark: SparkSession): Array[String] = {
val tables = spark.sql(s"show tables from ${databaseName}").collect().map(_(1).asInstanceOf[String])
logger.debug(s"${tables.mkString(",")} found")
tables
}
also, `show views shows error"
scala> spark.sql("show views from gshah03;").show
org.apache.spark.sql.catalyst.parser.ParseException:
missing 'FUNCTIONS' at 'from'(line 1, pos 11)
== SQL ==
show views from gshah03;
-----------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:69)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:643)
... 49 elided

Try this-
val df = spark.range(1, 5)
df.createOrReplaceTempView("df_view")
println(spark.catalog.currentDatabase)
val db: Database = spark.catalog.getDatabase(spark.catalog.currentDatabase)
val tables: Dataset[Table] = spark.catalog.listTables(db.name)
tables.show(false)
/**
* default
* +-------+--------+-----------+---------+-----------+
* |name |database|description|tableType|isTemporary|
* +-------+--------+-----------+---------+-----------+
* |df_view|null |null |TEMPORARY|true |
* +-------+--------+-----------+---------+-----------+
*/

What does Dataset's as method really mean

I have simple code:
test("Dataset as method") {
val spark = SparkSession.builder().master("local").appName("Dataset as method").getOrCreate()
import spark.implicits._
//xyz is an alias of ds1
val ds1 = Seq("1", "2").toDS().as("xyz")
//xyz can be used to refer to the value column
ds1.select($"xyz.value").show(truncate = false)
//ERROR here, no table or view named xyz
spark.sql("select * from xyz").show(truncate = false)
}
It looks to me that xyz is like a table name, but the sql select * from xyz raises an error complaining xyz doesn't exist.
So, I want to ask, what does as method really mean? and how I should use the alias,like xyz in my case

.as() when used with dataset (as in your case) is a function to create alias for a dataset as you can see in the api doc
/**
* Returns a new Dataset with an alias set.
*
* #group typedrel
* #since 1.6.0
*/
def as(alias: String): Dataset[T] = withTypedPlan {
SubqueryAlias(alias, logicalPlan)
}
which can be used in function apis only such as select, join, filter etc. But the alias cannot be used for sql queries.
It is more evident if you create two columns dataset and use alias as you did
val ds1 = Seq(("1", "2"),("3", "4")).toDS().as("xyz")
Now you can use select to select only one column using the alias as
ds1.select($"xyz._1").show(truncate = false)
which should give you
+---+
|_1 |
+---+
|1 |
|3 |
+---+
The use of as alias is more evident when you do join of two datsets having same column names where you can write condition for joining using the alias.
But to use alias for use in sql queries you will have to register the table
ds1.registerTempTable("xyz")
spark.sql("select * from xyz").show(truncate = false)
which should give you the correct result
+---+---+
|_1 |_2 |
+---+---+
|1 |2 |
|3 |4 |
+---+---+
Or even better do it in a new way
ds1.createOrReplaceTempView("xyz")

How to turn header on while querying CSV file using SQL?

Spark SQL can query on CSV file directly. See the example below.
val df = spark.sql("SELECT * FROM csv.`csv/file/path/in/hdfs`")
However, how can we let Spark that there's a header line in the CSV file?

You can use a view:
spark.sql("""CREATE TEMPORARY VIEW df
USING csv
OPTIONS (header "true", path "csv/file/path/in/hdfs")""")
spark.sql("""SELECT * FROM df""")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to properly separate columns - apache-spark

Related

How to use columns to create queries (e.g. WHERE clause)?

BROADCASTJOIN hint is not working in PySpark SQL

Spark - get tables from database

What does Dataset's as method really mean

How to turn header on while querying CSV file using SQL?

Categories

Resources