I am trying to query the Cassandra database using SparkSQL terminal.
Query:
select * from keyspace.tablename
where user_id = e3a119e0-8744-11e5-a557-e789fe3b4cc1;
Error: java.lang.RuntimeException: [1.88] failure: ``union'' expected but identifier e5 found
Also tried:
user_id= UUID.fromString(\`e3a119e0-8744-11e5-a557-e789fe3b4cc1\`)")
user_id= \'e3a119e0-8744-11e5-a557-e789fe3b4cc1\'")
token(user_id)= token(`e3a119e0-8744-11e5-a557-e789fe3b4cc1`)
I am not sure how can I query data on timeuuid.
TimeUUIDs are not supported as a type in SparkSQL so you are only allowed to do direct string comparisons. Represent the TIMEUUID as a string
select * from keyspace.tablename where user_id = "e3a119e0-8744-11e5-a557-e789fe3b4cc1"
Related
I have the following SQL statement where i am reading the database to get the records for 1 day. Here is what i tried in pgAdmin console -
SELECT * FROM public.orders WHERE createdat >= now()::date AND type='t_order'
I want to convert this to the syntax of psycopg2but somehow it throws me errors -
Database connection failed due to invalid input syntax for type timestamp: "now()::date"
Here is what i am doing -
query = f"SELECT * FROM {table} WHERE (createdat>=%s AND type=%s)"
cur.execute(query, ("now()::date", "t_order"))
records = cur.fetchall()
Any help is deeply appreciated.
DO NOT use f strings. Use proper Parameter Passing
now()::date is better expressed as current_date. See Current Date/Time.
You want:
query = "SELECT * FROM public.orders WHERE (createdat>=current_date AND type=%s)"
cur.execute(query, ["t_order"])
If you want dynamic identifiers, table/column names then:
from psycopg2 import sql
query = sql.SQL("SELECT * FROM {} WHERE (createdat>=current_date AND type=%s)").format(sql.Identifier(table))
cur.execute(query, ["t_order"])
For more information see sql.
I'm trying to run such a PySpark application:
with SparkSession.builder.appName(f"Spark App").getOrCreate() as spark:
dataframe_mysql = spark.read.format('jdbc').options(
url="jdbc:mysql://.../...",
driver='com.mysql.cj.jdbc.Driver',
dbtable='my_table',
user=...,
password=...,
partitionColumn='id',
lowerBound=0,
upperBound=10000000,
numPartitions=11,
fetchsize=1000000,
isolationLevel='NONE'
).load()
dataframe_mysql = dataframe_mysql.filter("date > '2022-01-01'")
dataframe_mysql.write.parquet('...')
And I found that Spark didn't load data from Mysql until executing the write, this means that Spark let the database take care of filtering the data, and the SQL that database received may like:
select * from my_table where id > ... and id< ... and date > '2022-01-01'
My table was too big and there's no index on date column, the database couldn't handle the filtering. How can I load data into Spark's memory before filtering, I hope the query sent to the databse could be:
select * from my_table where id > ... and id< ...
According to #samkart's comment, set pushDownPredicate to False could solve this problem
i'm doing ETL task of transforming queries from one SQL dialect to another. The old db uses T-Sql, the new is hiveQL.
SELECT CAST(CONCAT(FMH.FLUIDMODELID,'_',RESERVOIR,'_',PRESSUREPSIA) AS NVARCHAR(255)) AS FACT_RRFP_INJ_PRESS_R_PHK
, FMH.FluidModelID ,FMH.FluidModelName ,[AnalysisDate]
FROM dbo.LZ_RRFP_FluidModelInj fmi
LEFT JOIN DBO.LZ_RRFP_FluidModelHeader fmh ON fmi.FluidModelIDFK = fmh.FluidModelID
LEFT JOIN LZ_RRFP_FluidModelAss fma on fma.InjectionFluidModelIDFK = fmi.FluidModelIDFK
WHERE FMA.RESERVOIR IN (SELECT RESERVOIR_CD FROM ATT_RESERVOIR)
the error is :
org.apache.spark.sql.catalyst.parser.ParseException:
DataType nvarchar(255) is not supported.
how to convert nvarchar?
Hive uses UTF-8 in STRINGs and VARCHARs, you are fine with using VARCHAR or STRING instead of NVARCHAR.
VARCHAR in Hive is the same as STRING + length validation. As #NickW mentioned in the comment, you can do the same without CAST at all, if you inserting result into the table with VARCHAR(255), it will work the same without CAST.
I'm using PySpark version 2.4 to read some tables using jdbc with a Postgres driver.
df = spark.read.jdbc(url=data_base_url, table="tablename", properties=properties)
One column is a timestamp column and I want to filter it like this:
df_new_data = df.where(df.ts > last_datetime )
This way the filter is pushed down as a SQL query but the datetime format
is not right. So I tried this approach
df_new_data = df.where(df.ts > F.date_format( F.lit(last_datetime), "y-MM-dd'T'hh:mm:ss.SSS") )
but then the filter is no pushed down anymore.
Can someone clarify why this is the case ?
While loading the data from a Database table, if you want to push down queries to database and get few result rows, instead of providing the 'table', you can provide the 'Query' and return just the result as a DataFrame. This way, we can leverage database engine to process the query and return only the results to Spark.
The table parameter identifies the JDBC table to read. You can use anything that is valid in a SQL query FROM clause. Note that alias is mandatory to be provided in query.
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
df.show()
I have a spark sql 2.1.1 job on a yarn cluster in cluster mode where I want to create an empty external hive table (partitions with location will be added in a later step).
CREATE EXTERNAL TABLE IF NOT EXISTS new_table (id BIGINT, StartTime TIMESTAMP, EndTime TIMESTAMP) PARTITIONED BY (year INT, month INT, day INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
When I run the job I get the error:
CREATE EXTERNAL TABLE must be accompanied by LOCATION
But when I run the same query on Hive Editor on Hue it runs just fine. I was trying to find an answer in the SparkSQL 2.1.1 documentation but came up empty.
Does anyone know why Spark SQL is more strict on queries?
TL;DR EXTERNAL with no LOCATION is not allowed.
The definitive answer is in Spark SQL's grammar definition file SqlBase.g4.
You can find the definition of CREATE EXTERNAL TABLE as createTableHeader:
CREATE TEMPORARY? EXTERNAL? TABLE (IF NOT EXISTS)? tableIdentifier
This definition is used in the supported SQL statements.
Unless I'm mistaken locationSpec is optional. That's according to the ANTLR grammar. The code may decide otherwise and it seems it does.
scala> spark.version
res4: String = 2.3.0-SNAPSHOT
val q = "CREATE EXTERNAL TABLE IF NOT EXISTS new_table (id BIGINT, StartTime TIMESTAMP, EndTime TIMESTAMP) PARTITIONED BY (year INT, month INT, day INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'"
scala> sql(q)
org.apache.spark.sql.catalyst.parser.ParseException:
Operation not allowed: CREATE EXTERNAL TABLE must be accompanied by LOCATION(line 1, pos 0)
== SQL ==
CREATE EXTERNAL TABLE IF NOT EXISTS new_table (id BIGINT, StartTime TIMESTAMP, EndTime TIMESTAMP) PARTITIONED BY (year INT, month INT, day INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
^^^
at org.apache.spark.sql.catalyst.parser.ParserUtils$.operationNotAllowed(ParserUtils.scala:39)
at org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitCreateHiveTable$1.apply(SparkSqlParser.scala:1096)
at org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitCreateHiveTable$1.apply(SparkSqlParser.scala:1064)
at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99)
at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitCreateHiveTable(SparkSqlParser.scala:1064)
at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitCreateHiveTable(SparkSqlParser.scala:55)
at org.apache.spark.sql.catalyst.parser.SqlBaseParser$CreateHiveTableContext.accept(SqlBaseParser.java:1124)
at org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:42)
at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71)
at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71)
at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99)
at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:70)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:69)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:68)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:97)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
... 48 elided
The default SparkSqlParser (with astBuilder as SparkSqlAstBuilder) has the following assertion that leads to the exception:
if (external && location.isEmpty) {
operationNotAllowed("CREATE EXTERNAL TABLE must be accompanied by LOCATION", ctx)
I'd consider reporting an issue in Spark's JIRA if you think that the case should be allowed. See SPARK-2825 to have a strong argument to have the support:
CREATE EXTERNAL TABLE already works as far as I know and should have the same semantics as Hive.