Is it possible to insert a value into a BYTES column using SQL INSERT? - google-cloud-spanner

Here's my schema:
CREATE TABLE Library (
Index INT64 NOT NULL,
Data BYTES(MAX) NOT NULL,
) PRIMARY KEY(Index);
Is it possible to insert into the Data column using an SQL INSERT statement? I tried base64-encoding the data and passing it as a string, hoping Spanner would be smart enough to detect base64, but no luck. Am I out of luck? Will I need to write an app using the Spanner client library?
Thanks for any input/advice!

You can do this by using a INSERT INTO (..) SELECT .. in combination with for example the FROM_BASE64 function. I'm not sure exactly which SQL client you are using in this case, but I just tried the following example using the latest version of DBeaver:
INSERT INTO Library (Id, Data)
SELECT 1, FROM_BASE64('BAR');
The latest version of DBeaver has built-in support for Cloud Spanner using the open source Cloud Spanner JDBC driver.

Related

Example for CREATE TABLE on TRINO using HUDI

I am using Spark Structured Streaming (3.1.1) to read data from Kafka and use HUDI (0.8.0) as the storage system on S3 partitioning the data by date. (no problems with this section)
I am looking to use Trino (355) to be able to query that data. As a pre-curser, I've already placed the hudi-presto-bundle-0.8.0.jar in /data/trino/hive/
I created a table with the following schema
CREATE TABLE table_new (
columns, dt
) WITH (
partitioned_by = ARRAY['dt'],
external_location = 's3a://bucket/location/',
format = 'parquet'
);
Even after calling the below function, trino is unable to discover any partitions
CALL system.sync_partition_metadata('schema', 'table_new', 'ALL')
My assessment is that I am unable to create a table under trino using hudi largely due to the fact that I am not able to pass the right values under WITH Options.
I am also unable to find a create table example under documentation for HUDI.
I would really appreciate if anyone can give me a example for that, or point me to the right direction, if in case I've missed anything.
Really appreciate the help
Small Update:
Tried Adding
connector = 'hudi'
but this throws the error:
Catalog 'hive' does not support table property 'connector'
Have you tried below?
Reference: https://hudi.apache.org/docs/next/querying_data/#trino
https://hudi.apache.org/docs/query_engine_setup/#PrestoDB

How do I load gziped json data into table, using Spark SQL's CREATE TABLE Query

I want to connect Apache Superset with Apache Spark (I have Spark 3.1.2) and Query the data on Superset's SQL Lab using Apache Spark SQL.
On spark's master, I started thrift server using this command spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.
Then I added Spark cluster as a database in Superset using SQLAlchemy URI hive://hive#spark:10000/. I am able to access Spark cluster on Superset.
I can load JSON data as table using this SQL
CREATE table IF NOT EXISTS test_table
USING JSON
LOCATION "/path/to/data.json"
and I am able to Query data using simple SQL statements like SELECT * FROM test_table LIMIT 10
BUT the problem is that json data is compressed as gzipped files.
So I tried
CREATE table IF NOT EXISTS test_table
USING JSON
LOCATION "/path/to/data.json.gz"
but it did not work. I want to know how do load gzipped json data into a table
Compressed JSON storage
If you have large JSON text you can explicitly compress JSON text using built-in COMPRESS function. In the following example compressed JSON content is stored as binary data, and we have computed column that decompress JSON as original text using DECOMPRESS function:
CREATE TABLE Person
( _id int identity constraint PK_JSON_ID primary key,
data varbinary(max),
value AS CAST(DECOMPRESS(data) AS nvarchar(max))
)
INSERT INTO Person(data)
VALUES (COMPRESS(#json))
COMPRESS and DECOMPRESS functions use standard GZip compression.
Another example:
CREATE EXTENSION json_fdw;
postgres=# CREATE SERVER json_server FOREIGN DATA WRAPPER json_fdw;
postgres=# CREATE FOREIGN TABLE customer_reviews
(
customer_id TEXT,
"review.date" DATE,
"review.rating" INTEGER,
"product.id" CHAR(10),
"product.group" TEXT,
"product.title" TEXT,
"product.similar_ids" CHAR(10)[]
)
SERVER json_server
OPTIONS (filename '/home/citusdata/customer_reviews_nested_1998.json.gz');
Note: This example was taken from https://www.citusdata.com/blog/2013/05/30/run-sql-on-json-files-without-any-data-loads

How to query INFORMATION_SCHEMA view using spark bq connector?

I'm trying to identify partitions which got updated from a BQ table using the below query:
select * from PROJECT-ID.DATASET.INFORMATION_SCHEMA.PARTITIONS where
table_name='TABLE-NAME' and
extract(date from last_modified_time)='TODAY-DATE'
This is working fine from the BQ console. However when I use the same query from spark-bq connector it's failing.
spark.read.format("bigquery").load("PROJECT-ID.DATASET.INFORMATION_SCHEMA.PARTITIONS")
Error:
"Invalid project ID PROJECT-ID. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash."
I tried multiple combinations like by adding ` after PROJECT-ID but the API is still throwing 400 error.
What is the right way to query the INFORMATION_SCHEMA from spark-bq connector?
Setting the project is as parentProject is solving the issue.
spark.read
.format("bigquery")
.option('parentProject', project_id)
INFORMATION_SCHEMA is not a standard dataset in BigQuery, and as such is not available via the BigQuery Storage API used by the spark-bigquery connector. However, you can query it and load the data into a dataframe in the following manner:
spark.conf.set("viewsEnabled","true")
spark.conf.set("materializationDataset","<dataset>")
val tablesDF = spark.read.format("bigquery").load("select * from `<projectId>.<dataset>.__TABLES__`")
table = "INFORMATION_SCHEMA.TABLES"
sql = f"""SELECT *
FROM {project_id}.{dataset}.{table}
"""
return (
spark.
read.
format('bigquery').
load(sql)
)

update table from Pyspark using JDBC

I have a small log dataframe which has metadata regarding the ETL performed within a given notebook, the notebook is part of a bigger ETL pipeline managed in Azure DataFactory.
Unfortunately, it seems that Databricks cannot invoke stored procedures so I'm manually appending a row with the correct data to my log table.
however, I cannot figure out the correct sytnax to update a table given a set of conditions :
the statement I use to append a single row is as follows :
spark_log.write.jdbc(sql_url, 'internal.Job',mode='append')
this works swimmingly however, as my Data Factory is invoking a stored procedure,
I need to work in a query like
query = f"""
UPDATE [internal].[Job] SET
[MaxIngestionDate] date {date}
, [DataLakeMetadataRaw] varchar(MAX) NULL
, [DataLakeMetadataCurated] varchar(MAX) NULL
WHERE [IsRunning] = 1
AND [FinishDateTime] IS NULL"""
Is this possible ? if so can someone show me how?
Looking at the documentation this only seems to mention using select statements with the query parameter :
Target Database is an Azure SQL Database.
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
just to add this is a tiny operation, so performance is a non-issue.
You can't do single record updates using jdbc in Spark with dataframes. You can only append or replace the entire table.
You can do updates using pyodbc- requires installing the MSSQL ODBC driver (How to install PYODBC in Databricks) or you can use jdbc via JayDeBeApi (https://pypi.org/project/JayDeBeApi/)

Unable to read column types from amazon redshift using psycopg2

I'm trying to access the types of columns in a table in redshift using psycopg2.
I'm doing this by running a simple query on pg_table_def like as follows:
SELECT * FROM pg_table_def;
This returns the traceback:
psycopg2.NotSupportedError: Column "schemaname" has unsupported type "name"
So it seems like the types of the columns that store schema (and other similar information on further queries) are not supported by psycopg2.
Has anyone run into this issue or a similar one and is aware of a workaround? My primary goal in this is to be able to return the types of columns in the table. For the purposes of what I'm doing, I can't use another postgresql adapter.
Using:
python- 3.6.2
psycopg2- 2.7.4
pandas- 0.17.1
You could do something like below, and could return the result back to calling service.
cur.execute("select * from pg_table_def where tablename='sales'")
results = cur.fetchall()
for row in results:
print ("ColumnNanme=>"+row[2] +",DataType=>"+row[3]+",encoding=>"+row[4])
Not sure about exception, if all the permissions are fine, then, it should work fine, print something like below.
ColumnNanme=>salesid,DataType=>integer,encoding=>lzo
ColumnNanme=>commission,DataType=>numeric(8,2),encoding=>lzo
ColumnNanme=>saledate,DataType=>date,encoding=>lzo
ColumnNanme=>description,DataType=>character varying(255),encoding=>lzo

Resources