how to extract data JSON from zeppelin sql - apache-spark

I query to test_tbl table on Zeppelin.
the table data structure looks like as below :
%sql
desc stg.test_tbl
col_name | data_type | comment
id | string |
title | string |
tags | string |
The tags column has data JSON type following as :
{"name":[{"family": null, "first": "nelson"}, {"pos_code":{"house":"tlv", "id":"A12YR"}}]}
and I want to see the JSON data with columns, so my query is :
select *, tag.*
from stg.test_tbl as t
lateral view explode(t.tags.name) name as name
lateral view explode(name.pos_code) pos_code as pos_code
but when I query, it returns
Can't extract value from tags#3423: need struct type but got string; line 3 pos 21
set zeppelin.spark.sql.stacktrace = true to see full stacktrace
should i query as string in where statement?

Answered myself. I could use get_json_object in string type of JSON.
Also, if the JSON format is array like below
{"name":[{"family": null, "first": "nelson"}, {"pos_code":{"house":"tlv", "id":"A12YR"}}]}
Then I could query using the key
select * from stg.test_tbl as t
where t.pos_code[0].house = "tlv"

Related

How to run CQL in Zeppelin by taking input in user input format?

I was trying to run CQL query by taking in user input format in Zeppelin tool:-
%cassandra
SELECT ${Select Fields Type=uuid ,uuid | created_by | email_verify| username} FROM
${Select Table=keyspace.table_name}
${WHERE email_verify="true" } ${ORDER BY='updated_date' }LIMIT ${limit = 10};
while running this query I was getting this error:
line 4:0 mismatched input 'true' expecting EOF
(SELECT uuid FROM keyspace.table_name ["true"]...)
You need to move WHERE and ORDER BY out of the dynamic form declaration.
The input field declaration is looks as following: ${field_name=default_value}. In your case, instead of WHERE ..., you've got the field name of WHERE email_verify.
It should be as following (didn't tested):
%cassandra
SELECT ${Select Fields Type=uuid ,uuid | created_by | email_verify| username} FROM
${Select Table=keyspace.table_name}
WHERE ${where_cond=email_verify='true'} ORDER BY ${order_by='updated_date'} LIMIT ${limit = 10};
Update:
here is the working example for table with following structure:
CREATE TABLE test.scala_test2 (
id int,
c int,
t text,
tm timestamp,
PRIMARY KEY (id, c)
) WITH CLUSTERING ORDER BY (c ASC)

Is it possible to create a PERSISTED column that's made up of an array of specific JSON values and if so how?

Is it possible to create a PERSISTED column that's made up of an array of specific JSON values and if so how?
Simple Example (json column named data):
{ name: "Jerry", age: 91, mother: "Janet", father: "Eustace" }
Persisted Column Hopeful (assuming json column is called 'data'):
ALTER TABLE tablename ADD parents [ data::$mother, data::$father ] AS PERSISTED JSON;
Expected Output
| data (json) | parents (persisted json) |
| -------------------------------------------------------------- | ------------------------- |
| { name: "Jerry", age: 91, mother: "Janet", father: "Eustace" } | [ "Janet", "Eustace" ] |
| { name: "Eustace", age: 106, mother: "Jane" } | [ "Jane" ] |
| { name: "Jim", age: 54, mother: "Rachael", father: "Dom" } | [ "Rachael", "Dom ] |
| -------------------------------------------------------------- | ------------------------- |
The above doesn't work, but hopefully it conveys what I'm trying to accomplish.
There is no PERSISTED ARRAY data type for columns, but there is a JSON column type that can store arrays.
For example:
-- The existing table
create table tablename (
id int primary key AUTO_INCREMENT
);
-- Add the new JSON column
ALTER TABLE tablename ADD column parents JSON;
-- Insert data into the table
INSERT INTO tablename (parents) VALUES
('[ "Janet", "Eustace" ]'),
('[ "Jane" ]');
-- Select table based on matches in the JSON column
select *
from tablename
where JSON_ARRAY_CONTAINS_STRING(parents, 'Jane');
-- Change data in the JSON column
update tablename
set parents = JSON_ARRAY_PUSH_STRING(parents, 'Jon')
where JSON_ARRAY_CONTAINS_STRING(parents, 'Jane')
-- Show changed data
select *
from tablename
where JSON_ARRAY_CONTAINS_STRING(parents, 'Jane');
Check out more examples of pushing and selecting JSON data in the docs at https://docs.memsql.com/v7.0/concepts/json-guide/
Here is a sample table definition where I do something similar with customer and event:
CREATE TABLE `eventsext2` (
`data` JSON COLLATE utf8_bin DEFAULT NULL,
`memsql_insert_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`customer` as data::$custID PERSISTED text CHARACTER SET utf8 COLLATE utf8_general_ci,
`event` as data::$event PERSISTED text CHARACTER SET utf8 COLLATE utf8_general_ci,
customerevent as concat(data::$custID,", ",data::$event) persisted text,
`generator` as data::$genID PERSISTED text CHARACTER SET utf8 COLLATE utf8_general_ci,
`latitude` as (substr(data::$longlat from (instr(data::$longlat,'|')+1))) PERSISTED decimal(21,18),
`longitude` as (substr(data::$longlat from 1 for (instr(data::$longlat,'|')-1))) PERSISTED decimal(21,18),
`location` as concat('POINT(',latitude,' ',longitude,')') PERSISTED geographypoint,
KEY `memsql_insert_time` (`memsql_insert_time`)
/*!90618 , SHARD KEY () */
) /*!90623 AUTOSTATS_CARDINALITY_MODE=OFF, AUTOSTATS_HISTOGRAM_MODE=OFF */ /*!90623 SQL_MODE='STRICT_ALL_TABLES' */;
Though not your question, denormalizing this table into two tables might be a good choice:
create table parents (
id int primary key auto_increment,
tablenameid int not null,
name varchar(20),
type int not null, -- 1=Father, 2=Mother, ideally foreign key to other table
);

Why AWS Athena returns "string" datatype to all table's fields on "show create table" command or describe tables

Why AWS Athena returns "string" datatype to all table's fields on
"show create table" command or on describe tables:
for example table t_mus_albums:
albumid (bigint)
title (string)
artistid (bigint)
whan running
show create table t_mus_albums;
I get:
CREATE EXTERNAL TABLE `t_mus_albums`(
`albumid` string COMMENT 'from deserializer',
`title` string COMMENT 'from deserializer',
`artistid` string COMMENT 'from deserializer')
I think you might be doing something wrong or while generating the table automatically, you may not have correct formatted data.
Here are the systematic steps to solve your problem.
Assume that your data is in below format.
ID,Code,City,State
41,5,"Youngstown", OH
42,52,"Yankton", SD
46,35,"Yakima", WA
42,16,"Worcester", MA
43,37,"Wisconsin Dells", WI
36,5,"Winston-Salem", NC
Then your create table will go something like below.
CREATE EXTERNAL TABLE IF NOT EXISTS example.tbl_datatype (
`id` int,
`code` int,
`city` string,
`state` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://example-bucket/location/a/'
TBLPROPERTIES ('has_encrypted_data'='false');
Then, run the Query to description the table.
SHOW CREATE TABLE tbl_datatype;
It will give you output something like below.
CREATE EXTERNAL TABLE `tbl_datatype`(
`id` int,
`code` int,
`city` string,
`state` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://example-bucket/location/a/';
Hope it helps!
This is because you use CSV serde and not e.g. TEXT serde.
CSV serde does support only string data type, so all columns are of this type.
From https://docs.aws.amazon.com/athena/latest/ug/csv.html
The OpenCSV SerDe [...] Converts all column type values to STRING.
The documentation outlines some conditions under which the table schema could be different than all strings ("For example, it parses the values into BOOLEAN, BIGINT, INT, and DOUBLE data types when it can discern them"), but apparently this was not effective in your case.

How to cast varchar to MAP(VARCHAR,VARCHAR) in presto

I have table in presto, one column named ("mappings") have key-value pair as string
select mappings from hello;
Ex: {"foo": "baar", "foo1": "bar1" }
I want to cast "mappings" column into a MAP
like select CAST("mappings" as MAP) from hello;
This will throw error in presto. How can we translate this to map?
There is no canonical string representation for a MAP in Presto, so so there's no way to cast it directly to MAP(VARCHAR, VARCHAR). But, if your string contains a JSON map, you can use the json_parse function to convert the string into a value of JSON type and convert that to a SQL MAP via a cast.
Example:
WITH
data(c) AS (
VALUES '{"foo": "baar", "foo1": "bar1"}'
),
parsed AS (
SELECT cast(json_parse(c) as map(varchar, varchar)) AS m
FROM data
)
SELECT m['foo'], m['foo1']
FROM parsed
produces:
_col0 | _col1
-------+-------
baar | bar1
select cast( json_parse(mappings) as MAP(VARCHAR,VARCHAR)) from hello1;

Using datastax driver, how can I read a non-primitive column as a JSON string?

I have several non-primitive columns in my cassandra tables.
Some of them are user-defined-types UDTs.
While querying them through datastax driver, I want to convert such UDTs into JSON values.
More specifically, I want to get JSON string for the value object below:
Row row = itr.next();
ColumnDefinitions cds = row.getColumnDefinitions();
cds.asList().forEach((ColumnDefinitions.Definition cd) -> {
String name = cd.getName();
Object value = row.getObject(name);
}
I have gone through http://docs.datastax.com/en/developer/java-driver/3.1/manual/custom_codecs/
But I do not want to add a codec for every UDT I have.
Looking at Using Datastax Java Driver to query a row as a JSON, I tried this too:
row.getString(name)
But it gives me an error: Codec not found for requested operation: [set<date> <-> java.lang.String]
Can the driver somehow return me direct JSON without explicit meddling with codecs and all?
Use toJson method
Starting with version 2.2 of Apache Cassandraâ„¢, toJson method return json representation of an object.
Example :
CREATE TYPE fullname (
firstname text,
lastname text
);
CREATE TABLE users (
id uuid PRIMARY KEY,
direct_reports set<frozen<fullname>>,
name frozen<fullname>
);
Now you can select direct_reports and name as json
SELECT id,toJson(direct_reports) as direct_reports,toJson(name) as name FROM users ;
Here i have rename toJson(direct_reports) as direct_reports and toJson(name) as name so you can use row.getString("direct_reports") and row.getString("name") to get the direct_reports json and name json.
Output :
id | 62c36092-82a1-3a00-93d1-46196ee77204
direct_reports | [{"firstname": "Naoko", "lastname": "Murai"}, {"firstname": "Sompom", "lastname": "Peh"}]
name | {"firstname": "Marie-Claude", "lastname": "Josset"}

Resources