extract array of arrays in presto - struct

I have a table in Athena (presto) with just one column named individuals and this is the type of then column:
array(row(individual_id varchar, ids array(row(type varchar, value varchar, score integer))))
I want to extract value from inside the ids and return them as a new array. As an example:
[{individual_id=B1Q, ids=[{type=H, value=efd3, score=1}, {type=K, value=NpS, score=1}]}, {individual_id=D6n, ids=[{type=A, value=178, score=6}, {type=K, value=NuHV, score=8}]}]
and I want to return
ids
[efd3, NpS, 178, NuHV]
I tried multiple solutions like
select * from "test"
CROSS JOIN UNNEST(individuals.ids.value) AS t(i)
but always return
Expression individuals is not of type ROW

select
array_agg(ids.value)
from test
cross join unnest(test.individuals) t(ind)
cross join unnest(ind.ids) t(ids)
result:
[efd3, NpS, 178, NuHV]
that will return all the id values as one row, which may or may not be what you want
if you want to return an array of individual values by individual_id:
select
ind.individual_id,
array_agg(ids.value)
from test
cross join unnest(test.individuals) t(ind)
cross join unnest(ind.ids) t(ids)
group by
ind.individual_id

Related

How to make my identity column consecutive on delta table in Azure Databricks?

I am trying to create a delta table with a consecutive identity column. The goal is for our clients to see if there is some data they did not receive from us.
It looks like the generated identity column is not consecutive. Which makes the "INCREMENT BY 1" quite misleading.
store_visitor_type_name = ["apple","peach","banana","mango","ananas"]
card_type_name = ["door","desk","light","coach","sink"]
store_visitor_type_desc = ["monday","tuesday","wednesday","thursday","friday"]
colnames = ["column2","column3","column4"]
data_frame = spark.createDataFrame(zip(store_visitor_type_name,card_type_name,store_visitor_type_desc),colnames)
data_frame.createOrReplaceTempView('vw_increment')
data_frame.display()
%sql
CREATE or REPLACE TABLE TEST(
`column1SK` BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1)
,`column2` STRING
,`column3` STRING
,`column4` STRING
,`inserted_timestamp` TIMESTAMP
,`modified_timestamp` TIMESTAMP
)
USING delta
LOCATION '/mnt/Marketing/Sales';
MERGE INTO TEST as target
USING vw_increment as source
ON target.`column2` = source.`column2`
WHEN MATCHED
AND (target.`column3` <> source.`column3`
OR target.`column4` <> source.`column4`)
THEN
UPDATE SET
`column2` = source.`column2`
,`modified_timestamp` = current_timestamp()
WHEN NOT MATCHED THEN
INSERT (
`column2`
,`column3`
,`column4`
,`modified_timestamp`
,`inserted_timestamp`
) VALUES (
source.`column2`
,source.`column3`
,source.`column4`
,current_timestamp()
,current_timestamp()
)
I'm getting the following results. You can see this is not sequential.What is also very confusing is that it is not starting at 1, while explicitely mentionned in the query.
I can see in the documentation (https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters) :
The automatically assigned values start with start and increment by
step. Assigned values are unique but are not guaranteed to be
contiguous. Both parameters are optional, and the default value is 1.
step cannot be 0.
Is there a workaround to make this identity column consecutive ?
I guess I could have another column and do a ROW_NUMBER operation after the MERGE, but it looks expensive.
You can utilize Pyspark to achieve the requirement instead of using row_number() function.
I have read the TEST table as a spark dataframe and converted it to pandas on spark dataframe. In pandas dataframe, using reset_index(), I have created a new index column.
Then I have converted it back to spark dataframe. I have added 1 to the index column values since the index starts with 0.
df = spark.sql("select * from test")
pdf = df.to_pandas_on_spark()
#to create new index column.
pdf.reset_index(inplace=True)
final_df = pdf.to_spark()
#Since index starts from 0, I have added 1 to it.
final_df.withColumn('index',final_df['index']+1).show()

Invalid Column In SQL Query

select
university_cars_video_kroenke.dbo.car_customer.cus_first,
university_cars_video_kroenke.dbo.car_customer.cus_last,
(
select COUNT(university_cars_video_kroenke.dbo.car_customer.cus_id)
from university_cars_video_kroenke.dbo.car_purchases
where university_cars_video_kroenke.dbo.car_customer.cus_id = university_cars_video_kroenke.dbo.car_purchases.cus_id
)
from university_cars_video_kroenke.dbo.car_customer
(edited for clarity)
select
customer.cus_first,
customer.cus_last,
(select
COUNT(customer.cus_id)
from purchases
where customer.cus_id = purchases.cus_id )
from customer
My error message is
Msg 8120, Level 16, State 1, Line 4 Column
'university_cars_video_kroenke.dbo.car_customer.cus_first'
is invalid in the select list because it is not contained
in either an aggregate function or the GROUP BY clause
I just want a count of records the cus_id is the same in both tables.
I just want a count of records the cus_id is the same in both tables.
Something like the following should work.
SELECT
A.cus_id,
count(A.cus_id)
FROM
university_cars_video_kroenke.dbo.car_customer AS A,
university_cars_video_kroenke.dbo.car_purchases AS B
WHERE
A.cus_id = B.cus_id

AWS Athena working with nested arrays, trying to search for a field within the array

I have a sql query:
SELECT id_str, entities.hashtags
FROM tweets, unnest(entities.hashtags) as t(hashtag)
WHERE cardinality(entities.hashtags)=2 and id_str='1248585590573948928'
limit 5
which returns:
id_str hashtags
1248585590573948928 [{text=LUCAS, indices=[75, 81]}, {text=WayV, indices=[83, 88]}]
1248585590573948928 [{text=LUCAS, indices=[75, 81]}, {text=WayV, indices=[83, 88]}]
The unnesting has returned the row twice which originally was one row, this is because there are 2 objects in this array.
The next part I wanted to add to the sql query was
select hashtag['text'] as htag to the existing select which should return 2 rows still but this time returning LUCAS and WayV in the separate rows in same column, named htag.
But I get this error - any idea what I am doing wrong?
Your query has the following error(s):
SYNTAX_ERROR: line 1:8: '[]' cannot be applied to row(text varchar,indices array(bigint)), varchar(4)
I assume it is because I have another array within this array.. ?
Thanks in advance
I'm not entirely sure where you're adding the hashtag['text'] expression, so I can't say with confidence what your problem is, but I have two suggestions for you to try:
The error says that hashtag is of type row(text varchar, …), which suggests that hashtag.text should work.
If that doesn't work, you can try using element_at e.g. element_at(hashtag, 'text').
I came across this issue as well and since there is no solution provided I like to chip in:
After you unnest an array, you can address the result with a . reference instead of ['']:
WITH dataset AS (
SELECT ARRAY[
CAST(ROW('Bob', 38) AS ROW(name VARCHAR, age INTEGER)),
CAST(ROW('Alice', 35) AS ROW(name VARCHAR, age INTEGER)),
CAST(ROW('Jane', 27) AS ROW(name VARCHAR, age INTEGER))
] AS users
)
SELECT
user,
user.name
FROM dataset
cross join unnest (users) as t(user)

Cast some columns and select all columns without explicitly writing column names

I want to cast some columns and then select all others
id, name, property, description = column("id"), column("name"), column("property"), column("description")
select([cast(id, String).label('id'), cast(property, String).label('property'), name, description]).select_from(events_table)
Is there any way to cast some columns and select all with out mentioning all column names
I tried
select([cast(id, String).label('id'), cast(property, String).label('property')], '*').select_from(events_table)
py_.transform(return_obj, lambda acc, element: acc.append(dict(element)), [])
But I get two extra columns (total 7 columns) which are cast and I can't convert them to dictionary which throws key error.
I'm using FASTAPI, sqlalchemy and databases(async)
Thanks
Pretty sure you can do
select_columns = []
for field in events_table.keys()
select_columns.append(getattr(events_table.c, field))
select(select_columns).select_from(events_table)
to select all fields from that table. You can also keep a list of fields you want to actually select instead of events_table.keys(), like
select_these = ["id", "name", "property", "description"]
select_columns = []
for field in select_these
select_columns.append(getattr(events_table.c, field))
select(select_columns).select_from(events_table)

Oracle spatial data operator - SDO_nn - Not getting any results for sdo_num_res = 1

I am using SDO_NN operator to find the nearest hydrant next to a building.
Building:
CREATE TABLE "BUILDINGS"
(
"NAME" VARCHAR2(40),
"SHAPE" "SDO_GEOMETRY")
Hydrant:
CREATE TABLE "HYDRANTS"
( "NAME" VARCHAR2(10),
"POINT" "SDO_POINT_TYPE"
);
I have setup spatial indexes properly for buildings.shape and I run the query to get the nearest hydrant to the building 'Motel'
select b1.name as name, h.point.x as x, h.point.y as y from buildings b1, hydrants h where b1.name ='Motel' and
SDO_nn( b1.shape, MDSYS.SDO_GEOMETRY(2003,NULL, NULL,SDO_ELEM_INFO_ARRAY(1,1003,1),
SDO_ORDINATE_ARRAY( h.point.x,h.point.y)), 'sdo_num_res=1')= 'TRUE';
Here's the problem:
When I set the parameter sdo_num_res=1, I get zero tuples.
And when I make sdo_num_res=2, I get one tuple.
What is the reason for the weird behavior ?
Note: I am getting zero rows only when building.name= 'Motel', for all other tuples I am getting 1 row when sdo_num_res = 1
Edit:
Insert queries
Insert into buildings (NAME,SHAPE) values ('Motel',MDSYS.SDO_GEOMETRY(2003,NULL,NULL,MDSYS.SDO_ELEM_INFO_ARRAY(1,1003,1),MDSYS.SDO_ORDINATE_ARRAY(564,425,585,436,573,458,552,447)));
Insert into hydrants (name,POINT) values ('p57',MDSYS.SDO_POINT_TYPE(589,448,0));
To perform spatial comparisons between a point to a polygon, the SDO_GEOMETRY is defined with SDO_SRID=2001 and center set to a SDO_POINT_TYPE-> which we want to compare.
MDSYS.SDO_GEOMETRY(2001, NULL, SDO_POINT_TYPE(-79, 37, NULL), NULL, NULL)
First of all, your query does not do what you say it does: it actually returns the nearest building called "Motel" from any of your hydrants. To do what you want (i.e. the opposite) you need to reverse the order of the arguments to SDO_NN: all spatial operators search the first argument, using the value of the second argument.
Then the insert into your HYDRANTS table is wrong:
Insert into hydrants (name,POINT) values ('p57',MDSYS.SDO_POINT_TYPE(589,448,0));
The SDO_POINT_TYPE object is not designed to be used that way: it is only used inside the SDO_GEOMETRY type. The proper way is this:
insert into hydrants (name,POINT) values ('p57',sdo_geometry(2001, null, SDO_POINT_TYPE(589,448,null), null, null));
And of course you need to change your table definition accordingly.
Then your building is also incorrectly created: a polygon must always close, i.e. the last point must be the same as the first point. So the proper shape should be like this:
insert into buildings (NAME,SHAPE) values ('Motel', SDO_GEOMETRY(2003,NULL,NULL,SDO_ELEM_INFO_ARRAY(1,1003,1),SDO_ORDINATE_ARRAY(564,425,585,436,573,458,552,447,564,425)));
Here is the full example:
Create the tables:
create table buildings (
name varchar2(40) primary key,
shape sdo_geometry
);
create table hydrants(
name varchar2(10) primary key,
point sdo_geometry
);
Populate the tables:
insert into buildings (NAME,SHAPE) values ('Motel', SDO_GEOMETRY(2003,NULL,NULL,SDO_ELEM_INFO_ARRAY(1,1003,1),SDO_ORDINATE_ARRAY(564,425,585,436,573,458,552,447,564,425)));
insert into hydrants (name,POINT) values ('p57',sdo_geometry(2001, null, SDO_POINT_TYPE(589,448,null), null, null));
commit;
Confirm that the geometries are all correct:
select name, sdo_geom.validate_geometry_with_context (point, 0.05) from hydrants;
select name, sdo_geom.validate_geometry_with_context (shape, 0.05) from buildings;
Setup spatial metadata and create spatial indexes:
insert into user_sdo_geom_metadata (table_name, column_name, diminfo, srid)
values (
'BUILDINGS',
'SHAPE',
sdo_dim_array (
sdo_dim_element ('X', 0,1000,0.05),
sdo_dim_element ('Y', 0,1000,0.05)
),
null
);
commit;
create index buildings_sx on buildings (shape)
indextype is mdsys.spatial_index;
insert into user_sdo_geom_metadata (table_name, column_name, diminfo, srid)
values (
'HYDRANTS',
'POINT',
sdo_dim_array (
sdo_dim_element ('X', 0,1000,0.05),
sdo_dim_element ('Y', 0,1000,0.05)
),
null
);
commit;
create index hydrants_sx on hydrants (point)
indextype is mdsys.spatial_index;
Now Try the properly written query:
select h.name, h.point.sdo_point.x as x, h.point.sdo_point.y as y
from buildings b, hydrants h
where b.name ='Motel'
and sdo_nn(h.point, b.shape, 'sdo_num_res=1')= 'TRUE';
which returns:
NAME X Y
---------------- ---------- ----------
p57 589 448
1 row selected.

Resources