KSQL DB collect_list and collect_set does not support multiple columns, struct or map - struct

I am currently building a POC for my company on ConfluentCloud. The current version of KSQLDB does not yet support collect_list/collect_set on multiple columns, a struct or a map; therefore, I am trying to think of a workaround.
I am consuming SQL CDC streams, and trying to compose a nested object model with parent child relationships without the need to build a self hosted JAVA UDF or KSTREAM app.
Streams demo_games, demo_players and demo_teams should in the end yield the following model written to a kafka topic.
{
teamId: bigint,
teamName: string,
teamPlayers: [
{
playerid: bigint,
playername:string
}
],
teamGames: [
{
gameid: bigint,
gamename: string
}
]
}
Let's start with a smidge of code to illustrate what I am trying to achieve.
CREATE STREAM DEMO_GAMES( GAMEID BIGINT KEY, TEAMID BIGINT, GAMENAME STRING )
WITH (KAFKA_TOPIC='DEMO.GAMES',VALUE_FORMAT='JSON', PARTITIONS=1);
INSERT INTO DEMO_GAMES( GAMEID, TEAMID, GAMENAME) VALUES (1,1,'SUNDAY');
INSERT INTO DEMO_GAMES( GAMEID, TEAMID, GAMENAME) VALUES (2,1,'MONDAY');
INSERT INTO DEMO_GAMES( GAMEID, TEAMID, GAMENAME) VALUES (3,1,'FRIDAY');
CREATE STREAM DEMO_PLAYERS( PLAYERID BIGINT KEY, TEAMID BIGINT, PLAYERNAME STRING )
WITH (KAFKA_TOPIC='DEMO.PLAYERS',VALUE_FORMAT='JSON', PARTITIONS=1);
INSERT INTO DEMO_PLAYERS( PLAYERID, TEAMID, PLAYERNAME) VALUES (1,1,'PLAYER 1');
INSERT INTO DEMO_PLAYERS( PLAYERID, TEAMID, PLAYERNAME) VALUES (2,1,'PLAYER 2');
INSERT INTO DEMO_PLAYERS( PLAYERID, TEAMID, PLAYERNAME) VALUES (3,1,'PLAYER 3');
INSERT INTO DEMO_PLAYERS( PLAYERID, TEAMID, PLAYERNAME) VALUES (4,1,'PLAYER 4');
CREATE STREAM DEMO_TEAMS( TEAMID BIGINT KEY,TEAMNAME STRING )
WITH (KAFKA_TOPIC='DEMO.TEAMS',VALUE_FORMAT='JSON', PARTITIONS=1);
INSERT INTO DEMO_TEAMS( TEAMID, TEAMNAME) VALUES (1,'THE TEAM');
#create a few persistent queries...
create stream demo_team_players as
select
teamid,
playerid,
struct(playerid:=playerid,
playername:=playername ) `model`
from DEMO_PLAYERS emit changes;
create stream demo_team_games as
select
teamid,
gameid,
struct(gameid:=gameid,
gamename:=gamename) `model`
from DEMO_games emit changes;
The two persistent queries above wrap the data I want to include in collect_list into a struct. So now I can execute the following query.
select teamid, transform(collect_list( cast(`model` as string)), t=>t) as teamplayers from DEMO_TEAM_GAMES group by teamid emit changes;
#yields
{
"TEAMID": 1,
"TEAMPLAYERS": [
"Struct{GAMEID=1,GAMENAME=SUNDAY}",
"Struct{GAMEID=2,GAMENAME=MONDAY}",
"Struct{GAMEID=3,GAMENAME=FRIDAY}"
]
}
My question is this. Is there a way to take a "serialized" STRUCT string and convert it back to a STRUCT within the transform lambda?
I also tried building a dynamic JSON string as a parameter to the collection_list function and then build a struct using EXTRACT_JSON_FIELD. This seems very brittle and to get the JSON string, I am forced to cast all values to string.

also I am doing some POC for my current company, I got the same issue as you mentioned here. I think it's a critical feature for ksqldb, there is pull request for this feature in ksql repository and already merged. you can wait for the new version release or build the image yourself.
https://github.com/confluentinc/ksql/pull/8877

I eventually realized grouping in KSQL is not the same as grouping in T-SQL. If you somehow managed to group the relationships, you would end up with ALL related data not just the most recent because you can only group on a stream. But I did manage to whip up this gem. You basically force AS_MAP to serialize and then replace a few delimiters to make valid JSON and then pass it off to SPLIT_TO_MAP
set 'auto.offset.reset' = 'earliest';
create table EVENTTABLE
WITH( KAFKA_TOPIC='EventsGroupedByTeamTable', VALUE_FORMAT='AVRO', PARTITIONS=1)
as
select HOMETEAMID,
ARRAY_LENGTH (COLLECT_SET(HOMETEAMID) ) AS COUNT,
TRANSFORM(COLLECT_SET( CAST( AS_MAP(
ARRAY[
'AWAYTEAMNAME',
'HOMETEAMNAME'
],
ARRAY[
CAST (AWAYTEAMNAME AS STRING),
CAST (HOMETEAMNAME AS STRING)
]
) AS STRING ) ), P=>SPLIT_TO_MAP(REPLACE(REPLACE(P,'{',''),'}',''),',','=') ) AS TEAMSCHEDULES
from EVENTSTREAM
GROUP BY HOMETEAMID
emit changes;

Related

How do I post data from req.body into a CQL UDT column using the Node.js driver?

I am new to cassandra I need your help.
After creating a collection table using cql console, I am able to create new records and read them, but Post operation using cassandra-driver in nodejs is not working, it only works when I use cql console.
I created table:
CREATE TYPE event_info (
type text,
pagePath text,
ts text,
actionName text
);
CREATE TABLE journey_info_5 (
id uuid PRIMARY KEY,
user_id text,
session_start_ts timestamp,
event FROZEN<event_info>
);
codes for post operation:
export const pushEvent = async(req,res)=>{
const pushEventQuery = 'INSERT INTO user_journey.userjourney (id, user_id, session_start_ts,events)
VALUES ( ${types.TimeUuid.now()}, ${req.body.user_id},${types.TimeUuid.now()},
{ ${req.body.type},${req.body.pagePath},${req.body.ts},${req.body.actionName}} } );'
try {
await client.execute(pushEventQuery)
res.status(201).json("new record added successfully");
} catch (error) {
res.status(404).send({ message: error });
console.log(error);
}
}
it is giving errors, How can I get data from user and post in this collection?
please help me, if any idea
The issue is that your CQL statement is invalid. The format for inserting values in a user-defined type (UDT) column is:
{ fieldname1: 'value1', fieldname2: 'value2', ... }
Note that the column names in your schema don't match up with the CQL statement in your code so I'm reposting the schema here for clarity:
CREATE TYPE community.event_info (
type text,
pagepath text,
ts text,
actionname text
)
CREATE TABLE community.journey_info_5 (
id uuid PRIMARY KEY,
event frozen<event_info>,
session_start_ts timestamp,
user_id text
)
Here's the CQL statement I used to insert a UDT into the table (formatted for readability):
INSERT INTO journey_info_5 (id, user_id, session_start_ts, event)
VALUES (
now(),
'thierry',
totimestamp(now()),
{
type: 'type1',
pagePath: 'pagePath1',
ts: 'ts1',
actionName: 'actionName1'
}
);
For reference, see Inserting or updating data into a UDT column. Cheers!

How to insert new rows to a junction table Postgres

I have a many to many relationship set up with with services and service_categories. Each has a table, and there is a third table to handle to relationship (junction table) called service_service_categories. I have created them like this:
CREATE TABLE services(
service_id SERIAL,
name VARCHAR(255),
summary VARCHAR(255),
profileImage VARCHAR(255),
userAgeGroup VARCHAR(255),
userType TEXT,
additionalNeeds TEXT[],
experience TEXT,
location POINT,
price NUMERIC,
PRIMARY KEY (id),
UNIQUE (name)
);
CREATE TABLE service_categories(
service_category_id SERIAL,
name TEXT,
description VARCHAR(255),
PRIMARY KEY (id),
UNIQUE (name)
);
CREATE TABLE service_service_categories(
service_id INT NOT NULL,
service_category_id INT NOT NULL,
PRIMARY KEY (service_id, service_category_id),
FOREIGN KEY (service_id) REFERENCES services(service_id) ON UPDATE CASCADE,
FOREIGN KEY (service_category_id) REFERENCES service_categories(service_category_id) ON UPDATE CASCADE
);
Now, in my application I would like to add a service_category to a service from a select list for example, at the same time as I create or update a service. In my node js I have this post route set up:
// Create a service
router.post('/', async( req, res) => {
try {
console.log(req.body);
const { name, summary } = req.body;
const newService = await pool.query(
'INSERT INTO services(name,summary) VALUES($1,$2) RETURNING *',
[name, summary]
);
res.json(newService);
} catch (err) {
console.log(err.message);
}
})
How should I change this code to also add a row to the service_service_categories table, when the new service ahas not been created yet, so has no serial number created?
If any one could talk me through the approach for this I would be grateful.
Thanks.
You can do this in the database by adding a trigger to the services table to insert a row into the service_service_categories that fires on row insert. The "NEW" keyword in the trigger function represents the row that was just inserted, so you can access the serial ID value.
https://www.postgresqltutorial.com/postgresql-triggers/
Something like this:
CREATE TRIGGER insert_new_service_trigger
AFTER INSERT
ON services
FOR EACH ROW
EXECUTE PROCEDURE insert_new_service();
Then your trigger function looks something like this (noting that the trigger function needs to be created before the trigger itself):
CREATE OR REPLACE FUNCTION insert_new_service()
RETURNS TRIGGER
LANGUAGE PLPGSQL
AS
$$
BEGIN
-- check to see if service_id has been created
IF NEW.service_id NOT IN (SELECT service_id FROM service_service_categories) THEN
INSERT INTO service_service_categories(service_id)
VALUES(NEW.service_id);
END IF;
RETURN NEW;
END;
$$;
However in your example data structure, it doesn't seem like there's a good way to link the service_categories.service_category_id serial value to this new row - you may need to change it a bit to accommodate
I managed to get it working to a point with multiple inserts and changing the schema a bit on services table. In the service table I added a column: category_id INT:
ALTER TABLE services
ADD COLUMN category_id INT;
Then in my node query I did this and it worked:
const newService = await pool.query(
`
with ins1 AS
(
INSERT INTO services (name,summary,category_id)
VALUES ($1,$2,$3) RETURNING service_id, category_id
),
ins2 AS
(
INSERT INTO service_service_categories (service_id,service_category_id) SELECT service_id, category_id FROM ins1
)
select * from ins1
`,
[name, summary, category_id]
);
Ideally I want to have multiple categories so the category_id column on service table, would become category_ids INT[]. and it would be an array of ids.
How would I put the second insert into a foreach (interger in the array), so it creates a new service_service_categories row for each id in the array?

Proper Sequelize flow to avoid duplicate rows?

I am using Sequelize in my node js server. I am ending up with validation errors because my code tries to write the record twice instead of creating it once and then updating it since it's already in DB (Postgresql).
This is the flow I use when the request runs:
const latitude = req.body.latitude;
var metrics = await models.user_car_metrics.findOne({ where: { user_id: userId, car_id: carId } })
if (metrics) {
metrics.latitude = latitude;
.....
} else {
metrics = models.user_car_metrics.build({
user_id: userId,
car_id: carId,
latitude: latitude
....
});
}
var savedMetrics = await metrics();
return res.status(201).json(savedMetrics);
At times, if the client calls the endpoint very fast twice or more the endpoint above tries to save two new rows in user_car_metrics, with the same user_id and car_id, both FK on tables user and car.
I have a constraint:
ALTER TABLE user_car_metrics DROP CONSTRAINT IF EXISTS user_id_car_id_unique, ADD CONSTRAINT user_id_car_id_unique UNIQUE (car_id, user_id);
Point is, there can only be one entry for a given user_id and car_id pair.
Because of that, I started seeing validation issues and after looking into it and adding logs I realize the code above adds duplicates in the table (without the constraint). If the constraint is there, I get validation errors when the code above tries to insert the duplicate record.
Question is, how do I avoid this problem? How do I structure the code so that it won't try to create duplicate records. Is there a way to serialize this?
If you have a unique constraint then you can use upsert to either insert or update the record depending on whether you have a record with the same primary key value or column values that are in the unique constraint.
await models.user_car_metrics.upsert({
user_id: userId,
car_id: carId,
latitude: latitude
....
})
See upsert
PostgreSQL - Implemented with ON CONFLICT DO UPDATE. If update data contains PK field, then PK is selected as the default conflict key. Otherwise, first unique constraint/index will be selected, which can satisfy conflict key requirements.

How to structure nested arrays with postgresql

I'm making a simple multiplayer game using postgres as a database (and node as the BE if that helps). I made the table users which contains all of the user accounts, and a table equipped, which contains all of the equipped items a user has. users has a one -> many relationship with equipped.
I'm running into the situation where I need the data from both tables structured like so:
[
{
user_id: 1,
user_data...
equipped: [
{ user_id: 1, item_data... },
...
],
},
{
user_id: 2,
user_data...
equipped: [
{ user_id: 2, item_data... },
...
],
},
]
Is there a way to get this data in a single query? Is it a good idea to get it in a single query?
EDIT: Here's my schemas
CREATE TABLE IF NOT EXISTS users (
user_id SERIAL PRIMARY KEY,
username VARCHAR(100) UNIQUE NOT NULL,
password VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL,
created_on TIMESTAMP NOT NULL DEFAULT NOW(),
last_login TIMESTAMP,
authenticated BOOLEAN NOT NULL DEFAULT FALSE,
reset_password_hash UUID
);
CREATE TABLE IF NOT EXISTS equipment (
equipment_id SERIAL PRIMARY KEY NOT NULL,
inventory_id INTEGER NOT NULL REFERENCES inventory (inventory_id) ON DELETE CASCADE,
user_id INTEGER NOT NULL REFERENCES users (user_id) ON DELETE CASCADE,
slot equipment_slot NOT NULL,
created_on TIMESTAMP NOT NULL DEFAULT NOW(),
CONSTRAINT only_one_item_per_slot UNIQUE (user_id, slot)
);
Okay so what I was looking for was closer to postgresql json aggregate, but I didn't know what to look for.
Based on my very limited SQL experience, the "classic" way to handle this would just to do a simple JOIN query on the database like so:
SELECT users.username, equipment.slot, equipment.inventory_id
FROM users
LEFT JOIN equipment ON users.user_id = equipment.user_id;
This is nice and simple, but I would need to merge these tables in my server before sending them off.
Thankfully postgres lets you aggregate rows into a JSON array, which is exactly what I needed (thanks #j-spratt). My final* query looks like:
SELECT users.username,
json_agg(json_build_object('slot', equipment.slot, 'inventory_id', equipment.inventory_id))
FROM users
LEFT JOIN equipment ON users.user_id = equipment.user_id
GROUP BY users.username;
Which returns in exactly the format I was looking for.

BigQuery UDF - Accessing leaf value of a RECORD nested on 2 levels

I'm trying to pass a nested RECORD to my passthrough UDF function which performs some actions on logMessage and then returns a string. However I'm unable to find the correct leaf that contains the logMessage. I couldn't find an example that deals with multiple level nesting. Do I need to do something else with the nested record to be able to access the logMessage string 2 levels deep? I suspect the answer must be pretty straightforward, but since my query is executing, but just returning "null" for each record as a result (probably because I'm emitting a nonexistent leaf or I'm missing some logic), I don't really know how to debug this.
DATA Schema:
[{"name":"proto","mode":"repeated","type":"RECORD",
"fields":
[
{"name":"line","mode":"repeated","type":"RECORD",
"fields":
[
{"name": "logMessage","type": "STRING"}
]
}
]
}]
Here's my SQL:
SELECT
url
FROM (passthrough(
SELECT
proto.line.logMessage
FROM
[mydata]
))
My UDF (I'm emitting the value right back at the moment - returns "null" for each record):
function passthrough(row, emit) {
emit({url: row.proto.line.logMessage});
}
bigquery.defineFunction(
'passthrough',
['proto.line.logMessage'],
[{'name': 'url', 'type': 'string'}],
passthrough
);
You're using repeated records, and repeated fields are represented as arrays in JS. So you probably need something like this:
function passthrough(row, emit) {
emit({url: row.proto[0].line[0].logMessage});
}
If you want to debug your UDF outside BigQuery, try using this test tool:
http://storage.googleapis.com/bigquery-udf-test-tool/testtool.html
You can generate input data for your UDF that matches the exact structure of your data by clicking on the "Preview" button in the BQ web UI and then clicking on "JSON" to get a copy-pastable JSON representation of your data.
I think, below pretty much resembles your case:
SELECT body
FROM JS(
( // input table
SELECT payload.comment.body
FROM [publicdata:samples.github_nested]
WHERE actor = 'shiftkey'
AND payload.comment.body IS NOT NULL
),
payload.comment.body, // input columns
"[ // output schema
{'name': 'body', 'type': 'STRING'
}
]",
"function(row, emit) { // function
emit({body: row.payload.comment.body});
}"
)

Resources