Renaming a table & keeping connections to existing partitions in YugabyteDB - yugabytedb

[Question posted by a user on YugabyteDB Community Slack]
Does renaming the table, existing partitions attached to that table remain as it is after renaming?

Yes.
yugabyte=# \dt
List of relations
Schema | Name | Type | Owner
--------+-----------------------+-------+----------
public | order_changes | table | yugabyte
public | order_changes_2019_02 | table | yugabyte
public | order_changes_2019_03 | table | yugabyte
public | order_changes_2020_11 | table | yugabyte
public | order_changes_2020_12 | table | yugabyte
public | order_changes_2021_01 | table | yugabyte
public | people | table | yugabyte
public | people1 | table | yugabyte
public | user_audit | table | yugabyte
public | user_credentials | table | yugabyte
public | user_profile | table | yugabyte
public | user_svc_account | table | yugabyte
(12 rows)
yugabyte=# alter table order_changes RENAME TO oc;
ALTER TABLE
yugabyte=# \dS+ oc
Table "public.oc"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-------------+------+-----------+----------+---------+----------+--------------+-------------
change_date | date | | | | plain | |
type | text | | | | extended | |
description | text | | | | extended | |
Partition key: RANGE (change_date)
Partitions: order_changes_2019_02 FOR VALUES FROM ('2019-02-01') TO ('2019-03-01'),
order_changes_2019_03 FOR VALUES FROM ('2019-03-01') TO ('2019-04-01'),
order_changes_2020_11 FOR VALUES FROM ('2020-11-01') TO ('2020-12-01'),
order_changes_2020_12 FOR VALUES FROM ('2020-12-01') TO ('2021-01-01'),
order_changes_2021_01 FOR VALUES FROM ('2021-01-01') TO ('2021-02-01')
Postgres and therefore YugabyteDB doesn’t actually use the names of an object, it uses the OID (object ID) of an object.
That means that you can rename it, without actually causing any harm, because it’s simply a name in the catalog with the object identified by its OID.
This has other side effects as well: if you create a table, and perform a certain SQL like ‘select count(*) from table’, drop it, and then create a table with the same name, and perform the exact same SQL, you will get two records in pg_stat_statements with identical SQL text. This seems weird from the perspective of databases where the SQL area is shared. In postgres, only pg_stat_statements is shared, there is no SQL cache.
pg_stat_statements does not store the SQL text, it stores the query tree (an internal representation of the SQL), and symbolizes the tree, which makes to appear like SQL again. The query tree uses the OID, and therefore for pg_stat_statements the above two identical SQL texts are different query trees, because the OIDs of the tables are different.

Related

Efficiently update rows of a postgres table from another table in another database based on a condition in a common column

I have two pandas DataFrames:
df1 from database A with connection parameters {"host":"hostname_a","port": "5432", "dbname":"database_a", "user": "user_a", "password": "secret_a"}. The column key is the primary key.
df1:
| | key | create_date | update_date |
|---:|------:|:-------------|:--------------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 |
| 1 | 57248 | | 2018-01-21 |
| 2 | 57249 | 1992-12-22 | 2016-01-31 |
| 3 | 57250 | | 2015-01-21 |
| 4 | 57251 | 1991-12-23 | 2015-01-21 |
| 5 | 57262 | | 2015-01-21 |
| 6 | 57263 | | 2014-01-21 |
df2 from database B with connection parameters {"host": "hostname_b","port": "5433", "dbname":"database_b", "user": "user_b", "password": "secret_b"}. The column id is the primary key (these values are originally the same than the one in the column key in df1; it's only a renaming of the primary key column of df1).
df2:
| | id | create_date | update_date | user |
|---:|------:|:-------------|:--------------|:------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 | |
| 1 | 57248 | | 2018-01-21 | |
| 2 | 57249 | 1992-12-24 | 2020-10-11 | klm |
| 3 | 57250 | 2001-07-14 | 2019-21-11 | ptl |
| 4 | 57251 | 1991-12-23 | 2015-01-21 | |
| 5 | 57262 | | 2015-01-21 | |
| 6 | 57263 | | 2014-01-21 | |
Notice that the row[2] and row[3] in df2 have more recent update_date values (2020-10-11 and 2019-21-11 respectively) than their counterpart in df1 (where id = key) because their creation_date have been modified (by the given users).
I would like to update rows (i.e. in concrete terms; create_date and update_date values) of df1 where update_date in df2 is more recent than its original value in df1 (for the same primary keys).
This is how I'm tackling this for the moment, using sqlalchemy and psycopg2 + the .to_sql() method of pandas' DataFrame:
import psycopg2
from sqlalchemy import create_engine
connector = psycopg2.connect(**database_parameters_dictionary)
engine = create_engine('postgresql+psycopg2://', creator=connector)
df1.update(df2) # 1) maybe there is something better to do here?
with engine.connect() as connection:
df1.to_sql(
name="database_table_name",
con=connection,
schema="public",
if_exists="replace", # 2) maybe there is also something better to do here?
index=True
)
The problem I have is that, according to the documentation, the if_exists argument can only do three things:
if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’
Therefore, to update these two rows, I have to;
1) use .update() method on df1 using df2 as an argument, together with
2) replacing the whole table inside the .to_sql() method, which means "drop+recreate".
As the tables are really large (more than 500'000 entries), I have the feeling that this will need a lot of unnecessary work!
How could I efficiently update only those two newly updated rows? Do I have to generate some custom SQL queries to compares the dates for each rows and only take the ones that have really changed? But here again, I have the intuition that, looping through all rows to compare the update dates will take "a lot" of time. How is the more efficient way to do that? (It would have been easier in pure SQL if the two tables were on the same host/database but it's unfortunately not the case).
Pandas can't do partial updates of a table, no. There is a longstanding open bug for supporting sub-whole-table-granularity updates in .to_sql(), but you can see from the discussion there that it's a very complex feature to support in the general case.
However, limiting it to just your situation, I think there's a reasonable approach you could take.
Instead of using df1.update(df2), put together an expression that yields only the changed records with their new values (I don't use pandas often so I don't know this offhand); then iterate over the resulting dataframe and build the UPDATE statements yourself (or with the SQLAlchemy expression layer, if you're using that). Then, use the connection to DB A to issue all the UPDATEs as one transaction. With an indexed PK, it should be as fast as this would ever be expected to be.
BTW, I don't think df1.update(df2) is exactly correct - from my reading, that would update all rows with any differing fields, not just when updated_date > prev updated_date. But it's a moot point if updated_date in df2 is only ever more recent than those in df1.

Order by in materialized view doesn't sort the results

I have a table with a structure like this:
CREATE TABLE kaefko.se_vi_f55dfeebae00d2b3 (
value text PRIMARY KEY,
id text,
popularity bigint);
With data that looks like this:
value | id | popularity
--------+------------------+------------
rally | 4eff16cb91f96cd6 | 2
reddit | 11aa39686ed66ba5 | 3
red | 552d7e95af481415 | 1
really | 756bfa499965863c | 1
right | c5850c6b08f7966b | 1
redis | 7f1d251f399442d7 | 1
And I've created a materialized view that should sort these values by the popularity from the biggest to the smallest ones:
CREATE MATERIALIZED VIEW kaefko.se_vi_f55dfeebae00d2b3_by_popularity AS
SELECT *
FROM kaefko.se_vi_f55dfeebae00d2b3
WHERE popularity IS NOT null
PRIMARY KEY (value, popularity)
WITH CLUSTERING ORDER BY (popularity DESC);
But the data in the materialized view looks like this:
value | popularity | id
--------+------------+------------------
rally | 2 | 4eff16cb91f96cd6
reddit | 3 | 11aa39686ed66ba5
really | 1 | 756bfa499965863c
right | 1 | c5850c6b08f7966b
redis | 1 | 7f1d251f399442d7
As you can see there are two main issues:
Data is not sorted as defined in the materialized view
There is just a part of all data in the materialized view
I'm not very experienced in Cassandra and I've already spent hours trying to find the reason why this happens with no avail. Could somebody please help me? Thank you <3
__
I'm using ScyllaDB 4.1.9-0 and cqlsh shows this:
[cqlsh 5.0.1 | Cassandra 3.0.8 | CQL spec 3.3.1 | Native protocol v4]
Alex's comment is 100% correct, the order is within the partition.
PRIMARY KEY (value, popularity)
WITH CLUSTERING ORDER BY (popularity DESC);
This means that the ordering of popularity is descending only for values where the 'value' field is the same - if I was to alter the data you used to show what this would look like as an example, you would get the following:
value | popularity | id
--------+------------+------------------
rally | 3 | 4eff16cb91f96cd6
rally | 2 | 11aa39686ed66ba5
really | 3 | 756bfa499965863c
really | 2 | c5850c6b08f7966b
really | 1 | 7f1d251f399442d7
The order is on a per partition key basis, not globally ordered.

follower/following in cassandra

We are designing a twitter like follower/following in Cassandra, and found something similar
from here https://www.slideshare.net/jaykumarpatel/cassandra-at-ebay-13920376/13-Data_Model_simplified_13
so I think ItemLike is a table?
itemid1=>(userid1, userid2...) is a row in the table?
what do you think is the create table of this ItemLike table?
Yes, ItemLike is a table
Schema of the ItemLike table will be Like :
CREATE TABLE itemlike(
itemid bigint,
userid bigint,
timeuuid timeuuid,
PRIMARY KEY(itemid, userid)
);
The picture of the slide is the internal structure of the above table.
Let's insert some data :
itemid | userid | timeuuid
--------+--------+--------------------------------------
2 | 100 | f172e3c0-67a6-11e7-8e08-371a840aa4bb
2 | 103 | eaf31240-67a6-11e7-8e08-371a840aa4bb
1 | 100 | d92f7e90-67a6-11e7-8e08-371a840aa4bb
Internally cassandra will store the data like below :
--------------------------------------------------------------------------------------|
| | 100:timeuuid | 103:timeuuid |
| +---------------------------------------+----------------------------------------|
|2 | f172e3c0-67a6-11e7-8e08-371a840aa4bb | eaf31240-67a6-11e7-8e08-371a840aa4bb |
--------------------------------------------------------------------------------------|
---------------------------------------------|
| | 100:timeuuid |
| +---------------------------------------|
|1 | d92f7e90-67a6-11e7-8e08-371a840aa4bb |
---------------------------------------------|

Waterline.js Joins/Populate with existing database

I have an existing postgres database which I am using to build a sails.js driven website, utilising waterline for ORM.
I'm fine with using my database in its existing form for everything other than population i.e. joining tables.
When working with a non-production database I'm comfortable with how waterline can produce join tables for me, but I'm really unsure how to bypass this to work with the current tables I have and their foreign key relationships. To give an idea of the types of tables I would typically have I've shown an example below:
| Intel | |
|-------------|--------|
| Column | Type |
| id | int PK |
| alliance_id | int FK |
| planet_id | int FK |
| dist | int |
| bg | string |
| amps | int |
| Alliance | |
|----------|--------|
| Column | Type |
| id | int PK |
| name | string |
| score | int |
| value | int |
| size | int |
| Planet | |
|-----------|--------|
| Column | Type |
| id | int PK |
| rulerName | string |
| score | int |
| value | int |
| size | int |
So in the above tables I would typically be able to join Intel --> Alliance and Intel --> Planet and access the data across each of these.
What would I need in my waterline model of Intel, Alliance, Planet to access this easily?
I'd love to do a:
Intel.find({alliance.name= 'test'})
or
Intel.find().populate('planet')
and then somehow be able to access intel.planet.score or intel.alliance.name etc
Thanks for any help. I can add more information if required just let me know in the comments.
first create models for all your databases table , as mention here
you can populate models and return joins results

Cassandra: Searching for NULL values

I have a table MACRecord in Cassandra as follows :
CREATE TABLE has.macrecord (
macadd text PRIMARY KEY,
position int,
record int,
rssi1 float,
rssi2 float,
rssi3 float,
rssi4 float,
rssi5 float,
timestamp timestamp
)
I have 5 different nodes each updating a row based on its title i-e node 1 just updates rssi1, node 2 just updates rssi2 etc. This evidently creates null values for other columns.
I cannot seem to be able to a find a query which will give me only those rows which are not null. Specifically i have referred to this post.
I want to be able to query for example like SELECT *FROM MACRecord where RSSI1 != NULL as in MYSQL. However it seems both null values and comparison operators such as != are not supported in CQL.
Is there an alternative to putting NULL values or a special flag?. I am inserting float so unlike strings i cannot insert something like ''. What is a possible workaround for this problem?
Edit :
My data model in MYSQL was like this :
+-----------+--------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+-------------------+-----------------------------+
| MACAdd | varchar(17) | YES | UNI | NULL | |
| Timestamp | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| Record | smallint(6) | YES | | NULL | |
| RSSI1 | decimal(5,2) | YES | | NULL | |
| RSSI2 | decimal(5,2) | YES | | NULL | |
| RSSI3 | decimal(5,2) | YES | | NULL | |
| RSSI4 | decimal(5,2) | YES | | NULL | |
| RSSI5 | decimal(5,2) | YES | | NULL | |
| Position | smallint(6) | YES | | NULL | |
+-----------+--------------+------+-----+-------------------+-----------------------------+
Each node (1-5) was querying from MYSQL based on its number for example node 1 "SELECT *FROM MACRecord WHERE RSSI1 is not NULL"
I updated my data model in cassandra as follows so that rssi1-rssi5 are now VARCHAR types.
CREATE TABLE has.macrecord (
macadd text PRIMARY KEY,
position int,
record int,
rssi1 text,
rssi2 text,
rssi3 text,
rssi4 text,
rssi5 text,
timestamp timestamp
)
I was thinking that each node would initially insert string 'NULL' for a record and when an actual rssi data comes it will just replace the 'NULL' string so it would avoid having tombstones and would more or less appear to the user that the values are actually not valid pieces of data since they are flagged 'NULL'.
However i am still puzzled as to how i will retrieve results like i have done in MYSQL. There is no != operator in cassandra. How can i write a query which will give me a result set for example like "SELECT *FROM HAS.MACRecord where RSSI1 != 'NULL'" .
You can only select rows in CQL based on the PRIMARY KEY fields, which by definition cannot be null. This also applies to secondary indexes. So I don't think Cassandra will be able to do the filtering you want on the data fields. You could select on some other criteria and then write your client to ignore rows that had null values.
Or you could create a different table for each rssiX value, so that none of them would be null.
If you are only interested in some kind of aggregation, then the null values are treated as zero. So you could do something like this:
SELECT sum(rssi1) WHERE macadd='someadd';
The sum() function is available in Cassandra 2.2.
You might also be able to do some kind of trick with a user defined function/aggregate, but I think it would be simpler to have multiple tables.

Resources