This question is essentially the same as in this post, SQL Select only rows with Max Value on a Column, except in CQL. I'm working with Cassandra 3.10 so GROUP BY is supported, but HAVING and JOIN are not.
As in the question in above link, we need to find the rows (including "content" column) in each id, with max(rev). In fact, the actual problem I'm trying to solve is to max(rev) grouping by two identifiers, id1 and id2, so ordering by id also doesn't work here.
+------+-------+-------+--------------------------------------+
| id1 | rev | id2 | content |
+------+-------+-------+------------------------------ -------+
| 1 | 1 | 1 | ... |
| 1 | 2 | 1 | ... |
| 2 | 1 | 2 | ... |
| 1 | 3 | 3 | ...
+------+-------+-------+--------------------------------------+
The SQL solutions I had for this were:
SELECT id1, id2, rev, content FROM table
GROUP BY id1, id2 HAVING rev = MAX(rev);
And
SELECT id1, id2, rev, content FROM table
WHERE rev IN
(SELECT MAX(rev) FROM table GROUP BY id1, id2)
(The second works assuming rev is unique.)
Without HAVING or JOIN, what would be a viable approach in CQL or Cassandra 3.10?
Related
I have been trying to answer this question
With the following data
+---------+---------+-----------+---------+
| Column1 | Column2 | Column3 | Column4 |
+---------+---------+-----------+---------+
| 1 | happy | 1-veggies | GHF |
| 1 | sad | 1-veggies | HGF |
| 2 | angry | 1-veggies | GHG |
| 2 | sad | 1-veggies | FGH |
| 3 | sad | 1-veggies | HGF |
| 4 | moody | 2-meat | FFF |
| 4 | sad | 2-meat | HGF |
| 5 | excited | 2-meat | HGF |
+---------+---------+-----------+---------+
OP was asking for a way of finding how many records there were which matched 'sad' and '1-veggies', and also had another record with the same value in column 1 and a code of GHF or FGH in column 4. The first two rows qualify, but the fourth row does not qualify because (if I understand correctly) it has the correct code, but in the same record as the one matching 'sad' and '1-veggies'. The count should be one.
I think the answer would have been fairly standard if this had been a SQL question - you would do a self-join with an equality on the first column and an inequality on the row number. In SQL it would look something like this:
create table Veggies
(
num integer,
emotion varchar(10),
food varchar(10),
code varchar(10),
seq integer
)
insert into Veggies
values
(1,'happy','1-veggies','GHF',1),
(1,'sad','1-veggies','HGF',2),
(2, 'angry' ,'1-veggies' ,'GHG',3),
(2, 'sad', '1-veggies', 'FGH',4),
(3, 'sad', '1-veggies', 'HGF',5),
(4, 'moody', '2-meat', 'FFF',6),
(4, 'sad', '2-meat', 'HGF',7),
(5, 'excited', '2-meat', 'HGF',8)
with t1 (num,seq)
as
(
select num,seq
from veggies
where emotion='sad' and food='1-veggies'
),
t2 (num,seq)
as
(
select num,seq
from veggies
where code='GHF' or code='FGH'
)
select *
from t1 inner join t2 on t1.num=t2.num and t1.seq<>t2.seq
I thought it might be possible to do the same thing (join on first column equal but row number unequal) in Power Query, but I have worked through the steps of getting the two queries with row numbers, and am stuck here:
I don't see any way of expressing an inequality and the documentation seems unhelpful. Does anyone have any inside knowledge on how to do this?
So although it looks as though you can't translate the SQL in the question directly into Power Query and replicate this in a single step
select *
from t1 inner join t2 on t1.num=t2.num and t1.seq<>t2.seq
you can split it into two steps as suggested by #Ron Rosenfeld.
To recap, the initial steps which hopefully were fairly straightforward were:
Establish a connection to the data as Table 1
Add an index column
Duplicate the table and call it Table 2
Filter table 1 by 'sad' and '1-veggies'
filter table 2 by 'GHF' or 'FGH'
Now join Table 2 to Table 1 using an inner join on Column 1:
and exclude rows that were in table 1 using a left anti join on the index column:
This leaves one row as required.
I have two pandas DataFrames:
df1 from database A with connection parameters {"host":"hostname_a","port": "5432", "dbname":"database_a", "user": "user_a", "password": "secret_a"}. The column key is the primary key.
df1:
| | key | create_date | update_date |
|---:|------:|:-------------|:--------------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 |
| 1 | 57248 | | 2018-01-21 |
| 2 | 57249 | 1992-12-22 | 2016-01-31 |
| 3 | 57250 | | 2015-01-21 |
| 4 | 57251 | 1991-12-23 | 2015-01-21 |
| 5 | 57262 | | 2015-01-21 |
| 6 | 57263 | | 2014-01-21 |
df2 from database B with connection parameters {"host": "hostname_b","port": "5433", "dbname":"database_b", "user": "user_b", "password": "secret_b"}. The column id is the primary key (these values are originally the same than the one in the column key in df1; it's only a renaming of the primary key column of df1).
df2:
| | id | create_date | update_date | user |
|---:|------:|:-------------|:--------------|:------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 | |
| 1 | 57248 | | 2018-01-21 | |
| 2 | 57249 | 1992-12-24 | 2020-10-11 | klm |
| 3 | 57250 | 2001-07-14 | 2019-21-11 | ptl |
| 4 | 57251 | 1991-12-23 | 2015-01-21 | |
| 5 | 57262 | | 2015-01-21 | |
| 6 | 57263 | | 2014-01-21 | |
Notice that the row[2] and row[3] in df2 have more recent update_date values (2020-10-11 and 2019-21-11 respectively) than their counterpart in df1 (where id = key) because their creation_date have been modified (by the given users).
I would like to update rows (i.e. in concrete terms; create_date and update_date values) of df1 where update_date in df2 is more recent than its original value in df1 (for the same primary keys).
This is how I'm tackling this for the moment, using sqlalchemy and psycopg2 + the .to_sql() method of pandas' DataFrame:
import psycopg2
from sqlalchemy import create_engine
connector = psycopg2.connect(**database_parameters_dictionary)
engine = create_engine('postgresql+psycopg2://', creator=connector)
df1.update(df2) # 1) maybe there is something better to do here?
with engine.connect() as connection:
df1.to_sql(
name="database_table_name",
con=connection,
schema="public",
if_exists="replace", # 2) maybe there is also something better to do here?
index=True
)
The problem I have is that, according to the documentation, the if_exists argument can only do three things:
if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’
Therefore, to update these two rows, I have to;
1) use .update() method on df1 using df2 as an argument, together with
2) replacing the whole table inside the .to_sql() method, which means "drop+recreate".
As the tables are really large (more than 500'000 entries), I have the feeling that this will need a lot of unnecessary work!
How could I efficiently update only those two newly updated rows? Do I have to generate some custom SQL queries to compares the dates for each rows and only take the ones that have really changed? But here again, I have the intuition that, looping through all rows to compare the update dates will take "a lot" of time. How is the more efficient way to do that? (It would have been easier in pure SQL if the two tables were on the same host/database but it's unfortunately not the case).
Pandas can't do partial updates of a table, no. There is a longstanding open bug for supporting sub-whole-table-granularity updates in .to_sql(), but you can see from the discussion there that it's a very complex feature to support in the general case.
However, limiting it to just your situation, I think there's a reasonable approach you could take.
Instead of using df1.update(df2), put together an expression that yields only the changed records with their new values (I don't use pandas often so I don't know this offhand); then iterate over the resulting dataframe and build the UPDATE statements yourself (or with the SQLAlchemy expression layer, if you're using that). Then, use the connection to DB A to issue all the UPDATEs as one transaction. With an indexed PK, it should be as fast as this would ever be expected to be.
BTW, I don't think df1.update(df2) is exactly correct - from my reading, that would update all rows with any differing fields, not just when updated_date > prev updated_date. But it's a moot point if updated_date in df2 is only ever more recent than those in df1.
Say I have the following spark dataframe:
| Node_id | Parent_id |
|---------|-----------|
| 1 | NULL |
| 2 | 1 |
| 3 | 1 |
| 4 | NULL |
| 5 | 4 |
| 6 | NULL |
| 7 | 6 |
| 8 | 3 |
This dataframe represents a tree structure consisting of several disjoint trees. Now, say that we have a list of nodes [8, 7], and we want to get a dataframe containing just the nodes that are roots of the trees containing the nodes in the list.The output looks like:
| Node_id | Parent_id |
|---------|-----------|
| 1 | NULL |
| 6 | NULL |
What would be the best (fastest) way to do this with spark queries and pyspark?
If I were doing this in plain SQL I would just do something like this:
CREATE TABLE #Tmp
Node_id int,
Parent_id int
INSERT INTO #Tmp Child_Nodes
SELECT #num = COUNT(*) FROM #Tmp WHERE Parent_id IS NOT NULL
WHILE #num > 0
INSERT INTO #Tmp (
SELECT
p.Node_id
p.Parent_id
FROM
#Tmp t
LEFT-JOIN Nodes p
ON t.Parent_id = p.Node_id)
SELECT #num = COUNT(*) FROM #Tmp WHERE Parent_id IS NOT NULL
END
SELECT Node_id FROM #Tmp WHERE Parent_id IS NULL
Just wanted to know if there's a more spark-centric way of doing this using pyspark, beyond the obvious method of simply looping over the dataframe using python.
parent_nodes = spark.sql("select Parent_id from table_name where Node_id in [2,7]").distinct()
You can join the above dataframe with the table to get the Parent_id of those nodes as well.
I want to be able to combine two columns from a table into one column then to to be able to get the actual value of the foreign keys. I can do these things individually but not together.
Following the answer below I was able to combine the two columns into one using the first sql statement below.
How to combine 2 columns into a new one in sqlite
The combining process is shown below:
+---+---+
|HT | AT|
+---+---+
|1 | 2 |
|5 | 7 |
|9 | 5 |
+---+---+
into one column as shown:
+---+
|HT |
+---+
| 1 |
| 5 |
| 9 |
| 2 |
| 7 |
| 5 |
+---+
The second SQL statement show's the actual value of each foreign key corresponding to each foreign key id. The Foreign Key Table.
+-----+------------------------+
|T_id | TN |
+-----+------------------------+
| 1 | 'Dallas Cowboys |
| 2 | 'Chicago Bears' |
| 5 | 'New England Patriots' |
| 7 | 'New York Giants' |
| 9 | 'New York Jets' |
+-----+------------------------+
sql = "SELECT * FROM (SELECT M.HT FROM M UNION SELECT M.AT FROM Match)t"
The second sql statement lets me get the foreign key values for each value in M.HT.
sql = "SELECT M.HT, T.TN FROM M INNER JOIN T ON M.HT = T.Tid WHERE strftime('%Y-%m-%d', M.ST) BETWEEN \'2015-08-01\' AND \'2016-06-30\' AND M.Comp = 6 ORDER BY M.ST"
Result of second SQL statement:
+-----+------------------------+
| HT | TN |
+-----+------------------------+
| 1 | 'Dallas Cowboys |
| 5 | 'New England Patriots' |
| 9 | 'New York Jets' |
+-----+------------------------+
But try as I might I have not been able to combine these queries!
I believe the following will work (assuming that the tables are Match and T and baring the WHERE and ORDER BY clauses for brevity/ease) :-
SELECT DISTINCT(m.ht), t.tn
FROM
(SELECT Match.HT FROM Match UNION SELECT Match.AT FROM Match) AS m
JOIN T ON t.tid = m.ht
JOIN Match ON (m.ht = Match.ht OR m.ht = Match.at)
/* WHERE and ORDER BY clauses using Match as m only has columns ht and at */
WHERE strftime('%Y-%m-%d', Match.ST)
BETWEEN \'2015-08-01\' AND \'2016-06-30\' AND Match.Comp = 6
ORDER BY Match.ST
;
Note only tested without the WHERE and ORDER BY clause.
That is using :-
DROP TABLE IF EXISTS Match;
DROP TABLE IF EXISTS T;
CREATE TABLE IF NOT EXISTS Match (ht INTEGER, at INTEGER, st TEXT DEFAULT (datetime('now')));
CREATE TABLE IF NOT EXISTS t (tid INTEGER PRIMARY KEY, tn TEXT);
INSERT INTO T (tn) VALUES('Cows'),('Bears'),('a'),('b'),('Pats'),('c'),('Giants'),('d'),('Jets');
INSERT INTO Match (ht,at) VALUES (1,2),(5,7),(9,5);
/* Directly without the Common Table Expression */
SELECT
DISTINCT(m.ht), t.tn,
Match.st /*<<<<< Added to show results of obtaining other values from Matches >>>>> */
FROM
(SELECT Match.HT FROM Match UNION SELECT Match.AT FROM Match) AS m
JOIN T ON t.tid = m.ht
JOIN Match ON (m.ht = Match.ht OR m.ht = Match.at)
/* WHERE and ORDER BY clauses here using Match */
;
Noting that limited data (just the one extra column) was used for brevity
Results in :-
I'm selecting data from a Cassandra database using a query. It is working fine but how to get the data in same order as I have given IN query?
I have created table like this:
id | n | p | q
----+---+---+------
5 | 1 | 2 | 4
10 | 2 | 4 | 3
11 | 1 | 2 | null
I am trying to select data using
SELECT *
FROM malleshdmy
WHERE id IN ( 11,10,5)
But, It producing same data as like stored.
id | n | p | q
----+---+---+------
5 | 1 | 2 | 4
10 | 2 | 4 | 3
11 | 1 | 2 | null
Please help me in this issue.
I want data as 11,10 and 5
If the id is partition key, then it's impossible - data are sorted only inside the clustering columns, and data for different partition keys could be returned in arbitrary order (but sorted inside that partition).
You need to sort data yourself.
Since id is your partition key, your data is actually being sorted by the token of id, not the values themselves:
cqlsh:testid> SELECT id,n,p,q,token(id) FROM table;
id | n | p | q | system.token(id)
----+---+---+------+----------------------
5 | 1 | 2 | 4 | -7509452495886106294
10 | 2 | 4 | 3 | -6715243485458697746
11 | 1 | 2 | null | -4156302194539278891
Because of this, you don't have any control over how the partition key is sorted.
In order to sort your data by id, you need to make id a clustering column rather than a partition key. Your data will still need a partition key, however, and this will always be sorted by token.
If you decide to make id a clustering column, you will need to specify that you want a descending order in your order by statement
CREATE TABLE clusterTable (
... partition type, //partition key with a type to be specified
... id INT,
... n INT,
... p INT,
... q INT,
... PRIMARY KEY((partition),id))
... WITH CLUSTERING ORDER BY (id DESC);
This link is very helpful in discussing how ordering works in Cassandra: https://www.datastax.com/dev/blog/we-shall-have-order