Python SQLite 3 query multiple query - python-3.x

I've a problem to in building a query for Python SQLite3 to do the following:
Count a word which appears in columns, if word appears more than 1 time count one.
I've attached a picture to illustrate my table format.
I tried this but the result still counts duplicate values with same ID.
"SELECT id, value, count(value) FROM table WHERE type like'%hi%' GROUP BY value ORDER BY COUNT(*)<1 DESC"
The result needs to be like:

Hi all you need can be achieved with GROUP BY clause.
This should help:
SELECT
id
,value
,1 AS cnt
FROM table
GROUP BY id, value
ORDER BY id

What you're looking for is DISTINCT clause or GROUP BY as mentioned by Peter.
for GROUP BY use this syntax:
SELECT
id
,value
,1 AS cnt
FROM table
GROUP BY id, value
for DISTINCT use this one:
SELECT DISTINCT
id
,value
,1 AS cnt
FROM table

Related

Getting records based on latest date [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 11 months ago.
I'm quite new to SQL and I'm trying to filter the latest date record (DateTime column) for each unique ID present in the table.
Sample data: there are 2 unique IDs (16512) and (76513).
DateTime
ID
Notes
2021-03-26T10:39:54.9770238
16512
Still a work in Progress
2021-04-29T12:46:12.8277807
16512
Still working on it
2021-03-21T10:39:54.9770238
76513
Still a work in Progress
2021-04-20T12:46:12.8277800
76513
Still working on project
Desired result (get last row of each ID based on the DateTime column):
DateTime
ID
Notes
2021-04-29T12:46:12.8277807
16512
Still working on it
2021-04-20T12:46:12.8277800
76513
Still working on project
My query:
SELECT MAX(DateTime), ID
FROM Table1
GROUP BY DateTime, ID
Thanks in advance for you help.
SELECT max(DateTime), ID
FROM Table1
GROUP BY ID
You can use row_number here
with d as (
select *, row_number() over(partition by Id order by DataTime desc)rn
)
select Datetime, Id, Notes
from d
where rn = 1;
You didn't state a particular database but if you are using Postgres then you can use its DISTINCT ON and is often the fastest solution if the size of your groups is not too big (in your case this is the size of tasks that have the same id).
Here's an example. Please note I've excluded your notes column for brevity but it will work if you include it and will give you the output you desire above.
create temporary table tasks (
id int,
created_at date,
);
insert into tasks(id, created_at) values
(16512, '2021-03-26'),
(16512, '2021-04-29'),
(76513, '2021-03-21'),
(76513, '2021-04-20')
;
select
distinct on (id)
id,
created_at
from tasks
order by id, created_at desc
/*
id | created_at
-------+------------
16512 | 2021-04-29
76513 | 2021-04-20
*/
The mentioned row_number is one of the method solving your problem. You tagged databricks in your question, so let me show you another option that you can implement with Spark SQL using last function from aggregate functions pool.
In refrence to the spark documentation:
last(expr[, isIgnoreNull]) - Returns the last value of expr for a group of rows. If isIgnoreNull is true, returns only non-null values.
Note that:
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
In your example:
%sql
WITH cte AS (
SELECT *
FROM my_table
ORDER BY DateTime asc
)
SELECT Id, last(DateTime) AS DateTime, last(Notes) as Notes
FROM cte
GROUP BY Id
Similarly, you can use first function to obtain the first record in a sorted dataset.
Check if that works for you.

Correct way to get the last value for a field in Apache Spark or Databricks Using SQL (Correct behavior of last and last_value)?

What is the correct behavior of the last and last_value functions in Apache Spark/Databricks SQL. The way I'm reading the documentation (here: https://docs.databricks.com/spark/2.x/spark-sql/language-manual/functions.html) it sounds like it should return the last value of what ever is in the expression.
So if I have a select statement that does something like
select
person,
last(team)
from
(select * from person_team order by date_joined)
group by person
I should get the last team a person joined, yes/no?
The actual query I'm running is shown below. It is returning a different number each time I execute the query.
select count(distinct patient_id) from (
select
patient_id,
org_patient_id,
last_value(data_lot) data_lot
from
(select * from my_table order by data_lot)
where 1=1
and org = 'my_org'
group by 1,2
order by 1,2
)
where data_lot in ('2021-01','2021-02')
;
What is the correct way to get the last value for a given field (for either the team example or my specific example)?
--- EDIT -------------------
I'm thinking collect_set might be useful here, but I get the error shown when I try to run this:
select
patient_id,
last_value(collect_set(data_lot)) data_lot
from
covid.demo
group by patient_id
;
Error in SQL statement: AnalysisException: It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;;
Aggregate [patient_id#89338], [patient_id#89338, last_value(collect_set(data_lot#89342, 0, 0), false) AS data_lot#91848]
+- SubqueryAlias spark_catalog.covid.demo
The posts shown below discusses how to get max values (not the same as last in a list ordered by a different field, I want the last team a player joined, the player may have joined the Reds, the A's, the Zebras, and the Yankees, in that order timewise, I'm looking for the Yankees) and these posts get to the solution procedurally using python/r. I'd like to do this in SQL.
Getting last value of group in Spark
Find maximum row per group in Spark DataFrame
--- SECOND EDIT -------------------
I ended up using something like this based upon the accepted answer.
select
row_number() over (order by provided_date, data_lot) as row_num,
demo.*
from demo
You can assign row numbers based on an ordering on data_lots if you want to get its last value:
select count(distinct patient_id) from (
select * from (
select *,
row_number() over (partition by patient_id, org_patient_id, org order by data_lots desc) as rn
from my_table
where org = 'my_org'
)
where rn = 1
)
where data_lot in ('2021-01','2021-02');

WHERE variable = ( subquery ) in OpenSQL

I'm trying to retrieve rows from a table where a subquery matches an variable. However, it seems as if the WHERE clause only lets me compare fields of the selected tables against a constant, variable or subquery.
I would expect to write something like this:
DATA(lv_expected_lines) = 5.
SELECT partner contract_account
INTO TABLE lt_bp_ca
FROM table1 AS tab1
WHERE lv_expected_lines = (
SELECT COUNT(*)
FROM table2
WHERE partner = tab1~partner
AND contract_account = tab1~contract_account ).
But obviously this select treats my local variable as a field name and it gives me the error "Unknown column name "lv_expected_lines" until runtime, you cannot specify a field list."
But in standard SQL this is perfectly possible:
SELECT PARTNER, CONTRACT_ACCOUNT
FROM TABLE1 AS TAB1
WHERE 5 = (
SELECT COUNT(*)
FROM TABLE2
WHERE PARTNER = TAB1.PARTNER
AND CONTRACT_ACCOUNT = TAB1.CONTRACT_ACCOUNT );
So how can I replicate this logic in RSQL / Open SQL?
If there's no way I'll probably just write native SQL and be done with it.
The program below might lead you to an Open SQL solution. It uses the SAP demo tables to determines the plane types that are used on a specific number of flights.
REPORT zgertest_sub_query.
DATA: lt_planetypes TYPE STANDARD TABLE OF s_planetpp.
PARAMETERS: p_numf TYPE i DEFAULT 62.
START-OF-SELECTION.
SELECT planetype
INTO TABLE lt_planetypes
FROM sflight
GROUP BY planetype
HAVING COUNT( * ) EQ p_numf.
LOOP AT lt_planetypes INTO DATA(planetype).
WRITE: / planetype.
ENDLOOP.
It only works if you don't need to read fields from TAB1. If you do you will have to gather these with other selects while looping at your results.
For those dudes who found this question in 2020 I report that this construction is supported since ABAP 7.50. No workarounds are needed:
SELECT kunnr, vkorg
FROM vbak AS v
WHERE 5 = ( SELECT COUNT(*)
FROM vbap
WHERE kunnr = v~kunnr
AND vkorg = v~vkorg )
INTO TABLE #DATA(customers).
This select all customers who made 5 sales orders within some sales organization.
In ABAP there is no way to do the query as in NATIVE SQL.
I would advice not to use NATIVE SQL, instead give a try to SELECT/ENDSELECT statement.
DATA: ls_table1 type table1,
lt_table1 type table of table1,
lv_count type i.
SELECT PARTNER, CONTRACT_ACCOUNT
INTO ls_table1
FROM TABLE1.
SELECT COUNT(*)
INTO lv_count
FROM TABLE2
WHERE PARTNER = TAB1.PARTNER
AND CONTRACT_ACCOUNT = TAB1.CONTRACT_ACCOUNT.
CHECK lv_count EQ 5.
APPEND ls_table1 TO lt_table1.
ENDSELECT
Here you append to ls_table1 only those rows where count is equals to 5 in selection of table2.
Hope it helps.

Cassandra CQL: Filter the rows between a range of values

The structure of my column family is something like
CREATE TABLE product (
id UUID PRIMARY KEY,
product_name text,
product_code text,
status text,//in stock, out of stock
mfg_date timestamp,
exp_date timestamp
);
Secondary Index is created on status, mfg_date, product_code and exp_date fields.
I want to select the list of products whose status is IS (In Stock) and the manufactured date is between timestamp xxxx to xxxx.
So I tried the following query.
SELECT * FROM product where status='IS' and mfg_date>= xxxxxxxxx and mfg_date<= xxxxxxxxxx LIMIT 50 ALLOW FILTERING;
It throws error like No indexed columns present in by-columns clause with "equals" operator.
Is there anything I need to change in the structure? Please help me out. Thanks in Advance.
cassandra is not supporting >= so you have to change the value and have to use only >(greater then) and <(lessthen) for executing query.
You should have at least one "equals" operator on one of the indexed or primary key column fields in your where clause, i.e. "mfg_date = xxxxx"

how to join two or more tables and result set having all distinct values

I have some 20 excel files containing data. all the tables have same columns like id name age location etc..... each file has distinct data but i don't know if data in one file is again repeated in another file. so i want to join all the files and the result st should contain distinct values. please help me out with this problem as soon as possible. i want the result set to be stored in an access database.
I would recomend either linking the sheets in acces, or importing the sheets as tabels.
Then from there try to determine using a DISTINCT select from the tables/sheets the keys required, and only selecting the records as required.
In SQL, you can use JOIN or NATURAL JOIN to join tables. I would look into NATURAL JOIN since you said all tables have the same values.
After that you can use DISTINCT to get distinct values.
I'm not sure if this is what you're looking for though: your question asks about excel but you've tagged it with SQL.
If you can use all the tables in one query, you can use a union to get the distinct rows:
select id, name, age, location from Table1
union
select id, name, age, location from Table2
union
select id, name, age, location from Table3
union
...
You can insert the records directly from the result:
insert into ResultTable
select id, name, age, location from Table1
union
....
If you only can select from one table at a time, you can skip the insert of rows that are already in the table:
insert into ResultTable
select t.id, t.name, t.age, t.location from Table1 as t
left join ResultTable as r on r.id = t.id
where r.id is null
(Assuming that id is a unique field identifying the record.)
It seems the unique set of data you want is this:
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db1.xls;
].[Sheet1$] AS T1
UNION
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db2.xls;
].[Sheet1$] AS T1
...but that you then want to arbitrarily apply a sequence of integers as id (rather than using the id values from the Excel tables).
Because Access Database Engine does not support common table expressions and Excel does not support VIEWs, you will have to repeat that UNION query as derived tables (hopefully the optimizer will recognize the repeat?) e.g. using a correlated subquery to get the row number:
SELECT (
SELECT COUNT(*) + 1
FROM (
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db1.xls;
].[Sheet1$] AS T1
UNION
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db2.xls;
].[Sheet1$] AS T1
) AS DT1
WHERE DT1.name < DT2.name
) AS id,
DT2.name, DT2.loc
FROM (
SELECT T2.name, T2.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db1.xls;
].[Sheet1$] AS T2
UNION
SELECT T2.name, T2.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db2.xls;
].[Sheet1$] AS T2
) AS DT2;
Note:
i want the result set to be stored in
an access database
Then maybe you should migrate the Excel data into a staging table in your Access database and do the data scrubbing from there. At least you could put that derived table into a VIEW :)
Join is to combine two tables by matching the values in corresponding columns. In result, you will get a merged table which consists of the first table, plus the matched rows copied from the second table. You can use DIGBD add-in for excel

Resources