Postgresql : Find the string with the closest substring match

Postgresql : Find the string with the closest substring match - string

If I have a table test with values like :
id | value
----------------
1 | ABC 1-2-3
2 | AB 1-2-3-4-5
3 | ABC 1
4 | ABC 1-2
5 | ABC
and the input string I'm trying to is ABC 1-2-3-4-5, then the closest substring match (if I could call it that) should be ABC 1-2-3. Row # 2 should not match because it doesn't have the "ABC". I've only been able to search for the string if the input string is shorter than the actual records, but not if it's longer. e.g
select * from test where value ilike 'ABC 1-2%';
but this also does not give me an exact record, but only those starting with ABC 1-2. How do I construct the proper sql statement to solve this?

You may be interested in pg_trgm extension:
create extension if not exists pg_trgm;
Standard similarities for your data are as follows:
select *, similarity(value, 'ABC 1-2-3-4-5')
from test
order by 3 desc;
id | value | similarity
----+--------------+------------
2 | AB 1-2-3-4-5 | 0.8
1 | ABC 1-2-3 | 0.714286
4 | ABC 1-2 | 0.571429
3 | ABC 1 | 0.428571
5 | ABC | 0.285714
(5 rows)
However you can always add additional criteria in WHERE clause:
select *, similarity(value, 'ABC 1-2-3-4-5')
from test
where value ilike 'abc%'
order by 3 desc;
id | value | similarity
----+-----------+------------
1 | ABC 1-2-3 | 0.714286
4 | ABC 1-2 | 0.571429
3 | ABC 1 | 0.428571
5 | ABC | 0.285714
(4 rows)

Reverse the comparison:
select * from test
where 'ABC 1-2-3-4-5' ilike value || '%'
order by length(value) desc
The best (ie longest) matches will be returned first.

Related

Drop rows in Pandas where column value is not equal to specific suffix

Suppose, I have a df having rows values
ID Name Age
ABC-123 XYZ 22
ABC-345 LMK 12
ABC-123-1 MNO 22
After applying a filter on column ID,
I need only the first two rows to be returned in this dataset case.
Like.
ID Name Age
ABC-123 XYZ 22
ABC-345 LMK 12
You see all the rows are excluded from the final result which doesn't match the pattern. All rows should be returned that match the pattern like ABC-123.
Note: Suffix number can be anything so I think it should be done with some regex to check for string pattern.

import pandas
df = pd.DataFrame(dict(id=['ABC-123','ABC-345','ABC-123-1'], age=[22,12,22]))
| | id | age |
|---:|:----------|------:|
| 0 | ABC-123 | 22 |
| 1 | ABC-345 | 12 |
| 2 | ABC-123-1 | 22 |
df.query('id.str.len() <= 7')
| | id | age |
|---:|:--------|------:|
| 0 | ABC-123 | 22 |
| 1 | ABC-345 | 12 |

How to update new values in a column without affecting old ones in sqlite database

I have created a table called Employees with 3 columns namely names,exp and samples and I am trying to insert values individually as below:
test = Company()
test.details()
DML = '''INSERT INTO Employees (Names,Exp) VALUES(?,?)'''
data = list(zip_longest(test.names,test.exps,fillvalue= None))
self.curr.executemany(DML,data)
self.curr.execute('''UPDATE Employees SET Samples = 'sample1' ''')
self.conn.commit()
Company is the class created in another python file and it has a corresponding function details() which is being called in the current file. Names and Exp columns
have list of values stored in test.names and test.exps respectively. When executed the values are properly being inserted but the third column Samples varies for each insertion
hence I need to populate it individually. When I execute the above:
# In the db
Names | Exp | Samples
John | 2 | sample1
Cena | 4 | sample1
Tom | 6 | sample1
Since, names and exps are list I am changing the list values in the other file for each insertion and its working as expected. For the above John,Cena,Tom are first list values of test.names. Similarly, 2,4,6 are first list values of test.exps
Each time I insert the values for a particular list the sample value remains same as in:
# In the db First insertion
Names | Exp | Samples
John | 2 | sample1
Cena | 4 | sample1
Tom | 6 | sample1
# Second insertion (expected)
Names | Exp | Samples
John | 2 | sample1
Cena | 4 | sample1
Tom | 6 | sample1
Meg | 3 | sample2
Cena | 4 | sample2
Renu | 6 | sample2
But each time I insert the value Names and Exp are perfect but the Samples column value gets replaced by new ones as in:
Names | Exp | Samples
John | 2 | sample2
Cena | 4 | sample2
Tom | 6 | sample2
Meg | 3 | sample2
Cena | 4 | sample2
Renu | 6 | sample2
I also tried with self.curr.execute('''UPDATE Employees SET Samples = 'sample1' where Samples = NULL ''') but no luck. Is there any way I can update sample values without affecting the old values.
PS: I do not want to delete or replace the old values.

Postgres 9.4+ split an alpha numeric string into two columns

I need to deal with the following and after searching I wasn't able to find exactly what I'm looking for:
Let's say I have a column which may or may not have an alphanumeric string
SKU
-----
12345ABC
12345-Abc
12345-Ab23
12345
Which I would like to break into
SKU | BATCH
------------------
12345 | ABC
12345 | Abc
12345 | Ab23
12345 | NULL
using PostgreSQL 9.4+ I've tried the string and sub_string method's but I'm not getting the results I'm looking to achieve... any ideas?

You can use the substring function.
with a (SKU) as (values('12345ABC'), ('12345-Abc'), ('12345-Ab23'), ('12345'))
select substring(sku from '^\d+'), substring(sku from '[a-zA-Z][a-zA-Z0-9]*$') from a;
substring | substring
-----------+-----------
12345 | ABC
12345 | Abc
12345 | Ab23
12345 |
(4 rows)

You can use regexp_matches:
with a (SKU) as (values('12345ABC'), ('12345-Abc'), ('12345-Ab23'), ('12345'))
select res[1], res[2]
from (
SELECT regexp_matches(SKU, '(\d+)[^[:alnum:]]*([[:alnum:]]+)?') res
FROM a
) y;

powerpivot using a calculated value in another calculation

I have the following tables
Orders:
OrderID|Cost|Quarter|User
-------------------------
1 | 10 | 1 | 1
2 | 15 | 1 | 2
3 | 3 | 2 | 1
4 | 5 | 3 | 3
5 | 8 | 4 | 2
6 | 9 | 2 | 3
7 | 6 | 3 | 3
Goals:
UserID|Goal|Quarter
-------------------
1 | 20 | 1
1 | 15 | 2
2 | 12 | 2
2 | 15 | 3
3 | 5 | 3
3 | 7 | 4
Users:
UserID|Name
-----------
1 | John
2 | Bob
3 | Homer
What I'm trying to do is to sum up all orders that one user had, divide it by the sum of his goals, then sum up all orders, devide the result by the sum of all goals and then add this result to the previous result of all Users.
The result should be:
UserID|Name |Goal|CostSum|Percentage|Sum all
---------------------------------------------------
1 |John | 35 | 13 | 0.37 |
2 |Bob | 27 | 23 | 0.85 |
3 |Homer| 12 | 20 | 1.67 |
the calculation is as follow:
CostSum: 10+3=13
Goal: 20+15=35
Percentage: CostSum/Goal=13/35=0.37
Sum all: 10+15+3+5+8+9+6=56
Goal all: 20+15+12+15+5+7=74
percentage all= Sum_all/Goal_all=56/74=0.76
Result: percentage+percentage_all=0.37+0.76=1.13 for John
1.61 for Bob
2.43 for Homer
My main problem is the last step. I cant get it to add the whole percentage. It will always filter the result so making it wrong.

To do this you're going to need to create some measures.
(I will assume you've already set your pivot table to be in tabular layout with subtotals switched off - this allows you to set UserID and Name next to each other in the row labels section.)
This is what our output will look like.
First let's be sure you've set up your relationships correctly - it should be like this:
I believe you already have the first 5 columns set up in your pivot table, so we need to create measures for CostSumAll, GoalSumAll, PercentageAll and Result.
The key to making this work is to ensure PowerPivot ignores the row label filter for your CostSumAll and GoalSumAll measures. The ALL() function acts as an override filter when used in CALCULATE() - you just have to specify which filters you want to ignore. In this case, UserID and Name.
CostSumAll:
=CALCULATE(SUM(Orders[Cost]),ALL(Users[UserID]),ALL(Users[Name]))
GoalSumAll:
=CALCULATE(SUM(Goals[Goal]),ALL(Users[UserID]),ALL(Users[Name]))
PercentageAll:
=Orders[CostSumAll]/Orders[GoalSumAll]
Result:
=Orders[Percentage]+Orders[PercentageAll]
Download - Example file available for download here. (Don't actually read it in Google Docs - it won't be able to handle the PowerPivot stuff. Save locally to view.)

qlikview syntax - uniq key

I've an excel file with 4 fields :a,b,c,key.
I need to check in QV script that for each row a,b,c there is only on key.
The rows that have diffrent keys should be the result.
for example this is an uncorrect situation that I need to catch :
key | c | b | a
111 | test3 | test2 | test1
222 | test3 | test2 | test1
anyone has an idea how can it be done in qlikview?
thanks,
Lena.

Interesting problem. I suggest treating columns c + b + a as a composite key and counting the number of unique values in the field key for each composite key. Here is one way to do that (QlikView script):
DATA:
LOAD key, c, b, a
FROM some_file.xls;
LEFT JOIN(DATA)
LOAD c, b, a, COUNT(DISTINCT key) AS key_count
RESIDENT DATA
GROUP BY c, b, a;
Your data model now has a 5th column named key_count. You can now use key_count in a chart or list box, or another LOAD statement with a WHERE clause, to filter the rows that have 2 or more values in the field key. To expand on your sample data:
key | c | b | a | key_count
111 | 3 | 2 | 1 | 2
222 | 3 | 2 | 1 | 2
333 | 4 | 3 | 2 | 1
444 | 5 | 4 | 3 | 1
In a list box or LOAD statement, you can now easily find the rows where key_count > 1. I hope this helps!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Postgresql : Find the string with the closest substring match - string

Reverse the comparison: select * from test where 'ABC 1-2-3-4-5' ilike value || '%' order by length(value) desc The best (ie longest) matches will be returned first.

Related

Drop rows in Pandas where column value is not equal to specific suffix

How to update new values in a column without affecting old ones in sqlite database

Postgres 9.4+ split an alpha numeric string into two columns

powerpivot using a calculated value in another calculation

qlikview syntax - uniq key

Categories

Resources