How to split name data in the same column in PostgreSQL? - string

I am new to PostgreSQL and am using PGADMIN 4 on a Mac. I have one column of imported data that has some usernames, sometimes a last name and mostly a first and last name in the same column.
I care more to be able to query and count the most occurrences of a name in the column. I will be able to determine by results if it is a first or last for my need. Listing the first 50 should do it. Please assist with the specific code including addressing the table and column.
Have played with this, but need more:
select surname, count(*) from atreedata
group by surname
order by count(*) desc limit 40;
Works great with only one name! I need the most common names listed by name and count.
Common Column Example:
John Smith
jsmith3
Stacey123
Bob Smith
Jones

So, if I understand it correctly, you just need to find the most numerous words in surname column.
There's a built-in function regexp_split_to_table that can split strings to words and creates rows from those words. So:
select surname_word, count(*) as surname_word_count
from (
select regexp_split_to_table(surname, E'\\s+') as surname_word
from atreedata
) as surname_words
group by surname_word
order by surname_word_count desc
limit 40;

Related

Spark - partitioning/bucketing of n-tables with overlapping but not identical ids

i'm currently trying to optimize some kind of query of 2 rather large tables, which are characterized like this:
Table 1: id column - alphanumerical, about 300mil unique ids, more than 1bil rows overall
Table 2: id column - identical semantics, about 200mil unique ids, more than 1bil rows overall
Lets say on a given day, 17.03. i want to join those two tables on id.
Table 1 is left, table 2 is right, i get like 90% of matches, meaning table 2 has like 90% of those ids present in table 1.
One week later, said table 1 did not change (could but to make explanation easier, consider it didn't), table 2 was updated and now contains more records. I do the join again and now, from the former missing ids some came up, so i got like 95% matches now.
In general, table1.id has some matches with table2.id at a given time which might change on a day-per-day base.
I now want to optimize this join and came up on the bucketing feature. Is this possible?
Example:
1st join: id "ABC123" is present in table1, not in table2. ABC123 gets sorted into a certain bucket, e.g. "1".
2nd join (week later): id "ABC123" now came up in table2; how can it be ensured it comes into the bucket on table 2 which then is co-located with table 1?
Or am i having a general problem of understanding how it works?

When and why are Google Cloud Spanner table and column names case-sensitive?

Spanner documentation says:
Table and column names:
Can be between 1-128 characters long. Must start with an uppercase or lowercase letter.
Can contain uppercase and lowercase letters, numbers, and underscores, but not hyphens.
Are case-insensitive. For example, you cannot create tables named mytable and MyTable in the same database or columns names mycolumn and
MyColumn in the same table.
https://cloud.google.com/spanner/docs/data-definition-language#table_statements
Given that, I have no idea what this means:
Table names are usually case insensitive, but may be case sensitive
when querying a database that uses case sensitive table names.
https://cloud.google.com/spanner/docs/lexical#case-sensitivity
In fact it seems that table names are case-sensitive, for example:
Queries fail if we don't match the case shown in the UI.
This seems to be an error in the documentation. Table names are case insensitive in cloud spanner. I'll follow up with the docs team.
Edit: Updated docs https://cloud.google.com/spanner/docs/data-definition-language#naming_conventions
I add a couple of examples, so we can see the diference.
Table names are case sensitive, In this example, It does not matter, there is only one table:
Example 1:
SELECT *
FROM Roster
WHERE LastName = #myparam
returns all rows where LastName is equal to the value of query parameter myparam.
But for Example 2, if we comparing two tables, or make other kind of queries, using tables.
SELECT id, name
FROM Table1 except select id, name
FROM Table2
It will give you everything in Table1 but not in Table2.

Sum Indirect Table

I have a series of table that breakdown the users per product usage. I have also created a summary table. In the summary table i want to sum the total number of users on a for each product using the product name to dynamically search there designated table. The product tables are created using the product name with the exception that spaces have been replaced with "_". For Example DynCRMSrvcPrvdr ALNG LicSAPk MVL SAL will have a table name DynCRMSrvcPrvdr_ALNG_LicSAPk_MVL_SAL
For some reason when i try the reference the table and sum column 'count' i get an error
=SUM(INDIRECT(SUBSTITUTE(E4," ","_")&"["&Count&"]"))
The "COUNT" column header name should as well be within the quotation marks. Please try the corrected formula:
=SUM(INDIRECT(SUBSTITUTE(E4," ","_")&"[Count]"))

Concatenate all values for a column as one string for each row in select

I have been looking into this issue for some time and need some help. I have looked into sub-queries and CTE with no Luck.
I want to create a select of all of the data in Table A. I also want to add some columns from Table B. So an Inner join will work initially.
However in Table B I want to create a string that is concatenated from all the different categories for each person and list this as a column in my select.
For example, one person might be in IT and also Construction categories so I would like a string that says "Construction, IT" listed for that person and the same basis for all others.
I believe I have looked into COALESCE
which meets me part way. Any advice most appreciated.
Table A
Name
Surname
Person Ref
Table B
Person Ref
Person Category
Person Gender
These are example tables I can't post the ones I'm looking at for security reasons but hopefully, you get my point.
Martin

Cassandra: How to check if a column value is already present

I am using Cassandra. There is a column name-text which stores all say usernames.
name
------
bob
Bob
bobby
mike
michael
micky
BOB
I have 2 questions
I have to select all user names that starts with 'bo'. I know there is no like equivalent in Cassandra. But is there anyway to achieve that? (Additional column is an option but is there something else?)
There are 3 entries. bob,Bob and BOB. Is there anyway to use fetch all 3 rows if I pass where name='bob'. I need to fetch the names case-insensitive.
Thanks in advance.
Let's start with the second question first. If you want to support case-insensitive queries, you should store a second, upper-case copy of the text data you want to search for in another column. Then by querying by that column you'll be able to do case-insensitive requests.
Going back to searches for bo*. The best way to do that is to use a schema that allows you to leverage clustering columns (columns 2 and higher of the primary key) for range searches. Here is an example:
CREATE TABLE t1 (region INT, name TEXT, PRIMARY KEY (region, name));
In particular, if you make name the second column of the key, you will be able to perform searches such as
SELECT * FROM t1 WHERE name >= 'bo' and name < 'bp' ALLOW FILTERING;
which will return results you're looking for. This only works for trailing wildcards: the leading characters have to be constant for range queries to work. Again, if you want to do case-insensitive searches, have the case-insensitive column be the second part of the primary key and query by it:
SELECT * FROM t1 WHERE name_upper >= 'BO' and name_upper < 'BP' ALLOW FILTERING;

Resources