I want to extract data from a column before the '-' symbol. I could do this easily with T-SQL but I am getting errors when I do the same in Azure Databricks.
I also want to be able to check the column if there is such a symbol and where it does not exist I don't want to extract the data.
In T-SQL I could write:
SELECT EmailAddress
,SUBSTRING(emailaddress, 0, charindex('#', emailaddress, 0))
FROM [dbo].[DimEmployee]
How do I get the same result in Databricks, please?
To find records with ‘-’ symbol, you can use pyspark.sql.Column.contains
Column.contains(other)
Contains the other element. Returns a boolean Column based on a string match.
regexp_extract_all function
Extracts the all strings in str that matches the regexp expression and corresponds to the regex group index.
regexp_extract_all(str, regexp [, idx] )
E.g.
SELECT regexp_extract_all('100-200, 300-400', '(\\d+)-(\\d+)', 1);
[100, 300]
Refer - https://docs.databricks.com/sql/language-manual/functions/regexp_extract_all.html#regexp_extract_all-function-databricks-sql
Related
I have a column from which i have to extract String and then format it back to US currency format with 2 decimal places.
For example :
Column value : {tag}0000020000890|
From this, I have to match the tag and extract 20000890, and format it to 200,008.90
I have extracted the part with below code:
LTRIM(REGEXP_SUBSTR('match pattern', 1,1,'i',,1), '0')
Where match pattern is '\{tag\}(.*?)\|'
With this, I am able to extract 20000890
And then I tried the below to_char and to_number function on top of it to format as comma separated currency with 2 decimal points.
to_char(ltrim(Regexp_substr('match pattern',1,1,'i',1),'0'), '99G999G999D99')
But this throws below error:
Sql error -20447, sqlstate 22007 sqlerrmc 99G999G999D99
Sysibm.Varchar-format
Then I tried,
to_char(to_number(ltrim(Regexp_substr('match pattern',1,1,'i',1),'0')), '99G999G999D99')
But this also throws error:
Sql error -20476, sqlstate 22018 sqlermc DECFLOAT_FORMAT; 99G999G999D99
I'm not sure what causes this error.
The format that you try to use is supported starting from V11.5 only.
TO_CHAR V11.5
TO_CHAR V11.1
Compare the Table 2. Format elements for decimal floating-point to varchar table from both links.
Moreover, you must cast a string to a numeric value in the 1-st parameter of TO_CHAR:
SELECT TO_CHAR(DECFLOAT(REGEXP_SUBSTR(V, '\{tag\}(.*?)\|', 1, 1, 'i', 1)), '99,999,999.99')
FROM (VALUES '{tag}0000020000890|') T(V);
Take a look at VARCHAR_FORMAT. It is the function TO_CHAR is mapped to. The group separator is not G, but "," or ".". Basically, you have to replace your formatting string 99G999G999D99 with something like 99,999,999.99.
The Db2 documentation has more examples on that.
Hope one can help me and explain this query for me,
why the first query return result but the second does not:
EDIT:
first query:
select name from Items where name like '%abc%'
second Query:
select name from Items where name like substring('''%abc%''',1,10)
why the first return result but the second return nothing while
substring('''%abc%''',1,10)='%abc%'
If there are a logic behind that, Is there another approach to do something like the second query,
my porpuse is to transform a string like '''abc''' to 'abc' in order to use like statement,
You can concatenate strings to form your LIKE string. To trim the first 3 and last 3 characters from a string use the SUBSTRING and LEN functions. The following example assumes your match string is called #input and starts and ends with 3 quote marks that need to be removed to find a match:
select name from Items where name like '%' + SUBSTRING(#input, 3, LEN(#input) - 4) + '%'
I have a string field with mostly numeric values like 13.4, but some have 13.4%. I am trying to use the following expression to remove the % symbols and retain just the numeric values to convert the field to integer.
Here is what I have so far in the expression definition of Cognos 8 Report Studio:
IF(POSITION('%' IN [FIELD1]) = NULL) THEN
/*** this captures rows with valid data **/
([FIELD1])
ELSE
/** trying to remove the % sign from rows with data like this 13.4% **/
(SUBSTRING([FIELD1]), 1, POSITION('%' IN [FIELD1])))
Any hints/help is much appreciated.
An easy way to do this is to use the trim() function. The following will remove any trailing % characters:
TRIM(trailing '%',[FIELD1])
The approach you are using is feasable. However, the syntax you are using is not compatible with the version of the ReportStudio that I'm familiar with. Below you will find an updated expression which works for me.
IF ( POSITION( '%'; [FIELD1]) = 0) THEN
( [FIELD1] )
ELSE
( SUBSTRING( [FIELD1]; 1; POSITION( '%'; [FIELD1]) - 1 ) )
Since character positions in strings are 1-based in Cognos it's important to substract 1 from the position returned by POSITION(). Otherwise you would only cut off characters after the percent sign.
Another note: what you are doing here is data cleansing. It's usually more advantageous to push these chores down to a lower level of the data retrieval chain, e.g. the Data Warehouse or at least the Framework Manager model, so that at the reporting level you can use this field as numeric field directly.
I am trying to find if a string exist in a word and extract it. I have uses the instr() function but this works as the LIKE function: if part or the whole word exists it returns it.
Here I want to get the string 'Services' out, it works but if I change 'Services' to 'Service' it still works. I don't want that. If 'Service' is entered it should return null and not 'Services'
Modified:
What I am trying to do here is abbreviate certain parts of the company name.
This is what my database table looks like :
Word | Abb
---------+-----
Company | com
Limited | ltd
Service | serv
Services | servs
Here is the code:
Declare
Cursor Words Is
SELECT word,abb
FROM abbWords
processingWord VARCHAR2(50);
abbreviatedName VARCHAR(120);
fullName = 'A.D Company Services Limited';
BEGIN
FOR eachWord IN Words LOOP
--find the position of the word in name
wordPosition := INSTR(fullName, eachWord.word);
--extracts the word form the full name that matches the database
processingWord := Substr(fullName,instr(fullName,eachWord.word), length(eachWord.word));
--only process words that exist in name
if wordPosition > 0 then
abbreviatedName = replace(fullName, eachWord.word,eachWord.abb);
end if;
END lOOP;
END;
So if the user enters 'Service' I don't want 'Services' to be returned. By this I mean word position should be 0 if the word 'Service' in not found instead of returning the position for the word 'Services'
One way of doing it:
DECODE(INSTR('A.D Company Seervices Limited','Services'),
0,
NULL,
SUBSTR('A.D Company Services Limited',
INSTR('A.D Company Services Limited','Services'),
length('Services')))
INSTR() will return 0 if text is not found. DECODE() will evaluate the first argument, compare to the second, if match, return third argument, if not, return fourth argument. (sqlfiddle link)
Arguably not the most elegant way, but matches your requirement.
I think you're over-complicating this. You can do everything with regular expressions. For instance; given the following table:
create table names ( name varchar2(100));
insert into names values ('A.D Company Services Limited');
insert into names values ('A.D Company Service Limited');
This query will only return the name 'A.D Company Services Limited'.
select *
from names
where regexp_like( name
, '(^|[[:space:]])services($|[[:space:]])'
, 'i' )
This means match the beginning of the string, ^, or a space followed by services followed the end of the string, $, or a space. This is what differentiates regular expressions from using instr etc. You can make your matches easily conditional on other factors.
However, though this seems to be your question I don't think this is what you're trying to do. You're trying to replace the string 'serv' in your wider string without replacing 'services' or 'service'. For this you need to use regexp_replace().
If I add the following row to the table:
insert into names values ('A.D Company Serv Limited');
and run this query:
select regexp_replace( name
, '(^|[[:space:]])serv($|[[:space:]])'
, ' Services '
, 1, 0, 'i' )
from names
The only thing that will change is ' Serv ', which in this newest line, will be replaced with ' Services '. Note the spaces; as you don't want to replace 'Services' with 'ServServices' these are very important.
Here's a little SQL Fiddle to demonstrate.
Another alternative is to use something like:
select replace(name,' serv ', ' Services ')
from names;
This will replace only the word 'Serv' situated between 2 spaces.
Thank you,
Alex.
INSTR returns a number: the index of the first occurrence of the matching string. You should use regexp_substr instead (10g+):
SQL> select regexp_substr('A.D Company Services Limited', 'Services') match,
2 regexp_substr('A.D Company Service Limited', 'Services') unmatch
3 from dual;
MATCH UNMATCH
-------- -------
Services
I am rolling up a huge table by counts into a new table, where I want to change all the empty strings to NULL, and typecast some columns as well. I read through some of the posts and I could not find a query, which would let me do it across all the columns in a single query, without using multiple statements.
Let me know if it is possible for me to iterate across all columns and replace cells with empty strings with null.
Ref: How to convert empty spaces into null values, using SQL Server?
To my knowledge there is no built-in function to replace empty strings across all columns of a table. You can write a plpgsql function to take care of that.
The following function replaces empty strings in all basic character-type columns of a given table with NULL. You can then cast to integer if the remaining strings are valid number literals.
CREATE OR REPLACE FUNCTION f_empty_text_to_null(_tbl regclass, OUT updated_rows int)
LANGUAGE plpgsql AS
$func$
DECLARE
_typ CONSTANT regtype[] := '{text, bpchar, varchar}'; -- ARRAY of all basic character types
_sql text;
BEGIN
SELECT INTO _sql -- build SQL command
'UPDATE ' || _tbl
|| E'\nSET ' || string_agg(format('%1$s = NULLIF(%1$s, '''')', col), E'\n ,')
|| E'\nWHERE ' || string_agg(col || ' = ''''', ' OR ')
FROM (
SELECT quote_ident(attname) AS col
FROM pg_attribute
WHERE attrelid = _tbl -- valid, visible, legal table name
AND attnum >= 1 -- exclude tableoid & friends
AND NOT attisdropped -- exclude dropped columns
AND NOT attnotnull -- exclude columns defined NOT NULL!
AND atttypid = ANY(_typ) -- only character types
ORDER BY attnum
) sub;
-- RAISE NOTICE '%', _sql; -- test?
-- Execute
IF _sql IS NULL THEN
updated_rows := 0; -- nothing to update
ELSE
EXECUTE _sql;
GET DIAGNOSTICS updated_rows = ROW_COUNT; -- Report number of affected rows
END IF;
END
$func$;
Call:
SELECT f_empty2null('mytable');
SELECT f_empty2null('myschema.mytable');
To also get the column name updated_rows:
SELECT * FROM f_empty2null('mytable');
db<>fiddle here
Old sqlfiddle
Major points
Table name has to be valid and visible and the calling user must have all necessary privileges. If any of these conditions are not met, the function will do nothing - i.e. nothing can be destroyed, either. I cast to the object identifier type regclass to make sure of it.
The table name can be supplied as is ('mytable'), then the search_path decides. Or schema-qualified to pick a certain schema ('myschema.mytable').
Query the system catalog to get all (character-type) columns of the table. The provided function uses these basic character types: text, bpchar, varchar, "char". Only relevant columns are processed.
Use quote_ident() or format() to sanitize column names and safeguard against SQLi.
The updated version uses the basic SQL aggregate function string_agg() to build the command string without looping, which is simpler and faster. And more elegant. :)
Has to use dynamic SQL with EXECUTE.
The updated version excludes columns defined NOT NULL and only updates each row once in a single statement, which is much faster for tables with multiple character-type columns.
Should work with any modern version of PostgreSQL. Tested with Postgres 9.1, 9.3, 9.5 and 13.