How to tokenize a string and assign tokens to column in Teradata? - string

I have multiple strings of the form {key1=value, key2=value2, key3=value3 ...} with a known set of keys. The key names are set and known, with only the values changing between records. I would like to tokenize the string with the space delimiter as my tokenizing character then strip off the key names and assign each one to a column in sequential order. Is this something I can do in-database in teradata 15?

Starting with TD14 there's NVP to extract data from name-value-pairs, e.g.
NVP(col, 'key1', '{ ,\ }', '=')

Related

How to convert a column of string array to array format and coalesce the first non null value in the dataset

I have dataset which consists of two columns. Where "Values" column consists of string in list/array and the column datatype is char. I need to get coalesce first non null value in the new column since we also have null values other rows. I am new to SAS. Could you please help me with the solution.
Required Output
So you have a long string with comma separated values? You can use SCAN() to select one item from the list. Since your list has extra [ and ] you can just include those extra characters in the set of delimiter characters for SCAN().
data want;
set have;
first = scan(values,1,'[,]');
run;
If the values can include the delimiter use the 'q' modifier. That will ignore delimiters that are inside of quoted strings. If you want to remove the quotes from the result use the DEQUOTE() function.
data want;
set have;
first = dequote(scan(values,1,'[,]','q'));
run;

Azure SQL: join of 2 tables with 2 unicode fields returns empty when matching records exist

I have a table with a few key columns created with nvarchar(80) => unicode.
I can list the full dataset with SELECT * statement (Table1) and can confirm the values I need to filter are there.
However, I can't get any results from that table if I filter rows by using as input an alphabet char on any column.
Columns in table1 stores values in cyrilic characters.
I know it must have to do with character encoding => what I see in the result list is not what I use as input characters.
Unicode nvarchar type should resolve automatically this character type mismatch.
What do you suggest me to do in order to get results?
Thank you very much.
Paulo

Power Query: How to delete duplicate characters from a string (eg. xzxxxzzzzxzzzzx-> leave only xz)?

I have a huge table in Power Query with text in cells that consist of multiple 'x's and 'z's. I want to deduplicate values so I have one x and one z only.
For example:
xzzzxxxzxz-> xz
zzzzzzzzzz-> z
The table is very big, so I don't want to create additional columns. Can you please help?
You can convert the string to a list of characters, make the list distinct (remove duplicates), sort (if desired), and then transform back to text.
= Table.TransformColumns(#"Previous Step", {{"ColumnName",
each Text.Combine( List.Sort( List.Distinct( Text.ToList(_) ) ) ),
type text}})

Need initial N characters of column in Postgres where N is unknown

I have one column in my table in Postgres let's say employeeId. We do some modification based on the employee type and store it in DB. Basically, we append strings from these 4 strings ('ACR','AC','DCR','DC'). Now we can have any combination of these 4 strings appended after employeeId. For example, EMPIDACRDC, EMPIDDCDCRAC etc. These are valid combinations. I need to retrieve EMPID from this. EMPID length is not fixed. The column is of varying length type. How can this be done in Postgres?
I am not entirely sure I understand the question, but regexp_replace() seems to do the trick:
with sample (employeeid) as (
values
('1ACR'),
('2ACRDCR'),
('100DCRAC')
)
select employeeid,
regexp_replace(employeeid, 'ACR|AC|DCR|DC.*$', '', 'gi') as clean_id
from sample
returns:
employeeid | clean_id
-----------+---------
1ACR | 1
2ACRDCR | 2
100DCRAC | 100
The regular expression says "any character after any of those string up to the end of the string" - and that is then replace with nothing. This however won't work if the actual empid contains any of those codes that are appended.
It would be much cleaner to store this information in two columns. One for the empid and one for those "codes"

Spark Dataframe Pivot on Word in String

Basically I have a dataframe column (String type) that contains english sentences. My goal is to create a pivot table (grouped by user ids) that has words as columns and counts as entries. The problem is that if you do something like
myDataframe.groupBy(col("user")).pivot(col("sentences")).count()
Where "sentences" is the name of the column containing the english sentences, you will be counting the sentences rather than the individual words. Is there any way to count the individual words in the sentences and not just the sentences themselves? Whitespace tokenization is fine.
You have to tokenize and explode first:
import org.apache.spark.ml.feature.Tokenizer
new Tokenizer()
.setInputCol("sentences")
.setOutputCol("tokens")
.transform(df)
.withColumn("token", explode($"tokens"))
.groupBy(col("user")).pivot(col("token")).count()

Resources