Use Str_to_map in bigquery - struct

I have a function str_to_map() in hive that I need to convert to Big Query. As we don't have map in Bigquery, I want to find another way to have a map format and then after that to extract the key-values by using the key name.
Example :
Select str_to_map('cars:0,kids:143,cats:1,lost:0,win:1,chances:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0,missed:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0',',',':')
If I call the key 'cars' I get the value '0'.
If I call the key 'chances' I should get '0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0'
It's necessary for me to have a type like the 'map' type (key-value).
Thank you 😀

Google provides some useful UDFs for BigQuery here in bigquery-utils.
Don't reinvent the wheel
So, I brought two udfs to answer this question.
1. get_value(k STRING, arr ANY TYPE)
Given a key and a list of key-value maps in the form [{'key': 'a', 'value': 'aaa'}], returns the SCALAR type value.
2. cw_map_parse(m string, pd string, kvd string)
String to map convert.
With these, you can write a query like below:
SELECT get_value('kids', cw_map_parse(str, ',', ':')) kids,
get_value('chances', cw_map_parse(str, ',', ':')) chances,
FROM UNNEST(['cars:0,kids:143,cats:1,lost:0,win:1,chances:0,missed:0']) str;
+------+---------+
| kids | chances |
+------+---------+
| 143 | 0 |
+------+---------+
But due to below requirements, cw_map_parse implementation needs to be customized a little bit.
If I call the key 'cars' I get the value '0'. If I call the key 'chances' I should get '0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0'
Below is a query with cutomized UDFs functions. str_to_map() is a customized version of cw_map_parse().
CREATE TEMP FUNCTION str_to_map(m string, pd string, kvd string)
RETURNS ARRAY<STRUCT<key STRING, value STRING>> AS (
ARRAY(
SELECT AS STRUCT kv[SAFE_OFFSET(0)] AS key, kv[SAFE_OFFSET(1)] AS value
FROM (
SELECT SPLIT(REGEXP_REPLACE(kv, r'^(.*?)' || kvd, r'\1|'), '|') AS kv
FROM UNNEST(SPLIT(m, pd)) AS kv
)
));
CREATE TEMP FUNCTION get_value(get_key STRING, arr ANY TYPE) AS (
(SELECT value FROM UNNEST(arr) WHERE key = get_key)
);
SELECT get_value('cars', map) cars,
get_value('kids', map) kids,
get_value('chances', map) chances,
get_value('missed', map) missed,
FROM UNNEST(['cars:0,kids:143,cats:1,lost:0,win:1,chances:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0,missed:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0']) str,
UNNEST([STRUCT(str_to_map(str, ',', ':') AS map)]);
+------+------+-------------------------------------+-------------------------------------+
| cars | kids | chances | missed |
+------+------+-------------------------------------+-------------------------------------+
| 0 | 143 | 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0 | 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0 |
+------+------+-------------------------------------+-------------------------------------+

Another super simple option for that particular case
select
json_value(json, '$.cars') cars,
json_value(json, '$.kids') kids,
json_value(json, '$.cats') cats,
json_value(json, '$.lost') lost,
json_value(json, '$.win') win,
json_value(json, '$.chances') chances,
json_value(json, '$.missed') missed
from your_table,
unnest([format('{%s}', regexp_replace(str, r'([^:,]+):([\d:]*\d)', r'"\1":"\2"'))]) json
with output

Related

Use dataframe column value as input to select expression

I have a series of expressions used to map raw JSON data to normalized column data. I'm trying to think of a way to efficiently apply this to every row as there are multiple schemas to consider.
Right now, I have one massive CASE statement (built dynamically) that gets interpreted to SQL like this:
SELECT
CASE
WHEN schema = 'A' THEN CONCAT(get_json_object(payload, '$.FirstName'), ' ', get_json_object(payload, '$.LastName'))
WHEN schema = 'B' THEN get_json_object(payload, '$.Name')
END as name,
CASE
WHEN schema = 'A' THEN get_json_object(payload, '$.Telephone')
WHEN schema = 'B' THEN get_json_object(payload, '$.PhoneNumber')
END as phone_number
This works, I just worry about performance as the number of schemas and columns increases. I want to see if there's another way and here is my idea.
I have a DataFrame expressions_df of valid SparkSQL expressions.
schema
column
column_expression
A
name
CONCAT(get_json_object(payload, '$.FirstName'), ' ', get_json_object(payload, '$.LastName'))
A
phone_number
get_json_object(payload, '$.Telephone')
B
name
get_json_object(payload, '$.Name')
B
phone_number
get_json_object(payload, '$.PhoneNumber')
This DataFrame is used as a lookup table of sorts against a DataFrame raw_df:
schema
payload
A
{"FirstName": "John", "LastName": "Doe", "Telephone": "123-456-7890"}
B
{"Name": "Jane Doe", "PhoneNumber": "123-567-1234"}
I'd like to do something like this where column_expression is passed to F.expr and used to interpret the SQL and return the appropriate value.
from pyspark.sql import functions as F
(
raw_df
.join(expressions_df, 'schema')
.select(
F.expr(column_expression)
)
.dropDuplicates()
)
The desired end result would be something like this so that no matter what the original schema is, the data is transformed to the same standard using the expressions as shown in the SQL or expressions_df.
| name | phone_number |
| -------- | ------------ |
| John Doe | 123-456-7890 |
| Jane Doe | 123-567-1234 |
You can't use directly a DataFrame column value as an expression with expr function. You'll have to collect all the expressions into a python object in order to be able to pass them as parameters to expr.
Here's one way to do it where the expressions are collected into a dict then for each schema we apply a different select expression. Finally, union all the dataframes to get the desired output:
from collections import defaultdict
from functools import reduce
import pyspark.sql.functions as F
exprs = defaultdict(list)
for r in expressions_df.collect():
exprs[r.schema].append(F.expr(r.column_expression).alias(r.column))
schemas = [r.schema for r in raw_df.select("schema").distinct().collect()]
final_df = reduce(DataFrame.union, [raw_df.filter(f"schema='{s}'").select(*exprs[s]) for s in schemas])
final_df.show()
#+--------+------------+
#| name|phone_number|
#+--------+------------+
#|Jane Doe|123-567-1234|
#|John Doe|123-456-7890|
#+--------+------------+

Multiple string substitutions in PostgreSQL

I have a column with abbreviations separated by spaces like this
'BG MSG'
Also, there's another table with substitutions
target replacement
----------------------
'BG', 'Brick Galvan'
'MSG', 'Mosaic Galvan'
The goal is to apply all the substitutions to the abbreviations to obtain something like
'Brick Galvan Mosaic Galvan' from 'BG MSG'
I know I could do
replace( replace('BG MSG', 'BG', 'Brick Galvan'), 'MSG', 'Mosaic Galvan')
But imagine there are hundreds of substitutions, and they can change from one day to the next. The resulting query will be hideous to maintain.
I mean, I could do a code generator that will create the query with all the nested replaces, but I'm looking for something more elegant and postgres-native.
I've found solutions like this one
How to replace multiple special characters in Postgres 9.5 but they seem to work only for single characters.
Let's say your tables look like this:
create table my_table(id serial primary key, abbrevs text);
insert into my_table (abbrevs) values
('BG MSG');
create table substitutions(target text, replacement text);
insert into substitutions values
('BG', 'Brick Galvan'),
('MSG', 'Mosaic Galvan');
You can get each abbreviation in a single row:
select id, unnest(string_to_array(abbrevs, ' ')) as abbrev
from my_table
id | abbrev
----+--------
1 | BG
1 | MSG
(2 rows)
and use them to join the substitution table and get full names:
select id, string_agg(replacement, ' ') as full_names
from (
select id, unnest(string_to_array(abbrevs, ' ')) as abbrev
from my_table
) t
join substitutions on abbrev = target
group by id
id | full_names
----+----------------------------
1 | Brick Galvan Mosaic Galvan
(1 row)
Db<>fiddle.
Nested replace approach would work but it is quite ugly, right?
SELECT REPLACE(REPLACE(REPLACE(REPLACE(…
After carefully formatted to make it look readable, the best you can get follows:
SELECT
REPLACE(
REPLACE(
REPLACE(
REPLACE(...
On the other hand, you might just use the LATERAL JOIN solution which uses more characters but, it is definitely more readable.
-- Input: BG, MSG
-- Output: Brick Galvan, Mosaic Galvan
SELECT msg.Materials
FROM (SELECT 'BG, MSG' AS Materials) mt
INNER JOIN LATERAL (SELECT REPLACE(mt.Materials::text, 'BG', 'Brick Galvan') AS Materials) bg ON true
INNER JOIN LATERAL (SELECT REPLACE(bg.Materials::text, 'MSG', 'Mosaic Galvan') AS Materials) msg ON true;

Is it legit to store CQL tuples with null components in Cassandra 3.x

I have to store a protocol buffer structure in Cassandra 3.x. It is defined in a .proto file as:
message Attribute
{
required string key = 1;
oneof value {
int64 integerValue = 2;
float floatValue = 3;
string stringValue = 4;
}
}
To store multiple Attributes I was thinking about this CQL definition.
CREATE TABLE ... attributes: map<text, tuple<int, float, text> ...
and in each tuple 2 of 3 components would actually be null. I haven't tested this syntax yet but are there any downsides using this approach? Maybe there is a better way, i.e. User Defined Types?
Let's try this out. I'll start with a simple table, containing a valuemap column of type map<text,tuple<int,float,text> as you have above:
CREATE TABLE tupleTest (
key text,
value text,
valuemap map<text, FROZEN<tuple<int,float,text>>>,
PRIMARY KEY (key));
I'll INSERT some data:
INSERT INTO tupletest (key,value,valuemap) VALUES ('1','A',{'a':(0,0.0,'hi')});
INSERT INTO tupletest (key,value,valuemap) VALUES ('2','B',{'b':(0,null,'hi')});
INSERT INTO tupletest (key,value,valuemap) VALUES ('3','C',{'c':(null,null,'hi')});
And then I'll SELECT it, just to see:
aploetz#cqlsh:stackoverflow> SELECT * FROM tupletest ;
key | value | valuemap
-----+-------+---------------------------
3 | C | {'c': (None, None, 'hi')}
2 | B | {'b': (0, None, 'hi')}
1 | A | {'a': (0, 0, 'hi')}
(3 rows)
The main apprehension about explicitly INSERTing NULL values into Cassandra, is that in "normal" columns they actually create tombstones. But since we are not setting an entire column to NULL, merely an element in a tuple (nested inside a map), this is not the case. In fact, they are showing as None. And when I view the underlying SSTables, I also do not see evidence that a tombstone has been written.
Normally, I'd say that explicitly INSERTing a NULL into Cassandra is a terrible, terrible idea. But in this case, it shouldn't cause you any issues. Now, as to whether or not this is considered to be "legit" or a good practice...well, my data modeling senses do not approve. I would find another way to represent the absence of a value in a tuple type, as someone (the developer who follows you) could see this and interpret that as being "ok" to explicitly INSERT NULLs into other column values.

How can I handle custom NULL string with Spark DataFrame

I have a data file look like below:
// data.txt
1 2016-01-01
2 \N
3 2016-03-01
I used \N to represent a null value for some reason. (It's not a special character, it's a string consists of 2 chars: \ and N).
I want to create DataFrame like below:
case class Data(
val id : Int,
val date : java.time.LocalDate)
val df = sc.textFile("data.txt")
.map(_.split("\t"))
.map(p => Data(
p(0).toInt,
_helper(p(1))
))
.toDF()
My question is how can I write the helper method ?
def _helper(s : String) = s match {
case "\\N" => null, // type error
case _ => LocalDate.parse(s, dateFormat)
}
This is where an Option type will come in handy.
I changed the custom null value to make the case more explicit but it should work in your case. My data is in a .txt file like so:
Ryan,11
Bob,22
Kevin,23
Asop,-nnn-
Notice the -nnn- is my custom null. I use a slightly different case class:
case class DataSet(name: String, age: Option[Int])
And write a pattern matching function to capture the nuances of the situation:
def customNull (col: String): Option[Int] = col match {
case "-nnn-" => None
case _ => Some(Integer.parseInt(col))
}
From here it should work as expected when you combine the two:
val df = sc.textFile("./data.txt")
.map(_.split(","))
.map(p=>DataSet(p(0), customNull(p(1))))
.toDF()
When I do a df.show() I get the following:
+-----+----+
| name| age|
+-----+----+
| Ryan| 11|
| Bob| 22|
|Kevin| 23|
| Asop|null|
+-----+----+
Treating the ages like a string gets around the problem. It might not be the fastest to parse values like this. Ideally, you could also use an Either but that can also get complex.

Accessing nested data in spark

I have a collection of nested case classes. I've got a job that generates a dataset using these case classes, and writes the output to parquet.
I was pretty annoyed to discover that I have to manually do a load of faffing around to load and convert this data back to case classes to work with it in subsequent jobs. Anyway, that's what I'm now trying to do.
My case classes are like:
case class Person(userId: String, tech: Option[Tech])
case class Tech(browsers: Seq[Browser], platforms: Seq[Platform])
case class Browser(family: String, version: Int)
So I'm loading my parquet data. I can get the tech data as a Row with:
val df = sqlContext.load("part-r-00716.gz.parquet")
val x = df.head
val tech = x.getStruct(x.fieldIndex("tech"))
But now I can't find how to actually iterate over the browsers. If I try val browsers = tech.getStruct(tech.fieldIndex("browsers")) I get an exception:
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to org.apache.spark.sql.Row
How can I iterate over my nested browser data using spark 1.5.2?
Update
In fact, my case classes contain optional values, so Browser actually is:
case class Browser(family: String,
major: Option[String] = None,
minor: Option[String] = None,
patch: Option[String] = None,
language: String,
timesSeen: Long = 1,
firstSeenAt: Long,
lastSeenAt: Long)
I also have similar for Os:
case class Os(family: String,
major: Option[String] = None,
minor: Option[String] = None,
patch: Option[String] = None,
patchMinor: Option[String],
override val timesSeen: Long = 1,
override val firstSeenAt: Long,
override val lastSeenAt: Long)
And so Tech is really:
case class Technographic(browsers: Seq[Browser],
devices: Seq[Device],
oss: Seq[Os])
Now, given the fact that some values are optional, I need a solution that will allow me to reconstruct my case classes correctly. The current solution doesn't support None values, so for example given the input data:
Tech(browsers=Seq(
Browser(family=Some("IE"), major=Some(7), language=Some("en"), timesSeen=3),
Browser(family=None, major=None, language=Some("en-us"), timesSeen=1),
Browser(family=Some("Firefox), major=None, language=None, timesSeen=1)
)
)
I need it to load the data as follows:
family=IE, major=7, language=en, timesSeen=3,
family=None, major=None, language=en-us, timesSeen=1,
family=Firefox, major=None, language=None, timesSeen=1
Because the current solution doesn't support None values, it in fact has an arbitrary number of values per list item, i.e.:
browsers.family = ["IE", "Firefox"]
browsers.major = [7]
browsers.language = ["en", "en-us"]
timesSeen = [3, 1, 1]
As you can see, there's no way of converting the final data (returned by spark) into the case classes that generated it.
How can I work around this insanity?
Some examples
// Select two columns
df.select("userId", "tech.browsers").show()
// Select the nested values only
df.select("tech.browsers").show(truncate = false)
+-------------------------+
|browsers |
+-------------------------+
|[[Firefox,4], [Chrome,2]]|
|[[Firefox,4], [Chrome,2]]|
|[[IE,25]] |
|[] |
|null |
+-------------------------+
// Extract the family (nested value)
// This way you can iterate over the persons, and get their browsers
// Family values are nested
df.select("tech.browsers.family").show()
+-----------------+
| family|
+-----------------+
|[Firefox, Chrome]|
|[Firefox, Chrome]|
| [IE]|
| []|
| null|
+-----------------+
// Normalize the family: One row for each family
// Then you can iterate over all families
// Family values are un-nested, empty values/null/None are handled by explode()
df.select(explode(col("tech.browsers.family")).alias("family")).show()
+-------+
| family|
+-------+
|Firefox|
| Chrome|
|Firefox|
| Chrome|
| IE|
+-------+
Based on the last example:
val families = df.select(explode(col("tech.browsers.family")))
.map(r => r.getString(0)).distinct().collect().toList
println(families)
gives the unique list of browers in a "normal" local Scala list:
List(IE, Firefox, Chrome)

Resources