U-SQL Error in Naming the Column - azure

I have a JSON where the order of fields is not fixed.
i.e. I can have [A, B, C] or [B, C, A]
All A, B, C are json objects are of the form {Name: x, Value:y}.
So, when I use USQL to extract the JSON (I don't know their order) and put it into a CSV (for which I will need column name):
#output =
SELECT
A["Value"] ?? "0" AS CAST ### (("System_" + A["Name"]) AS STRING),
B["Value"] ?? "0" AS "System_" + B["Name"],
System_da
So, I am trying to put column name as the "Name" field in the JSON.
But am getting the error at #### above:
Message
syntax error. Expected one of: FROM ',' EXCEPT GROUP HAVING INTERSECT OPTION ORDER OUTER UNION UNION WHERE ';' ')'
Resolution
Correct the script syntax, using expected token(s) as a guide.
Description
Invalid syntax found in the script.
Details
at token '(', line 74
near the ###:
**************
I am not allowed to put the correct column name "dynamically" and it is an absolute necessity of my issue.
Input: [A, B, C,], [C, B, A]
Output: A.name B.name C.name
Row 1's values
Row 2's values

This
#output =
SELECT
A["Value"] ?? "0" AS CAST ### (("System_" + A["Name"]) AS STRING),
B["Value"] ?? "0" AS "System_" + B["Name"],
System_da
is not a valid SELECT clause (neither in U-SQL nor any other SQL dialect I am aware of).
What is the JSON Array? Is it a key/value pair? Or positional? Or a single value in the array that you want to have a marker for whether it is present in the array?
From your example, it seems that you want something like:
Input:
[["A","B","C"],["C","D","B"]]
Output:
A B C D
true true true false
false true true true
If that is the case, I would write it as:
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
#input =
SELECT "[[\"A\", \"B\", \"C\"],[\"C\", \"D\", \"B\"]]" AS json
FROM (VALUES (1)) AS T(x);
#data =
SELECT JsonFunctions.JsonTuple(arrstring) AS a
FROM #input CROSS APPLY EXPLODE( JsonFunctions.JsonTuple(json).Values) AS T(arrstring);
#data =
SELECT a.Contains("A") AS A, a.Contains("B") AS B, a.Contains("C") AS C, a.Contains("D") AS D
FROM (SELECT a.Values AS a FROM #data) AS t;
OUTPUT #data
TO "/output/data.csv"
USING Outputters.Csv(outputHeader : true);
If you need something more dynamic, either use the resulting SqlArray or SqlMap or use the above approach to generate the script.
However, I wonder why you would model your information this way in the first place. I would recommend finding a more appropriate way to mark the presence of the value in the JSON.
UPDATE: I missed your comment about that the inner array members are an object with two key-value pairs, where one is always called name (for property) and one is always called value for the property value. So here is the answer for that case.
First: Modelling key value pairs in JSON using {"Name": "propname", "Value" : "value"} is a complete misuse of the flexible modelling capabilities of JSON and should not be done. Use {"propname" : "value"} instead if you can.
So changing the input, the following will give you the pivoted values. Note that you will need to know the values ahead of time and there are several options on how to do the pivot. I do it in the statement where I create the new SqlMap instance to reduce the over-modelling, and then in the next SELECT where I get the values from the map.
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
#input =
SELECT "[[{\"Name\":\"A\", \"Value\": 1}, {\"Name\": \"B\", \"Value\": 2}, {\"Name\": \"C\", \"Value\":3 }], [{\"Name\":\"C\", \"Value\": 4}, {\"Name\":\"D\", \"Value\": 5}, {\"Name\":\"B\", \"Value\": 6}]]" AS json
FROM (VALUES (1)) AS T(x);
#data =
SELECT JsonFunctions.JsonTuple(arrstring) AS a
FROM #input CROSS APPLY EXPLODE( JsonFunctions.JsonTuple(json)) AS T(rowid, arrstring);
#data =
SELECT new SqlMap<string, string>(
a.Values.Select((kvp) =>
new KeyValuePair<string, string>(
JsonFunctions.JsonTuple(kvp)["Name"]
, JsonFunctions.JsonTuple(kvp)["Value"])
)) AS kvp
FROM #data;
#data =
SELECT kvp["A"] AS A,
kvp["B"] AS B,
kvp["C"] AS C,
kvp["D"] AS D
FROM #data;
OUTPUT #data
TO "/output/data.csv"
USING Outputters.Csv(outputHeader : true);

Related

Problem with selecting json object in Pyspark, which may sometime have Null values

I have this big nested json object from which I need to make a Dataframe. One of the inner json elements sometimes come as empty and sometimes it comes with some values in it.
I am giving a simple example here:
When it is filled:
{"student_address": {"Door Number":"1234",
"Place":"xxxx",
"Zip code":"12345"}}
When it is empty:
{"student_address":""}
So, in the final DataFrame I have all the three columns Door Number, Place and Zip code. When the address is empty, I should put Null values in the respective columns and should fill them when there is data.
The code I tried:
test = test.withColumn("place",when(col("student_address") == "", lit(None)).otherwise(col("student_address.place")))\
.withColumn("door_num",when(col("student_address") == "",lit(None)).otherwise(col("student_address.door_num")))\
.withColumn("zip_code",when(col("student_address") == "", lit(None)).otherwise(col("student_address.zip_code")))
So, I am trying to check wether the value is empty or not.
This is the error I am getting:
AnalysisException: Can't extract value from student_address#34: need struct type but got string
I am not able to understand why PySpark is checking the statement in otherwise, when condition is met in when itself. (I tried giving simple values in otherwise instead of json path and it worked).
I am struggling to understand what is happening here and would like to know if there is any simple way to do this.
val addressSchema = StructType(StructField("Place", StringType, false) :: Nil) # Add more fields
val schema = StructType(StructField("address", addressSchema, true) :: Nil) # the point is address is nullable
val df = spark.read.schema(schema).json("example.json")
use Get JSON object with the first object with is Student_address and the column should be the name of the column.
df.withColumn("place",when(df.student_address== "", lit(None)).otherwise(get_json_object(col("student_address"),"$.student_address.place")))

Select Query containing tuple With mixed single as well as double quotes

Postgresql select query containing tuple with single quotes as well as double quotes when giving this tuple as the input to select query it genrates error stating that specific value is not present in the database.
I have treid converting that list of values to JSON list with double quotes but that doesn't help either.
list = ['mango', 'apple', "chikoo's", 'banana', "jackfruit's"]
query = """select category from "unique_shelf" where "Unique_Shelf_Names" in {}""" .format(list)
ERROR: column "chikoo's" doesn't exist
Infact chikoo's does exist
But due to double quotes its not fetching the value.
Firstly please don't use list as a variable name, list is a reserved keyword and you don't wanna overwrite it.
Secondly, using "" around tables and columns is bad practice, use ` instead.
Thirdly, when you format an array, it outputs as
select category from `unique_shelf`
where `Unique_Shelf_Names` in (['mango', 'apple', "chikoo's", 'banana', "jackfruit's"])
Which is not a valid SQL syntax.
You can join all values with a comma
>>>print("""select category from `unique_shelf` where `Unique_Shelf_Names` in {})""".format(','.join(l)))
select category from `unique_shelf`
where `Unique_Shelf_Names` in (mango,apple,chikoo's,banana,jackfruit's)
The issue here is that the values inside the in bracket are not quoted. We can do that by formatting them beforehand using double quotes(")
l = ['mango', 'apple', "chikoo's", 'banana', "jackfruit's"]
list_with_quotes = ['"{}"'.format(x) for x in l]
query = """
select category from `unique_shelf`
where `Unique_Shelf_Names` in ({})""" .format(','.join(list_with_quotes))
This will give you an output of
select category from `unique_shelf`
where `Unique_Shelf_Names` in ("mango","apple","chikoo's","banana","jackfruit's")

Getting metadata in plain SQL statement in Slick 3.1.x

In the following plain SQL statement in Slick I know beforehand that it will return a list of (String, String)
sql"""select c.name, s.name
from coffees c, suppliers s
where c.price < $price and s.id = c.sup_id""".as[(String, String)]
But what if I don't know the column types? Can I analyze the metadata and retrieve the values? In JDBC I could use getInt(n) and getString(n), is there anything similar in Slick?
You can use tsql (Type-Checked SQL Statements):
tsql"""select c.name, s.name
from coffees c, suppliers s
where c.price < $price and s.id = c.sup_id"""
this will return a DBIO[Seq[(String, String)]] (depending on the column types).
produces a DBIOAction of the correct type without requiring a call to .as
Note: I've found it a little flakey (to the point of being unusable) with option types, so beware if your columns can be null (since null: String).
This requires a little bit of wiring up, you need #StaticDatabaseConfig (e.g. on your DAO), as these types are checked, against the database, at compile time:
# annotate the object
#StaticDatabaseConfig("file:src/main/resources/application.conf#tsql")
...
val dc = DatabaseConfig.forAnnotation[JdbcProfile]
import dc.driver.api._
val db = dc.db
# to pull out a Future[Seq[String, String]]
# use db.run(tsql"...")
# to pull out a Future[Option[(String, String)]]
# use db.run(tsql"...".headOption)
# etc.

Google BigQuery Replace function for string type

I am trying to replace certain customer names in my data.
I was able to do SQL using Google BigQuery language to transform one part of the string another via the replace function for one particular string.
Replace(CustomerName, 'ABC', 'XYZ')
However, I have a couple more that I would need to use the replace function such that
Replace(CustomerName, 'PLO', 'Rustic')
Replace(CustomerName, 'Kix', 'BowWow')
and so on.
I've tried doing
Replace(CustomerName, 'ABC', 'XYZ') OR Replace(CustomerName, 'PLO', 'Rustic') OR Replace(CustomerName, 'Kix', 'BowWow')
but that got me an error message.
I've also tried
Replace(CustomerName, 'ABC', 'XYZ') AND Replace(CustomerName, 'PLO', 'Rustic') AND Replace(CustomerName, 'Kix', 'BowWow')
but that also got me an error message.
I am able to just use "case when statement" and then hardcode each one, but I'm wondering if there is a better/faster way to just use replace statement instead.
Thanks for your help.
The CASE WHEN option is pretty reasonable. Another option is to chain them together:
REPLACE(
REPLACE(
REPLACE(
CustomerName,
'ABC',
'XYZ'),
'PLO',
'Rustic'),
'Kix',
'BowWow')
Which one you pick really depends on the exact scenario. The chained REPLACE calls are probably faster, but they could overlap in weird ways (e.g., if the output to one replacement matches the input to a subsequent one). The CASE WHEN approach avoids that issue, but it's probably more expensive because you need to do one operation to find the substring and another to actually replace it.
Note that when you're using AND or OR, you're trying to combine the string output of REPLACE as if it were a boolean, which is why it's failing.
In cases when you have quite a number of replacements - chaining of REPLACEs can become not practical and annoying manual work.
Below addresses this potential issue (assuming you maintain Lookup table with pairs: Word, Replacement)
SELECT CustomerName, fixedCustomerName FROM JS(
// input table
(
SELECT
CustomerName, Replacements
FROM YourTable
CROSS JOIN (
SELECT
GROUP_CONCAT_UNQUOTED(CONCAT(Word, ',', Replacement), ';') AS Replacements
FROM ReplacementLookup
) ,
// input columns
CustomerName, Replacements,
// output schema
"[
{name: 'CustomerName', type: 'string'},
{name: 'fixedCustomerName', type: 'string'}
]",
// function
"function(r, emit){
var Replacements = r.Replacements.split(';');
var fixedCustomerName = r.CustomerName;
for (var i = 0; i < Replacements.length; i++) {
var pat = new RegExp(Replacements[i].split(',')[0],'gi')
fixedCustomerName = fixedCustomerName.replace(pat, Replacements[i].split(',')[1]);
}
emit({CustomerName: r.CustomerName,fixedCustomerName: fixedCustomerName});
}"
)
You can test it using below example
SELECT CustomerName, fixedCustomerName FROM JS(
// input table
(
SELECT
CustomerName, Replacements
FROM (
SELECT CustomerName FROM
(SELECT '1234ABC567' AS CustomerName),
(SELECT '12 34 PLO 56' AS CustomerName),
(SELECT 'Kix' AS CustomerName),
(SELECT '98 ABC PLO Kix ABC 76 XYZ 54' AS CustomerName),
(SELECT 'ABCQweKIX' AS CustomerName)
) YourTable
CROSS JOIN (
SELECT
GROUP_CONCAT_UNQUOTED(CONCAT(Word, ',', Replacement), ';') AS Replacements
FROM (
SELECT Word, Replacement FROM
(SELECT 'XYZ' AS Word, 'QWE' AS Replacement),
(SELECT 'ABC' AS Word, 'XYZ' AS Replacement),
(SELECT 'PLO' AS Word, 'Rustic' AS Replacement),
(SELECT 'Kix' AS Word, 'BowWow' AS Replacement)
)
) ReplacementLookup
) ,
// input columns
CustomerName, Replacements,
// output schema
"[
{name: 'CustomerName', type: 'string'},
{name: 'fixedCustomerName', type: 'string'}
]",
// function
"function(r, emit){
var Replacements = r.Replacements.split(';');
var fixedCustomerName = r.CustomerName;
for (var i = 0; i < Replacements.length; i++) {
var pat = new RegExp(Replacements[i].split(',')[0],'gi')
fixedCustomerName = fixedCustomerName.replace(pat, Replacements[i].split(',')[1]);
}
emit({CustomerName: r.CustomerName,fixedCustomerName: fixedCustomerName});
}"
)
Please note: there is still issue if result of one replacement matches the input to a subsequent replacement
I believe there are multiple ways to tackle this problem, and it depends on the size of your dataset, practicality of simply making a guiding table by hand and uploading it to BigQuery, and the granularity of the data you want to replace.
If your values are very granular, you can create a table with "from" and "to" values on different columns, and join that table with your main table, and retrieve those values very cleanly.
# Replace the support_table table with your actual table
WITH support_table AS (
SELECT "ABC" AS OldValue, "XYZ" AS NewValue
)
SELECT main_table.OldValue, support_table.NewValue FROM main_table
JOIN support_table ON main_table.old_value = support_table.old_value
Now, if you want to replace a big list of different values with something, you can use REGEXP_REPLACE with a string containing all possible values.
If you have a very big list of items, you can use
STRING_AGG in a table with all the values you want to replace, or skip the STRING_AGG step and create said string by hand.
Both of the snippets below result in "item1|item2|item3". Choose which is faster for you to do.
# Replace the values_to_replace table with your actual table
WITH values_to_replace AS (
SELECT "item1" AS ColumnWithItemsToReplace
UNION ALL
SELECT "item2"
UNION ALL
SELECT "item3"
)
SELECT STRING_AGG(ColumnsWithItemsToReplace,"|") FROM values_to_replace
SELECT r"item1|item2|item3"
STRING_AGG will retrieve all the values from a table or query and concatenate them using a separator of choice. If you use the pipe separator, you will be able to create a string like "item1|item2|item3|..."
For a regular expression, the pipe counts as "or", which means that the regex will interpret the string as "item1 or item2 or item3". Thus, if you pass that generated string to REGEXP_REPLACE as the values to be replaced, it will be considered valid.
Example code below:
REGEXP_REPLACE(
column_to_replace
,(SELECT STRING_AGG(ColumnWithItemsToReplace,"|") FROM `YourTable`)
,"Replacer"
)
Hope it helps.

Oracle spatial data operator - SDO_nn - Not getting any results for sdo_num_res = 1

I am using SDO_NN operator to find the nearest hydrant next to a building.
Building:
CREATE TABLE "BUILDINGS"
(
"NAME" VARCHAR2(40),
"SHAPE" "SDO_GEOMETRY")
Hydrant:
CREATE TABLE "HYDRANTS"
( "NAME" VARCHAR2(10),
"POINT" "SDO_POINT_TYPE"
);
I have setup spatial indexes properly for buildings.shape and I run the query to get the nearest hydrant to the building 'Motel'
select b1.name as name, h.point.x as x, h.point.y as y from buildings b1, hydrants h where b1.name ='Motel' and
SDO_nn( b1.shape, MDSYS.SDO_GEOMETRY(2003,NULL, NULL,SDO_ELEM_INFO_ARRAY(1,1003,1),
SDO_ORDINATE_ARRAY( h.point.x,h.point.y)), 'sdo_num_res=1')= 'TRUE';
Here's the problem:
When I set the parameter sdo_num_res=1, I get zero tuples.
And when I make sdo_num_res=2, I get one tuple.
What is the reason for the weird behavior ?
Note: I am getting zero rows only when building.name= 'Motel', for all other tuples I am getting 1 row when sdo_num_res = 1
Edit:
Insert queries
Insert into buildings (NAME,SHAPE) values ('Motel',MDSYS.SDO_GEOMETRY(2003,NULL,NULL,MDSYS.SDO_ELEM_INFO_ARRAY(1,1003,1),MDSYS.SDO_ORDINATE_ARRAY(564,425,585,436,573,458,552,447)));
Insert into hydrants (name,POINT) values ('p57',MDSYS.SDO_POINT_TYPE(589,448,0));
To perform spatial comparisons between a point to a polygon, the SDO_GEOMETRY is defined with SDO_SRID=2001 and center set to a SDO_POINT_TYPE-> which we want to compare.
MDSYS.SDO_GEOMETRY(2001, NULL, SDO_POINT_TYPE(-79, 37, NULL), NULL, NULL)
First of all, your query does not do what you say it does: it actually returns the nearest building called "Motel" from any of your hydrants. To do what you want (i.e. the opposite) you need to reverse the order of the arguments to SDO_NN: all spatial operators search the first argument, using the value of the second argument.
Then the insert into your HYDRANTS table is wrong:
Insert into hydrants (name,POINT) values ('p57',MDSYS.SDO_POINT_TYPE(589,448,0));
The SDO_POINT_TYPE object is not designed to be used that way: it is only used inside the SDO_GEOMETRY type. The proper way is this:
insert into hydrants (name,POINT) values ('p57',sdo_geometry(2001, null, SDO_POINT_TYPE(589,448,null), null, null));
And of course you need to change your table definition accordingly.
Then your building is also incorrectly created: a polygon must always close, i.e. the last point must be the same as the first point. So the proper shape should be like this:
insert into buildings (NAME,SHAPE) values ('Motel', SDO_GEOMETRY(2003,NULL,NULL,SDO_ELEM_INFO_ARRAY(1,1003,1),SDO_ORDINATE_ARRAY(564,425,585,436,573,458,552,447,564,425)));
Here is the full example:
Create the tables:
create table buildings (
name varchar2(40) primary key,
shape sdo_geometry
);
create table hydrants(
name varchar2(10) primary key,
point sdo_geometry
);
Populate the tables:
insert into buildings (NAME,SHAPE) values ('Motel', SDO_GEOMETRY(2003,NULL,NULL,SDO_ELEM_INFO_ARRAY(1,1003,1),SDO_ORDINATE_ARRAY(564,425,585,436,573,458,552,447,564,425)));
insert into hydrants (name,POINT) values ('p57',sdo_geometry(2001, null, SDO_POINT_TYPE(589,448,null), null, null));
commit;
Confirm that the geometries are all correct:
select name, sdo_geom.validate_geometry_with_context (point, 0.05) from hydrants;
select name, sdo_geom.validate_geometry_with_context (shape, 0.05) from buildings;
Setup spatial metadata and create spatial indexes:
insert into user_sdo_geom_metadata (table_name, column_name, diminfo, srid)
values (
'BUILDINGS',
'SHAPE',
sdo_dim_array (
sdo_dim_element ('X', 0,1000,0.05),
sdo_dim_element ('Y', 0,1000,0.05)
),
null
);
commit;
create index buildings_sx on buildings (shape)
indextype is mdsys.spatial_index;
insert into user_sdo_geom_metadata (table_name, column_name, diminfo, srid)
values (
'HYDRANTS',
'POINT',
sdo_dim_array (
sdo_dim_element ('X', 0,1000,0.05),
sdo_dim_element ('Y', 0,1000,0.05)
),
null
);
commit;
create index hydrants_sx on hydrants (point)
indextype is mdsys.spatial_index;
Now Try the properly written query:
select h.name, h.point.sdo_point.x as x, h.point.sdo_point.y as y
from buildings b, hydrants h
where b.name ='Motel'
and sdo_nn(h.point, b.shape, 'sdo_num_res=1')= 'TRUE';
which returns:
NAME X Y
---------------- ---------- ----------
p57 589 448
1 row selected.

Resources