Autoloader: empty structs - databricks

Any way around this:
I'm running structured streaming. My readStream config is:
cloudFiles_cfg = {
"cloudFiles.subscriptionId": self.subscription_id,
"cloudFiles.tenantId": self.tenant_id,
"cloudFiles.clientId": self.client_id,
"cloudFiles.clientSecret": self.client_secret,
"cloudFiles.resourceGroup": self.hub_rg,
"cloudFiles.connectionString" : self.hub_datalake_conn,
"cloudFiles.format": "json",
"cloudFiles.useNotifications": "true",
"cloudFiles.queueName": f"{queue_name}",
"cloudFiles.inferColumnTypes": "true",
"cloudFiles.schemaLocation": checkpoint,
"cloudFiles.schemaEvolutionMode": "addNewColumns"
}
I have writeStream set with options(mergeSchema, true). When I read in a json file, the fields with data in are inferred and included in the inferred schema, however the empty json objects (e.g. "field": {}) are not added (expected?). However, I would have expected the schemaEvolutionMode to add the columns and the write to be covered with my mergeSchema option. But I get the UnknownFieldException error described in the docs here. I wondered if I ran it for a second time, the columns would be added to the schema files in the check point location, but they are not. A second schema file is created but does not include the new columns it found, but repeats the original schema file. If I remove inferColumnTypes it's OK, but everything is imported as string whereas I would really like to keep the inferred types and any blank fields just to be structs.

Related

Transforming large array of objects to csv using json2csv

I need to transform a large array of JSON (that can have over 100k positions) into a CSV.
This array is created directly in the application, it's not the result of an uploaded file.
Looking at the documentation, I've thought on using parser but it says that:
For that reason is rarely a good reason to use it until your data is very small or your application doesn't do anything else.
Because the data is not small and my app will do other things than creating the csv, I don't think it'll be the best approach but I may be misunderstanding the documentation.
Is it possible to use the others options (async parser or transform) with an already created data (and not a stream of data)?
FYI: It's a nest application but I'm using this node.js lib.
Update: I've tryied to insert with an array with over 300k positions, and it went smoothly.
Why do you need any external modules?
Converting JSON into a javascript array of javascript objects is a piece of cake with the native JSON.parse() function.
let jsontxt=await fs.readFile('mythings.json','uft8');
let mythings = JSON.parse(jsontxt);
if (!Array.isArray(mythings)) throw "Oooops, stranger things happen!"
And, then, converting a javascript array into a CSV is very straightforward.
The most obvious and absurd case is just mapping every element of the array into a string that is the JSON representation of the object element. You end up with a useless CSV with a single column containing every element of your original array. And then joining the resulting strings array into a single string, separated by newlines \n. It's good for nothing but, heck, it's a CSV!
let csvtxt = mythings.map(JSON.stringify).join("\n");
await fs.writeFile("mythings.csv",csvtxt,"utf8");
Now, you can feel that you are almost there. Replace the useless mapping function into your own
let csvtxt = mythings.map(mapElementToColumns).join("\n");
and choose a good mapping between the fields of the objects of your array, and the columns of your csv.
function mapElementToColumns(element) {
return `${JSON.stringify(element.id)},${JSON.stringify(element.name)},${JSON.stringify(element.value)}`;
}
or, in a more thorough way
function mapElementToColumns(fieldNames) {
return function (element) {
let fields = fieldnames.map(n => element[n] ? JSON.stringify(element[n]) : '""');
return fields.join(',');
}
}
that you may invoke in your map
mythings.map(mapElementToColumns(["id","name","element"])).join("\n");
Finally, you might decide to use an automated for "all fields in all objects" approach; which requires that all the objects in the original array maintain a similar fields schema.
You extract all the fields of the first object of the array, and use them as the header row of the csv and as the template for extracting the rest of the elements.
let fieldnames = Object.keys(mythings[0]);
and then use this field names array as parameter of your map function
let csvtxt= mythings.map(mapElementToColumns(fieldnames)).join("\n");
and, also, prepending them as the CSV header
csvtxt.unshift(fieldnames.join(','))
Putting all the pieces together...
function mapElementToColumns(fieldNames) {
return function (element) {
let fields = fieldnames.map(n => element[n] ? JSON.stringify(element[n]) : '""');
return fields.join(',');
}
}
let jsontxt=await fs.readFile('mythings.json','uft8');
let mythings = JSON.parse(jsontxt);
if (!Array.isArray(mythings)) throw "Oooops, stranger things happen!";
let fieldnames = Object.keys(mythings[0]);
let csvtxt= mythings.map(mapElementToColumns(fieldnames)).join("\n");
csvtxt.unshift(fieldnames.join(','));
await fs.writeFile("mythings.csv",csvtxt,"utf8");
And that's it. Pretty neat, uh?

Atomic way of inserting a row if not exist in bigtable

We would like to insert a row if not exist in bigtable. Our idea is to use CheckAndMutateRow api with a onNoMatch insert. We are using the nodejs sdk, the idea would be to do the following (it seems to works, but we don't no about the atomicity of the operation)
const row = table.row('phone#4c410523#20190501');
const filter = [];
const config = {
onNoMatch: [
{
method: 'insert',
data: {
stats_summary: {
os_name: 'android',
timestamp,
},
},
},
],
};
await row.filter(filter, config);
CheckAndMutateRow is atomic. Based on the API definition:
Mutations are applied atomically and in order, meaning that earlier mutations can be masked / negated by later ones. Cells already present in the row are left unchanged unless explicitly changed by a mutation.
After committing the accumulated mutations, resets the local mutations.
MutateRow does an upsert. So if you give it a rowkey, column name and timestamp it will create a new cell if it doesn't exist, or overwrite it otherwise. You can achieve this behavior with a "simple write".
Conditional writes are use e.g. if you want to check the value before you overwrite it. Let's say you want to set column A to X only if column B is Y or overwrite column A only if column A's current value is Z.

bigQuery: PartialFailureError on table insert

I'm trying to insert data row to bigQuery table as follows:
await bigqueryClient
.dataset(DATASET_ID)
.table(TABLE_ID)
.insert(row);
But I get a PartialFailureError when deploying the cloud function.
The table schem has a name (string) and campaigns (record/repeated) fields which I created manually from the console.
hotel_name STRING NULLABLE
campaigns RECORD REPEATED
campaign_id STRING NULLABLE
platform_id NUMERIC NULLABLE
platform_name STRING NULLABLE
reporting_id STRING NULLABLE
And the data I'm inserting is an object like this:
const row = {
hotel_name: hotel_name,//string
campaigns: {
id: item.id,//string
platform_id: item.platform_id,//int
platform_name: item.platform_name,//string
reporting_id: item.reporting_id,//string
},
};
The errors logged don't give much clue about the issue.
These errors suck. The actual info about what went wrong can be found in the errors property on the PartialFailureError. In https://www.atdatabases.org we reformat the error to make this easier using: https://github.com/ForbesLindesay/atdatabases/blob/0e1a033264aac33deaff2ab753796049e623ab86/packages/bigquery/src/implementation/BigQueryDriver.ts#L211
According to my test it seems that there are 2 errors here. First is that you have campaign_id in schema while id in JSON.
2nd thing is related with format of REPEATED mode data in JSON. The documentation mentions following:
. Notice that the addresses column contains an array of values (indicated by [ ]). The multiple addresses in the array are the repeated data. The multiple fields within each address are the nested data.
It's not so straight in mentioned document (probably can be found somewhere else) however when you use REPEATED mode you should use brackets [].
I tested it shortly on my side and it seems that it should work like this:
const row = {
hotel_name: hotel_name,//string
campaigns: [ {
campaign_id: item.id,//string
platform_id: item.platform_id,//int
platform_name: item.platform_name,//string
reporting_id: item.reporting_id,//string
}, ]
};

How to keep remark when update type of a column in liquibase?

I try to update type of a column using modifyDataType in a Groovy DSL. After I run liquibase update by liquibase-maven-plugin(version is 3.8.9), I found the remark of column disappeared.
here is my code:
changeSet(author: "root", id: "20201218-modify-data-type") {
modifyDataType(columnName: "description", newDataType: "text", tableName: "t_user")
rollback {
modifyDataType(columnName: "description", newDataType: "varchar(2000)", tableName: "t_user")
}
}
I can not add param "remark" in modifyDataType(), because 'remarks' is an invalid property for 'modifyDataType' changes.
The way MySQL implements their ALTER TABLE MODIFY COLUMN is as if it drops the old column and adds a new one. This loses not just remarks, but also constraints or primary keys or anything else related to the "old" column.
We've not added all those fields into modifyDataType since there is really no end to what we'd have to include in there. Instead, if you modify a datatype on mysql you need to re-set the remarks, primary keys, etc. in separate calls.
There is a warning we output for mysql about losing primary key/constraint info, but we only show it for non-varchar datatypes to keep the number of warnings down. Since it affects things like remarks which can be on varchars, it's probably worth removing that non-varchar part of the check.

Accessing and looping though nested Json in Groovy

so I have been stuck with this problem for quite some time now.
I'm working with some data provided as JSON and retrieved via WSLITE using groovy.
So far handling the json structur and finding data was no problem since the webservice response provided by WSLITE
is retuned as an instance of JsonSlurper.
I now came across some cases where the structure that json response varies depending on what I query for.
Structure looks like this:
{
"result":[
...
"someId": [...]
"somethingElse": [...]
...
"variants":{
"0123":{
"description": "somethingsomething",
...,
...,
"subVariant":{
"1234001":{
name: "somename",
...
}
}
},
"4567":{
"description": "somethingsomething",
...,
...,
"subVariant":{
"4567001":{
name: "somename"
...
...
}
}
}
}
]
}
As you can see, the variants node is an object holding another nested object which holds even more nested objects.
The problem here is that the variants node sometimes holds more than one OBJECT but always at least one. Same vor the subVariant node.
Furthermore I do not now the name of the (numeric) nodes in advance. hence i have to build the path dinamically.
Since Json is handled as maps, lists and arrays in groovy I thought I was able to do this:
def js = JsonSlurper().parseText(jsonReponse)
println js.result.variants[0].subVariant[0].name //js.result.variants.getClass() prints class java.util.ArrayList
but that returns null
while accessing the data statically like this:
println "js.result.variants."0123".subVariant."1234001".name
works just fine.
Just pinting the variants like this
println js.result.variants
also works fine. it prints the first (and only) content of the whole variants tree.
My question is: Why is that?
It kind of seems like i can not address nexted objects by index... am i right? how else would i got about that?
thanks in advance.
Well the reason you cant access it using [0] is simply because "variants": { ... isnt an array, that would look like this: "variants" :[ ...
So your variants is an object with a variable number of attributes, which is basically a named map.
However, if you dont know the entry names, you can iterate through these using .each, so you could do something like js.result.variants.each{key, val -> println val.description}
And something similar for the subvariant

Resources