Kudu nested field - apache-spark

I have questions about Kudu with nested fields.
I have JSON from Kafka like this:
{
"ts": 32,
"status": "success",
"uid": "3232",
"url": "http://some_url",
"syncpixel": "http://some_url",
"dfp": {
"DFP_UABrowser": "Chrome 61",
"DFP_UAOperatingSystem": "Windows 7 ver.7.0",
"JavascriptDisplayData_Screen_W_x_H": "1440 x 900",
"Native_client": true
}
}
dfp field has a nested object, I want to insert this object to kudu through Flume
I know that kudu does not support nested field, and supported binary column.
What do I need to do?
Convert field dfp to binary format and read for example scala spark?
Turn JSON in flatten format (but in many cases is not best issue, something like streaming product purchase with product id, name and other or products view in page).

If you use spark/scala streaming will not be and issue when you have proper setup cluster.
Read the entire json through spark and use "explode" function to flatten the json.
This will make life easier.

Related

JMeter: Connect to PostGresSQL in JSR using groovy and then compare values from multiple tables in DB with API response

Sorry for the long post, but I really need some guidance here. I need to compare values from an API response with the values from multiple tables in the DB.
Currently, I am doing it as follows:
Use a JDBC Connect Configuration to connect to Postgres DB and then use the JDBC Sampler to execute queries. I use it three times to query 3 different tables. I store this data in variables (lets call them DBVariables). Please see this image for current Jmeter setup. https://i.stack.imgur.com/GZJyF.png
In JSR Assertion, I have written code that takes data from various DBVariables and compares it against the API response.
However, my issue is the API response can have an array of records and then nested arrays inside each (please see API response sample below). And these array elements can be sorted in any order. This is where I have issues.
I was wondering what would be the most efficient way of writing this JSR Assertion to validate all data elements returned by the API are the same as what is in the DB.
I am very new to groovy, but I think if I can query the DB inside the JSR assertion (instead of using the JDBC sampler), then the comparison can be done by storing API response in a map and then the DBResponse in another map and sorting both and comparing the items.
My questions are:
How can I connect to postgressql using groovy and then execute query statements in it? I have not done that before and was hoping if someone can provide a sample code.
How can I store API response and DB responses in Map and sort them and compare them in groovy?
The API response is of the following type:
{
"data":{
"response":{
"employeeList":[
{
"employeeNumber":"11102",
"addressList":[
{
"addrType":"Home",
"street_1":"123 Any street"
},
{
"addrType":"Alternate",
"street_1":"123 Any street"
}
],
"departmentList":[
{
"deptName":"IT"
},
{
"deptName":"Finance"
},
{
"deptName":"IT"
}
]
},
{
"employeeNumber":"11103",
"addressList":[
{
"addrType":"Home",
"street_1":"123 Any street"
},
{
"addrType":"Alternate",
"street_1":"123 Any street"
}
],
"departmentList":[
{
"deptName":"IT"
},
{
"deptName":"Finance"
},
{
"deptName":"IT"
}
]
}
]
}
}
}
Have you seen Working with a relational database chapter of Groovy documentation? Alternatively you can obtain a Connection instance from the JDBC Configuration Element like
def connection = org.apache.jmeter.protocol.jdbc.config.getConnection('your-pool-name')
With regards to "sort" There is DefaultGroovyMethods class which provides sort() function for any "sortable" entity. With regards to "compare" - we don't know how the object from the database looks like hence cannot provide a comprehensive solution.
Maybe an easier option would be converting the response from the JDBC Sampler to JSON using JsonBuilder and once you have 2 JSON structures use the library like JSONassert which doesn't care about order and depth
You haven't asked, but if you're "very new to groovy" maybe it worth extracting individual values from API using JSON Extractor, do the same for the database with the JDBC elements and compare individual JMeter Variables using Response Assertion?

U-SQL: How to skip files from analysis based on content

I have a lot of files each containing a set of json objects like this:
{ "Id": "1", "Timestamp":"2017-07-20T10:43:21.8841599+02:00", "Session": { "Origin": "WebClient" }}
{ "Id": "2", "Timestamp":"2017-07-20T10:43:21.8841599+02:00", "Session": { "Origin": "WebClient" }}
{ "Id": "3", "Timestamp":"2017-07-20T10:43:21.8841599+02:00", "Session": { "Origin": "WebClient" }}
etc.
Each file containts information about a specific type of session. In this case it are sessions from a Web App, but it could also be sessions of a Desktop App. In that case the value for Origin is "DesktopClient" instead of "WebClient"
For analysis purposes say I am only interested in DesktopClient sessions.
All files representing a session are stored in Azure Blob Storage like this:
container/2017/07/20/00399076-2b88-4dbc-ba56-c7afeeb9ef77.json
container/2017/07/20/00399076-2b88-4dbc-ba56-c7afeeb9ef78.json
container/2017/07/20/00399076-2b88-4dbc-ba56-c7afeeb9ef79.json
Is it possible to skip files of which the first line already makes it clear if it is not a DesktopClient session file, like in my example? I think it would save a lot of query resources if files that I know of do not contain the right session type can be skipped since they can be quit big.
At the moment my query read the data like this:
#RawExtract = EXTRACT [RawString] string
FROM #"wasb://plancare-events-blobs#centrallogging/2017/07/20/{*}.json"
USING Extractors.Text(delimiter:'\b', quoting : false);
#ParsedJSONLines = SELECT Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple([RawString]) AS JSONLine
FROM #RawExtract;
...
Or should I create my own version of Extractors.Text and if so, how should I do that.
To answer some questions that popped up in the comments to the question first:
At this point we do not provide access to the Blob Store meta data. That means that you need to express any meta data either as part of the data in the file or as part of the file name (or path).
Depending on the cost of extraction and sizes of files, you can either extract all the rows and then filter out the rows where the beginning of the row is not fitting your criteria. That will extract all files and all rows from all files, but does not need a custom extractor.
Alternatively, write a custom extractor that checks for only the files that are appropriate (that may be useful if the first solution does not give you the performance and you can determine the conditions efficiently inside the extractors). Several example extractors can be found at http://usql.io in the example directory (including an example JSON extractor).

JSON object selector to describe the criteria for querying documents in Azure Cosmos/ Document DB

I am using a Javascript Azure function to bind to CosmosDB (Document DB) and query documents within a collection. I would like the SELECT query to be formed based on a JSON object that would be coming in the request body. IBM Cloudant provides a feature wherein you can pass a JSON object (selector) to describe the criteria for selecting documents. How do I achieve the same in Azure?
The JSON selector looks like this-
{
"selector": {
"id": {
"$gt": 0
},
USERS": {
"username": "Jack",
"department": "HR"
}
}
}
The sql-from-mongo npm package provides conversion of expressions similar to these into CosmosDB SQL. The module can be easily loaded in your Azure Functions or it can even be loaded into a sproc with a little bit of manipulation.
Full disclosure: I'm the author of the npm package.

How to store a time value in MongoDB using SailsJS v0.11.3?

I'm working with SailsJS and MongoDB and I have an API that has two models and I want to store Date and Time separately but as in the official documentation said it doesn't have a typeof Time attribute, just Date and DateTime. So, I'm using DateTime to store Time values.
I send the values in the request to be stored in the database, I have no problem to store dates, I just send a value like:
2015-12-16
and that's it, it is stored in the table with no problem, but when I want to store a Time value like:
19:23:12
it doesn't works. The error is:
{
"error": "E_VALIDATION",
"status": 400,
"summary": "1 attribute is invalid",
"model": "ReTime",
"invalidAttributes": {
"value": [
{
"rule": "datetime",
"message": "`undefined` should be a datetime (instead of \"19:23:12\", which is a string)"
}
]
}
}
So, any idea how to send the time value to be stored in a DateTime attribute?
I also have tried to send it in different formats like:
0000-00-00T19:23:12.000Z
0000-00-00T19:23:12
T19:23:12.000Z
19:23:12.000Z
19:23:12
But any of them works fine.
Also I was thinking to store both values (Date and Time) in plain text, I mean typeof String attributes. But I need to make some queries and I don't know if it will affect the performance with the waterline ORM.
Please any kind of help will come in handy.
Thanks a lot!
You have two options here. The best would be to simply store the date and time as a datetime and parse them into separate values when you need to use them. MongoDB will store this in BSON format and this will be the most efficient method.
Otherwise, you could use string and create a custom validation rule as described in the Sails documentation.

Populate Azure Data Factory dataset from query

Cannot find an answer via google, msdn (and other microsoft) documentation, or SO.
In Azure Data Factory you can get data from a dataset by using copy activity in a pipeline. The pipeline definition includes a query. All the queries I have seen in documentation are simple, single table queries with no joins. In this case, a dataset is defined as a table in the database with "TableName"= "mytable". Additionally, one could retrieve data from a stored procedure, presumably allowing more complex sql.
Is there a way to define a more complex query in a pipeline that includes joins and/or transformation logic that alters the data from or pipeline from a query rather than stored procedure. I know that you can specify fields in a dataset, but don't know how to get around the "tablename" property.
If there is a way, what would that method be?
input is on-premises sql server. output is azure sql database.
UPDATED for clarity.
Yes, the sqlReaderQuery can be much more complex than what is provided in the examples, and it doesn't have to only use the Table Name in the Dataset.
In one of my pipelines, I have a Dataset with the TableName "dbo.tbl_Build", but my sqlReaderQuery looks at several tables in that database. Here's a heavily truncated example:
with BuildErrorNodes as (select infoNode.BuildId, ...) as MessageValue from dbo.tbl_BuildInformation2 as infoNode inner join dbo.tbl_BuildInformationType as infoType on (infoNode.PartitionId = infoType), BuildInfo as ...
It's a bit confusing to list a single table name in the Dataset, then use multiple tables in the query, but it works just fine.
There's a way to move data from on-premise SQL to Azure SQL using Data Factory.
You can use Copy Activity, check this code sample for your case specifically GitHub link to the ADF Activity source.
Basically you need create Copy Activity which will have TypeProperties with SqlSource and SqlSink sets look like this:
<!-- language: lang-json -->
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "select * from [Source]"
},
"sink": {
"type": "SqlSink",
"WriteBatchSize": 1000000,
"WriteBatchTimeout": "00:05:00"
}
},
Also do mention - you can use not only selects from tables or views, but also [Table-Valued-Functions] will work as well.

Resources