U-SQL: How to skip files from analysis based on content - azure

I have a lot of files each containing a set of json objects like this:
{ "Id": "1", "Timestamp":"2017-07-20T10:43:21.8841599+02:00", "Session": { "Origin": "WebClient" }}
{ "Id": "2", "Timestamp":"2017-07-20T10:43:21.8841599+02:00", "Session": { "Origin": "WebClient" }}
{ "Id": "3", "Timestamp":"2017-07-20T10:43:21.8841599+02:00", "Session": { "Origin": "WebClient" }}
etc.
Each file containts information about a specific type of session. In this case it are sessions from a Web App, but it could also be sessions of a Desktop App. In that case the value for Origin is "DesktopClient" instead of "WebClient"
For analysis purposes say I am only interested in DesktopClient sessions.
All files representing a session are stored in Azure Blob Storage like this:
container/2017/07/20/00399076-2b88-4dbc-ba56-c7afeeb9ef77.json
container/2017/07/20/00399076-2b88-4dbc-ba56-c7afeeb9ef78.json
container/2017/07/20/00399076-2b88-4dbc-ba56-c7afeeb9ef79.json
Is it possible to skip files of which the first line already makes it clear if it is not a DesktopClient session file, like in my example? I think it would save a lot of query resources if files that I know of do not contain the right session type can be skipped since they can be quit big.
At the moment my query read the data like this:
#RawExtract = EXTRACT [RawString] string
FROM #"wasb://plancare-events-blobs#centrallogging/2017/07/20/{*}.json"
USING Extractors.Text(delimiter:'\b', quoting : false);
#ParsedJSONLines = SELECT Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple([RawString]) AS JSONLine
FROM #RawExtract;
...
Or should I create my own version of Extractors.Text and if so, how should I do that.

To answer some questions that popped up in the comments to the question first:
At this point we do not provide access to the Blob Store meta data. That means that you need to express any meta data either as part of the data in the file or as part of the file name (or path).
Depending on the cost of extraction and sizes of files, you can either extract all the rows and then filter out the rows where the beginning of the row is not fitting your criteria. That will extract all files and all rows from all files, but does not need a custom extractor.
Alternatively, write a custom extractor that checks for only the files that are appropriate (that may be useful if the first solution does not give you the performance and you can determine the conditions efficiently inside the extractors). Several example extractors can be found at http://usql.io in the example directory (including an example JSON extractor).

Related

Azure Data Factory REST API return invalid JSON file with pagination

I'm building a pipeline, which copy a response from a API into a file in my storage account. There is also an element of pagination. However, that works like a charm and i get all my data from all the pages.
My result is something like this:
{"data": {
"id": "Something",
"value": "Some other thing"
}}
The problem, is that the copy function just appends the response to the file and thereby making it invalid JSON, which is a big problem further down the line. The final output would look like:
{"data": {
"id": "22222",
"value": "Some other thing"
}}
{"data": {
"id": "33333",
"value": "Some other thing"
}}
I have tried everything I could think of and google my way to, but nothing changes how the data is appended to the file and i'm stuck with an invalid JSON file :(
As a backup plan, i'll just make a loop and create a JSON file for each PAGE. But that seems a bit janky and really slow
Anyone got an idea or have a solution for my problem?
When you copy data from Rest API to blob storage it will copy data in the form of set of objects by default.
Example:
sample data
{ "time": "2015-04-29T07:12:20.9100000Z", "callingimsi": "466920403025604"}
sink data
{"time":"2015-04-29T07:12:20.9100000Z","callingimsi":"466920403025604"}
{"time":"2015-04-29T07:13:21.0220000Z","callingimsi":"466922202613463"}
{"time":"2015-04-29T07:13:21.4370000Z","callingimsi":"466923101048691"}
This is the invalid format of Json.
To work around this, select file pattern in sink activity setting as Array of objects this will return array of all objects.
Output:

How to assign tags to every file and then filter by these tags in GCS?

I'm using Google Cloud Storage to store 100-150 files, mostly pictures (<30MG) and a few videos (<100MG). The number of files should remain around what it is right now, so I'm not that worried about scalability. All of the files are in the same storage bucket.
My goal is to assign at least one tag (string) and a number of upvotes (integer) to every file.
I want to be able to:
Filter all of the files by their tag(s) and only retrieve the ones with said tag.
A file may have several tags, such as "halloween", "christmas" and "easter", and must be retrieved if any of its tags is specified
Then, sort the filtered files by their number of upvotes (descending order).
In other words, the end goal is to get a sorted list of the the filtered files' URLs (to show on a website similar to Imgur).
Right now, I've come up with two (potential) ways to do that:
CUSTOM METADATA KEY
When adding a file to the bucket, add a custom key for the tags and another one for the number of upvotes. Then, when I want to retrieve all files with the "halloween" tag, I can:
Loop through every file and get its metadata (doc)
Filter and sort all of the files based on the metadata's keys
Get the URL of each file
Seems like this could cause some performance issues, but would be an easy structure.
INDEX FILE
Create an index file (in JSON). When adding a file to the bucket, also add the file's info to the index...
Eg:
{
"files": [
{
"filename": "file1",
"path": "/path/to/file1",
"tags": ["tag1", "tag2", "tag3"],
"upvotes": 10
},
{
"filename": "file2",
"path": "/path/to/file2",
"tags": ["tag4", "tag5"],
"upvotes": 5
},
{
"filename": "file3",
"path": "/path/to/file3",
"tags": [],
"upvotes": 0
}
]
}
Then, when I want to retrieve all files with the "halloween" tag, I can:
Parse the json file to get all the files with the "halloween" tag
Sort based on the upvotes key
Get the URL of each file
I think this would be better performance-wise? I'm leaning towards this one right now...
But I don't love the idea of having an external index file.
Are both methods possible to implement? Is there a better way to do this?
Disclaimer: I'm pretty new to databases (especially GCS).
If not, which one would have the best performance? Doesn't need to be blazing fast, it's a school project.
Thanks!

Querying mongoDB using pymongo (completely new to mongo/pymongo)

If this question seems too trivial then please let me know in the comments, I will do further research on how to solve it.
I have a collection called products where I store details of a particular product from different retailers. The schema of a document looks like this -
{
"_id": "uuid of a product",
"created_at": "timestamp",
"offers": [{
"retailer_id": 123,
"product_url": "url - of -a - product.com",
"price": "1"
},
{
"retailer_id": 456,
"product_url": "url - of -a - product.com",
"price": "1"
}
]
}
_id of a product is system generated. Consider a product like 'iPhone X'. This will be a single document with URLs and prices from multiple retailers like Amazon, eBay, etc.
Now if a new URL comes into the system, I need to make a query if this URL already exists in our database. The obvious way to do this is to iterate every offer of every product document and see if the product_url field matches with the input URL. That would require loading up all the documents into the memory and iterate through the offers of every product one by one. Now my question arises -
Is there a simpler method to achieve this? Using pymongo?
Or my database schema needs to be changed since this basic check_if_product_url_exists() is too complex?
MongoDB provides searching within arrays using dot notation.
So your query would be:
db.collection.find({'offers.product_url': 'url - of -a - product.com'})
The same syntax works in MongoDB shell or pymongo.

Case insensitive search in arrays for CosmosDB / DocumentDB

Lets say I have these documents in my CosmosDB. (DocumentDB API, .NET SDK)
{
// partition key of the collection
"userId" : "0000-0000-0000-0000",
"emailAddresses": [
"someaddress#somedomain.com", "Another.Address#someotherdomain.com"
]
// some more fields
}
I now need to find out if I have a document for a given email address. However, I need the query to be case insensitive.
There are ways to search case insensitive on a field (they do a full scan however):
How to do a Case Insensitive search on Azure DocumentDb?
select * from json j where LOWER(j.name) = 'timbaktu'
e => e.Id.ToLower() == key.ToLower()
These do not work for arrays. Is there an alternative way? A user defined function looks like it could help.
I am mainly looking for a temporary low-effort solution to support the scenario (I have multiple collections like this). I probably need to switch to a data structure like this at some point:
{
"userId" : "0000-0000-0000-0000",
// Option A
"emailAddresses": [
{
"displayName": "someaddress#somedomain.com",
"normalizedName" : "someaddress#somedomain.com"
},
{
"displayName": "Another.Address#someotherdomain.com",
"normalizedName" : "another.address#someotherdomain.com"
}
],
// Option B
"emailAddressesNormalized": {
"someaddress#somedomain.com", "another.address#someotherdomain.com"
}
}
Unfortunately, my production database already contains documents that would need to be updated to support the new structure.
My production collections contain only 100s of these items, so I am even tempted to just get all items and do the comparison in memory on the client.
If performance matters then you should consider one of the normalization solution you have proposed yourself in question. Then you could index the normalized field and get results without doing a full scan.
If for some reason you really don't want to retouch the documents then perhaps the feature you are missing is simple join?
Example query which will do case-insensitive search from within array with a scan:
SELECT c FROM c
join email in c.emailAddresses
where lower(email) = lower('ANOTHER.ADDRESS#someotherdomain.com')
You can find more examples about joining from Getting started with SQL commands in Cosmos DB.
Note that where-criteria in given example cannot use an index, so consider using it only along another more selective (indexed) criteria.

How to backup/dump structure of graphs in arangoDB

Is there a way to dump the graphstructure of an arangoDB database, since
arangodump unfortunately just dumps the data of edges and collections.
According to the documentation in order to dump structural information of all collections (including system collections) you run the following
arangodump --dump-data false --include-system-collections true --output-directory "dump"
If you do not want the system collections to be included then don't provide the argument (it defaults to false) or provide a false value.
How is the structural and data of collections dumped, see below from the documentation
Structural information for a collection will be saved in files with
name pattern .structure.json. Each structure file will contains a JSON
object with these attributes:
parameters: contains the collection properties
indexes: contains the collection indexes
Document data for a collection will be saved in
files with name pattern .data.json. Each line in a data file is a
document insertion/update or deletion marker, alongside with some meta
data.
For testing I often want to extract a subgraph with a known structure. I use that to test my queries against. The method is not pretty but it might address your question. I blogged about it here.
Although #Raf's answer is accepted , --dump-data false will only give structure files for all the collections and but data wont be there. Including --include-system-collections true would give _graphs system collection's structure which wont have information pertaining to individual graphs creation/structure.
For Graph creation data as well
Right command is as follows.
arangodump --server.database <DB_NAME> --include-system-collections true --output-directory <YOUR_DIRECTORY>
We would be interested in _graphs_<long_id>.data.json named file which has below data format.
{
"type": 2300,
"data":
{
"_id": "_graphs/social",
"_key": "social",
"_rev": "_WaJxhIO--_",
"edgeDefinitions": [
{
"collection": "relation",
"from": ["female", "male"],
"to": ["female", "male"]
}
],
"numberOfShards": 1,
"orphanCollections": [],
"replicationFactor": 1
}
}
Hope this helps other users who were looking for my requirement!
Currently ArangoDB manages graphs via documents in the system collection _graphs.
One document equals one graph. It contains the graph name, involved vertex collections and Edge Definition that configure the directions of edge collections.

Resources