can someone help me to analyse the data size for CosmosDB
I upload my data from JSON file
Single Region
I use CosmosDB SQL API for this DB
there are 95969 rows/documents
this is what I have as document 704b
the only vary as size "CityName": "Carleton Place"
however the JSON data file is 26.7MB
this gives 64MB
How come they inflate with 32MB???
index OK 15.45MB I have spatial points
{
"agegroup": 2,
"locationgeometry": {
"type": "Point",
"coordinates": [ 45.14478, -76.14443 ]
},
"ProvinceAbbr": "ON",
"age": 34,
"LHIN_LocationID": 11,
"Latitude": 45.14478,
"Longitude": -76.14443,
"PostalCode": "K7C 1X2",
"CityName": "Carleton Place",
"CityType": "D",
"ProvinceName": "Ontario",
"id": "3e496a96-db77-4535-b73b-5ab317b44231",
"_rid": "sGVsAMC4X4ICAAAAAAAAAA==",
"_self": "dbs/sGVsAA==/colls/sGVsAMC4X4I=/docs/sGVsAMC4X4ICAAAAAAAAAA==/",
"_etag": "\"0000cd97-0000-0200-0000-5d586f650000\"",
"_attachments": "attachments/",
"_ts": 1566076773
}
The JSON file 26.7MB created from MS SQL
the original MS SQL store 18.94MB with 2.5MB index
I have a SQL api, cosmosdb container with 6 logical partitions and each partition is about 15 MiB ( that is Million bytes not Mega - 1024*1024 bytes ).
That is 85 Mega Byte.
Using the DTUI tool to export the container to jsom dump, the text dump file size is 48.74 MB
Part of the overhead is CosmosdB's internal fields that are stored in CosmosDB but not part of user data and thus not part of export. Example fields ( you can see them in data explorer )
_rid
_self
_etag
_attachments
_ts
There are other overheads not seen in data explorer.
Anywai you should not be too concerned about the size - as most of your cost, typically will be RU ( usage, provisioned ).
Hope this helps !
Related
In cosmos db i have two containers
Subscriptions
UserSubscriptions
I want to move items from Subscriptions container to UserSubscription container but while moving Subscriptions document should convert into UserSubscriptions's document structure.
Is there a way to achive this process?
Any comments or suggestions would be greatly appricated.
Subscriptions json sample:
{
"id": "d18e4605-8bac-4bae-95d7-990703033a50",
"isActive": true,
"distributionUserIds": [
"27039a11-9ace-4748-a65e-f463a9e7ef11"
],
"PayerNumber": "0000000005",
"AccountNumber": "0000000005",
"_rid": "4XYVAOblmAIBAAAAAAAAAA==",
"_self": "dbs/4XYVAA==/colls/4XYVAOblmAI=/docs/4XYVAOblmAIBAAAAAAAAAA==/",
"_etag": "\"00000000-0000-0000-91a9-b9eef09501d6\"",
"_attachments": "attachments/",
"_ts": 1600866120
}
UserSubscription json sample:
{
"id": "d18e4605-8bac-4bae-95d7-990703033a50",
"UUID": "27039a11-9ace-4748-a65e-f463a9e7ef11",
"Type": "Accounts",
"Payers": [
{
"PayerNumber": "0000000005",
"Accounts": [
"0000000005"
]
}
],
"_rid": "C7VEAMxad9UkAAAAAAAAAA==",
"_self": "dbs/C7VEAA==/colls/C7VEAMxad9U=/docs/C7VEAMxad9UkAAAAAAAAAA==/",
"_etag": "\"00000000-0000-0000-2d10-3d41429f01d7\"",
"_attachments": "attachments/",
"_ts": 1617952580
}
Yes, migration tool can't updating the json while importing as you said.
There are serval ways to do this:
You can use Change Feed to get all documents in Subscriptions
collection and change the document structure by code.
You can transform document structure using Data Flow in Azure Data
Factory.
Logic app can also achieve your requirement.
Steps to use data factory for copying:-
Create pipe line
Create linked services for both cosmos db
Define structure
Start pipeline for non duplicate line use pre script
(sql to filter) . We can configure more like batch size
Setup
tigger
I am collecting IoT data to Azure cosmos DB. I know COSOMOS DB SQL API is auto indexed by Path. I have around 150 sensors in each document, and most of sql queries are of
DeviceId is already Partition Key
Select c.sensorVariable From c where c.DeviceId = 'dev1' AND c.time= date1'
{ "DeviceId" : 'dev1' , "time" : 123333 , "sensor1" : 20 , "sensor2" : 40}
I will Fetch the various sensors data but all my queries are depend on depend on deviceId and time( which is in Unix Timestamp )
Is it possible to index data on deviceID and time and exclude other keys, which are also in the same path / .
And for collection by default
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Range",
"dataType": "String",
"precision": -1
},
{
"kind": "Spatial",
"dataType": "Point"
}
]
}
],
It comes with this I feel as for DataType String shouldn't it be having Hash kind indexing rather than Range? And what is this Precision : -1
In Azure cosmos DB doc examples I have seen precision as 3 for string, I did not understood why ?
If I have 100 devices and putting data every second level what type of indexing is better ?
Is it possible to index data on deviceID and time and exclude other
keys , which are also in the same path
Yes. You could custom your index policy by IncludedPaths and ExcludedPaths.
Such as :
var excluded = new DocumentCollection { Id = "excludedPathCollection" };
excluded.IndexingPolicy.IncludedPaths.Add(new IncludedPath { Path = "/*" });
excluded.IndexingPolicy.ExcludedPaths.Add(new ExcludedPath { Path = "/nonIndexedContent/*" });
await client.CreateDocumentCollectionAsync(UriFactory.CreateDatabaseUri("db"), excluded);
Please refer to more details here.
what is this Precision : -1
In Azure cosmos DB doc examples I have seen precision as 3 for string, I did not understood why ?
Based on Index data types, kinds, and precisions:
For a Hash index, this varies from 1 to 8 for both strings and numbers. The default is 3. For a Range index, this value can be -1 (maximum precision). It can vary from between 1 and 100 (maximum precision) for string or number values.
You could focus on this statement to make your choices.
If i have 100 devices and putting data every second level what type of
indexing is better ?
It's hard to say which index mode is the best choice. It should be considered with consistency level and your requirements for read and write performance. You could refer to this paragraph.
I'm using Azure Data Factory to periodically import data from MySQL to Azure SQL Data Warehouse.
The data goes through a staging blob storage on an Azure storage account, but when I run the pipeline it fails because it can't separate the blob text back to columns. Each row that the pipeline tries to insert into the destination becomes a long string which contains all the column values delimited by a "⯑" character.
I used Data Factory before, without trying the incremental mechanism, and it worked fine. I don't see a reason it would cause such a behavior, but I'm probably missing something.
I'm attaching the JSON that describes the pipeline with some minor naming changes, please let me know if you see anything that can explain this.
Thanks!
EDIT: Adding exception message:
Failed execution Database operation failed. Error message from
database execution :
ErrorCode=FailedDbOperation,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error
happened when loading data into SQL Data
Warehouse.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Data.SqlClient.SqlException,Message=Query
aborted-- the maximum reject threshold (0 rows) was reached while
reading from an external source: 1 rows rejected out of total 1 rows
processed.
(/f4ae80d1-4560-4af9-9e74-05de941725ac/Data.8665812f-fba1-407a-9e04-2ee5f3ca5a7e.txt)
Column ordinal: 27, Expected data type: VARCHAR(45) collate SQL_Latin1_General_CP1_CI_AS, Offending value:* ROW OF VALUES
* (Tokenization failed), Error: Not enough columns in this
line.,},],'.
{
"name": "CopyPipeline-move_incremental_test",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "$$Text.Format('select * from [table] where InsertTime >= \\'{0:yyyy-MM-dd HH:mm}\\' AND InsertTime < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "SqlDWSink",
"sqlWriterCleanupScript": "$$Text.Format('delete [schema].[table] where [InsertTime] >= \\'{0:yyyy-MM-dd HH:mm}\\' AND [InsertTime] <\\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)",
"allowPolyBase": true,
"polyBaseSettings": {
"rejectType": "Value",
"rejectValue": 0,
"useTypeDefault": true
},
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "column1:column1,column2:column2,column3:column3"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": "StagingStorage-somename",
"path": "somepath"
}
},
"inputs": [
{
"name": "InputDataset-input"
}
],
"outputs": [
{
"name": "OutputDataset-output"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 10,
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "Activity-0-_Custom query_->[schema]_[table]"
}
],
"start": "2017-06-01T05:29:12.567Z",
"end": "2099-12-30T22:00:00Z",
"isPaused": false,
"hubName": "datafactory_hub",
"pipelineMode": "Scheduled"
}
}
It sounds like what your doing is right, but the data is poorly formed (common problem, none UTF-8 encoding) so ADF can't parse the structure as you require. When I encounter this I often have to add a custom activity to the pipeline that cleans and prepares the data so it can then be used in a structured way by downstream activities. Unfortunately this is a big be overhead in the development of the solution and will require you to write a C# class to deal with the data transformation.
Also remember ADF has none of its own compute, it only invokes other services, so you'll also need an Azure Batch Service to execute to compiled code.
Sadly there is no magic fix here. Azure is great to Extract and Load your perfectly structured data, but in the real world we need other services to do the Transform or Cleaning meaning we need a pipeline that can ETL or I prefer ECTL.
Here's a link on create ADF custom activities to get you started: https://www.purplefrogsystems.com/paul/2016/11/creating-azure-data-factory-custom-activities/
Hope this helps.
I've been struggeling with the same message, sort of, when importing from Azure sql db to Azure DWH using Data Factory v.2 using staging (which implies Polybase). I've learned that Polybase will fail with error messages related to incorrect data types etc. The message I've received is much similar to the one mentioned here, even though I'm not using Polybase directly from SQL, but via Data Factory.
Anyways, the solution for me was to avoid NULL values for columns of decimal or numeric type, e.g. ISNULL(mynumericCol, 0) as mynumericCol.
When parsing exported Application Insights telemetry from Blob storage, the request data looks something like this:
{
"request": [
{
"id": "3Pc0MZMBJgQ=",
"name": "POST Blah",
"count": 6,
"responseCode": 201,
"success": true,
"url": "https://example.com/api/blah",
"durationMetric": {
"value": 66359508.0,
"count": 6.0,
"min": 11059918.0,
"max": 11059918.0,
"stdDev": 0.0,
"sampledValue": 11059918.0
},
...
}
],
...
}
I am looking for the duration of the request, but I see that I am presented with a durationMetric object.
According to the documentation the request[0].durationMetric.value field is described as
Time from request arriving to response. 1e7 == 1s
But if I query this using Analytics, the value don't match up to this field:
They do, however, match up to the min, max and sampledValue fields.
Which field should I use? And what does that "value": 66359508.0 value represent in the above example?
It doesn't match because you're seeing sampled data (meaning this event represents sampled data from multiple requests). I'd recommend starting with https://azure.microsoft.com/en-us/documentation/articles/app-insights-sampling/ to understand how sampling works.
In this case, the "matching" value would come from duration.sampledValue (notice that value == count * sampledValue)
It's hard to compare exactly what you're seeing because you don't show the Kusto query you're using, but you do need to be aware of sampling when writing AI Analytics queries. See https://azure.microsoft.com/en-us/documentation/articles/app-insights-analytics-tour/#counting-sampled-data for more details on the latter.
I set up blob indexing and full-text searching for Azure as described in this article: Indexing Documents in Azure Blob Storage with Azure Search.
Some of my documents are failing in the indexer, throwing the returning the following error:
Field 'content' contains a term that is too large to process. The max length for UTF-8 encoded terms is 32766 bytes. The most likely cause of this error is that filtering, sorting, and/or faceting are enabled on this field, which causes the entire field value to be indexed as a single term. Please avoid the use of these options for large fields.
The particular pdf that is producing this error is 3.68 MB, and contains a variety of content (text, tables, images, etc).
The index and indexer are set up exactly as described in that article, with the addition of some file type restrictions.
Index:
{
"name": "my-index",
"fields": [{
"name": "id",
"type": "Edm.String",
"key": true,
"searchable": false
}, {
"name": "content",
"type": "Edm.String",
"searchable": true
}]
}
Indexer:
{
"name": "my-indexer",
"dataSourceName": "my-data-source",
"targetIndexName": "my-index",
"schedule": {
"interval": "PT2H"
},
"parameters": {
"maxFailedItems": 10,
"configuration": {
"indexedFileNameExtensions": ".pdf,.doc,.docx,.xls,.xlsx,.ppt,.pptx,.html,.xml,.eml,.msg,.txt,.text"
}
}
}
I tried searching through their docs and some other related articles, but I couldn't really find any information. I'm guessing this is because this feature is still in preview.
there's a limit on the size of a single term in the search index - it also happens to be 32KB. If the content field in your search index is marked as filterable, facetable or sortable then you'll hit this limit (regardless of whether the field is marked as searchable or not). Typically for large searchable content you want to enable searchable and sometimes retrievable but not the rest. That way you won't hit limits on content length from the index side.
Please see this answer for more context as well.