How to find Duplicate documents in Cosmos DB

How to find Duplicate documents in Cosmos DB - azure

I have seen like a huge amount of data write to cosmos DB from stream analytics job on a particular day.
It was not supposed to write huge amount of documents in a day. I have to check if there is duplication of documents on that particular day.
Is there any query/any way to find out duplicate records in cosmos DB?

It is possible if you know the properties to check for duplicates.
We had a nasty production issue causing many duplicate records as well.
Upon contacting MS Support to help us identify the duplicate documents, they gave us the following query;
Bear in mind: property A and B together define the uniqueness in our case. So if two documents have the same value for A and B, they are duplicate.
You can then use the output of this query to, for example, delete the oldest ones but keep the recent (based on _ts)
SELECT d.A, d.B From
(SELECT c.A, c.B, count(c._ts) as counts FROM c
GROUP BY c.Discriminator, c.EndDateTime) AS d
WHERE d.counts > 1

Is there any query/any way to find out duplicate records in cosmos DB?
Quick answer is YES.Please use distinct keyword in the cosmos db query sql.And filter the _ts(System generated unix timestamp:https://learn.microsoft.com/en-us/azure/cosmos-db/databases-containers-items#properties-of-an-item)
Something like:
Select distinct c.X,c.Y,C.Z....(all columns you want to check) from c where c._ts = particular day
Then you could delete the duplicate data using this bulk delete lib:https://github.com/Azure/azure-cosmosdb-bulkexecutor-dotnet-getting-started/tree/master/BulkDeleteSample.

Related

Best practices to store every minute and select from database latest 24h data only?

The task is to permanently record new data to a database every minute and then, occasionally, to read only latest 24h data, using Python.
The only approach I know:
create a script A that will be inserting into a MariaDB database table, one new line per minute, with a timestamp as a field value
create a script B that will be reading from the database table, using WHERE and timestamp values
The problem is, there are 2 restrictions:
it is not allowed to have more than 10.000 lines in one database table
it is not allowed to delete any lines
How to fulfill the task and meet both restrictions? Are there best practices?
Thanks!

You can create a new table every X days when it is full. Name the table with the first timestamp value.
With this solution you need to create your B script in this way:
List all tables
Find the tables you are looking for
Write your SQL query on all theses tables using UNION ALL
You can do it into a single SQL query for optimisation or into a script using multiple queries for simplicity.

What is the best way to move out-of-order Access records into the proper order by using a locked ID field?

I have roughly 1500 records in an Access database. I have a field ID that acts as the primary key, and as such cannot be manually changed. After looking through the original Excel sheet these records were kept in, I noticed that a few records in Excel were missing from the Access database. After going through all of them, I added the three missing records into Access.
This database stores records in date order, grouped by a manufacturer. Ex. records from Manufacturer1 collected during week 1 of June '16 are all located together, and records from Manufacturer2 collected during week 2 of June '16 are stored directly afterwards. This is important for us because the data in this database often needs to be looked at visually, so keeping things in date order is essential. There is also a macro that export the data to an Excel sheet and formats it to be easier to read, which exports the records in the order in which they are stored (by the ID field). This is a problem because the three missing records are from years past - now they are in the middle of records from 2018. The IDs they were assigned upon entry keeps them in that location.
Is there a way to reliably insert these records into the database in the location at which they should be? Such as shifting the values of other records ID fields down by 3 to allow room for the missing records? I know I can probably manually have those three records move to the desired location in the macro that exports to Excel, but I'd rather have a less hacky solution that could work if a similar problem happens again.

The order of data in a database is of no interest to the database - it's the relation between data that matters.
To always view your data in the order you want use the ORDER BY clause in an SQL statement. Generally you can add data to the underlying table directly through the query - unless you've got many-to-one type queries where your update would need to affect more than one record.
SELECT FieldName1, FieldName2, . . . .
FROM MyDataTable
ORDER BY Manufacturer, Date
Edit: Even here you'll be adding new records to the bottom of the dataset, but refreshing the query will move the records to the correct order.

Azure query using the select

I am trying to get a query in azure in which I want to get the entity with the given partition key and row key based on Date.
I am keeping entities
Partisionkey, row key, Date, Additional info.
I am looking for a query using tableservies so that ,
I always get the latest one (using date)
How can I get the query? (I am using node and Azure)
TableQuery
.select()
.from('myusertables')
.where('PartitionKey eq ?', '545455');
How write the table query?

To answer you question, check out this previously answered question: How to select only the records with the highest date in LINQ
However, you may be facing a design issue. Performing the operation you are trying to do will require you to pull all the entities from the underlying Azure Table, which will perform slower over time as entities are added. So you may want to reconsider your design and possibly change the way you use your partitionkey and rowkey. You could also store the latest entities in a separate table, so that only 1 entity is found per table, transforming your scan/filter into a seek operation. Food for thought...

Azure - Querying 200 million entities

I have a need to query a store of 200 million entities in Windows Azure. Ideally, I would like to use the Table Service, rather than SQL Azure, for this task.
The use case is this: a POST containing a new entity will be incoming from a web-facing API. We must query about 200 million entities to determine whether or not we may accept the new entity.
With the entity limit of 1,000: does this apply to this type of query, i.e. I have to query 1,000 at a time and perform my comparisons / business rules, or can I query all 200 million entities in one shot? I think I would hit a timeout in the latter case.
Ideas?

Expanding on Shiraz's comment about Table storage: Tables are organized into partitions, and then your entities are indexed by a Row key. So, each row can be found extremely fast using the combination of partition key + row key. The trick is to choose the best possible partition key and row key for your particular application.
For your example above, where you're searching by telephone number, you can make TelephoneNumber the partition key. You could very easily find all rows related to that telephone number (though, not knowing your application, I don't know just how many rows you'd be expecting). To refine things further, you'd want to define a row key that you can index into, within the partition key. This would give you a very fast response to let you know whether a record exists.
Table storage (actually Azure Storage in general - tables, blobs, queues) have a well-known SLA. You can execute up to 500 transactions per second on a given partition. With the example above, the query for rows for a given telephone number would equate to one transaction (unless you exceed 1000 rows returned - to see all rows, you'd need additional fetches); adding a row key to narrow the search would, indeed, yield a single transaction). So would inserting a new row. You can also batch up multiple row inserts, within a single partition, and save them in a single transaction.
For a nice overview of Azure Table Storage, with some good labs, check out the Platform Training Kit.
For more info about transactions within tables, see this msdn blog post.

The limit of 1000 is the number of rows returned from a query, not the number of rows queried.
Pulling all of the 200 million rows into the web server to check them will not work.
The trick is to store the rows with a key that can be used to check if the record should be accepted.

How can I delete records from a table that have certain criteria

Rookie question I know.
I have a table with about 10 fields, one of the fields is a category field. I need this field to exist because of the multiple types of categories. However, one category in this field is wrong and is duplicating results.
So can I delete all records in the table that have "Type320" in the CatDescription field, and how? I want to keep eveerything else as it is in this table; just need to get rid of the records that have that that in that one field
Thanks very much!
EDIT: Thanks for the answer, I did not know how to do this so this is very helpful
However, this is more complicated than I thought. The raw data that I am supplied carries these duplicate records (only duplicate in certain circumstances but they are easy to isolate). This raw data is given to me on a monthly basis in several spreadsheet forms.
It all relates to these ID numbers, and has like 10 fields (xls columns). As I said before one of these is the Category Description field (sorry, this is not a lookup) In certain places this records automatically duplicates itself on output because in the database this comes from, it has to have this sub category for one particular "type"
So....every time there is a duplication, every single bit of information in all fields are exactly the same, with the exception of this CatDescription (one is Type320, and the duplicated record type is "Type321"). However, there are some instances where Type321 is valid on it's own (in which case there is no matching data row with a Type320 catdescription). By matching I mean all data in all fields of a particular record.
A very clear absolute of this is if all fields (data within) of a record with Type320 CatDescription, matches all fields (data within) a record with Type321 CatDescription, then I can delete that record containing Type321 CatDescription. This is true because this is the only situation where this duplication occurs, normally not all of this should match.
This allows all unique records with Type320 and Type321 data (that does not match exactly) to stay; just a it should. This makes sense to me (and hopefully you too :/) but can it be done, and how?
thanks because this is way over my head. I would rather know how to do it in access, but an xls solution is equally as appreciate. heck i would do it in ppt if it would get the job done! :)

I would try with one of these two querys:
DELETE FROM table WHERE CatDescription LIKE '%Type320%';
DELETE FROM table WHERE CatDescription LIKE '*Type320*';
That because the Access database engine could be using * (ANSI-89 Query Mode e.g. DAO) instead of % (ANSI-92 Query Mode e.g. OLE DB/ADO) for the wildcards.
Alternatively, this regardless of ANSI Query Mode:
DELETE FROM table WHERE CatDescription ALIKE '%Type320%';
Note the Access database engine's ALIKE keyword is not officially supported.

Does the CatDescription field look to another table? Is it a a query of those tables that creates what you call duplicate results?
If so, be careful about blaming the table that has CatDescription. Check the look-up table to see if Type320 is found there in duplicate.
If you don't have the problem isolated correctly, then you're likely to delete good records while not fixing the problem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string