I recently was asked by one of my Customers if there was a method to clean out records with the "DeletedDatabaseRecord" flagged.
They are in the process of implementing a new base company and have done several import/delete/import/delete of key records which has resulted in quite a few of these that they'd prefer not carry over to their actual live company.
Looking through the system i didn't see a build in method to clear these records out.
Is there a method of purging these records that is part of the system, be it from the ERP Configuration tools, stored procedures, or in the interface itself?
Jeff,
No, there is no special functionality to remove records flagged as DeletedDatabaseRecord, but you may always use a simple SQL script to loop over all the tables that have this column and remove from each of them the records that have it set to 1.
Related
In an NodeJS application I have to maintain a "who was online in the last N minutes" state. Since there is potentially thousands of online users - for performance reasons - I decided to not update my Postgresql user table for this task.
I choosed to use Redis to manage the online status. It's very easy and efficient.
But now I want to make complex queries to the user table, sorted by the online status.
I was thinking of creating a online table filled every minute from a Redis snapshot, but I'm not sure it's the best solution.
Following the table filling, will the next query referencing the online table take a big hit caused by the new indexes creation or loading?
Does anyone know a better solution?
I had to solve almost this exact same issue, but I took a different approach because I Didn't like the issues caused by trying to mix Redis and Postgres.
My solution was to collect the online data in a queue (Zero MQ in my case) but any queueing system should work, or a stream processing facility like Amazon Kinesis (The alternative I looked at.) I then inserted the data in batches into a second table (not the users table). I don't delete or update that table, only inserts and queries are allowed.
Doing things this way preserved the ability to do joins between the last online data and the users table without bogging down the database or creating many updates on the user tables. It has the side effect of giving us a lot of other useful data.
One thing to note that I have though about when thinking of other solutions to this problem is that your users table in transactional data(OLTP) while the latest online information is really analytics data (OLAP), so if you have a data warehouse, data lake, big data, or whatever term of the week you want to use for storing this type of data and querying against it that may be a better solution.
We are looking for a technology stack which will have the following criteria.
We will be having around 10 million customer.
Each customer will be having around 20MB+ of data.
Data of each user will be updated everyday.
We need to store the data for more than six months.
We may need to query on the data any time within the time span of six months.
Currently we are thinking to use Cassandra, but the limitation of maximum storage per node in Cassandra should be less than 3TB, we are looking for other alternatives to use with or without Cassandra.
Well, I don't know if my suggestion applies for your case. We had a similar case with one of our products. There was created a blob field to record binary data, as pdf documents, that made the database grew considerably.
The solution we made was to create a second database, as a repository for records older then one year. At the application server there's a service running which:
1) Copies the records, from specific tables, older then one year to this second database;
2) Deletes records from the main database, once we have a copy in the other side;
3) Queries that need data older then one year are directed to this second database;
Sure, we had to do some implementations on the code to adapt to this situation, but is running good so far.
You can try ScyllaDB. It's a C++ reimplementation of Cassandra at 10x the speed. Scylla supports 10TB/node and there are examples of larger amounts per node. Proper disclosure - I work there but am speaking from experience.
You can definitely consider just to store the metadata itself in the database and the blobs on a separate nodes outside but it's complex and Scylla can store it all altogether. Such a similar system is already in production and we hope that user will eventually open source it
We are doing CRM data migration in order to keep two CRM Systems in Sync. And removing history data from Primary CRM. Target CRM is been created by taking Source as base. Now while we migrate the data we keep guids of record, same in order to maintain data integrity. This solution expects that in target systems that GUID must be available to assign to new record. There are no new records created directly in target system except Emails, that too very low in number. But apart from that there are ways in which system creates its guids, e.g when we move newly created entity to target solution using Solution it will not maintain the GUID of entity and attributes and will create its own, since we do not have control on this. Also some of the records which are created internally will also get created by platform and assigned a new GUID. Now if we do not have control over guid creation in target system(Although number is very small), i fear of the situation where Source System has guid which target has already consumed!! And at time of data migration it will give errors.
My Question is there any possibility that above can happen? because if that happens to us whole migration solution will loose its value.
SQL Server's NEWID() generates a 128-bit ID. All IDs generated on the same machine are guaranteed to be unique but because yours have been generated across multiple machines, there's no guarantee.
That being said, from this source on GUIDs:
...for there to be a one in a billion chance of duplication, 103 trillion version 4 UUIDs must be generated.
So the answer is yes there is a chance of collision, but it's so astromonically low that most consider the answer to effectively be no.
I am trying to add lucene.net on my project where searching getting more complicated data. but transaction (or table modifying frequently like inserting new data or modifying the field which is used in lucene index).
Is it good to use lucene.net searching here?
How can I find modified fields & update to specific lucene index which is already created? Lucene index contains documents that are deleted from the table then how can I remove them from lucene index?
while loading right now,
I have removed index which are not available in the table based on unique Field
inserting if index does not exist otherwise updating all index which are matching table unique field
While loading page it's taking more time than normal, due to my removing/inserting/updating index method calling.
How can I proceed with it?
Lucene is absolutely suited for this type of feature. It is completely thread-safe... IF you use it the right way.
Solution pointers
Create a single IndexWriter and keep it in a globally accessible singleton (either a global static variable or via dependency injection). IWs are completely threadsafe. NEVER open multiple IWs on the same folder.
Perform all updates/deletes via this singleton. (I had one project doing 100's of ops/second with no issues, even on slightly crappy hardware).
Depending on the frequency of change and the latency acceptable to the app, you could:
Send an update/delete to the index every time you update the DB
Keep a "transaction log" or queue (probably in the same DB) of changed rows and deletions (which are are to track otherwise). Then update the index by consuming the log/queue.
To search, create your IndexSearcher with searcher = new IndexSearcher(writer.GetReader()). This is part of the NRT (near real time) pattern. NEVER create a separate IndexReader on an index folder that is also open by an IW.
Depending on your pattern of usage you may wish to introduce a period of "latency" between changes happening and those changes being "visible" to the searches...
Instances of IS are also threadsafe. So you can also keep an instance of an IS through which all your searches go. Then recreate it periodically (eg with a timer) then swap it using Interlocked.Exchange.
I previously created a small framework to isolate this from the app and make it reusable.
Caveat
Having said that... Hosting this inside IIS does raise some problems. IIS will occasionally restart your app. Is will also (by default) start the new instance before stopping the existing one, then swaps them (so you don't see the startup time of the new one).
So, for a short time there will be two instances of the writer (which is bad!)
You can tell IIS to disable "overlapping" or increase the time between restarts. But this will cause other side-effects.
So, you are actually better creating a separate service to host your lucene bits. A simple self hosted WebAPI Windows service is ideal and pretty simple. This also gives you better control over where the index folder goes and the ability to host it on a different machine (which isolates the disk IO load). And means that the service can be accessed from other parts of your system, tested separately etc etc
Why is this "better" than one of the other services suggested?
It's a matter of choice. I am a huge fan of ElasticSearch. It solves a lot of problems around scale and resilience. It also uses the latest version of Java Lucene which is far, far ahead of lucene.net in terms of capability and performance. (The same goes for the other two).
BUT, ES and Solr are Java (which may or may not be an issue for you). AzureSearch is hosted in Azure which again may or may not be an issue.
All three will require climbing a learning curve and will require infrastructure support or external third party SaaS commitment.
If you keep the service inhouse and in c# it keeps it simple and you have control over the capabilities and the shape of the API can be turned for your needs.
No "right" answer. You'll have to make choices based on your situation.
You should be indexing preferrably according to some schedule (periodically). The easiest approach is to keep the date of last index and then query for all the changes since then and index new, update and remove records. In order to keep track of removed entries in the database you will need to have a log of deleted records with a date it was removed. You can then query using that date to what needs to be removed from the lucene.
Now simply run that job every 2 minutes or so.
That said, Lucene.net is not really suited for web application, you should consider using ElasticSearch, SOLR or AzureSearch. Basically server that can handle load and multi threading better.
I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.