How to delete and create graphs in titan/cassandra? - groovy

I am just wondering how to delete an entire graph from titan/cassandra?
I've set up Titan with Elastic Search, Cassandra and Rexster from the website. I created a simple graph called "graph". The problem I encountered is I did not realize that
Once an index has been created for a key, it can never be removed.
So I when experimenting I made a lot of random indexes good and bad. I tried renaming it in rexster-cassandra-es.xml thinking it would lose the reference but it just continued with a different rexster path.
Also how do you create new databases? When I started Rexster it just created a database named "graph" for me and I noticed that it allowed you to choose which database to use.
Thank you very much.

You can do a few things, but the easiest might be to simply shutdown casssandra/elastic search/rexster and just delete the data directories. Restart it all and get a fresh start. You could also drop the "titan" key space in cassandra and delete indices in elastic search...the instructions for these approaches would be in the respective documentation for each component.

Related

How to optimize knex migration?

i'm working on a project that has been using bookshelfjs (with knexjs migration system) since its beginning (1 year and a half).
We now have a little bit less than 80 migrations and it's starting to take a lot of time (more than 2 minutes) to run all migrations. We are deploying using continuous integration so the migrations have to be run in the test process and in the deployment process.
I'd like to know how to optimize that. Is that possible to start from a clean state ? I don't care about losing rollback possibilities. The project is much more mature right now and we don't need to iterate much anymore on the data structure part.
Is there any best practice ? I'm coming from the Doctrine (PHP) world and it's really different.
Thanks for your advice !
Create database dump from your current database state.
Always use that dump to initialize new database for tests
Run migrations on top of already initialized database
In that way migration system applies only newly added migrations to top of existing initial dump.
When using knex.schema.createTable to create table with foregin keys from another table, and later when you run knex migrate:latest, the table with foreign keys should be processed before the one using the foreign keys. For example, table1 has foreign key key1 from talbe2, to make sure table2 is processed first, you can add numbers before the name of the table. Then in your migrations folder, there will be 1table2.js, 2table1.js. This looks hacky and not pretty, but it works!

Updating lucene index frequently causing performance degrade

I am trying to add lucene.net on my project where searching getting more complicated data. but transaction (or table modifying frequently like inserting new data or modifying the field which is used in lucene index).
Is it good to use lucene.net searching here?
How can I find modified fields & update to specific lucene index which is already created? Lucene index contains documents that are deleted from the table then how can I remove them from lucene index?
while loading right now,
I have removed index which are not available in the table based on unique Field
inserting if index does not exist otherwise updating all index which are matching table unique field
While loading page it's taking more time than normal, due to my removing/inserting/updating index method calling.
How can I proceed with it?
Lucene is absolutely suited for this type of feature. It is completely thread-safe... IF you use it the right way.
Solution pointers
Create a single IndexWriter and keep it in a globally accessible singleton (either a global static variable or via dependency injection). IWs are completely threadsafe. NEVER open multiple IWs on the same folder.
Perform all updates/deletes via this singleton. (I had one project doing 100's of ops/second with no issues, even on slightly crappy hardware).
Depending on the frequency of change and the latency acceptable to the app, you could:
Send an update/delete to the index every time you update the DB
Keep a "transaction log" or queue (probably in the same DB) of changed rows and deletions (which are are to track otherwise). Then update the index by consuming the log/queue.
To search, create your IndexSearcher with searcher = new IndexSearcher(writer.GetReader()). This is part of the NRT (near real time) pattern. NEVER create a separate IndexReader on an index folder that is also open by an IW.
Depending on your pattern of usage you may wish to introduce a period of "latency" between changes happening and those changes being "visible" to the searches...
Instances of IS are also threadsafe. So you can also keep an instance of an IS through which all your searches go. Then recreate it periodically (eg with a timer) then swap it using Interlocked.Exchange.
I previously created a small framework to isolate this from the app and make it reusable.
Caveat
Having said that... Hosting this inside IIS does raise some problems. IIS will occasionally restart your app. Is will also (by default) start the new instance before stopping the existing one, then swaps them (so you don't see the startup time of the new one).
So, for a short time there will be two instances of the writer (which is bad!)
You can tell IIS to disable "overlapping" or increase the time between restarts. But this will cause other side-effects.
So, you are actually better creating a separate service to host your lucene bits. A simple self hosted WebAPI Windows service is ideal and pretty simple. This also gives you better control over where the index folder goes and the ability to host it on a different machine (which isolates the disk IO load). And means that the service can be accessed from other parts of your system, tested separately etc etc
Why is this "better" than one of the other services suggested?
It's a matter of choice. I am a huge fan of ElasticSearch. It solves a lot of problems around scale and resilience. It also uses the latest version of Java Lucene which is far, far ahead of lucene.net in terms of capability and performance. (The same goes for the other two).
BUT, ES and Solr are Java (which may or may not be an issue for you). AzureSearch is hosted in Azure which again may or may not be an issue.
All three will require climbing a learning curve and will require infrastructure support or external third party SaaS commitment.
If you keep the service inhouse and in c# it keeps it simple and you have control over the capabilities and the shape of the API can be turned for your needs.
No "right" answer. You'll have to make choices based on your situation.
You should be indexing preferrably according to some schedule (periodically). The easiest approach is to keep the date of last index and then query for all the changes since then and index new, update and remove records. In order to keep track of removed entries in the database you will need to have a log of deleted records with a date it was removed. You can then query using that date to what needs to be removed from the lucene.
Now simply run that job every 2 minutes or so.
That said, Lucene.net is not really suited for web application, you should consider using ElasticSearch, SOLR or AzureSearch. Basically server that can handle load and multi threading better.

How do production Cassandra DBA's do table changes & additions?

I am interested in how the Cassandra production DBA's processes change when using Cassandra and performing many releases over a year. During the releases, columns in tables would change frequently and so would the number of Cassandra tables, as new features and queries are supported.
In the relational DB, in production, you create the 'view' and BOOM you get the data already there - loaded from the view's query.
With Cassandra, does the DBA have to create a new Cassandra table AND have to write/run a script to copy all the required data into that table? Can a production level Cassandra DBA provide some pointers on their processes?
We run a small shop, so I can tell you how I manage table/keyspace changes, and that may differ from how others get it done. First, I keep a text .cql file in our (private) Git repository that has all of our tables and keyspaces in their current formats. When changes are made, I update that file. This lets other developers know what the current tables look like, without having to use SSH or DevCenter. This also has the added advantage of giving us a file that allows us to restore our schema with a single command.
If it's a small change (like adding a new column) I'll try to get that out there just prior to deploying our application code. If it's a new table, I may create that earlier, as a new table without code to use it really doesn't hurt anything.
However, if it is a significant change...such as updating/removing an existing column or changing a key...I will create it as a new table. That way, we can deploy our code to use the new table(s), and nobody ever knows that we switched something behind the scenes. Obviously, if the table needs to have data in it, I'll have export/import scripts ready ahead of time and run those right after we deploy.
Larger corporations with enterprise deployments use tools like Chef to manage their schema deployments. When you have a large number of nodes or clusters, an automated deployment tool is really the best way to go.

Duplicate Cassandra node from one node to two nodes without repair?

Is there an easy way to move from having a single node to two nodes that both have a full copy of the data. In the past when I added a node, I had to wait for the second node to take half of the data and then I had to muck around to make them full replicas again.
Yes, you can copy over the sstables directly and that way when the node starts up, it can access its data directly without needing streaming. This is how it should work: go to your data folder on your old node and copy over directories matching for the keyspaces you want to migrate, as well as schema directories inside system. Copy these over to your new node's data directory (create it if necessary). Then configure the new node and have it join the cluster as usual. I believe you'll have access to your new data at this point without further migrations.
See a blog post on this topic for more details, especially on using a more refined approach with the help of sstableloader tool.

Cassandra Database Problem

I am using Cassandra database for large scale application. I am new to using Cassandra database. I have a database schema for a particular keyspace for which I have created columns using Cassandra Command Line Interface (CLI). Now when I copied dataset in the folder /var/lib/cassandra/data/, I was not able to access the values using the key of a particular column. I am getting message zero rows present. But the files are present. All these files are under extension, XXXX-Data.db, XXXX-Filter.db, XXXX-Index.db. Can anyone tell me how to access the columns for existing datasets.
(a) Cassandra doesn't expect you to move its data files around out from underneath it. You'll need to restart if you do any manual surgery like that.
(b) if you didn't also copy the schema definition it will ignore data files for unknown column families.
For what you are trying to achieve it may probably be better to export and import your SSTables.
You should have a look at bin/sstable2json and bin/json2sstable.
Documentation is there (near the end of the page): Cassandra Operations

Resources