I'm having a few issues with designing a database in Azure at the moment, down to the following:
SQL2012 Auto-Increment keys can/will jump by 1000 fairly regularly, related to the new "feature" of SQL 2012, as documented here (link). This has been closed on "By Design" in Connect (link)
The recommendation is to use either a startup flag to avoid this behaviour (cannot do with Azure), or to use Sequences to generate incrementing numbers instead.
However, SEQUENCE is unsupported in Azure DB. It in fact alternates between being an active issue and a "won't fix" issue on Connect (link)
So, my question is how to actually go about having a field auto-increment by 1 on insert to an Azure DB table, whilst avoiding large gaps.
I did think about using a Trigger instead, and then using that to find the existing Max value. Didn't seem clean. I also thought it would cause concurrency issues.
I'm happy to have a surrogate key here, but without Sequences I am wondering what the recommended route would be to actually generate the value for the surrogate at insert time.
Any advice appreciated.
Edit: Please note I am using the older "Web/Business" type of DB rather than the new tiers; I don't know if that will make a difference to any answers.
The reason your feedback has been marked as it was is because auto-increment by 1 becomes a very hard problem to solve at a database layer at scale. If you "shard" your database (split it to cope with load) how would each shard know which number to increment to? This should be extracted from the database and built into your application logic.
Without knowing the background requirement to having an incremental number it's hard to advise of the right solution. Is it for uniqueness or for a sequence? If it's for a sequence does it really matter that it isn't by 1 each time as long as the number is (a) unique and (b) increases on each insert? Could you make the setting of this sequence an offline process? That is - use insert date / time via a batch process to assign a sequence number?
If it's for uniqueness use a GUID or similar.
Related
I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.
Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.
Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.
So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.
I am migrating legacy application from Oracle to CosmosDB (MongoDB 3.6 API). At current stage of the project and also with respect to current goals refactoring ID from sequence numbering to GUIDs cannot be done. I am aware of many reasons why this is bad design but it is what it is - I need sequence generation :(
I am trying to find a solution that will be reliable. What came to my mind is to create a collection sequences and inside of it keep documents such is seq_foo and seq_bar and so on. Upon each insert I would first do findAndModify and then insert with custom id.
What is the question I currently struggle finding answer: is findAndModify atomic if used on single document with eventual consistency?
In case it is not do I need to use multiple DB Accounts with different consistency levels or is there another solution for this problem?
I'm currently playing with couchDB a bit and have the following scenario:
I'm implementing an issue tracker. Requirement is that each issue document has (besides it's document _id) a unique numerical sequential number in order to refer to it in a more appropriate way.
My first approach was to have a view which simply returns the count of unique issue documents currently stored. Increment that value on the client side by 1, assign it to my new issue and insert that.
Turned out to be a bad idea, when inserting multiple issues with ajax calls or having multiple clients adding issues at the same time. In latter case is wouldn't be even possible without communication between clients.
Ideally I want the sequential number to be generated on couch, which is afaik not possible due to conflicting states in distributed systems.
Is there any good pattern one could use (maybe on the client side) to approach this? I feel like this is a standard kind of use case (thinking of invoice numbers, etc).
Thanks in advance!
You could use a separate document which is empty, though it only consists of the id and rev. The rev prefix is always an integer, so you could use it as your auto incrementing number.
Just make a POST to your document, this will increase the rev and return it. Then you can use this generated value for your purpose.
Alternative way:
Create a separate document, consisting of value and lock. Then execute something like: "IF lock == true THEN return ELSE set lock = true AND increase value by 1", then do a GET to retrieve the new value and finally set lock = false.
I agree with you that using a view that gives you a document count is not a great idea. And it is the reason that couchdb uses a uuid's instead.
I'm not aware of a sequential id feature in couchdb, but think it's quite easy to write. I'd consider either:
An RPC (eg. with RabbitMQ) call to a single service to avoid concurrency issues. You can then store the latest number in a dedicated document on a specific non distributed couchdb or somewhere else. This may not scale particularly well, but you're writing a heck of an issue tracking system before this becomes an issue.
If you can allow missing numbers, set the uuid algorithm on your couch to sequential and you are at least good until the first buffer overflow. See more info at: http://couchdb.readthedocs.org/en/latest/config/misc.html#uuids-configuration
I'm making the switch to MongoDB from MySQL. A familiar architecture to me for a very basic users table would have auto-incrementing of the uid. See Mongo's own documentation for this use case.
I'm wondering whether this is the best architectural decision. From a UX standpoint, I like having UIDs as external references, for example in shorter URLs: http://example.com/users/12345
Is there a third way? Someone in IRC Freenode's #mongodb suggested creating a range of IDs and caching them. I'm unsure of how to actually implement that, or whether there's another route I can go. I don't necessarily even need the _id itself to be incremented this way. As long as the users all have a unique numerical uid within the document, I would be happy.
I strongly disagree with author of selected answer that No auto-increment id in MongoDB and there are good reasons. We don't know reasons why 10gen didn't encourage usage of auto-incremented IDs. It's speculation. I think 10gen made this choice because it's just easier to ensure uniqueness of 12-byte IDs in clustered environment. It's default solution that fits most newcomers therefore increases product adoption which is good for 10gen's business.
Now let me tell everyone about my experience with ObjectIds in commercial environment.
I'm building social network. We have roughly 6M users and each user has roughly 20 friends.
Now imagine we have a collection which stores relationship between users (who follows who). It looks like this
_id : ObjectId
user_id : ObjectId
followee_id : ObjectId
on which we have unique composite index {user_id, followee_id}. We can estimate size of this index to be 12*2*6M*20 = 2GB. Now that's index for fast look-up of people I follow. For fast look-up of people that follow me I need reverse index. That's another 2GB.
And this is just the beginning. I have to carry these IDs everywhere. We have activity cluster where we store your News Feed. That's every event you or your friends do. Imagine how much space it takes.
And finally one of our engineers made an unconscious decision and decided to store references as strings that represent ObjectId which doubles its size.
What happens if an index does not fit into RAM? Nothing good, says 10gen:
When an index is too large to fit into RAM, MongoDB must read the index from disk, which is a much slower operation than reading from RAM. Keep in mind an index fits into RAM when your server has RAM available for the index combined with the rest of the working set.
That means reads are slow. Lock contention goes up. Writes gets slower as well. Seeing lock contention in 80%-nish is no longer shock to me.
Before you know it you ended up with 460GB cluster which you have to split to shards and which is quite hard to manipulate.
Facebook uses 64-bit long as user id :) There is a reason for that. You can generate sequential IDs
using 10gen's advice.
using mysql as storage of counters (if you concerned about speed take a look at handlersocket)
using ID generating service you built or using something like Snowflake by Twitter.
So here is my general advice to everyone. Please please make your data as small as possible. When you grow it will save you lots of sleepless nights.
Josh,
No auto-increment id in MongoDB and there are good reasons.
I would say go with ObjectIds which are unique in the cluster.
You can add auto increment by a sequence collection and using findAndModify to get the next id to use. This will definitely add complexities to your application and may also affect the ability to shard your database.
As long as you can guarantee that your generated ids will be unique, you will be fine.
But the headache will be there.
You can look at this post for more info about this question in the dedicated google group for MongoDB:
http://groups.google.com/group/mongodb-user/browse_thread/thread/f57b712b2aae6f0b/b4315285e689b9a7?lnk=gst&q=projapati#b4315285e689b9a7
Hope this helps.
Thanks
So, there's a fundamental problem with "auto-increment" IDs. When you have 10 different servers (shards in MongoDB), who picks the next ID?
If you want a single set of auto-incrementing IDs, you have to have a single authority for picking those IDs. In MySQL, this is generally pretty easy as you just have one server accepting writes. But big deployments of MongoDB are running sharding which doesn't have this "central authority".
MongoDB, uses 12-byte ObjectIds so that each server can create new documents uniquely without relying on a single authority.
So here's the big question: "can you afford to have a single authority"?
If so, then you can use findAndModify to keep track of the "last highest ID" and then you can insert with that.
That's the process described in your link. The obvious weakness here is that you technically have to do two writes for each insert. This may not scale very well, you probably want to avoid it on data with a high insertion rate. It may work for users, it probably won't work for tracking clicks.
There is nothing like an auto-increment in MongoDB but you may store your own counters in a dedicated collection and $inc the related value of counter as needed. Since $inc is an atomic operation you won't see duplicates.
The default Mongo ObjectId -- the one used in the _id field -- is incrementing.
Mongo uses a timestamp ( seconds since the Unix epoch) as the first 4-byte portion of its 4-3-2-3 composition, very similar (if not exactly) the same composition as a Version 1 UUID. And that ObjectId is generated at time of insert (if no other type of _id is provided by the user/client)
Thus the ObjectId is ordinal in nature; further, the default sort is based on this incrementing timestamp.
One might consider it an updated version of the auto-incrementing (index++) ids used in many dbms.
In my travels in Oracle, the 'stragg' function, or 'String Aggregator' was life-saving when I had to create dynamic SQL queries on the fly.
You can read up about it here: http://www.oratechinfo.co.uk/delimited_lists_to_collections.html
The basic use of it was:
select stragg(fruit) from food;
fruit
-----------
apple,pear,banana,strawberry
1 row(s) returned
So simple to use, concatenating chr(13) turned it into a long list, and selecting information from system tables gave a 5 minute solution to dynamically generated SQL, e.g. auditing triggers.
Now I've been charged with transferring oracle functionality related to auditing into Sybase, and a function similar to Stragg would be ideal for this purpose.
E.g.
select #my_table = 'table_of_fruit'
select 'insert into '+#mytable+'_copy (' +char(10)
+ stragg(c.name) +char(10)
+ 'select '
+ stragg('inserted.'+c.name) + char(10)
+ 'from '+#mytable
from syscolumns c
where objectid(#mytable) = c.id
------------------------------------------
insert into table_of_fruit_copy
(fruit, sweetness, price)
select fruit, sweetness,price
from inserted
Done. Simple.
Except I don't know how to get a string-aggregation function working in Sybase.
Does anyone know of an attempt to do this kind of thing, or code that could work the same as stragg that could be used in this way?
The alternative at the moment is printing code based on complex cursors and such (sample LOC: 500), or select statements combining static strings and columns from user tables (sample LOC: 200). Stragg would severely reduce the complexity of this code, and would be a great deal of help in the future (sample LOC: who knows, maybe 50?)
p.s. I'm calling these selects through a shell script then piping them to file, then running the file through iSQL. Not the nicest solution, but it's better than the alternatives.
There are three separate answers
Question
You have made comments about simplicity, which need to be addressed before we get to the solution.
It is a common requirement to be able to take a delimited list of values, say A,B,C,D, and treat this data like it was a set of rows in a table, or vice versa
This one of the Top Ten Worst Programming Practices I read about recently.
In general, Sybase types tend to be somewhat more academically and Relationally qualified than Oracle types, so we simply do not do that sort of thing in SybaseLand or DB2Land.
In 20 years of working with Sybase, I have had to code that as part of my project just once, and that was for non-technical Auditor who loaded the result set into MS Access.
On the other hand, I have had to code that at least 12 times, when producing text files for importation into Oracle databases (fulfilling external requirements is outside my project, but I satisfy any such requirement free). Obviously the target databases were sub-standard and non-relational (loading a column with more than one datum breaks 1NF, and creates Update Anomalies), which is typical of what Oracle types have to do to get some speed.
Therefore, no, it is not simplicity, at least in the sense of that principle. It is by definition, complexity.
Your reference to "arrays" is incorrect. All commercial dbms handle arrays, according to the ISO/IEC/ANSI SQL (STRAGGR and LIST operators are non-standard SQL, therefore not SQL). Sybase is very strong in processing arrays. If it was an array, you would not need special hand coding to handle it (and you do, as per your question). This is not an array, there is no definition to the cells. This is a single concatenated scalar string.
Pivoting is an entirely different process, which uses set-processing; it does not require row-processing. (I understand on good authority, that Oracle is hopeless at scalar subqueries, and thus Oracle people are used to writing them as [very inefficient] joins or inline views, and then filtering: all that can be elevated to set-processing via scalar subqueries, and it will perform much faster. Particularly your Pivots.)
Even the author in your link posts as follows. Please familiarise yourself with the caveats:
It's as simple as this: If you want to have a system with no logical limitation in the number of data elements passed to a given process, then forget the following mechanisms! They are simply the wrong way to approach the problem.
Therefore, know whatever you are doing is sub-standard, non-relational, and limited; and go ahead with your eyes open. No use pretending that: it will not break; it is not limited; it is an "array"; or that Sybase doesn't have a neat little function that Oracle has. Any professional will see through all that. And if the string length is exceeded, for God's sake send some indicator back to the caller ("!Exceeded" in the string) identifying that condition.
Essentially you are turning the set-processing engine on its head, and forcing it into row-processing mode, so it will be very slow. A WHILE loop is distinctly faster than a cursor, but both are in the same class, row-processors.
The alternative at the moment is printing code based on complex cursors and such
What 200 or 500 LoC ? It is possible I am missing something, but my code is the same few lines of code identified under "Using a Table Function" in your link. Maximum 20, if you count nice formatting; the loop; initialisation; error handling. There is nothing "complex" about it. Do the exact reverse to cancatenate a single string from multiple rows. We use stored procedures for this (which oracle does not have, really, PL/SQL is a different animal). If you have ASE 15.0.2 or greater, you can use a User Defined Function, which you can then use in place of a column. Stored procs are better for true arrays.
the concatenation operator in Sybase is the plus sign. For reversal (decomposing the CSV string) you need CHARINDEX and SUBSTRING functions
You may need the Function Reference Manual, if for nothing else, to avoid writing code where we have functions.
Likewise, we do not have a RANK() function. We are quite happy with the 4 lines of code requires for the subquery. It is only required for Oracle because subqueries are crippled.
Ok, I have answered your question, Now to address the approaches.
You will be aware that code using Oracle Extensions to the SQL standard will need to be changed.
Sybase is way more automated than Oracle; if you familiarise yourself with its feature set, in many instances, you can get the same result (as you did in Oracle) without writing any code. Writing code-for-code blocks is the chain gang, rock-breaking method of building roads, in the context of bulldozers. Even if your company had good reason to use that method, you need to the aware that features work quite differently, eg. triggers, which is why I am posting so much detail.
Another issue that will annoy you is that Oracle isn't really ANSI SQL compliant (stretches the definitions in many places, in order to appear to be compliant), and Sybase, given its customer base, is rigidly SQL compliant. So in addition to the same function working differently, or in a different deployment, you need to be aware that code changes may be required to elevate Oracle code to ANSI compliance levels, just to execute on an ANSI SQL compliant platform.
I am not sure if you are trying to write code for the content of a trigger, or if you are trying to capture the changes to a database. I will provide both answers.
Auditing
Capture Changes to Database
We have an very robust, fast and configurable Auditing subsystem, fit for high volumes and banking level auditing requirements. Get your DBA to setup the sybaudit (separate) database, and to configure exactly what changes need to be captured. This facility will perform much faster than any code you or I can write in a trigger (as much as 100 times faster than your row-by-processing required for the above, as it is executed within the engine, within your executing thread). And of course the setup time is a fraction of your coding time.
Triggers
Again, I am not sure exactly what you are trying to achieve, but assuming you want to copy every insert to some table to a COPY of that table (inside the Trigger), that example code you have provided will not work (and I am not counting syntax issues).
Speaking to your example, you need to do way more work, to deal with the different datatypes; column sizes; precisions; scale; etc. And perhaps the UPDATE() function to identify which columns have changed (for an UPDATE trigger of course). If all you are trying to do is convert the various datatypes to strings, check the CONVERT() function.
Triggers are transactional.
Never place row-processing code in a Trigger (it will strangle the table)
You can't place Dynamic SQL in a Trigger.
But in Sybase even that is not necessary. Refer to the User Guide, chapter 19 is devoted to Triggers, with several variations, and examples. Inside the trigger, you should be able to simply:
INSERT table_copy
SELECT column_list -- never use * unless you want the db fixed in cement
FROM inserted
If you are trying to copy the inserts to all tables into one Audit table, then beware. Then I understand your example a little bit more. You will be forcing a highly Symmetric Muli-Threading server (oracle is not a server in the architecture sense) into single-threading through your table. Auditing is multi-threaded.
Last, the use of manual methods of any kind is not required, so if you could expand a bit more on your PS, what the requirement you are trying to fulfil is, I can identify the programmatic method for you. It appears you are trying to use the PL/SQL approach (which is very limited).
Just use the LIST() function. It's a direct replacement for stragg() function. Example:
SELECT LIST(state, ', ') FROM cities
Result:
name
CA, CA, MA, NY