To AutoNumber Or Not To AutoNumber
Posted By : todd sharp Posted At : February 12, 2007 7:45 AM Posted In: SQL, Project Learn
29
I'm in the midst of building my database structure for 'Project Learn' (see this post if you're not sure what that's about) and have come across the age old question: should I use auto numbered integers for my primary keys or should I create the id using createUUID() in CF and pass it in? I seem to go back and forth on this topic so I'll put my pro's and con's out here and leave it up to my readers for further discussion.
Pros to Auto Number:
1. Easy. Inserts will always handle the creation of the auto numbered id field.
2. Integers are more easily indexed by SQL server (performance related).
Cons to Auto Number:
1. Less control over data. I know this is kinda a 'control freak' thing, but I like to have control over the creation of the primary key so that I can pass that key back to my caller (or another function when cascading inserts such as a one to many type insert).
2. Related to Con #1 - have to rely on 'select @@identity' to return the most recently created id - which can be unreliable in a high traffic db. This topic has been debated many times and will likely continue to be. There are other ways to retrieve the id I believe - but I don't know much on this topic (anyone?).
Pros to createUUID():
1. As I stated above, I control the creation of the id and have no question that it is unique and the correct key.
Cons to createUUID():
1. No built in datatype in SQL for CF's UUID. Rob Gonda recently posted a way to create such a datatype so that you can validate this type so there's a workaround available for this con. 2. Non-integers are less easily indexed by SQL server.
So that's my dilemma. I'm also considering a mix of auto numbered id's and UUID pk's (using the UUID's where I need the granular control over the id number and the auto number where that's less important). I'd like to hear everyones opinion on this one. I'm really hoping that this series will be interactive and that as many people as possible feel comfortable joining in the learning and discussions. Feel free to tell me your thoughts.



SQL Server:
SELECT IDENT_CURRENT ('mytablename')
When I have to share data with another database, however, the advantages of UUID or GUID are clear over AutoNumber.
I also have seen a combination approach work well. Instead of having some tables as AutoNumber and others as UUID (which I have done, but meant sometimes I had to look to see which a table was), make them all have both.
You can have the primary key field an AutoNumber, but have another store a UUID. You can still use the UUID field to identify the record. I didn't think up this approach, but it certainly seems to have some advantages.
The flip side is that it's more difficult to do quick record lookups in your isql query tools.
It creates a little more data, but unless the dataset is very large, the impact is negligible.
Gus
You can still use integers without using an identity by using SELECT MAX(id) and using this value in your insert. However you may need to bump up your isolation level - I think to repeatable read (someone correct me if I'm wrong) - to prevent another transaction from committing an insert at the same time, therefore causing a duplicate PK violation.
I don't think you need a look-up table. As Gus says, no benefit for a 1-1 relationship. It also makes it really easy to select by either value and the insert is really easy because you insert the UUID and let the database handle the AutoNumber automatically.
As to the choice of SCOPE_IDENTITY or IDENT_CURRENT, SCOPE_IDENTITY will return the last AutoNumber for any table in the same scope. IDENT_CURRENT will return the last identity for the specified table. To me, this makes it a safer choice, especially if you combine it with locking for the target table.
http://www.sqlteam.com/item.asp?ItemID=283
:)
Personally I use GUIDs. Years ago on slow hard drives and using SQL Server 6.0, GUID could really slow down a database. Now serial ATA drives, 4 GHZ processors and SQL Server 200 or 2005, there really is no difference.
My only recommendation to you is that you setup a job to tune you indexes once a week to maintain the performance of your database. I could bet that a lot of the people that say using guids are slow is because they have never tuned their indexes before in their lives.
In conclusion, although I've been doing DBA work for the last 5 years and I'm sure everyone who's commenting here has had a lot of experience with databases, I would take the advice you get from these comments with a grain of salt. I would go ask a Microsoft Certified DBA with years of experience what their professional opinion is.
Please don't take offense to this, but your logic behind having IDENT_CURRENT being a safer choice than SCOPE_IDENTITY is wrong. I would go read the BOLs about each choice since you have been misinformed.
You are correct that IDENT_CURRENT return the last identity insert for a particular table, but it is for all sessions. So if you have 3 sessions actively inserting into the table, you could get another sessions identity. While locking will prevent this, this will also have a huge impact on performance if there is a lot of inserts into the table. There is no row locking in SQL Server like in Oracle, so locking an entire table is pretty expensive. This is the reason why Microsoft created the SCOPE_IDENTITY function.
I like the use of both as well. I use the autonumber internally for defining relationship, and then use the UUID as the lookup key. This way you only need to expose the UUID to your web pages and internally use the autonumber key in your queries.
I think it might be considered bad design for a DB, but it certainly helps keep my code design easier to manage (of course, I think the same could be said both ways).
What I'm thinking is to use auto number for less manipulated tables - like a table of gender (bad example but you get the point). record 1 would be male and record 2 would be female.
then in the users table the primary key would be a uuid since it is likely to be created via a web form - and i like the control of creating my id when creating via the web. No autonumber for the user table, but the gender value would be stored in the user table.
As I said it's a poor example but I think it gets the point across. Autonumber (only) for some tables and UUID (only) for other tables.
Issues?
No offense taken and I am happy to change what I am doing if it is wrong, but I did look at the BOL.
Specifically,
http://msdn2.microsoft.com/en-us/library/ms190315....
http://msdn2.microsoft.com/en-us/library/ms175098....
Do you have another reference that suggests the use of SCOPE_IDENTITY over IDENT_CURRENT?
My concern may rest from a need to return the value from a separate cfquery call.
Todd,
I have used the approach that you are suggesting before. It worked pretty well, but I had several tables for which arguments could be made for using identity or uuid, so I sometimes have to look to see which I did.
The first link sums it up nicely. Specifically:
"IDENT_CURRENT is not limited by scope and session; it is limited to a specified table. IDENT_CURRENT returns the value generated for a specific table in any session and any scope."
And
"SCOPE_IDENTITY returns values inserted only within the current scope"
Essentially it boils down to this:
IDENT_CURRENT is potentially prone to a race condition where it returns the value from an an insert to the table form a session that isn't yours.
@@IDENTITY is problematic primarily if you are using triggers, where your insert may trigger an second insert in which case @@IDENTITY actually returns the value from the trggered insert, not the initial insert.
SCOPE_IDENTITY is the most specific - it returns the IDENTITY value for the specific table, scope and session from which it is called, and so is generally the safest.
In a nutshell, the wider the scope of where the value may come from, the lower your confidence is that you have the "right" value.
In no case should these be called from separate queries from the one doing the insert - it's much harder to be sure of a predictable result.
Overall, I MUCH prefer Oracle's sequence method for generating ID's, since you essentially grab the value first (at which point the value becomes unavailable for any other inserts), and then apply it to subsequent INSERTS as needed, it's much safer, and more clear what value is being assigned where.
I'm a bit wary of UUID's, partially because I can't imagine they index as efficiently as integer values, and also because I often find it useful to be able to look at a table at get an idea of the order which data was inserted.
This sentence is incorrect: "SCOPE_IDENTITY is the most specific - it returns the IDENTITY value for the specific table, scope and session from which it is called, and so is generally the safest.".
That is the comment misconception about SCOPE_IDENTITY. To quote the BOL (emphasis is mine):
"SCOPE_IDENTITY returns the last identity value generated for *any table* in the current session and the current scope."
To me, that is the essential weakness of that choice. In my systems, ColdFusion is the only user accessing the database and so IDENT_CURRENT seems the safest choice.
However, by being scope limited, scope_identity is still more narrowly targeted than ident_current.
Here's the definition of scope:
"The scope concept is important and is defined as a module of work contained within a stored procedure, trigger, function, or batch".
This essentially limits it to the cfquery you are running right then.
ident_current, while limited to the table you define, doesn't limit it to the scope, therefore if a number of users of your CF app hit the same INSERT query simultaneously, they are all hitting the same table, but in different scopes. You can conceivably have a race condition and get unpredictable results.
The issue isn't whether CF is the only "user" of your database, but if there is only one user of your CF app at any given time. If the latter is the case, then you're fine with how you are doing it, otherwise, you might want to reconsider using scope_identity instead.
Of course I have more than one user.
"This essentially limits it to the cfquery you are running right then. " - not true. I just tested it.
I just ran an insert query and then selected SCOPE_IDENTITY() in a separate cfquery tag and it got the value despite having been in another cfquery block.
To test further, I ran the insert query on one page and then ran the query with SCOPE_IDENTITY() on another page and it still got the correct value. That means that the scope extends past one cfquery - making it (in my opinion) less specific than IDENT_CURRENT().
Again, I really am willing to change but my evidence so far still suggests to me that IDENT_CURRENT() is the safer choice (especially since I can lock a table to ensure a correct result).
The safest way to deal with it is to encapsulate your inserts in stored procedures, where SCOPE_IDENTITY will always be properly limited to the currently executing procedure.
Incidentally, refreshing the SCOPE_IDENTITY page still return the correct result several minutes later (this on my local systems so no other inserts have happened since).
This still suggests to me that IDENT_CURRENT is the safest approach if one doesn't want to use stored procs for every insert.
One place where autonumber is not viable is with database replication. Typical scenario would be multiple regional databases that handle local publishing and then need to sync via replication.
You may want to add a 3rd option, which is let the SQL server generate the UUID/GUID for you.
Facts:
* It's easier to index integers than UUID/GUID
* You cannot use integers if you're looking to scale.
You can have a standby/failover system replicating tables with integer primary keys
You cannot have multiple active clusters if you don't use Universal/Globally unique identifiers
* It's easier to guess an integer than a UUID/GUID
* If you're looking to support multiple databases, you probably do not want to rely on database generated UUID/GUID
* Using UUID gives the application (cf) more control. You can generate a UUID, assign it to a bean,
allow the users to create a new record, hit the back button, re-post the same form, identify the same primary key
and update the record instead of adding a new one.
* UUID/GUID are more portable. Example would be to move CMS pages from dev to an active production environment.
Arguably, it is not ideal to move dev data, but it happens all the time, and when moving it to an existing production environment
you would run into conflicting integer keys.
* Generating UUID in ColdFusion is a very performance heavy operation. In high traffic sites, generating many UUIDs per second
may have a vast performance impact.
* MSSQL can easily generate a GUID.
* If you're worry about indexing, MSSQL 2005 introduced a new function NEWSEQUENTIALID(), which will create
sequential GUIDs for better indexing performance.
* In stricter OO programming, you would probably create a bean to hold your data prior to inserting it into the DB
and you would need an ID for this bean. The only way to ensure that it's not in use is to use a UUID
Now that you know all the facts, you should be able to decide what's best for your particular scenario.
Cheers.
You can create this function in MSSQL
CREATE FUNCTION dbo.newUUID(@GUID varchar(36))
RETURNS varchar(35)
AS
BEGIN
RETURN left(@GUID, 23) + right(@GUID,12)
END
And set your default values to: dbo.newUUID(newid())
One that hasn't been mentioned so far (although the Oracle sequencing sounds similar) is manually created
database-wide 'unique' integers. This can give you the best of both integer and UUID worlds:
1) Create a table with a column "nextID" and enter the number 1 in a single row.
2) Create a function getNewID() as UDF or CFC method: it queries and selects the nextID value,
immediately increments it within a cflock/db transaction and returns the ID it selected
3) Whenever you add a row to any table just use call getNewID() as you would createUUID()
Pros:
* The ID is unique across the DB so you can refer to data rows (items/objects) whichever table they're stored in. This is an advantage over UUID if you *want* users to be able to type the IDs (and/or you want to keep them short for URLs), obviously not appropriate for items that are better obscured. Couple this with URL_rewriting and an object type lookup table, and you've got a simple mechanism for locating any page with just a number, regardless of which table the primary key content is stored in or the logic required to generate it.
* Gives you the same control/access as UUID, so you can use these IDs for objects that haven't been
saved to the DB yet. Once generated it won't be created again *within the DB*
* Performance benefit of sequenced integers vs UUIDs
* Portable in the sense that auto-numbering will not be an issue.
* inTRA DB merge-friendly: ie you could potentially move rows from one table to another within the same DB (unlike auto-numbers)
Cons:
* Not intER-DB merge-friendly (ie between different DBs), same as integers in general (and auto-number in particular).
* Non-obscure
* Small one-off manual set up as compared with no setup for UUID.
I've used this successfully in apps for years, but I also use UUIDs and auto-numbers where appropriate.
As Rob says: look at the options, understand the implications in each case and make your choice.