To AutoNumber Or Not To AutoNumber

Posted By : todd sharp Posted At : February 12, 2007 7:45 AM Posted In: SQL, Project Learn

29

I'm in the midst of building my database structure for 'Project Learn' (see this post if you're not sure what that's about) and have come across the age old question: should I use auto numbered integers for my primary keys or should I create the id using createUUID() in CF and pass it in? I seem to go back and forth on this topic so I'll put my pro's and con's out here and leave it up to my readers for further discussion.

Pros to Auto Number:

1. Easy. Inserts will always handle the creation of the auto numbered id field.

2. Integers are more easily indexed by SQL server (performance related).

Cons to Auto Number:

1. Less control over data. I know this is kinda a 'control freak' thing, but I like to have control over the creation of the primary key so that I can pass that key back to my caller (or another function when cascading inserts such as a one to many type insert).

2. Related to Con #1 - have to rely on 'select @@identity' to return the most recently created id - which can be unreliable in a high traffic db. This topic has been debated many times and will likely continue to be. There are other ways to retrieve the id I believe - but I don't know much on this topic (anyone?).

Pros to createUUID():

1. As I stated above, I control the creation of the id and have no question that it is unique and the correct key.

Cons to createUUID():

1. No built in datatype in SQL for CF's UUID. Rob Gonda recently posted a way to create such a datatype so that you can validate this type so there's a workaround available for this con. 2. Non-integers are less easily indexed by SQL server.

So that's my dilemma. I'm also considering a mix of auto numbered id's and UUID pk's (using the UUID's where I need the granular control over the id number and the auto number where that's less important). I'd like to hear everyones opinion on this one. I'm really hoping that this series will be interactive and that as many people as possible feel comfortable joining in the learning and discussions. Feel free to tell me your thoughts.

Related Blog Entries

Comments (29)

Steve Bryant's Gravatar I prefer AutoNumber overall. Most databases handle this well (Oracle uses an odd approach for it though). Each database provides a pretty good way of getting it back as well.

SQL Server:
SELECT   IDENT_CURRENT ('mytablename')

When I have to share data with another database, however, the advantages of UUID or GUID are clear over AutoNumber.

I also have seen a combination approach work well. Instead of having some tables as AutoNumber and others as UUID (which I have done, but meant sometimes I had to look to see which a table was), make them all have both.

You can have the primary key field an AutoNumber, but have another store a UUID. You can still use the UUID field to identify the record. I didn't think up this approach, but it certainly seems to have some advantages.

Rich's Gravatar I think another pro for UUID's is they prevent users from "hacking" your URLs....for example, trying to get another company's data by changing the company_id on the URL. Of course, you should take pains to block this anyway, but UUIDs provide a nice level of obfuscation.

The flip side is that it's more difficult to do quick record lookups in your isql query tools.

Gus's Gravatar I use both, a UUID or GUID that can be passed via a browser, and an autonumber int that can be used internal to the application.

It creates a little more data, but unless the dataset is very large, the impact is negligible.

Gus

todd sharp's Gravatar So Gus/Steve - you would recommend using _both_ for a single row? I never thought of that. I was thinking more specific to the table - for example a lookup table that would map to user data I would use auto number - but the user table would use a UUID. Thoughts?

Rob Pilic's Gravatar If you are using MS SQL Server you can also use SELECT SCOPE_IDENTITY() to retrieve the last identity value created in the current "scope", i.e. stored procedure or batch. That said, the more I deal with identity columns the less I like them. If you think you will *ever* need to merge your data into another db, do yourself a favor and do not use identities.

You can still use integers without using an identity by using SELECT MAX(id) and using this value in your insert. However you may need to bump up your isolation level - I think to repeatable read (someone correct me if I'm wrong) - to prevent another transaction from committing an insert at the same time, therefore causing a duplicate PK violation.

Gus's Gravatar You won't gain anything by having a lookup table in this case as it is always a 1 to 1 relationship. I just store the UUID as part of the record.

Scott Stroz's Gravatar I prefer to use SCOPE_IDENTITY() in SQL Server to return the most recently inserted ID.

Steve Bryant's Gravatar I typically just stick with AutoNumber personally, but if you want to gain the level of control that you mention then, yes, I think having both fields in one table offers a good advantage.

I don't think you need a look-up table. As Gus says, no benefit for a 1-1 relationship. It also makes it really easy to select by either value and the insert is really easy because you insert the UUID and let the database handle the AutoNumber automatically.

As to the choice of SCOPE_IDENTITY or IDENT_CURRENT, SCOPE_IDENTITY will return the last AutoNumber for any table in the same scope. IDENT_CURRENT will return the last identity for the specified table. To me, this makes it a safer choice, especially if you combine it with locking for the target table.

tony petruzzi's Gravatar http://p2p.wrox.com/topic.asp?TOPIC_ID=7447
http://www.sqlteam.com/item.asp?ItemID=283

:)

Personally I use GUIDs. Years ago on slow hard drives and using SQL Server 6.0, GUID could really slow down a database. Now serial ATA drives, 4 GHZ processors and SQL Server 200 or 2005, there really is no difference.

My only recommendation to you is that you setup a job to tune you indexes once a week to maintain the performance of your database. I could bet that a lot of the people that say using guids are slow is because they have never tuned their indexes before in their lives.

In conclusion, although I've been doing DBA work for the last 5 years and I'm sure everyone who's commenting here has had a lot of experience with databases, I would take the advice you get from these comments with a grain of salt. I would go ask a Microsoft Certified DBA with years of experience what their professional opinion is.

tony petruzzi's Gravatar @steve,

Please don't take offense to this, but your logic behind having IDENT_CURRENT being a safer choice than SCOPE_IDENTITY is wrong. I would go read the BOLs about each choice since you have been misinformed.

You are correct that IDENT_CURRENT return the last identity insert for a particular table, but it is for all sessions. So if you have 3 sessions actively inserting into the table, you could get another sessions identity. While locking will prevent this, this will also have a huge impact on performance if there is a lot of inserts into the table. There is no row locking in SQL Server like in Oracle, so locking an entire table is pretty expensive. This is the reason why Microsoft created the SCOPE_IDENTITY function.

Dan G. Switzer, II's Gravatar @Todd:

I like the use of both as well. I use the autonumber internally for defining relationship, and then use the UUID as the lookup key. This way you only need to expose the UUID to your web pages and internally use the autonumber key in your queries.

Sammy Larbi's Gravatar I'm like you Todd - I've gone back and forth on the issue. However, I think I've finally settled on integer autonumber. I just find it much easier to work with. I don't care about url hacking because in any case that I should care about it, I cover it with code. Also, in the case that I might need to merge databases (which happens from time to time), I don't find it too much of a pain to deal with.

I think it might be considered bad design for a DB, but it certainly helps keep my code design easier to manage (of course, I think the same could be said both ways).

todd sharp's Gravatar Dan - _THATS_ what i'm talking about ;) - anyone else see a prob with that?

todd sharp's Gravatar Actually maybe it's not - I may have misread.

What I'm thinking is to use auto number for less manipulated tables - like a table of gender (bad example but you get the point). record 1 would be male and record 2 would be female.

then in the users table the primary key would be a uuid since it is likely to be created via a web form - and i like the control of creating my id when creating via the web. No autonumber for the user table, but the gender value would be stored in the user table.

As I said it's a poor example but I think it gets the point across. Autonumber (only) for some tables and UUID (only) for other tables.

Issues?

Wayne Graham's Gravatar I tend toward the UUID, but it really comes from the fact that I had a CS Professor who really pressed the fact that data types in themselves are important pieces of metadata. Using numeric data types as primary keys suggests that you can perform meaningful arithmetic operations on them, but adding two primary keys together really doesn't tell you anything (well, in most cases it doesn't).

Peter Bell's Gravatar One other consideration. If you're ever going to move to a versioning system, an autoID will be defining a version rather than a content item so you still need a separate ID or UUID for distinct content items.

Steve Bryant's Gravatar Tony,

No offense taken and I am happy to change what I am doing if it is wrong, but I did look at the BOL.

Specifically,

http://msdn2.microsoft.com/en-us/library/ms190315....
http://msdn2.microsoft.com/en-us/library/ms175098....

Do you have another reference that suggests the use of SCOPE_IDENTITY over IDENT_CURRENT?

My concern may rest from a need to return the value from a separate cfquery call.

Todd,

I have used the approach that you are suggesting before. It worked pretty well, but I had several tables for which arguments could be made for using identity or uuid, so I sometimes have to look to see which I did.

Doug's Gravatar Steve,
The first link sums it up nicely. Specifically:
"IDENT_CURRENT is not limited by scope and session; it is limited to a specified table. IDENT_CURRENT returns the value generated for a specific table in any session and any scope."

And

"SCOPE_IDENTITY returns values inserted only within the current scope"

Essentially it boils down to this:
IDENT_CURRENT is potentially prone to a race condition where it returns the value from an an insert to the table form a session that isn't yours.

@@IDENTITY is problematic primarily if you are using triggers, where your insert may trigger an second insert in which case @@IDENTITY actually returns the value from the trggered insert, not the initial insert.

SCOPE_IDENTITY is the most specific - it returns the IDENTITY value for the specific table, scope and session from which it is called, and so is generally the safest.

In a nutshell, the wider the scope of where the value may come from, the lower your confidence is that you have the "right" value.

In no case should these be called from separate queries from the one doing the insert - it's much harder to be sure of a predictable result.

Overall, I MUCH prefer Oracle's sequence method for generating ID's, since you essentially grab the value first (at which point the value becomes unavailable for any other inserts), and then apply it to subsequent INSERTS as needed, it's much safer, and more clear what value is being assigned where.

I'm a bit wary of UUID's, partially because I can't imagine they index as efficiently as integer values, and also because I often find it useful to be able to look at a table at get an idea of the order which data was inserted.

Steve Bryant's Gravatar Doug,

This sentence is incorrect: "SCOPE_IDENTITY is the most specific - it returns the IDENTITY value for the specific table, scope and session from which it is called, and so is generally the safest.".

That is the comment misconception about SCOPE_IDENTITY. To quote the BOL (emphasis is mine):

"SCOPE_IDENTITY returns the last identity value generated for *any table* in the current session and the current scope."

To me, that is the essential weakness of that choice. In my systems, ColdFusion is the only user accessing the database and so IDENT_CURRENT seems the safest choice.

Doug's Gravatar You are correct - scope_identity is not limited to the current table.

However, by being scope limited, scope_identity is still more narrowly targeted than ident_current.

Here's the definition of scope:
"The scope concept is important and is defined as a module of work contained within a stored procedure, trigger, function, or batch".

This essentially limits it to the cfquery you are running right then.

ident_current, while limited to the table you define, doesn't limit it to the scope, therefore if a number of users of your CF app hit the same INSERT query simultaneously, they are all hitting the same table, but in different scopes. You can conceivably have a race condition and get unpredictable results.

The issue isn't whether CF is the only "user" of your database, but if there is only one user of your CF app at any given time. If the latter is the case, then you're fine with how you are doing it, otherwise, you might want to reconsider using scope_identity instead.

Steve Bryant's Gravatar Doug,

Of course I have more than one user.

"This essentially limits it to the cfquery you are running right then. " - not true. I just tested it.

I just ran an insert query and then selected SCOPE_IDENTITY() in a separate cfquery tag and it got the value despite having been in another cfquery block.

To test further, I ran the insert query on one page and then ran the query with SCOPE_IDENTITY() on another page and it still got the correct value. That means that the scope extends past one cfquery - making it (in my opinion) less specific than IDENT_CURRENT().

Again, I really am willing to change but my evidence so far still suggests to me that IDENT_CURRENT() is the safer choice (especially since I can lock a table to ensure a correct result).

Roland Collins's Gravatar The behavior you're seeing with SCOPE_IDENTITY is because CF caches connections.

The safest way to deal with it is to encapsulate your inserts in stored procedures, where SCOPE_IDENTITY will always be properly limited to the currently executing procedure.

Steve Bryant's Gravatar Right. Which is why I mentioned ColdFusion as one user earlier.

Incidentally, refreshing the SCOPE_IDENTITY page still return the correct result several minutes later (this on my local systems so no other inserts have happened since).

This still suggests to me that IDENT_CURRENT is the safest approach if one doesn't want to use stored procs for every insert.

Johan's Gravatar I prefer autonumber since it is simple and works well with maintaining realtionships. I also double up with a UUID which allows me to uniquely identify a record across all tables (for example I might be listing groups and users and want to highlight the most recently edited group, if there is a group with the same autonumber value as a user I could not uniquely identify the group vs the user).

One place where autonumber is not viable is with database replication. Typical scenario would be multiple regional databases that handle local publishing and then need to sync via replication.

Rob Gonda's Gravatar The globally accepted answer is: it depends. Like for any other variable in the programming world, there is no universal solution.
You may want to add a 3rd option, which is let the SQL server generate the UUID/GUID for you.
Facts:
* It's easier to index integers than UUID/GUID
* You cannot use integers if you're looking to scale.
You can have a standby/failover system replicating tables with integer primary keys
You cannot have multiple active clusters if you don't use Universal/Globally unique identifiers
* It's easier to guess an integer than a UUID/GUID
* If you're looking to support multiple databases, you probably do not want to rely on database generated UUID/GUID
* Using UUID gives the application (cf) more control. You can generate a UUID, assign it to a bean,
allow the users to create a new record, hit the back button, re-post the same form, identify the same primary key
and update the record instead of adding a new one.
* UUID/GUID are more portable. Example would be to move CMS pages from dev to an active production environment.
Arguably, it is not ideal to move dev data, but it happens all the time, and when moving it to an existing production environment
you would run into conflicting integer keys.
* Generating UUID in ColdFusion is a very performance heavy operation. In high traffic sites, generating many UUIDs per second
may have a vast performance impact.
* MSSQL can easily generate a GUID.
* If you're worry about indexing, MSSQL 2005 introduced a new function NEWSEQUENTIALID(), which will create
sequential GUIDs for better indexing performance.
* In stricter OO programming, you would probably create a bean to hold your data prior to inserting it into the DB
and you would need an ID for this bean. The only way to ensure that it's not in use is to use a UUID

Now that you know all the facts, you should be able to decide what's best for your particular scenario.

Cheers.

todd sharp's Gravatar Beautifully outlined Rob - and in fact you bring up some points that are perfectly valid in my scenario. I think is this case I'm going to go with UUID's - since the particular application is in fact a simple CMS. I think the performance hits taken by using UUIDs will be minimal in this application but the benefits/requirements of OO dictate the use of UUIDs.

todd sharp's Gravatar Of course, the next question is - without creating a simple tool in CF (which wouldn't be all that bad), how would one generate UUIDs for default table entries that need to be populated before the front end is built?

Rob Gonda's Gravatar Allaire/Macromedia/Adobe did not invent the UUID. Most databases are perfectly capable of creating one, and in MSSQL's case, a similar one.
You can create this function in MSSQL
CREATE FUNCTION dbo.newUUID(@GUID varchar(36))
RETURNS varchar(35)
AS
BEGIN
RETURN left(@GUID, 23) + right(@GUID,12)
END

And set your default values to: dbo.newUUID(newid())

Julian Halliwell's Gravatar I agree with the general point that the best ID-generating method will depend on the context.

One that hasn't been mentioned so far (although the Oracle sequencing sounds similar) is manually created

database-wide 'unique' integers. This can give you the best of both integer and UUID worlds:

1) Create a table with a column "nextID" and enter the number 1 in a single row.
2) Create a function getNewID() as UDF or CFC method: it queries and selects the nextID value,

immediately increments it within a cflock/db transaction and returns the ID it selected
3) Whenever you add a row to any table just use call getNewID() as you would createUUID()

Pros:
* The ID is unique across the DB so you can refer to data rows (items/objects) whichever table they're stored in. This is an advantage over UUID if you *want* users to be able to type the IDs (and/or you want to keep them short for URLs), obviously not appropriate for items that are better obscured. Couple this with URL_rewriting and an object type lookup table, and you've got a simple mechanism for locating any page with just a number, regardless of which table the primary key content is stored in or the logic required to generate it.

* Gives you the same control/access as UUID, so you can use these IDs for objects that haven't been

saved to the DB yet. Once generated it won't be created again *within the DB*

* Performance benefit of sequenced integers vs UUIDs

* Portable in the sense that auto-numbering will not be an issue.

* inTRA DB merge-friendly: ie you could potentially move rows from one table to another within the same DB (unlike auto-numbers)

Cons:
* Not intER-DB merge-friendly (ie between different DBs), same as integers in general (and auto-number in particular).
* Non-obscure
* Small one-off manual set up as compared with no setup for UUID.

I've used this successfully in apps for years, but I also use UUIDs and auto-numbers where appropriate.

As Rob says: look at the options, understand the implications in each case and make your choice.