Sunday, July 10, 2005

Tags get no respect

The lowly tag started the folksonomy revolution. One would have thought we'd be setting it up on a pedestal. From a 'buzz' point of view we do. 'Tagosphere this', 'Blogosphere that', ... buzz, buzz, buzzzzzz .....

But look under the covers. Are Tags first class objects? Is there a 'Tag' table along with the Item (photo, url ...) and User tables. Maybe. Maybe not. Some have them, some don't.

Why do we want a whole table for a tag which is just a 'string' attribute of an item?

It's a good question. It's tempting to 'get going' and avoid 'analysis paralysis' by just adding a 'tag' column to a table here and there in the schema. Or better still - a 'tags' (plural) column in which we stuff a 'comma separated' list. This creates a number of issues which are prevalent in contemporary tag applications.

One of the first and unobvious issues that arise out of 'dissing' tags in this way is that users are forced to use ' '(space) as a separator for tags. Why? Because comma is already used up in the taglist stuffed in a single column in the dark corner of a table somewhere.

This snowballs into the issue that I can't use a multi-word phrase as a tag. Unless I invent my own personal word separator, say '_'(underscore) or '/' or '+' etc. all of which are just fine from a single user point of view but suboptimal from a social point of view. Multi-word tags are pretty much invisible for social searching as I don't know and don't want to deal with everyone's separator.

People argue that we shouldn't enforce a convention - this is the wild and woolly world of folksonomy after all. But we aren't enforcing a convention - we want to use a 'comma' as a separator. This is already in use in natural language. Except that practitioners have abused the comma for a separator in a list of tags stuffed into a column. That in turn has arisen from not having, or seeing the fundamental need for, a Tag table. So now we are stuck with this highly hobbled form of tagging for no good reason.

Folks, if you are creating a folksonomy application from scratch today and you don't have a separate Tag table then you are, by a series of unforeseen consequences, hobbling your users and reducing the ultimate effectiveness of your application.

If 'User centered design' is what it's all about then please, please deal with this issue.

No amount of AJAX and Ruby on Rails (I love both of them by the way) will solve the problem created by a suboptimal Tag data model, blocking users from using commas as separators AND preventing them from using multi-word tags.

(taking a deep breath ..... thinking of a happy place .... palm trees, pineapples, white sands, blue waves ..... relaxing all muscles one by one ..... Ohhhhh Kayyy... there we go ...)

Ok, let us assume for a second that this 'taglist-comma-blocking-forced-to-use-danged-space-separator' issue doesn't exist. There's a whole other set of reasons why a Tag table is needed. As we see in practice, tags have different uses - action/todo, content-type, annotation .... and a single opague string is already being implicitly overloaded.

Tags also have a 'language of origin' - a tag in a single charset, say 'ISO Latin', may mean different things in different countries in Europe.

Finally, when tags are stuffed into columns as attributes, it's much harder to get an idea of the number of unique tags used. The same string is saved again and again, perhaps the unique occurences are counted, maybe not. Here we have to create a compensating application structure to 'uniqueify' the tags, count the unique tag occurences, if at all we do that.

So when a folksonomy application without a Tag table says '1 million tags' these are not 1 million unique tags. The number of unique tags is approximately a couple of orders of magnitude smaller. I am curious to know how reports on tag usage, growth etc. are created when there's no way to inherently track unique tags as first class objects in the database. There's probably a lot of summarisation, matching and sorting done in the application layer that's best done by the database. All of this is done on the tag and any tag-lists and tag-substructure that may have been created.

Adding fine structure to a string and then creating an API layer to pack and unpack this private structure creates something akin to a 'foreign growth' on the data model. It's a mini-database within a database with it's own syntax, logic and query language. It's as if some part of the database has unilaterally declared independence and is setting sail on its own.

It separates data related knowledge into two disjoint areas, one, the 'mini-database + API' which holds 'type' and other information and meaning, and two, the underlying database itself. For large scale applications that are meant to grow fast and furiously, this asymmetric complexity creates an area in the design where 80% of the effort could possibly be needed in future to solve 20% of the problems.

So, after all this verbiage, what am I saying?

Quite simply, tags are first class entities in the folksonomy data model. They have a number of important attributes such as 'type','lang','count', 'created', 'lastused'
etc. From a Data 2.0 approach, a Tag is a slowly-varying-dimension and Tags should be saved in a Tag dimension table. Aside from avoiding the kinds of problems mentioned, this model allows evolution in the role of the Tag in the folksonomy data model without creating massive ripples in the application structure.

It's best to give tags the respect they deserve and put them in a table by themselves. The logical symmetry between User, Tag, Item IMO should be reflected in a symmetric underlying data model. If not, we have a fundamental impedance mismatch between the data and the application logic. And the mismatch is usually created by disrespecting the thing the started it all - the tag.

8 Comments:

Blogger Philipp Keller said...

Hmm.. interesting!
You are true in saying that "tags get no respect". I think that lot of people are in a "waiting" position. Jonathan Snook for instance has chosen the denormalized approach.
I think this is because he just wants to search by tag. And this is ok with 10k entries.. I think in a year or so when the "tagging scene" has done some cool tag-tools (such as clustering for instance! :-), then people start to normalize.

But you are right, they should do it now..

6:02 AM  
Blogger Kevin said...

Well *that* was a long post :)

Do you feel that there's a big opposition to using a dedicated tag table? I don't think so.

1. It allows you to have a TAG_ID on columns that need tags. This way you can get a really efficient integer index (not a string index) which consumes a lot less memory and disk.

2. you can have extra metadata such as language.

3. You can avoid table bloat. It would suck to have multiple strings stored in the table.

6:39 PM  
Blogger Fidel said...

Wouldn't putting tags in thier own table just be common sense? Personally I think a convention is needed. commas here spaces there. It makes users confused, and in the end makes more work when a user has to go over thier cloud and get rid of erroneous tags like "this,that,other" when what was needed was this, that, and the other. See!? We already use commas as list seperators anyway. spaces are fine as word seperators, but make horrible tag seperations.

4:41 PM  
Anonymous Erik Haugo said...

Nitin-

It's four months after you made this post...

Are there now any examples of bookmark databases or open source products that incorporate the well-defined tag databases you described in your post?

Have you seen any innovations resulting from them?

I've been interested in how the use of tags might evolve in a wiki environment. I'd like to see the wiki "application platforms" like Jotspot, Socialtext, Backpackit or Confluence incorporate the efficent, robust tag databases you describe.

Have you seen any wikis that work with seperate tag databases?

Erik.Haugo (at Gmail)

11:51 PM  
Blogger Vivek Krishna said...

Nice post.I tried doing a comma seperated string kind of things for user tags but then how do u efficiently compute 'all users having tag t' .I didnt use a string but used a tag_id ..but even then this means that you just have to examine each user row to figure out this set.

2:15 AM  
Blogger Vivek Krishna said...

I guess I reached a similar conclusion ,so I just had to use a tag table to get more useful relations between tags

2:16 AM  
Anonymous Anonymous said...

Interesting post, but I think these problems are a side effect of the use of relational databases.

Ideally, you would like to be able to tag anything. Bookmarks are a natural choice, but if you have for example a list of companies, it would be cool to be able to tag them too.

But that would mean that you have to create a join table for every table you might want to tag, which is IMHO not very practical. (You also have to use different SQL queries to extract the objects, which is actually the same operation for every table).

I am experimenting with RDF, and have found that these problems totally vanish. You don't have to create any tables, the only things that exist are triples (subject, predicate, object). So to tag a book you only have to add the triple ({book}, 'has_tag', {tag}). It doesn't matter if you tag a book or a company, the structure is the same.

This way it is also easy to use tags with spaces. (Not that I think that would be a great idea...)

9:58 AM  
Anonymous Anonymous said...

Kevin -- a 4 byte integer to a string is not that much more efficient. Your typical tag is going to be about 6 bytes plus a length byte (or null terminator). If you have a tag table without a separate compressed text table, then you will store 8 bytes for each record to reference the tag. If you have a tag table with an index to compressed strings, you will have 4 bytes per entry, plus a record per string and at least one index record per string. Say the string table is really compact and you need 32 bytes for the record and its index. Then if the string is used once, you use 36 bytes versus 8 bytes. If the string is used twice, you use 40 bytes versus 16 bytes. So, with string compression you use 32+4N bytes and without string compression you use 8N bytes. 8N > 32+4N when 8N-4N > 32 when 4N > 32 when N > 8.

So, you'll start to save space when your average tag is used 8 times. Eventually, for that particular column, you can save half of your disk storage. But that column is a relatively small part of a larger record, so "a lot less disk space" isn't necessarily true.

9:27 PM  

Post a Comment

<< Home