Monday, June 06, 2005

Slicing and Dicing Data 2.0 (Part 2)

Foundation Data Model for Folksonomy Navigation.


Underneath the trinity of User, Tag, Item is a rich set of relationships that, together with the trinity, form the foundation data model of Data 2.0. Before we get into the experimental data and numbers in future posts, let's explore these relationships and the implications for data architecture.

Users have Items ( URL's, photo's, chunks of text, ... ) but a User could exist in the database without any Items.
On the other hand an Item will end up in the database only if associated with at least one User ( the one who tagged it in the first place). A User then has maybe-none-and-possibly-many Items and an Item has a-least-1-and-possibly-many Users who tagged it.

Similarly, an Item can exist with no Tags but a Tag exists in the database only when attached to some Item. Finally, a User may have tagged nothing so has no Tags but a Tag may exist in the database only if created by some User.

Putting all these together we have a 3-cycle of relationships

Users (1-many) <---------> (0-Many) Items
Tags (0-Many) <----------> (1-Many) Items
Tags (0-many) <---------> (1-Many) Users

Using a diagram, instead of overloaded ASCII, we see some useful symmetries (more on the symmetries later).



ER diagram of many-many relationships between User, Tag and Item entities

We see there are many-to-many relationships between between each pair of entities. So far we haven't created any tables in a database - we are dealing only at the level of the Logical Model. To create a database from this we need to deal with the many-to-many relationships which cannot be represented efficiently in relational database tables without some refactoring.

The standard idiom for representing or "resolving" a many-to-many relationship in a relational database is the association table which associates the key columns for the two entities in a single row.

For example we want to record the Items a User has saved so far we and we also want to record all the Users that have saved the Item, in an efficient way. An association table, typically called UserItems, would be created to resolve the many-to-many relationships between Users and Items would be created. It contains rows containing (UserId, Itemid).

Similarly we can create ItemTags to see the relationship between Tags and Items and UserTags for Users and Tags.

We now have the following dual trinity of primary tables and association tables.

ER diagram of pairwise associations between User, Tag and Item entitites

So we've replaced each many to many relationship with an association table, that allows us to slice the data by each pair of dimensions.

Finally ( from our discussion of the previous post "Slicing and Dicing Data 2.0 (Part 1)" ) we add the fact table for Users, Tags and Items which is a three way association table which we represent so

ER diagram of three-way association between User, Tag and Item entitites

At this point you might be shaking your head saying "Cute pictures but what has this got to do with me, or anything for that matter ?"

If you're interested in scalable architectures for folksonomy applications then let me just whet your appetite by saying that the two way association tables are used for answering questions about pairs of entities e.g. "What are all the Items for this User?" Or "what are all the Tags for this Item?" And this will scale to millions of Users better than any other schema.

Or using a second order (depth 2) query - given a User and all that User's Items, who are all the other Users who have saved the same Items? Now before the ho-hums drown us, notice that this can all be done with simple SQL SELECT's to a single table. When these tables have say, 25 million or 250 million rows, not having SQL table joins matters.

Ok, so you're a wiz and you've known that all along. Then consider the symmetry in this diagram - where Users, Tags and Items have a co-equal relationship with each other. It means that if you use this data model, then at the core you can write a small(er) number of SQL queries that you can re-use across the application by replacing one pair of tables with another. And when you write a query to slice by one particular pair of dimensions using one particular set of restrictions in the "WHERE" clause, you can quickly do the same across the other two pairs of dimensions. This means using this data model isn't just good for data organization, it's also good for the organization of your application code.

Consider the fact that most folksonomy apps today don't deal with Tags at the same level as Users and Items and you see the asymmetry there and the potential power of exploiting symmetry.

Answering questions about Tags requires a whole different set of acrobatics in application code when the Tags are stored as a comma separated list in a column in a table. Very difficult to slice by that dimension when it's hidden away. Also, doesn't allow you to use the power of the database. And, any optimizations you do for scaling in the User and Items dimensions won't apply to Tags or worse still may pull in the opposite direction from optimizations for the Tag dimension.
So there's a lot of goodies in these pretty diagrams, and we haven't even scratched the surface yet.

Finally, when we combine all these into one diagram we have the central Mandala of Data 2.0 worth meditating on, until we reach FolkSatori. Or is it TagZen ?

Combined ER diagram of three-way association between User, Tag and Item entitites and pairwise two way associations.

12 Comments:

Blogger Jeremy Dunck said...

"Answering questions about Tags requires a whole different set of acrobatics in application code when the Tags are stored as a comma separated list in a column in a table."

Well, yes, but what tagging software is actually guilty of that crime?

Strawman?

11:46 PM  
Blogger Nitin said...

Hi Jeremy,

Comma-separated list of tags are not uncommon. Cannot mention names as some discussions took place in confidence. The real question to ask is "do you have a 'tag' table ? And the answer is often 'no' for reasons of denormalization. That approach to denormalization is likley to prove problematic for many reasons ......

Nitin

7:30 PM  
Blogger Nitin said...

Jeremy,

Take a look at Philipp Keller's site especially this collection of tagschemas.

You'll see what I mean.

Nitin

7:48 PM  
Anonymous Anonymous said...

Can you maybe post some CREATE TABLE for this setup? I don't see the revolutionary new two-common-tag-intersection-killer.

7:51 AM  
Anonymous kellan said...

Not sure if I'm reading the diagram correctly but are you proposing having users<->tags and items<->tags tables as well as the users<->tags<->items table?

If so, is it for any other reason then as an optimization at read time (at the cost of more inserts for every tagging operation) Maybe that is what you're trying to get at in the paragraph about 25mill records, but it isn't clear.

And if so, shouldn't you call this out explicitly?

9:13 AM  
Anonymous Piet Delport said...

I don't mean to quench the obvious enthusiasm that this post was written with, but aren't the "usertag", "useritem", and "itemtag" tables just crude, slow emulations of equivalent indexes on the central ("usertagitem") table? It should be possible to implement all the queries you mention (and more), faster and more efficiently by defining the appropriate indexes directly on the central table.

Also, isn't the "trinity" missing the time dimension? What about queries like "all sites tagged by some user last month"? This is something that separate association tables can't really help with, but that the central table (and its indexes), can easily and efficiently be extended with.

Something like this is recommended reading if you haven't worked with SQL indexing much, and want to learn what it's really capable of.

9:55 AM  
Blogger Nitin said...

Hi Piet,

Thanks for the comments. No "squelching" experienced :-). This is in response to your comment and the previous one from Kellan.

Yes theoretically the two-way association tables are redundant, once the three way association table is in place. However, in practice, the three-way association table will be a choke-point due to contention.
Read queries that involve only two out of the three entities say users and items should be offloaded to the user_item table, which would also be indexed and isn't a slow emulation of anything.

The whole principle behind using data warehousing techniques is "trade space for time". So, yes there are redundant tables, but they are there for a purpose.

Now about the "time" dimension, my feeling is that a separate set of facts relating to physical dimensions should be created so we would have timestamp, geocode as attributes (columns) but I am not sure we want to add them to the user-item-tag table always. A subset of applications consider physical data as fundamental and for them these are important. But these are not specific to relations in folksonomy, they are of interest in many non-folksonomy applications as well. For instance timestamps are probably meaningful in del.icio.us URL items, but most user interactions/navigation is via "user, item, tag" dimensions.

So my current thinking is that I don't consider physical co-ordinates in the folksonomy "trinity".

From your reference link I gather you are a Postgres aficionado. You might want to take a look at Greenplum They are doing some very interesting things with MPP extensions to Postgres. They are a past and current client of mine.

Thanks again for stopping by and taking the time.

1:38 PM  
Anonymous Piet Delport said...

How can contention be a problem? A two-column index takes load off the central three-way table in exactly the same way that a separate two-way table does.

Explained another way: any index you can define on the separate two-way tables can just as easily be defined directly on the central table, and they reduce contention to the central table in exactly the same way.

In fact, the only effect that the separate two-way tables have on the system, at all, is slightly reduced performance (due to the increased data redundancy and indirection), decreased flexibility, and an over-complicated update procedure (you have to keep the tables in sync yourself).

If you still don't believe me, try it yourself: replace the two-way tables with appropriate two-column indexes, and run benchmarks. You should find that it runs just as fast, or faster, and scales under heavy load just as well, or better.

But the real advantage, of course, isn't just speed, but increased application simplicity, and greater flexibility in the queries you can perform efficiently.

8:59 AM  
Blogger Nitin said...

Hi Piet,

Thanks again for your comments.

The way I see it contention is reduced in the following way -

Queries that access only two of (user,item,tag) will be coded, in the application, to use the corresponding two column table. So fewer disk accesses for *data* on the three column table.

Only queries that use all three of (user,item,tag) are coded to access the three column table. Far fewer disk hits on the three column table.

Doesn't this reduce contention on the three column table?

It is not immediately clear to me how a two column index on the same table helps to reduce contention - eventually all queries have to hit one table get the data, yes? Maybe I am missing something.

Disk accesses will be serialized for the same table - doesn't matter how many indexes you have - assuming of course that the tables are all too large to be stored in memory all at once - in that case these techniques are superfluous.

It's not the separate index that is reducing contention, its the separate data in the separate tables, spreading out disk accesses to separate tables.

Eventually, the two column tables + the three column table should be put on separate disks and controllers as they grow larger.

That's when you can see the major benefit.

12:04 PM  
Anonymous Anonymous said...

Why would a query touch the redundant tables AT ALL? It doesn't make sense.

Queries should use an index, not a table, there is no information in the two-way table that is not in the index, regardless if it's built from a two-way or three-way table.

The only way your optimization would make sense is if the RDBMS actually did the equivalent of "SELECT item_id, user_id from item_user WHERE item_id = 2 AND user_id = 6".

Why would disk access be serialized on a table? (i assume you mean in the RDBMS, the OS kernel will serialize everything no matter what you do).

This is all assuming everything is on the same disk of course. Your solution might make sense as a way of achieving concurrency on the disk level, assuming the RDBM's can put different tables on different volumes, but not do the same with indices.

RAID-1 sound like a better solution in that case, since that allows you to make concurrent queries on the same index.

4:56 AM  
Anonymous pitibalrog said...

Question for people who want just the three-way table:

The user<->item table is important because you can't see an association between an item and a user without tag in the three-way table.

For example, you can bookmark an url without choosing tags. How do you handle this if you just have the three-way table?

4:49 PM  
Blogger namtupdj said...

I will assume that usertag, useritem, usertagitem, and itemtag all being association tables contain only the userid, tagid, and itemid attributes, and that those attributes are arbitrary unique keys (say a sequence).

Then, the ultra-normal form ("Mandala of Data 2.0") SHOULD be compressible in one of two ways (and still "model" normalized information).

(1) Some vendor will REAL set / relational completeness chops providing a mechanism to create a partial index on usertagitem such that the restriction creating the index is for DISTINCT (or just "FIRST") pairs of, for instance, useritem (after all, a partial index IS just another set that should be createable from a relational restriction!). THEN, that index could take the place of the useritem table, AND one would have one less table to cross-maintain (not to mention storage savings on HUGE folksonomy tables). Yes, the partial index would be "worthless" to access (more than one; again perhaps the "first") rows in usertagitem; BUT it could act as an index-only table for useritem (here, I am assuming one also has a usertagitem index to afford complete row access / coverage). In general, partial indices in this way could allow such "compression" while still maintaining the normalized view when one considers both the usertagitem table and such indices.

(2) Since in bookmarking a user assigns some tag(s) to an item all at once, one should be able to have usertagitem as the ONLY association table by creating all relevant rows for usertagitem at once. Even without a partial index facility, one should be able to create non-unique indices that will serve as the usertag, useritem, and itemtag tables. Queries seeking something like - "show me all tags for a user" SHOULD do an index only access to do so (given a "smart" optimizer). Yes, there could very well be duplicates. A DISTINCT query would fix that ... or messier, a cursor delivering only one row. Better still, a REALLY smart optimizer might "know" via a DISTINCT query that only the first pair of usertag for a particular tag query needs to be returned. IF one knows or anticipates that queries like "show me all the items for some user(s) with some tag(s)" will be needed, then this solution is probably better. Of course, the normalized view of each two-item association in this solution is now dependent on (can only be constructed / exposed via) query restrictions.

In either of these solutions because of (assuming) separate index only access, there is no more of a choke-point than there is in also having the two item association tables and their indices (modern optimizers should "cope" with that - choose the right index and access ONLY that index to satisfy the query). Moreover, there need be no restriction (constraint) that prevents a tag from being NULL if a user choses to not assign one to an item. Indeed, it could be useful for analysis (even by the user?) to be able to find out which of items have NOT been assigned a tag, which should be easier with the three item association table!

3:44 PM  

Post a Comment

<< Home