Slicing and Dicing Data 2.0 (Part 2)
Foundation Data Model for Folksonomy Navigation.
Underneath the trinity of User, Tag, Item is a rich set of relationships that, together with the trinity, form the foundation data model of Data 2.0. Before we get into the experimental data and numbers in future posts, let's explore these relationships and the implications for data architecture.
Users have Items ( URL's, photo's, chunks of text, ... ) but a User could exist in the database without any Items.
On the other hand an Item will end up in the database only if associated with at least one User ( the one who tagged it in the first place). A User then has maybe-none-and-possibly-many Items and an Item has a-least-1-and-possibly-many Users who tagged it.
Similarly, an Item can exist with no Tags but a Tag exists in the database only when attached to some Item. Finally, a User may have tagged nothing so has no Tags but a Tag may exist in the database only if created by some User.
Putting all these together we have a 3-cycle of relationships
Users (1-many) <---------> (0-Many) Items
Tags (0-Many) <----------> (1-Many) Items
Tags (0-many) <---------> (1-Many) Users
Using a diagram, instead of overloaded ASCII, we see some useful symmetries (more on the symmetries later).
We see there are many-to-many relationships between between each pair of entities. So far we haven't created any tables in a database - we are dealing only at the level of the Logical Model. To create a database from this we need to deal with the many-to-many relationships which cannot be represented efficiently in relational database tables without some refactoring.
The standard idiom for representing or "resolving" a many-to-many relationship in a relational database is the association table which associates the key columns for the two entities in a single row.
For example we want to record the Items a User has saved so far we and we also want to record all the Users that have saved the Item, in an efficient way. An association table, typically called UserItems, would be created to resolve the many-to-many relationships between Users and Items would be created. It contains rows containing (UserId, Itemid).
Similarly we can create ItemTags to see the relationship between Tags and Items and UserTags for Users and Tags.
We now have the following dual trinity of primary tables and association tables.
So we've replaced each many to many relationship with an association table, that allows us to slice the data by each pair of dimensions.
Finally ( from our discussion of the previous post "Slicing and Dicing Data 2.0 (Part 1)" ) we add the fact table for Users, Tags and Items which is a three way association table which we represent so
At this point you might be shaking your head saying "Cute pictures but what has this got to do with me, or anything for that matter ?"
If you're interested in scalable architectures for folksonomy applications then let me just whet your appetite by saying that the two way association tables are used for answering questions about pairs of entities e.g. "What are all the Items for this User?" Or "what are all the Tags for this Item?" And this will scale to millions of Users better than any other schema.
Or using a second order (depth 2) query - given a User and all that User's Items, who are all the other Users who have saved the same Items? Now before the ho-hums drown us, notice that this can all be done with simple SQL SELECT's to a single table. When these tables have say, 25 million or 250 million rows, not having SQL table joins matters.
Ok, so you're a wiz and you've known that all along. Then consider the symmetry in this diagram - where Users, Tags and Items have a co-equal relationship with each other. It means that if you use this data model, then at the core you can write a small(er) number of SQL queries that you can re-use across the application by replacing one pair of tables with another. And when you write a query to slice by one particular pair of dimensions using one particular set of restrictions in the "WHERE" clause, you can quickly do the same across the other two pairs of dimensions. This means using this data model isn't just good for data organization, it's also good for the organization of your application code.
Consider the fact that most folksonomy apps today don't deal with Tags at the same level as Users and Items and you see the asymmetry there and the potential power of exploiting symmetry.
Answering questions about Tags requires a whole different set of acrobatics in application code when the Tags are stored as a comma separated list in a column in a table. Very difficult to slice by that dimension when it's hidden away. Also, doesn't allow you to use the power of the database. And, any optimizations you do for scaling in the User and Items dimensions won't apply to Tags or worse still may pull in the opposite direction from optimizations for the Tag dimension.
So there's a lot of goodies in these pretty diagrams, and we haven't even scratched the surface yet.
Finally, when we combine all these into one diagram we have the central Mandala of Data 2.0 worth meditating on, until we reach FolkSatori. Or is it TagZen ?