Sunday, June 05, 2005

Slicing and Dicing Data 2.0 (Part 1)

The facts about dimensions.


While I am working on more hard data and data models, I thought it might be useful to put thoughts to "paper".

The key insight, IMHO, of Data 2.0 is the recognition that tagschemas represent data warehouses of tagged data. Hence analysis of tagschemas must include dimensional modeling. Before we get into that, let's take an informal look at the three main dimensions along which we view folksonomy applications - namely "Users", "Tags" and "Items". Folksonomy applications provide domain specific focus (photos, events, goals,URL's) and user access to data that varies along these three dimensions. This is the kind of thing that dimensional analysis talks about.

Dimensional data models are to data warehousing as normalized data models are to transactional applications.

Dimensional modeling is an activity originating within the discipline of data warehousing proper, but very soon likely to spread virally all over the brave new world of Data 2.0. To get to the point, dimensional analysis of data focuses on cross cutting concerns in data and isolates each such concern in a dimension. A dimension is roughly similar to an 'aspect' for data, in that it cuts across the whole app. But it is more than an aspect in that it defines a primary independent axis along which data varies. So fundamental to Data 2.0 is a far depeer appreciation of dimensional modeling of data, not just normalized transactional modeling. This is the first thing we need to assimilate from data warehousing.

The next thing we need to import from data warehousing is something that goes hand in hand with dimesional modeling, the concept of a "fact table". This is a table associating each dimension in an atomic event - typically the act of a user tagging an item and recording it as a row in the table. A tagschema fact table will contain a row for each occurrence of the event "user X puts tag Y on item Z" for all users X who are actively tagging items and for all tags Y used by user X and for all items Z that user X tagged.

So if user 'jill' uses tags 'physics', 'math', 'biology','chemistry' and tags 3 articles from a website discussing science, we may get a set of rows in the fact table of the form


User Tag Item
----------------------------------------------------------------
'jill' 'math' 'http://scifoo.com/math-article'
'jill' 'math' 'http://scifoo.com/physics-article'
'jill' 'physics' 'http://scifoo.com/physics-article'
'jill' 'math' 'http://scifoo.com/chem-article'
'jill' 'chemistry' 'http://scifoo.com/chem-article'
'jill' 'biology' 'http://scifoo.com/bio-article'


User 'joe' will have a similar set of rows for all his tagged items, user jane for hers etc.

The fact table is a multi-way association table associating available dimensions and recording the values of each dimension at the point of association. Fact tables allow a number of interesting queries to be run with probably the best possible scalability using relational databases for tagschemas. I am massively summarising here and this doesn't at all capture all the reasons why this works.

Other examples? In tagschemas, "User", "Tag", "Item" represent obvious dimensions. These are clearly cross cutting concerns of all tagapps. What else? Obviously, "Time" and "Location". Even without "Tag" and "Item", "User","Time" and "Location" form another basis for a dimensional model containing atomic facts about who was where at what time. Or with "Item", "Time", "Location" we record data such as when and where this photo was taken.

Once you have fact tables for your folksonomy data you can start answering all kinds of interesting questions using simple SQL selects without multi-table joins. Or as the data warehousing folks like to call it "slice and dice by dimension".

I have started running some experiments using three way fact tables for users, tags and items and even with a thousand users the results have generated interesting numbers. I'll be posting them here in the coming days/week(s) so stay tuned.

1 Comments:

Blogger Gaurav said...

even though I am too late in posting a reply , but would like to know what was the results of your experiments.

And is there any better schema possible for storing tag apps then using are old dimensional models like snow flakes.

5:07 AM  

Post a Comment

<< Home