Sunday, May 08, 2005

Web 2.0 needs Data 2.0

Tim O'Reilly, talking about Web 2.0 at the MySQL User Conference 2005, said "The future belongs to data". This confirmed and verbalised something I had been formulating with my gut for the last couple of months.

A major characteristic of Web 2.0 apps is folksonomy - the loose classification of content by mnemonic keywords or "tags". Today we see these tag based apps - "tagapps" - packaged as web applications. Tomorrow they'll be everywhere, on your laptop, iPod, camera, phone, point-of-sale terminal, ..... grabbing, synthesising, suggesting, storing, retrieving, correlating tags. Private tags on private data on private storage devices; shared tags on shared data in shared dataspaces; public tags on public data in massive public web tagapps.

News Flash: Massive amounts of data headed your way. ("Doh!" you say ... but wait....).

Crawling ticker subtext: We may need new thinking to be able to manage massive amounts of tagged data on the web and off.

We'll need ways to efficiently manage and use massive amounts of content and tags in tagapp databases - tagschemas. They'll need to be stored and accessed just as easily on the web as off. Along with the emergence of AJAX and "user centered design" we need a clearer idea of how to better model and manage this data. We need a whole new set of models and methodologies complementing the existing ones based on normalized entity-relationship models. We'll need to borrow from another data management discipline out there - data warehousing - and meld it into normalized data models to create effective tagschemas. You'll have your own personal data warehouse if you will, portions of which reside on your iPod, portions on your laptop and still other, more public, portions in the massive tag repositories on the web that have yet to be built. We need to evolve all this rapidly over the next few years.

We need, in effect, Data 2.0.

One thing appears certain - Web 2.0 with Data 1.0 will grind to a halt before it leaves the station. The sheer volume of data is already bringing normalized databases to their knees.

One example of Data 1.0 is the approach to performance tuning tagapps for massive scaling. The usual response to web app database performance issues so far, has been to partition the app over multiple servers, add more hardware, add caching and various minor tweaks to the schema. That is, most of the effort is put in improving the physical infrastructure.

It's the schema though i.e. the data model, that will place a hard limit on how much you can improve performance using other methods. Not only that, but the performance degradation due to schema issues will grow as the 3rd or 4th power of the data size. Infrastructure improvements will typically give a factor between 1 and 10 of performance improvement. Most of the times it is between 1 and 2.

More on this math in another blog. Hint: multi-table SQL joins on a tables with millions of rows in them.

Tagapp databases - tagschemas - have some common characteristics because their data is based on users tagging content. To wit, the last three words in the previous sentence immediately imply that "user", "tag" and "item" will be part of every such schema. These will either be entities (tables) or attributes(columns in a table). The act of a user tagging an item will certainly need to be recorded in the database as an association between the user, the tag and the item, perhaps with date/time info and other metadata. There will also be groups and group memberships, that define social interactions and the public/private visibility attributes. This will be true of all tagschemas, no matter what content is tagged.

So while Flickr, del.icio.us and Technorati may seem to be different because of the different content types they store, the big picture, the abstract or logical data model, is remarkably similar at the core. That is not to say that their MySQL databases will look similar if we were to look under the HTML covers. They may each have implemented the "user-tag-item-group-permission" abstraction in different physical ways. What's different aside from the content types they store is also how their users query the data. Optimizing physical design to improve query performance pulls the physical design in different directions. So that's another reason the physical design could be different.

There's a lot more that can be said about what's common and what's different in these early Web 2.0 tagschemas. There's a lot of valuable knowledge on the way to Data 2.0, waiting to be discovered and invented.

Hence this blog.

This blog is a conscious effort to stir the pot on this problem, to stimulate discussion and debate and to evolve a set of best practices for tagapp database schemas, through the collective wisdom of the 'net.

I'll get things rolling soon by pointing to a number of existing designs and suggest some lines of thinking outside the box.

I'd like to invite comments on the relevance of this effort and solicit tips and techniques that you'd like to share with the rest of us. If you'd like to participate in some of the performance testing efforts please say so as well.

The future belongs to data, Data 2.0.


Tags: folksonomy data tagschema

4 Comments:

Blogger Bud said...

Could not agree more with the sentiments you express here. I wanted to point you out to a couple of links and other folks who are talking on these topics:

http://thecommunityengine.com/home/archives/2005/04/xfolk_schema_fo.html

Also, look at

http://theryanking.com

http://jluster.org

http://tantek.com

I have published an xhtml microformat for folksonomy that is in version 0.3. It is at:

http://thecommunityengine.com/home/archives/2005/04/xfolk_03_xhtml.html

It has been implemented as a drupal module. See:

http://hybernaut.com/xfolk-001

You can see all of my xFolk writings, very related to your ideas at:

http://thecommunityengine.com/home/archives/xfolk/index.html

I am in the process of producing xFolk 0.4. I would appreciate any feedback.

12:30 PM  
Blogger Nitin said...

bud,

Thanks much for the info and links to a number of fascinating discussions.

While there's a lot in common with what I wrote re: Data 2.0, the big difference technically is in the use of xml for metadata management as opposed to relational databases esp. MySQL.

My comments were not meant to suggest that relational databases cannot handle folksonomy metadata. Rather, the data models i.e. entities and relationships used will cause massive performance issues. There are ways of handling these large volumes of data in the relational database world, esp. using techniques from data warehousing which is where I am going with the Data 2.0 idea.

While relational database based folskonomy schemas will need serious design up front, my belief is that they will scale well when that proactive approach is taken. A large corpus of XML files with metadata may not scale as well especially with 100's of millions of entries - my gut feeling is the filesystem will be one bottelneck and the XML parsing overhead the other. But that's just a guess.

xFolk sounds very interesting and I intend to track it - will add comments at some point when I have something intellligent to say.

Thanks again for the links.

Nitin

10:45 AM  
Blogger Dave said...

great stuff nitin. nice to meet you last night at tag tuesday.

7:48 AM  
Anonymous Alex James said...

I have a post on much the same topic here http://www.base4.net/Blog.aspx?ID=36

The essense of my post is that we need to upgrade the idea of the Foreign Key. And the analogy that that a Foreign Key is really just a hyperlink.

6:54 PM  

Post a Comment

<< Home