Wednesday, June 08, 2005

Fundamental Scalability Problem of Data 2.0

As the web has grown in use and entered the mainstream, server infrastructures have been subjected to ever increasing load from users. The well known "10 Thousand Client" problem C10K was formulated quite a while ago for web servers and the solutions have received some attention in architecture primarily in asynchronous request handling as a method for managing massive concurrency.

Users of Web 2.0 can now generate, store and manipulate voluminous streams of tagged digital data, or "tagstreams". These tagstreams constitute the input sources of Data 2.0. And they create a corresponding scalability problem at the database end that needs to be recognized and attacked by evolving new strategies for this problem.

The corresponding scalability problem for Data 2.0 is the

"1 Billion Row problem".

How do we manage databases, for folksonomy apps, which contain tables approaching 1 billion rows in size?

The 1 Billion Row Problem (or "R1B problem" for short), is not "doom and gloom" thinking or fanciful exaggeration. 1 Billion rows is not as wild as it may seem at first and not that far off in the future either. Let's go to the chalkboard.

Here's the result of a simple experiment.

I created a database based on the Foundation Data Model for Folksonomy Navigation, with 1000 users, 1000 tags (random strings), and 10,000 tagged items. Samples were generated with a Python script using the Python Standard Library random.expovariate() function tweaked a little to account for the errors due to rounding to integers. The version of Python used was 2.3.5 and this was on MacOS 10.4.

A user in this model has an average of 50 items and each item an average of 5 tags, both averages based on an exponential distribution.

With these and generating the association and fact tables by using the Python script, I had, with Number of Users = 1000, Tags = 1000, Items = 10,000


Table # of Rows
-----------------------------
UserItem ~ 56,000
ItemTag ~ 55,000
UserTag ~ 230,000
UserTagItem ~ 310,000

If we were to scale this up linearly (questionable, but stick with me for a bit) and assume 1 million users, 1 million tags and 10 million items we would multiply everything by 1000 so the UserTags and UserTagItem tables would be ~250,000 x 1000 which is 250 million rows. Whether this scales up linearly or not is TBD in the next few experiments but the order of magnitude (quarter of a billion) suggests we are not wildly off the mark and in fact may be underestimating the problem.

Yup, data volumes in Data 2.0 will be a bear. And not your average bear either.

3 Comments:

Anonymous randomdude said...

Read your posts with interest. Note sure i agree with some of your assumptions but still. Don't know if you spotted it, but you can reduce the scalability burden slightly using thinking about tagging behaviour. i.e. many users' tags for items will overlap. Put another way most people will tag a picture of a car as "car".

With this knowledge you see that the useritemtag table will contain a lot of redundant info and a better solution would be a user-itemtag table. ie. a table joining user_id & itemtag_id.

Extending this you can further decompose the useritemtag table into 2 more tables: item-usertag, tag-useritem. I'd call these normalised 2nd order joins - usertagitem being 2.5 order?

3:58 AM  
Anonymous randomdude said...

Knocked up a quick MySQL command of my previous comment. It shall be known as the Zurt Tag Schema or ZTS.


CREATE TABLE `users` (
`id` int(10) unsigned NOT NULL auto_increment,
`name` varchar(50) NOT NULL default '',
`email` varchar(50) NOT NULL default '',
PRIMARY KEY (`id`),
KEY `name` (`name`)
) TYPE=MyISAM AUTO_INCREMENT=100;


CREATE TABLE `tags` (
`id` int(10) unsigned NOT NULL auto_increment,
`tag` varchar(50) NOT NULL default '',
PRIMARY KEY (`id`),
KEY `tag` (`name`)
) TYPE=MyISAM AUTO_INCREMENT=100 ;


CREATE TABLE `items` (
`id` int(10) unsigned NOT NULL auto_increment,
`item` varchar(50) NOT NULL default '',
PRIMARY KEY (`id`),
KEY `item` (`name`)
) TYPE=MyISAM AUTO_INCREMENT=100 ;


CREATE TABLE `usertags` (
`id` int(10) unsigned NOT NULL auto_increment,
`user_id` int(10) unsigned NOT NULL default '0',
`tag_id` int(10) unsigned NOT NULL default '0',
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`),
KEY `tag_id` (`tag_id`)
) TYPE=MyISAM;

CREATE TABLE `useritems` (
`id` int(10) unsigned NOT NULL auto_increment,
`user_id` int(10) unsigned NOT NULL default '0',
`item_id` int(10) unsigned NOT NULL default '0',
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`),
KEY `item_id` (`item_id`)
) TYPE=MyISAM;

CREATE TABLE `itemtags` (
`id` int(10) unsigned NOT NULL auto_increment,
`item_id` int(10) unsigned NOT NULL default '0',
`tag_id` int(10) unsigned NOT NULL default '0',
PRIMARY KEY (`id`),
KEY `item_id` (`item_id`),
KEY `tag_id` (`tag_id`)
) TYPE=MyISAM;


CREATE TABLE `user-itemtags` (
`id` int(10) unsigned NOT NULL auto_increment,
`user_id` int(10) unsigned NOT NULL default '0',
`itemtag_id` int(10) unsigned NOT NULL default '0',
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`),
KEY `itemtag_id` (`itemtag_id`)
) TYPE=MyISAM;

4:27 AM  
Blogger Nitin said...

Hey,
Thanks for taking the time and thinking through some alternatives.
user_itemtags, certainly sounds promising....
One thing to note, though, is that in the fact table approach we are not trying to optimize out redundancies, in fact we are trading diskspace (redundant data) for time (faster lookups).
So with user_itemtags will the real world use cases actually need to do joins between user_itemtags and itemtags - it does look like that.
And does the "car tagged with car" use case completely describe a user's tags - probably not.
So when I need to get all a users tgas I will now have to do a bunch of queries across a bunch of tables.
The lock contention and the join overhead will be non-trivial.
The basic principle in the fact table approach is localize the data in one table so you can answer most questions with simple selects.
The issues of scalability are in reducing query cost, not in reducing space cost - as I said we are in fact trading space to gain query time.

11:01 AM  

Post a Comment

<< Home