All My Items Are Belong To Us
We preview now, an important topic to be dealt with in more detail later. This is about a vital element of building communities based on folksonomies - sharing items and making sets of items visible to groups of Users.
Users publish items (photos, URL's ...) and may wish to restrict visibility to a defined group of Users. So another set of questions that needs to be answered with queries to the folksonomy database, relate to visibility of items.
e.g.
* What items belonging to Jim are visible to me?
* Who are all the users who can see the specific item I published, and made visible to a number of groups? Or even more complicated,
* Given a set of items I have published what is the set of users that can see all of them? Any of them?
These kind of queries relate Users in two different roles, publishers and viewers. So queries that answer these questions involve a special case of the SQL JOIN called a SELF-JOIN where a table is joined to itself. As mentioned before, with a User table with a million rows, we don't want to be doing these kind of queries. Fortunately, the fact table appproach we saw before is also very useful here. More diagrams coming but here's a preview.
Create a table PublisherItemViewers or if you wish VisibilityFacts. It has three columns 'publisher', 'item' and 'viewer'. The 'publisher' column contains userids of Users who have published Items. The 'item' column contains itemids of Items published and the 'viewer' columns contains userid's of Users who are allowed to view this Item.
Thus each row (publisher,item,viewer) records an atomic fact about a user X having published an item Y which is viewable by another user, Z. For a given User and a given Item, it will possibly be visible to a number of Users. So there will be many such rows in VisibilityFacts with identical value in the first two columns, one row for each user that can see this item, with userid of viewer in the third column.
Now we can answer all the questions mentioned above and more, using simple SELECT statements on the VisibilityFacts table.
A quick numerical calculation. Given 1 million users with say 50 items each with each item published to an avg. of 20 users, we have a 1 Billion rows in this table.
Another Data 2.0 scalability challenge.
Users publish items (photos, URL's ...) and may wish to restrict visibility to a defined group of Users. So another set of questions that needs to be answered with queries to the folksonomy database, relate to visibility of items.
e.g.
* What items belonging to Jim are visible to me?
* Who are all the users who can see the specific item I published, and made visible to a number of groups? Or even more complicated,
* Given a set of items I have published what is the set of users that can see all of them? Any of them?
These kind of queries relate Users in two different roles, publishers and viewers. So queries that answer these questions involve a special case of the SQL JOIN called a SELF-JOIN where a table is joined to itself. As mentioned before, with a User table with a million rows, we don't want to be doing these kind of queries. Fortunately, the fact table appproach we saw before is also very useful here. More diagrams coming but here's a preview.
Create a table PublisherItemViewers or if you wish VisibilityFacts. It has three columns 'publisher', 'item' and 'viewer'. The 'publisher' column contains userids of Users who have published Items. The 'item' column contains itemids of Items published and the 'viewer' columns contains userid's of Users who are allowed to view this Item.
Thus each row (publisher,item,viewer) records an atomic fact about a user X having published an item Y which is viewable by another user, Z. For a given User and a given Item, it will possibly be visible to a number of Users. So there will be many such rows in VisibilityFacts with identical value in the first two columns, one row for each user that can see this item, with userid of viewer in the third column.
Now we can answer all the questions mentioned above and more, using simple SELECT statements on the VisibilityFacts table.
A quick numerical calculation. Given 1 million users with say 50 items each with each item published to an avg. of 20 users, we have a 1 Billion rows in this table.
Another Data 2.0 scalability challenge.



7 Comments:
Nitin, pray tell what your suggested "solutions" to the scalability problems are. I'm peeing in my pants as i await with bated breath.
I myself am programming a "freetagging" opensource project that allows users to tag anything they want. It has been my intention to allow the databse to massively scale, yet still provide the user/multiple users to perform statistical queries (in particular, counts) quickly.
Not having taht much DB design background, I learnt of the problem of the super poor scaleability of fully normalized tables the hard way.
My code was performing beautifully up 10k objects to be tagged, 6k tags, and 5k users.
I then decided to fill everything up to about a million each.
Needless tosay, the multitude of joins between all my tables killed me.
15 second long queries just won't do.
I am in the midst of attempting to reduce the number of joins to ZERO by including a universal "FACT" table.
However, because I have many other classification fields such as "labels/categories" "groups" "Permissions" etc, my FACT table is
going to be huge. With each extra column i add in, it my scale another 20-100 times. I face a HUGEEEEEEEE scaleablity problem. I might hit 1m rows in the fact table FAR before i even hit 10000 users.
I was thinking of having separate fact tables that store only the columns of interest in a query. But given the multitude of possible statistical queries, this might mean a lot of FACT tables, which would then take its toll on the insert speed.
not to mention i'd probably have tons of indexes for each fact table to prevent table scans, further increasing insert times.
Another big problem of scaleability issue with the massive fact tables, is that COUNTS themselves (even on a single table w/o joins) will take a while if the table is huge enough.
I am even considering having "count" fields /(temporary?) tables that manually get increased /decreased whenever a transaction occurs.
This way, by indexing the count field , i can get a an extremely quick idea of the top 50 tags by a simple "SELECT tagcount from counttable limit 0,50 order by count desc".
Currently the other option is waiting for the db to do a table scan through the 1bn rows, and do a manual count on whichever field i grouped by.
I seriously am very curious what sites like delicious, furl, spurl, use for their database schema. Delicious has 5 servers for MYSQL and 2 for apache. With my current db schema, my p4 2.53 2gb box can't even do a single query in less than 10 seconds due to the multitude of joins btw million row tables.
Do you think its something more than just FACT/Summary tables? Joshua did mention ideas such as individual tables for each user and things like that.
Would it be possible for me to email you? I have further issues i'd like to discuss.
Many thanks, and great job with this site.
Data 2.0's scalablity issue needs to be solved.
Another couple of thoughts:
Would using different DBs help?
What about a lower level db like berkeley db, which is nothing more than key value mappings.
Berkely db supposedly support HUGE HUGE HUGE tables (which may help for the scalablity issues), and allow for many concurrent reads/writes.
Afaik, amazon and google use it for their user profiles. How well bdb would suit folksonomy type apps is another qn though. I once read that the page that first appears when u login to Amazon hits the BDB with 800 queries.
I do understand that the ability of a db to handle many simple reads quickly does not necessaraily make it the best option.
Another solution could be www.kowari.org. I myself do not know how well this would scale. Apparently joshua of delicious says it is non performant.
Other ideas include more radical solutions such as 1 table per tag/user/URL/etc, although come to think of it, this would probably also be considered a type of FACT table.
oh and also add to the couple of thoughts:
possibility of a distributed db schema. eg, dividing by geographical location, division of fact tables by alphabetical/numerical/chronological order (A-G, H-N, O-Z, or May 2005, June 2005, or UseriD 1-100000, 100001 to 200000, etc).
This might help a little, but there also disadvantages when u try to do stastical queries that base calculations off of the entire database of records.
Then you are back to unions or joins, and a HUGE indexing problem due to the multitude of tables.
Hi,
Very interesting real world experience. I've created an email list for discussion of these issues.
Please see
Tagdb . Hope to see you there.
Hi,
Bad link in previous comment pl. use this one
Tagdb .
I couldn't find a better way to contact you, so I apologize that this may be off topic:
Are you aware of any PIMs that employ tagging of things like, text files?
I'd really like something that runs on my pda and syncs with a desktop client = )
-Dan
I don't SQL can solve these problem to be honest. Every model runs into the same "intersecting two huge sets" and "full table scans for counts" problems.
Post a Comment
<< Home