Thursday, May 08, 2008

Guten Tag. Und Wilkommen in Tag Schema

This blog was originally created to discuss database designs underlying the emerging tagging or folksonomy applications (Flickr, del.icio.us ....) at the leading edge of Web 2.0. Over the next year or so the frenetic activiy in that area stabilised and the focus moved to the so called 'social network' applications. While those were not the subject of the original folksonomy discussions, they underlined some importamt and related database issues.

One such issue is that of the database schemas induced by the need to model 'friend' relationships. As it turns out these are just as interesting as the folksonomy schemas. In parallel, other issues emerged - "data in the cloud" and "data portability". The former is about the move away from centralized, relational databases based on SQL and the latter is about data ownership issues created by having personal digital assets distributed all over the Internet captive inside Web 2.0 applications.

So the current trends on the net have major implications for underlying data structure. As the application architectures change and as new disruptive ones emerge, the underlying data layers experience corresponding tectonic shifts. And it is the new agenda of this blog to track all these data related issues as they emerge. This is much broader than the original narrow issue of folksonomy database design but in retrospect it is a natural evolution. The only problem that remains is the name 'tagschema' which seems so narrowly focused on tagging database schemas.

Luckily at a recent MySQL event the solution emerged in a conversation with Kaj Arno of MySQL, now Sun. "Ahhh ..." he said looking at the word 'tagschema' on my name tag. "That sounds like a German daily tv show 'Guten tag und wilkommen in Tag Schema' - that means 'Good day and welcome to Tag Schema - schema of the day'". I thought nothing more of it but the phrase 'Guten tag und wilkommen in Tag Schema' kept playing in my mind. Later the name for the new blog came back to mind and I realized that Tag Schema could mean - schema of the day or 'Current schema' or 'Current trends in schema' more generally 'discussions of underlying structure of the day' which is generally where we will be going with this blog. - exploring new data structures and technologies as they emerge.

So thanks very much Kaj Arno for that moment of zen serendipity.

And so 'Guten Tag. Und Wilkommen in Tag Schema'

How many times do I have to tell you? ........ Don’t …. Repeat ….. Yourself.

It is 2008 – do you know where your avatar is? I don’t. I have copied it into so many different web apps I have no idea where it’s been. And I just got a request to update my address book from yet another address book provider with an address of mine copied 5 years ago.

The social web has become a giant cookie monster – growling “Gimme Copy, Gimme Copy, yeah yeah yeah … mmmmmm Copy”

The DRY (Don’t Repeat Yourself) principle has been touted by designers of Rails and other modern web app frameworks. It is ironic then that these frameworks have been used to build a whole generation (Web 2.0) of apps that force the user to make copies of data again and again into each web app. In the data world the DRY principle reads "Don't Make Copies" (DoMaCo).

How many times do I have to tell you – Don’t …. Repeat ….. Yourself. Dear web app builders – you created the Internet Copy Monster – you need to help stamp it out. But how? Read on.



Fig 1. Yesterday's web app architecture – forced violation of DRY principle at the data level.

Most Web 2.0 apps do not expose REST URI’s to every data element. This means the user can’t access their data freely which means they can’t reuse their data in other web apps. The real added value of a successful social web app lies in the community interactions and the UI elegance that enables community interactions, not in the data management layer.

For example the popularity of a service such as Flickr is primarily due to their innovations around tagging and “interestingness” and the very active community, not because of their massive data storage facilities or their disk farms, which are a cost center.

That means Flickr would still be Flickr even if the data layer in Fig 1 were not owned by Flickr. Think about that for a minute and apply that across the social web. Note also that Flickr allows you to embed Flickr photos into other applications – so Flickr photos can become definitive instances of your photo data. Flickr exposes data pointers – in effect they have become a next generation Internet data layer for photos.

This leads to the possibility of a general purpose data layer - not part of the web app but part of the Internet infrastructure - a data layer which contains user digital assets and all social data. This would be provided by a new class of service provider the “data service provider” who would give you URI’s to all your content, give you full control and access to your data and would be a for-fee service.

Web apps would only point to data in this data layer and not be part of the huge Internet Copy Monster.

Now consider tomorrow’s web application architecture which is already in place in parts. I call this the Yinas approach – YINAS being a recursive acronym for Yinas Is Not A Silo.



Fig 2 Yinas. A web app architecture that doesn’t violate the DRY principle at the data level and respects user data rights.

This may seem like it needs a massive redesign of all web apps, but it doesn’t. It would just require a uniform approach to data embedding in web apps, most of which is already in place. The needed work is already done for most content except text, avatars and structured content such as address books etc. We already embed photos, video, and audio via URI’s to remote content hosting services. We just need to extend it uniformly to all content types, not just image, audio, video and we need to use it as a pervasive design principle across the web.

In summary – let’s recognize “Don’t Make Copies” as a useful design principle for web app data and let’s consume pointers instead of copies.

Let's stamp out the Internet Copy Monster. Let's stamp out unnecessary repetition. Shall we? Shall we?

P.S.

And please forward a permalink to your friends, not a copy ;-)

Friday, February 01, 2008

Why Data Portability is a non-solution to a non-problem

I have written a draft on Backpack
Note: As of Feb 6 2008 the draft is now a post on GigaOm

Please leave comments there - comments on this post here are now closed.

Wednesday, July 25, 2007

Some thoughts on data rights

Tim O'Reilly was talking about data and data access in his keynote at Oscon2007 today (Wed Jul 25th 2007) I thought I'd post some thoughts I have been chewing on for a while, even while I am still in the keynote. These issues have technical and philosophical implications. They are not about tags per se but do apply very strongly to data currently captive in contemporary folksonomy applications as well as other Web 2.0 applications. Comments and Criticism invited.

A manifesto for data rights in a globally networked world

(Draft 1 Jul 25th 2007) (cc) Published under Creative Commons "Attribution No Derivatives" Licence

We consider the following to be axiomatic and universal

  1. Data is a first class citizen of the network.

  2. Data must not be held captive in an application or locked in proprietary application-specific file formats.

  3. Data must be readable and exportable directly, programmatically, completely without restriction and stored in open, non-proprietary formats.

    1. Programmatic data access must allow FULL export and read capability independent of what the human UI allows
    2. Arbitrary restrictions must not be placed on data access by the application controlling the data, whether due to unintentional limitations of the application architecture or due to intentional design.


  4. Every unit of data must be independently addressable via a URI

    1. On the Internet, data should be accessible via REST based architectures


  5. Every unit of data must be capable of having an associated access policy, separately from other such units of data

    1. Each data unit must be able to have a possibly different access control policy
    2. The default access control policy of a data unit created by an individual must be "private"
    3. Policy change must be under the free control of the individual,
    4. Policy change must be under the control only of the individual.


  6. Data is property. Hence data access and ownership must be subject to rights strongly similar to or identical to physical property rights.

    1. No application, service, organization or other entity may require data exposure or implicit surrender of data ownership as a price of use or access to some facility

    2. Data exposure must be separately negotiated and be freely negotiable without coercion, according to the needs of the individual.
    3. "Website shrink wrapped licenses" are not considered to be a a meaningful negotiation in this context.

    4. Data about an individual belongs to that individual and only to that individual, who may choose to share the data subject to their needs and no one else's

    5. Data does not belong to the incidental keepers of data representations (internet service providers, medical service providers, financial service providers, state and federal govt agencies)

Sunday, October 01, 2006

Putting the "folk" back in folksonomy

Or ... The fat belly and recommendation systems



Since the beginning of Web 2.0 time, "folksonomy" has been synonymous with tagging. It's time to fill out the picture. As readers of this blog know, folksonomy involves tags, tagged-items, and tagger-users. This post digs deeper re: the role of users in the "holy trinity" of user-tag-item. And examines the relationship of users to recommendation systems, ... and to the "fat belly".

Yes, that does sound like a whole lot of ground to cover but

a) I have been gone for a while so need to catch up in a hurry - what can I say?
b) It's not that much ground to cover when we see the interesting relationships
c) The notation described in the previous post makes it possible to cover a lot of ground without too much verbiage.

So without further ado, here goes.

In a typical folksonomy system we have users attaching tags to items. As the system evolves we have, given an item 'i', the sets: -
T(i), the tags associated with i and
U(i) the users who use the item i.

Typical folksonomy apps have focused on navigating the various relationships with a focus on T(i). Recommendation systems that suggest 'related items' are also most often based on T(i), as follows. Given an item we find all tag related items via I(T(i)). Then we use some algorithm to trim this down to the "best" 5 or 10 by some definition of "best". Then we use these as recommendations. Given a user of item i, these are the recommended other items, or 'related items' based on tags.

For the rest of this discussion, we denote this set of recommendations as Rt(i) i.e given an item i, the recommended other items based on tags.

Consider now, the other way to get related items, i.e. user-related items.
This is the famous "users who bought this item also bought ...." approach that we know and love.

Given an item i we get U(i) all the users of i, and then I(U(i)), all the items used by those users. Again we use some way to trim this down to the best 5 to 10 or so and recommend these. Given a user of item i, these are the recommended other items, based on users.
We denote this set of recommendations as Ru(i) i.e the recommendations based on users of item i.

Now comes the interesting part derived from work done at Odeo and Greenplum over the last year or so. Experiments suggest the following two major results, which need much more qualification by further work and study. This is only an indicator of interesting areas for research, not a formal proof of anything.

a) Empirical results suggest that for even a small set of users Ru(i) gives better recommendations than Rt(i), i.e. using user-related items gives better recommendations than using tag-related items.

b) Empirical results suggest that the "algorithm" we use to go from I(U(i)) to Ru(i) makes a lot of difference to the relevance and 'interestingness' of recommendations.

Ok, b) was really cryptic so we'll take the rest of this post to unpack it into useful results and pretty pictures.

Step by step,

I(U(i) is the raw set of user related items for item i (people who bought item i also bought a whole ton of other shtuff namely I(U(i)) )

But that is too huge a set to use as recommendations - it could have anywhere from tens to tens of thousands of items depending on what data we are operating on. So we need to trim this down with a filter that filters out and keeps the best recommendations.

So I(U(i)) ---> Filter ---> Ru(i) ie. after filtering the raw set of user-related items we get user-related recommendations.

Now we need to decide how to filter. Lets do the simple thing first.

First we sort the collection I(U(i)) by count, i.e. how many times does some item turn up in this collection.

The temptation is to take the top 10 items by count and use these as the recommendation. This is what I did in practice and found that the recommendations that are generated are only mildly customized i.e they are interesting in general but not necessarily interesting to me. Most of the times they are almost identical to the "most popular" items on the front page.

Why is this?

Because I *took* the most popular ones by count, I sampled the head of the distribution and didn't get anything new.

So then I decided to go the other way - I looked at the lower end of the counts and picked reco's from there, i.e. the proverbial "long tail". Now I got some strange and freaky recommendations - if you had subscribed to the Catholic podcast on Odeo you would have been recommended the Open Source Sex podcast. Not quite what we have in mind, when we say "recommendations".

This led me by accident to explore the remaining area of the range of counts, the middle, recently named the "fat belly" by Robert Young in a recent post on GigaOm.

Here is where things got very, very interesting in the recommendations generated. For example,
Evan Williams who has an interest in modern furniture got a recommendation for a podcast related to furniture although none of his current subscriptions had anything to do with furniture!

This was very exciting and stimulated further exploration which confirmed that the best recommendations came from the fat belly.


So

I(U(i)) ----> Sort by count, filter from the head ----> "popular (i.e. obvious) "

I(U(i)) ----> Sort by count, filter from the long tail ----> "freaky (i.e. too different)"

I(U(i)) ----> Sort by count, filter from the fat belly ----> "relevant and interesting"


Recommendation systems and the powerlaw curve


Now the other interesting observation was that using similar techniques on I(T(i)) did not give such crisp recommendations, where I(T(i)) are all the tag-related items for a given item. i.e. collections of tags are not as useful as collections of users in creating a recommendation engine.

Why might this be and how do we understand it from first principles? Here's my little theory.

Let's think about this in terms of gestures, primary and secondary gestures. Users express interest in an item by various gestures. One of them is tagging an item, but prior to tagging an item is the act of focusing on an item and picking it out of the vast universe of items.
This primary selection process appears to be far more powerful an indication of interest than the secondary act of tagging or describing the already selected item. Hence, I hypothesize, a recommendation system based on user-related items is more crisp than one basedon tag-related items.

The bigger picture here suggests that the user or people dimension in folksonomy is just as or more interesting than just the tag dimension. We need to look more deeply at the "folk" and not just the "..sonomy".

(This subject was discussed in a talk I gave at FooCamp where present were and some very smart people like Hal Varian of Google, DeWitt Clinton ex of Amazon, Luke Lonergan CTO of Greenplum, Mary Hodder of Dabble, James Levine of SimplyHired, and Todd "the SEO Guy" .... who participated in a very energetic discussion and helped me refine these ideas. Thanks for that, guys.)

Sunday, October 02, 2005

Many dimensions of "related"ness in folksonomy

When one asks the question "What do we mean by 'related tags' the response usually is 'here's how I do related tags'.. and a SQL query is presented". SQL is a perfectly adequate language for querying tabular data, not a particularly useful one to represent the abstractions that we want to talk about in exploring "relatedness" between users, tags and items.

We want to evolve a simple notation for discussing user-item-tag "relatedness". Up to now such discussions have had to resort to SQL (an implementation notation) to describe a design. If we get away from the SQL representation and start from first principles we come up with the following :-

Let the letters i, t, and u represent a specific 'item, 'tag' and 'user' respectively.
Let the uppercase I, T, and U represent mappings (loosely) as follows :-

Let U(i) be all users of an item i.
(SQL: select u.* from users u, user_items ui where u.id = ui.userid and ui.itemid = someitemid )

Let U(t) be all users of a tag t.
(SQL: select u.* from users u user_tags ut where u.id = ut.userid and ut.tagid = sometagid )

Similarly,

Let I(u) be all items of a user u.
Let I(t) be all items with a tag t.
Let T(i) be all tags of an item i.
Let T(u) be all tags of a user u.

So now T(U(t)) is the set of all tags of all the users of a single tag t.
SQL : select t.* from tags t, user_tags ut, where ut.userid in (select userid from user_tags where tagid = sometagid)

Its clear from this notation that translating T(U(t)) into English - "the set of all tags of all the users of a single tag t" is pretty straightforward. It is far easier than to translate from the SQL. In fact it is hard to tell from the SQL what the design intent is even when we use the sub-select implementation. If we had used the self-join on the user_tags table it would have been even harder. Before all the SQL experts in the crowd start rolling their eyes saying "what's hard about that ...?? !!!" let me clarify that my point is - it is hard to use the SQL statement as an expression of a *design* intent, not that this query is intrinsically difficult.

So, back to the discussion.

Note that U(t) is a set of users i.e. a collection of u's. So the expression T(U(t)) makes sense and I(U(t)) makes sense but
U(U(t)) makes no sense. U(U(t)) is supposedly the "set of users of the set of users of the tag t". We set U(U(t)) = U(t) i.e. more formally, make U idempotent. Similarly for the other operators ie T(T()) = T() and I(I()) = I().

So now with the notational formalities out of the way we can just turn the crank and churn out a bunch of "related" operations (and by extrapolating the earlier SQL we can get the SQL we need when we want to implement it). It is somewhat surprising and interesting that there are so many different ways to look at the "related" question.

Given an item i,

I(T(i)) the set of other items with the same tags as this item i.e. the set of "tag related" items for this item.

I(U(i)) the set of other items with the same users as this item. i.e. the set of "user related" items for this item.

Note that most websites using the words "tags", "folksonomies", "social" etc. in their description, mostly focus on tag related items. In a social tagging website I want to know about other items with my tags, (the tagging dimension) but I also want to know more about other users with my tags and my items (the user dimension). I want to know their tags and their items and quick and easy ways to browse these.

Simpy is one of those that present the user dimension explicitly in "Related Users", but the usual suspects "technoratideliciousflickr" don't have a direct way to navigate the user dimension of their site. By "direct way to navigate the user dimension" I mean the ability to go from my page directly to "related users", rather than have to do a two-step via tags or items. Perhaps, I should do another rant blog post about how "users get no respect".

Arranging by the innermost letter i.e. item, tag, user

items
-----
I(T(i)) tag related items
I(U(i)) user related items
U(T(i)) tag related users for i    - users who tagged this item in a similar manner - the T-cluster of users for item i
[Update: Jeremy Dunck points out that the correct interpretation of U(T(i)) is "the set of users who used any of the tags of item i" - it is not necessary that they used the tag in the same way.]
T(U(i)) user related tags for i    - tags of all users who have this item - the collective tag-wisdom about this item


tags
----
T(U(t)) user related tags
T(I(t)) item related tags
U(I(t)) item related users for t    - users who have items with this tag   - I-cluster of users for tag t
I(U(t)) user related items for t    - items of all users with this tag   - items you might find interesting if you have t

users
-----
U(T(u)) tag related users of user u
U(I(u)) item related users of user u
I(T(u)) all items with all tags of user u    - all items of all tags of user u - the T-cluster of items for user u
T(I(u)) all tags of all items of user u   - all tag of all items of user u - the I-cluster of tags for user u


As you'll notice I have used the somewhat abstract descriptions such as "T-cluster of items for user u" .... where I couldn't easily come up with a common language description. Perhaps the collective wisdom of the net can help put these in less abstract terms.

The above twelve are all possible operations when we compose the operators to the second degree, ie the composition has two members.

We could look at this collection rearranged differently

Arranging by the outer operation - i.e Set of tags T(), users U() or items I()

T(U(i)) all tags of all users that have an item i
T(U(t)) all tags of all users that have a tag t
T(I(u)) all tags of all items of user u
T(I(t)) all tags of all items that have tag t

U(T(i)) all users of all tags of item i
U(T(u)) all users of all tags of user u
U(I(t)) all users of all items of tags t
U(I(u)) all users of all items of user u

I(U(t)) all items of all users of tag t
I(U(i)) all items of all users of item i
I(T(u)) all items of all tags of user u
I(T(i)) all items of all tags of item i

If you compute these, filtering and tuning appropriately, you see that each of these give different but potentially interesting results. Depending on what you are looking for one or other of these may be useful.

What happens then when we go another degree deeper such as T(U(T(u))) ? Don't know yet, but once we get away from using SQL to talk about this ... well at least we can talk about it easily.

Sunday, July 10, 2005

Tags get no respect

The lowly tag started the folksonomy revolution. One would have thought we'd be setting it up on a pedestal. From a 'buzz' point of view we do. 'Tagosphere this', 'Blogosphere that', ... buzz, buzz, buzzzzzz .....

But look under the covers. Are Tags first class objects? Is there a 'Tag' table along with the Item (photo, url ...) and User tables. Maybe. Maybe not. Some have them, some don't.

Why do we want a whole table for a tag which is just a 'string' attribute of an item?

It's a good question. It's tempting to 'get going' and avoid 'analysis paralysis' by just adding a 'tag' column to a table here and there in the schema. Or better still - a 'tags' (plural) column in which we stuff a 'comma separated' list. This creates a number of issues which are prevalent in contemporary tag applications.

One of the first and unobvious issues that arise out of 'dissing' tags in this way is that users are forced to use ' '(space) as a separator for tags. Why? Because comma is already used up in the taglist stuffed in a single column in the dark corner of a table somewhere.

This snowballs into the issue that I can't use a multi-word phrase as a tag. Unless I invent my own personal word separator, say '_'(underscore) or '/' or '+' etc. all of which are just fine from a single user point of view but suboptimal from a social point of view. Multi-word tags are pretty much invisible for social searching as I don't know and don't want to deal with everyone's separator.

People argue that we shouldn't enforce a convention - this is the wild and woolly world of folksonomy after all. But we aren't enforcing a convention - we want to use a 'comma' as a separator. This is already in use in natural language. Except that practitioners have abused the comma for a separator in a list of tags stuffed into a column. That in turn has arisen from not having, or seeing the fundamental need for, a Tag table. So now we are stuck with this highly hobbled form of tagging for no good reason.

Folks, if you are creating a folksonomy application from scratch today and you don't have a separate Tag table then you are, by a series of unforeseen consequences, hobbling your users and reducing the ultimate effectiveness of your application.

If 'User centered design' is what it's all about then please, please deal with this issue.

No amount of AJAX and Ruby on Rails (I love both of them by the way) will solve the problem created by a suboptimal Tag data model, blocking users from using commas as separators AND preventing them from using multi-word tags.

(taking a deep breath ..... thinking of a happy place .... palm trees, pineapples, white sands, blue waves ..... relaxing all muscles one by one ..... Ohhhhh Kayyy... there we go ...)

Ok, let us assume for a second that this 'taglist-comma-blocking-forced-to-use-danged-space-separator' issue doesn't exist. There's a whole other set of reasons why a Tag table is needed. As we see in practice, tags have different uses - action/todo, content-type, annotation .... and a single opague string is already being implicitly overloaded.

Tags also have a 'language of origin' - a tag in a single charset, say 'ISO Latin', may mean different things in different countries in Europe.

Finally, when tags are stuffed into columns as attributes, it's much harder to get an idea of the number of unique tags used. The same string is saved again and again, perhaps the unique occurences are counted, maybe not. Here we have to create a compensating application structure to 'uniqueify' the tags, count the unique tag occurences, if at all we do that.

So when a folksonomy application without a Tag table says '1 million tags' these are not 1 million unique tags. The number of unique tags is approximately a couple of orders of magnitude smaller. I am curious to know how reports on tag usage, growth etc. are created when there's no way to inherently track unique tags as first class objects in the database. There's probably a lot of summarisation, matching and sorting done in the application layer that's best done by the database. All of this is done on the tag and any tag-lists and tag-substructure that may have been created.

Adding fine structure to a string and then creating an API layer to pack and unpack this private structure creates something akin to a 'foreign growth' on the data model. It's a mini-database within a database with it's own syntax, logic and query language. It's as if some part of the database has unilaterally declared independence and is setting sail on its own.

It separates data related knowledge into two disjoint areas, one, the 'mini-database + API' which holds 'type' and other information and meaning, and two, the underlying database itself. For large scale applications that are meant to grow fast and furiously, this asymmetric complexity creates an area in the design where 80% of the effort could possibly be needed in future to solve 20% of the problems.

So, after all this verbiage, what am I saying?

Quite simply, tags are first class entities in the folksonomy data model. They have a number of important attributes such as 'type','lang','count', 'created', 'lastused'
etc. From a Data 2.0 approach, a Tag is a slowly-varying-dimension and Tags should be saved in a Tag dimension table. Aside from avoiding the kinds of problems mentioned, this model allows evolution in the role of the Tag in the folksonomy data model without creating massive ripples in the application structure.

It's best to give tags the respect they deserve and put them in a table by themselves. The logical symmetry between User, Tag, Item IMO should be reflected in a symmetric underlying data model. If not, we have a fundamental impedance mismatch between the data and the application logic. And the mismatch is usually created by disrespecting the thing the started it all - the tag.

Thursday, June 09, 2005

All My Items Are Belong To Us

We preview now, an important topic to be dealt with in more detail later. This is about a vital element of building communities based on folksonomies - sharing items and making sets of items visible to groups of Users.

Users publish items (photos, URL's ...) and may wish to restrict visibility to a defined group of Users. So another set of questions that needs to be answered with queries to the folksonomy database, relate to visibility of items.

e.g.

* What items belonging to Jim are visible to me?
* Who are all the users who can see the specific item I published, and made visible to a number of groups? Or even more complicated,
* Given a set of items I have published what is the set of users that can see all of them? Any of them?

These kind of queries relate Users in two different roles, publishers and viewers. So queries that answer these questions involve a special case of the SQL JOIN called a SELF-JOIN where a table is joined to itself. As mentioned before, with a User table with a million rows, we don't want to be doing these kind of queries. Fortunately, the fact table appproach we saw before is also very useful here. More diagrams coming but here's a preview.

Create a table PublisherItemViewers or if you wish VisibilityFacts. It has three columns 'publisher', 'item' and 'viewer'. The 'publisher' column contains userids of Users who have published Items. The 'item' column contains itemids of Items published and the 'viewer' columns contains userid's of Users who are allowed to view this Item.

Thus each row (publisher,item,viewer) records an atomic fact about a user X having published an item Y which is viewable by another user, Z. For a given User and a given Item, it will possibly be visible to a number of Users. So there will be many such rows in VisibilityFacts with identical value in the first two columns, one row for each user that can see this item, with userid of viewer in the third column.

Now we can answer all the questions mentioned above and more, using simple SELECT statements on the VisibilityFacts table.

A quick numerical calculation. Given 1 million users with say 50 items each with each item published to an avg. of 20 users, we have a 1 Billion rows in this table.

Another Data 2.0 scalability challenge.