Sunday, October 02, 2005

Many dimensions of "related"ness in folksonomy

When one asks the question "What do we mean by 'related tags' the response usually is 'here's how I do related tags'.. and a SQL query is presented". SQL is a perfectly adequate language for querying tabular data, not a particularly useful one to represent the abstractions that we want to talk about in exploring "relatedness" between users, tags and items.

We want to evolve a simple notation for discussing user-item-tag "relatedness". Up to now such discussions have had to resort to SQL (an implementation notation) to describe a design. If we get away from the SQL representation and start from first principles we come up with the following :-

Let the letters i, t, and u represent a specific 'item, 'tag' and 'user' respectively.
Let the uppercase I, T, and U represent mappings (loosely) as follows :-

Let U(i) be all users of an item i.
(SQL: select u.* from users u, user_items ui where u.id = ui.userid and ui.itemid = someitemid )

Let U(t) be all users of a tag t.
(SQL: select u.* from users u user_tags ut where u.id = ut.userid and ut.tagid = sometagid )

Similarly,

Let I(u) be all items of a user u.
Let I(t) be all items with a tag t.
Let T(i) be all tags of an item i.
Let T(u) be all tags of a user u.

So now T(U(t)) is the set of all tags of all the users of a single tag t.
SQL : select t.* from tags t, user_tags ut, where ut.userid in (select userid from user_tags where tagid = sometagid)

Its clear from this notation that translating T(U(t)) into English - "the set of all tags of all the users of a single tag t" is pretty straightforward. It is far easier than to translate from the SQL. In fact it is hard to tell from the SQL what the design intent is even when we use the sub-select implementation. If we had used the self-join on the user_tags table it would have been even harder. Before all the SQL experts in the crowd start rolling their eyes saying "what's hard about that ...?? !!!" let me clarify that my point is - it is hard to use the SQL statement as an expression of a *design* intent, not that this query is intrinsically difficult.

So, back to the discussion.

Note that U(t) is a set of users i.e. a collection of u's. So the expression T(U(t)) makes sense and I(U(t)) makes sense but
U(U(t)) makes no sense. U(U(t)) is supposedly the "set of users of the set of users of the tag t". We set U(U(t)) = U(t) i.e. more formally, make U idempotent. Similarly for the other operators ie T(T()) = T() and I(I()) = I().

So now with the notational formalities out of the way we can just turn the crank and churn out a bunch of "related" operations (and by extrapolating the earlier SQL we can get the SQL we need when we want to implement it). It is somewhat surprising and interesting that there are so many different ways to look at the "related" question.

Given an item i,

I(T(i)) the set of other items with the same tags as this item i.e. the set of "tag related" items for this item.

I(U(i)) the set of other items with the same users as this item. i.e. the set of "user related" items for this item.

Note that most websites using the words "tags", "folksonomies", "social" etc. in their description, mostly focus on tag related items. In a social tagging website I want to know about other items with my tags, (the tagging dimension) but I also want to know more about other users with my tags and my items (the user dimension). I want to know their tags and their items and quick and easy ways to browse these.

Simpy is one of those that present the user dimension explicitly in "Related Users", but the usual suspects "technoratideliciousflickr" don't have a direct way to navigate the user dimension of their site. By "direct way to navigate the user dimension" I mean the ability to go from my page directly to "related users", rather than have to do a two-step via tags or items. Perhaps, I should do another rant blog post about how "users get no respect".

Arranging by the innermost letter i.e. item, tag, user

items
-----
I(T(i)) tag related items
I(U(i)) user related items
U(T(i)) tag related users for i    - users who tagged this item in a similar manner - the T-cluster of users for item i
[Update: Jeremy Dunck points out that the correct interpretation of U(T(i)) is "the set of users who used any of the tags of item i" - it is not necessary that they used the tag in the same way.]
T(U(i)) user related tags for i    - tags of all users who have this item - the collective tag-wisdom about this item


tags
----
T(U(t)) user related tags
T(I(t)) item related tags
U(I(t)) item related users for t    - users who have items with this tag   - I-cluster of users for tag t
I(U(t)) user related items for t    - items of all users with this tag   - items you might find interesting if you have t

users
-----
U(T(u)) tag related users of user u
U(I(u)) item related users of user u
I(T(u)) all items with all tags of user u    - all items of all tags of user u - the T-cluster of items for user u
T(I(u)) all tags of all items of user u   - all tag of all items of user u - the I-cluster of tags for user u


As you'll notice I have used the somewhat abstract descriptions such as "T-cluster of items for user u" .... where I couldn't easily come up with a common language description. Perhaps the collective wisdom of the net can help put these in less abstract terms.

The above twelve are all possible operations when we compose the operators to the second degree, ie the composition has two members.

We could look at this collection rearranged differently

Arranging by the outer operation - i.e Set of tags T(), users U() or items I()

T(U(i)) all tags of all users that have an item i
T(U(t)) all tags of all users that have a tag t
T(I(u)) all tags of all items of user u
T(I(t)) all tags of all items that have tag t

U(T(i)) all users of all tags of item i
U(T(u)) all users of all tags of user u
U(I(t)) all users of all items of tags t
U(I(u)) all users of all items of user u

I(U(t)) all items of all users of tag t
I(U(i)) all items of all users of item i
I(T(u)) all items of all tags of user u
I(T(i)) all items of all tags of item i

If you compute these, filtering and tuning appropriately, you see that each of these give different but potentially interesting results. Depending on what you are looking for one or other of these may be useful.

What happens then when we go another degree deeper such as T(U(T(u))) ? Don't know yet, but once we get away from using SQL to talk about this ... well at least we can talk about it easily.

5 Comments:

Blogger suttree said...

Great article. Navigating users is something that we looked closely at with Millionsofgames.com as the best thing about del.icio.us, in my experience, is finding other users, browsing their inboxes, seeing what they like.

Since MOG is a folksonomy of casual games, we had a easy way to make navigating users just as much fun - we made a game out of it. Users are automatically ranked once they start adding games to MOG and we have a score table that shows how well you're doing, as well as a few other bits of relevant user information.

4:22 AM  
Blogger Jeremy Dunck said...

"
U(T(i)) tag related users for i - users who tagged this item in a similar manner - the T-cluster of users for item i
"
According to your previous definition of the notation, U(T(i)) would be the users which have used any of the tags on i, unless I misunderstood. This does not imply that those users used that tag on i, or even tagged i at all.

I think you'd need to define a different function at that point. UT(i, t), users who tagged i with t, and similarly TU(i, t) all tags of users who tagged i with t.

Similarly for TI, IT, IU, UI.

And, of course, you can be a better judge of relevance by weighting edges (by count, by average, etc).

11:40 AM  
Blogger Nitin said...

Jeremy you're absolutely correct, I rushed the interpretation of that combination. So I will be fixing that.

About the need for the new functions I agree on the need but am not sure the two caps name is the right way to go. So stay tuned on that.

12:16 PM  
Blogger Danny said...

Hi Nitin, I don't think Blogger supports trckback, so please consider this a manual ping:
http://dannyayers.com/archives/2005/10/04/tag-relation-notation/

3:06 AM  
Anonymous Bill Ward said...

Jeremy has a good point but I don't think the two-letter function name is right. I would say that U(i,t) would return the list of all users who tagged i with t. I think his other example, TU(i,t) would be better written T(U(i,t))

But clearly, there are 2-input queries that can be made, such as T(i,u) and I(u,t).

12:32 PM  

Post a Comment

<< Home