Tuesday, June 17, 2008

Robert Scoble and The Pied-Piper Effect

I've noticed some interesting happenings on Twitter recently that have inspired me to coin a term "The Pied Piper Effect". The setup is as follows.

A user on a social network starts to accrete a large number of followers to the extent that this user becomes the most followed user on the network. This user is popular for one of a number of reasons -

a) They have something useful to say about the world outside the web.
b) They have something useful to say about the web.
c) The number of messages that they send attracts attention and people are drawn just to see what the noise is all about.
d) The number of messages that they send about themselves being on the web talking about the web or about other people talking about the web draws people even more to see why that might even be faintly interesting.

On Twitter Robert Scoble falls into category d) but he is not the only member of the category. This category of user, at the head of the power law distribution, becomes at some point the test of the infrastructure of the service. Some services like Facebook (5000 followers max), LinkedIn (500 connections max) limit the number of links to other users. Some services (Twitter, FriendFeed...) don't. It is the latter that provide fertile ground for Pied Piper behavior.

A Scoble-like user accretes a large number of users, stresses the network talking mainly about themselves talking about themselves - "come see me I am livecasting myself now and I will be talking about my next livecasting event". And then when the network begins to show signs of stress the Scoble-like user threatens to move, or actually moves, the focus of their attention to a new and as yet not fully saturated network.

The hypnotized children follow.

The Pied Piper does this because the denizens of the previous town refused to "pay the piper". The children follow because they are afraid they might miss a note or a beat and then, oh my god the horrors. And all this will happen to the next town at some point and the next.

But the more important question is - how many of these newly created social networks get populated in the first place just because a Scoble-like user happened to pass by the town with the kids in tow and happened to stop by? To massively mix metaphors, is there a Pied-Piper-Pollination effect in play?

And then there's the much bigger question about whether the tune that the Piper is playing is even music at all, and whether the price is worth it - but that's a whole other story.

Monday, May 19, 2008

It's all about Tupperware(tm)

On Sunday, I spent an afternoon talking with Om Malik of GigaOm, Matt Mullenweg of Wordpress and Stanislav Shalunov of shlang about Social data and Facebook and monetization of data and Google and data ownership and data portability.

You know, nothing topical ;-).

One of the insights for me, from this discussion, was that monetizing Facebook social connections i.e spamming your friends to sell them something, was a model that's at least 40+ years old. In the past it was the "tupperware party" model.

The model behind "tupperware parties" was that friends were invited over to a party and then they all admired the Tupperware(tm) - which was plastic kitchenware. You focused on getting orders before the party was over, sold stuff to your friends and bootstrapped your business that way, hoping to convert some of them to Tupperware distributors.

This was in the era when plastic kitchen stuff was considered cool, because plastic anything was novel. Let that sink in. That's how old this model is.

But the huge difference there was you made real money by selling real stuff.


To quote:-

"This plan has been used primarily to sell items whose main appeal is to women, such as Tupperware itself (a food-storage system), kitchen utensils, home decor items, jewelry, skincare, cosmetics, and similar products; recent additions to the field include lingerie, sex toys and Landmark Education."

With the so called social network ad model - you sell stuff to your friends (oh ok, "get them to install apps"). You piss them off and then you make .... zip, nada, bupkus.

So how long before people catch on that this isn't really about being "social" which means actually connecting with people in a real way. This is like you are in big tent for hosting Tupperware parties. You invite people to meet you at this tent because that's where all the cool people are going to have their parties. And you're going because you are with someone who is following someone else who says it's cool to go to this party.

[According to Stanislav, "the core demographic on Facebook is young women and the guys are there because the women are there." ]

And the women and the guys all come over and then after the party's over, prematurely, they get solicited for Tupperware, relentlessly, endlessly, crassly.

And eventually people in the tent start muttering to each other that this is not what they expected when they first got there. And that it was all so much fun meeting new people and connecting with old friends until it all became about Tupperware. And they all move on to the next tent where it's about, oh does it matter ... Chipperware, or something equally ludicrous and plastic.

So how long before the muttering starts. I'll give it two years max. Or else the tent folds up because not enough Tupperware gets sold and they can't pay the rent and the people who bought the Tupperware to sell it now are stuck with ... well plastic social stuff that no one considers cool any more.

I know I am probably in the minority but have you been to a Tupperware party - a real one - lately? Make that ever? Don't you think if selling stuff at social events was something that people loved, it would have been around for a little longer than a few years?

And just because it's over the web somehow people are supposed to develop a taste for buying plastic at parties?

Perhaps the wizards behind the curtain are forgetting that the entities on the other side of the Innertube are real people and that people move on from fake stuff and long for real stuff.

Ok, so call me a curmudgeon, I've been called worse. Just don't sell me stuff pretending to make it a party. After all wasn't this social media thing supposed to be about "the conversation" in the first place?

.. because you know maybe people prefer organic social stuff, that doesn't abuse the social, you know, environment - let's call it "green social". Something that people can grow in their neighborhoods and you can actually talk in person to the farmer at the farmers market. And what's the equivalent of that for social networks?

Once the Tupperware party in the sky is over, let's have a real conversation where no one is selling anything. Like Om said at the end of the post where he announced the mini-meetup on Sunday - "I will buy coffee and cakes, but please don’t pitch me your company. I want some honesty about this topic."

It was a refreshing conversation and I'd like to have more of those where people aren't selling plastic social.

Thursday, May 08, 2008

Guten Tag. Und Wilkommen in Tag Schema

This blog was originally created to discuss database designs underlying the emerging tagging or folksonomy applications (Flickr, del.icio.us ....) at the leading edge of Web 2.0. Over the next year or so the frenetic activiy in that area stabilised and the focus moved to the so called 'social network' applications. While those were not the subject of the original folksonomy discussions, they underlined some importamt and related database issues.

One such issue is that of the database schemas induced by the need to model 'friend' relationships. As it turns out these are just as interesting as the folksonomy schemas. In parallel, other issues emerged - "data in the cloud" and "data portability". The former is about the move away from centralized, relational databases based on SQL and the latter is about data ownership issues created by having personal digital assets distributed all over the Internet captive inside Web 2.0 applications.

So the current trends on the net have major implications for underlying data structure. As the application architectures change and as new disruptive ones emerge, the underlying data layers experience corresponding tectonic shifts. And it is the new agenda of this blog to track all these data related issues as they emerge. This is much broader than the original narrow issue of folksonomy database design but in retrospect it is a natural evolution. The only problem that remains is the name 'tagschema' which seems so narrowly focused on tagging database schemas.

Luckily at a recent MySQL event the solution emerged in a conversation with Kaj Arno of MySQL, now Sun. "Ahhh ..." he said looking at the word 'tagschema' on my name tag. "That sounds like a German daily tv show 'Guten tag und wilkommen in Tag Schema' - that means 'Good day and welcome to Tag Schema - schema of the day'". I thought nothing more of it but the phrase 'Guten tag und wilkommen in Tag Schema' kept playing in my mind. Later the name for the new blog came back to mind and I realized that Tag Schema could mean - schema of the day or 'Current schema' or 'Current trends in schema' more generally 'discussions of underlying structure of the day' which is generally where we will be going with this blog. - exploring new data structures and technologies as they emerge.

So thanks very much Kaj Arno for that moment of zen serendipity.

And so 'Guten Tag. Und Wilkommen in Tag Schema'

How many times do I have to tell you? ........ Don’t …. Repeat ….. Yourself.

It is 2008 – do you know where your avatar is? I don’t. I have copied it into so many different web apps I have no idea where it’s been. And I just got a request to update my address book from yet another address book provider with an address of mine copied 5 years ago.

The social web has become a giant cookie monster – growling “Gimme Copy, Gimme Copy, yeah yeah yeah … mmmmmm Copy”

The DRY (Don’t Repeat Yourself) principle has been touted by designers of Rails and other modern web app frameworks. It is ironic then that these frameworks have been used to build a whole generation (Web 2.0) of apps that force the user to make copies of data again and again into each web app. In the data world the DRY principle reads "Don't Make Copies" (DoMaCo).

How many times do I have to tell you – Don’t …. Repeat ….. Yourself. Dear web app builders – you created the Internet Copy Monster – you need to help stamp it out. But how? Read on.



Fig 1. Yesterday's web app architecture – forced violation of DRY principle at the data level.

Most Web 2.0 apps do not expose REST URI’s to every data element. This means the user can’t access their data freely which means they can’t reuse their data in other web apps. The real added value of a successful social web app lies in the community interactions and the UI elegance that enables community interactions, not in the data management layer.

For example the popularity of a service such as Flickr is primarily due to their innovations around tagging and “interestingness” and the very active community, not because of their massive data storage facilities or their disk farms, which are a cost center.

That means Flickr would still be Flickr even if the data layer in Fig 1 were not owned by Flickr. Think about that for a minute and apply that across the social web. Note also that Flickr allows you to embed Flickr photos into other applications – so Flickr photos can become definitive instances of your photo data. Flickr exposes data pointers – in effect they have become a next generation Internet data layer for photos.

This leads to the possibility of a general purpose data layer - not part of the web app but part of the Internet infrastructure - a data layer which contains user digital assets and all social data. This would be provided by a new class of service provider the “data service provider” who would give you URI’s to all your content, give you full control and access to your data and would be a for-fee service.

Web apps would only point to data in this data layer and not be part of the huge Internet Copy Monster.

Now consider tomorrow’s web application architecture which is already in place in parts. I call this the Yinas approach – YINAS being a recursive acronym for Yinas Is Not A Silo.



Fig 2 Yinas. A web app architecture that doesn’t violate the DRY principle at the data level and respects user data rights.

This may seem like it needs a massive redesign of all web apps, but it doesn’t. It would just require a uniform approach to data embedding in web apps, most of which is already in place. The needed work is already done for most content except text, avatars and structured content such as address books etc. We already embed photos, video, and audio via URI’s to remote content hosting services. We just need to extend it uniformly to all content types, not just image, audio, video and we need to use it as a pervasive design principle across the web.

In summary – let’s recognize “Don’t Make Copies” as a useful design principle for web app data and let’s consume pointers instead of copies.

Let's stamp out the Internet Copy Monster. Let's stamp out unnecessary repetition. Shall we? Shall we?

P.S.

And please forward a permalink to your friends, not a copy ;-)

Friday, February 01, 2008

Why Data Portability is a non-solution to a non-problem

I have written a draft on Backpack
Note: As of Feb 6 2008 the draft is now a post on GigaOm

Please leave comments there - comments on this post here are now closed.

Wednesday, July 25, 2007

Some thoughts on data rights

Tim O'Reilly was talking about data and data access in his keynote at Oscon2007 today (Wed Jul 25th 2007) I thought I'd post some thoughts I have been chewing on for a while, even while I am still in the keynote. These issues have technical and philosophical implications. They are not about tags per se but do apply very strongly to data currently captive in contemporary folksonomy applications as well as other Web 2.0 applications. Comments and Criticism invited.

A manifesto for data rights in a globally networked world

(Draft 1 Jul 25th 2007) (cc) Published under Creative Commons "Attribution No Derivatives" Licence

We consider the following to be axiomatic and universal

  1. Data is a first class citizen of the network.

  2. Data must not be held captive in an application or locked in proprietary application-specific file formats.

  3. Data must be readable and exportable directly, programmatically, completely without restriction and stored in open, non-proprietary formats.

    1. Programmatic data access must allow FULL export and read capability independent of what the human UI allows
    2. Arbitrary restrictions must not be placed on data access by the application controlling the data, whether due to unintentional limitations of the application architecture or due to intentional design.


  4. Every unit of data must be independently addressable via a URI

    1. On the Internet, data should be accessible via REST based architectures


  5. Every unit of data must be capable of having an associated access policy, separately from other such units of data

    1. Each data unit must be able to have a possibly different access control policy
    2. The default access control policy of a data unit created by an individual must be "private"
    3. Policy change must be under the free control of the individual,
    4. Policy change must be under the control only of the individual.


  6. Data is property. Hence data access and ownership must be subject to rights strongly similar to or identical to physical property rights.

    1. No application, service, organization or other entity may require data exposure or implicit surrender of data ownership as a price of use or access to some facility

    2. Data exposure must be separately negotiated and be freely negotiable without coercion, according to the needs of the individual.
    3. "Website shrink wrapped licenses" are not considered to be a a meaningful negotiation in this context.

    4. Data about an individual belongs to that individual and only to that individual, who may choose to share the data subject to their needs and no one else's

    5. Data does not belong to the incidental keepers of data representations (internet service providers, medical service providers, financial service providers, state and federal govt agencies)

Sunday, October 01, 2006

Putting the "folk" back in folksonomy

Or ... The fat belly and recommendation systems



Since the beginning of Web 2.0 time, "folksonomy" has been synonymous with tagging. It's time to fill out the picture. As readers of this blog know, folksonomy involves tags, tagged-items, and tagger-users. This post digs deeper re: the role of users in the "holy trinity" of user-tag-item. And examines the relationship of users to recommendation systems, ... and to the "fat belly".

Yes, that does sound like a whole lot of ground to cover but

a) I have been gone for a while so need to catch up in a hurry - what can I say?
b) It's not that much ground to cover when we see the interesting relationships
c) The notation described in the previous post makes it possible to cover a lot of ground without too much verbiage.

So without further ado, here goes.

In a typical folksonomy system we have users attaching tags to items. As the system evolves we have, given an item 'i', the sets: -
T(i), the tags associated with i and
U(i) the users who use the item i.

Typical folksonomy apps have focused on navigating the various relationships with a focus on T(i). Recommendation systems that suggest 'related items' are also most often based on T(i), as follows. Given an item we find all tag related items via I(T(i)). Then we use some algorithm to trim this down to the "best" 5 or 10 by some definition of "best". Then we use these as recommendations. Given a user of item i, these are the recommended other items, or 'related items' based on tags.

For the rest of this discussion, we denote this set of recommendations as Rt(i) i.e given an item i, the recommended other items based on tags.

Consider now, the other way to get related items, i.e. user-related items.
This is the famous "users who bought this item also bought ...." approach that we know and love.

Given an item i we get U(i) all the users of i, and then I(U(i)), all the items used by those users. Again we use some way to trim this down to the best 5 to 10 or so and recommend these. Given a user of item i, these are the recommended other items, based on users.
We denote this set of recommendations as Ru(i) i.e the recommendations based on users of item i.

Now comes the interesting part derived from work done at Odeo and Greenplum over the last year or so. Experiments suggest the following two major results, which need much more qualification by further work and study. This is only an indicator of interesting areas for research, not a formal proof of anything.

a) Empirical results suggest that for even a small set of users Ru(i) gives better recommendations than Rt(i), i.e. using user-related items gives better recommendations than using tag-related items.

b) Empirical results suggest that the "algorithm" we use to go from I(U(i)) to Ru(i) makes a lot of difference to the relevance and 'interestingness' of recommendations.

Ok, b) was really cryptic so we'll take the rest of this post to unpack it into useful results and pretty pictures.

Step by step,

I(U(i) is the raw set of user related items for item i (people who bought item i also bought a whole ton of other shtuff namely I(U(i)) )

But that is too huge a set to use as recommendations - it could have anywhere from tens to tens of thousands of items depending on what data we are operating on. So we need to trim this down with a filter that filters out and keeps the best recommendations.

So I(U(i)) ---> Filter ---> Ru(i) ie. after filtering the raw set of user-related items we get user-related recommendations.

Now we need to decide how to filter. Lets do the simple thing first.

First we sort the collection I(U(i)) by count, i.e. how many times does some item turn up in this collection.

The temptation is to take the top 10 items by count and use these as the recommendation. This is what I did in practice and found that the recommendations that are generated are only mildly customized i.e they are interesting in general but not necessarily interesting to me. Most of the times they are almost identical to the "most popular" items on the front page.

Why is this?

Because I *took* the most popular ones by count, I sampled the head of the distribution and didn't get anything new.

So then I decided to go the other way - I looked at the lower end of the counts and picked reco's from there, i.e. the proverbial "long tail". Now I got some strange and freaky recommendations - if you had subscribed to the Catholic podcast on Odeo you would have been recommended the Open Source Sex podcast. Not quite what we have in mind, when we say "recommendations".

This led me by accident to explore the remaining area of the range of counts, the middle, recently named the "fat belly" by Robert Young in a recent post on GigaOm.

Here is where things got very, very interesting in the recommendations generated. For example,
Evan Williams who has an interest in modern furniture got a recommendation for a podcast related to furniture although none of his current subscriptions had anything to do with furniture!

This was very exciting and stimulated further exploration which confirmed that the best recommendations came from the fat belly.


So

I(U(i)) ----> Sort by count, filter from the head ----> "popular (i.e. obvious) "

I(U(i)) ----> Sort by count, filter from the long tail ----> "freaky (i.e. too different)"

I(U(i)) ----> Sort by count, filter from the fat belly ----> "relevant and interesting"


Recommendation systems and the powerlaw curve


Now the other interesting observation was that using similar techniques on I(T(i)) did not give such crisp recommendations, where I(T(i)) are all the tag-related items for a given item. i.e. collections of tags are not as useful as collections of users in creating a recommendation engine.

Why might this be and how do we understand it from first principles? Here's my little theory.

Let's think about this in terms of gestures, primary and secondary gestures. Users express interest in an item by various gestures. One of them is tagging an item, but prior to tagging an item is the act of focusing on an item and picking it out of the vast universe of items.
This primary selection process appears to be far more powerful an indication of interest than the secondary act of tagging or describing the already selected item. Hence, I hypothesize, a recommendation system based on user-related items is more crisp than one basedon tag-related items.

The bigger picture here suggests that the user or people dimension in folksonomy is just as or more interesting than just the tag dimension. We need to look more deeply at the "folk" and not just the "..sonomy".

(This subject was discussed in a talk I gave at FooCamp where present were and some very smart people like Hal Varian of Google, DeWitt Clinton ex of Amazon, Luke Lonergan CTO of Greenplum, Mary Hodder of Dabble, James Levine of SimplyHired, and Todd "the SEO Guy" .... who participated in a very energetic discussion and helped me refine these ideas. Thanks for that, guys.)

Sunday, October 02, 2005

Many dimensions of "related"ness in folksonomy

When one asks the question "What do we mean by 'related tags' the response usually is 'here's how I do related tags'.. and a SQL query is presented". SQL is a perfectly adequate language for querying tabular data, not a particularly useful one to represent the abstractions that we want to talk about in exploring "relatedness" between users, tags and items.

We want to evolve a simple notation for discussing user-item-tag "relatedness". Up to now such discussions have had to resort to SQL (an implementation notation) to describe a design. If we get away from the SQL representation and start from first principles we come up with the following :-

Let the letters i, t, and u represent a specific 'item, 'tag' and 'user' respectively.
Let the uppercase I, T, and U represent mappings (loosely) as follows :-

Let U(i) be all users of an item i.
(SQL: select u.* from users u, user_items ui where u.id = ui.userid and ui.itemid = someitemid )

Let U(t) be all users of a tag t.
(SQL: select u.* from users u user_tags ut where u.id = ut.userid and ut.tagid = sometagid )

Similarly,

Let I(u) be all items of a user u.
Let I(t) be all items with a tag t.
Let T(i) be all tags of an item i.
Let T(u) be all tags of a user u.

So now T(U(t)) is the set of all tags of all the users of a single tag t.
SQL : select t.* from tags t, user_tags ut, where ut.userid in (select userid from user_tags where tagid = sometagid)

Its clear from this notation that translating T(U(t)) into English - "the set of all tags of all the users of a single tag t" is pretty straightforward. It is far easier than to translate from the SQL. In fact it is hard to tell from the SQL what the design intent is even when we use the sub-select implementation. If we had used the self-join on the user_tags table it would have been even harder. Before all the SQL experts in the crowd start rolling their eyes saying "what's hard about that ...?? !!!" let me clarify that my point is - it is hard to use the SQL statement as an expression of a *design* intent, not that this query is intrinsically difficult.

So, back to the discussion.

Note that U(t) is a set of users i.e. a collection of u's. So the expression T(U(t)) makes sense and I(U(t)) makes sense but
U(U(t)) makes no sense. U(U(t)) is supposedly the "set of users of the set of users of the tag t". We set U(U(t)) = U(t) i.e. more formally, make U idempotent. Similarly for the other operators ie T(T()) = T() and I(I()) = I().

So now with the notational formalities out of the way we can just turn the crank and churn out a bunch of "related" operations (and by extrapolating the earlier SQL we can get the SQL we need when we want to implement it). It is somewhat surprising and interesting that there are so many different ways to look at the "related" question.

Given an item i,

I(T(i)) the set of other items with the same tags as this item i.e. the set of "tag related" items for this item.

I(U(i)) the set of other items with the same users as this item. i.e. the set of "user related" items for this item.

Note that most websites using the words "tags", "folksonomies", "social" etc. in their description, mostly focus on tag related items. In a social tagging website I want to know about other items with my tags, (the tagging dimension) but I also want to know more about other users with my tags and my items (the user dimension). I want to know their tags and their items and quick and easy ways to browse these.

Simpy is one of those that present the user dimension explicitly in "Related Users", but the usual suspects "technoratideliciousflickr" don't have a direct way to navigate the user dimension of their site. By "direct way to navigate the user dimension" I mean the ability to go from my page directly to "related users", rather than have to do a two-step via tags or items. Perhaps, I should do another rant blog post about how "users get no respect".

Arranging by the innermost letter i.e. item, tag, user

items
-----
I(T(i)) tag related items
I(U(i)) user related items
U(T(i)) tag related users for i    - users who tagged this item in a similar manner - the T-cluster of users for item i
[Update: Jeremy Dunck points out that the correct interpretation of U(T(i)) is "the set of users who used any of the tags of item i" - it is not necessary that they used the tag in the same way.]
T(U(i)) user related tags for i    - tags of all users who have this item - the collective tag-wisdom about this item


tags
----
T(U(t)) user related tags
T(I(t)) item related tags
U(I(t)) item related users for t    - users who have items with this tag   - I-cluster of users for tag t
I(U(t)) user related items for t    - items of all users with this tag   - items you might find interesting if you have t

users
-----
U(T(u)) tag related users of user u
U(I(u)) item related users of user u
I(T(u)) all items with all tags of user u    - all items of all tags of user u - the T-cluster of items for user u
T(I(u)) all tags of all items of user u   - all tag of all items of user u - the I-cluster of tags for user u


As you'll notice I have used the somewhat abstract descriptions such as "T-cluster of items for user u" .... where I couldn't easily come up with a common language description. Perhaps the collective wisdom of the net can help put these in less abstract terms.

The above twelve are all possible operations when we compose the operators to the second degree, ie the composition has two members.

We could look at this collection rearranged differently

Arranging by the outer operation - i.e Set of tags T(), users U() or items I()

T(U(i)) all tags of all users that have an item i
T(U(t)) all tags of all users that have a tag t
T(I(u)) all tags of all items of user u
T(I(t)) all tags of all items that have tag t

U(T(i)) all users of all tags of item i
U(T(u)) all users of all tags of user u
U(I(t)) all users of all items of tags t
U(I(u)) all users of all items of user u

I(U(t)) all items of all users of tag t
I(U(i)) all items of all users of item i
I(T(u)) all items of all tags of user u
I(T(i)) all items of all tags of item i

If you compute these, filtering and tuning appropriately, you see that each of these give different but potentially interesting results. Depending on what you are looking for one or other of these may be useful.

What happens then when we go another degree deeper such as T(U(T(u))) ? Don't know yet, but once we get away from using SQL to talk about this ... well at least we can talk about it easily.