Detail and the long tail Thursday, January 17, 2008


I've been reading Edward Tufte's fantastic book Envisioning Information and thinking about tag clouds as an approach to data visualisation.

Tufte's strategies are to find approaches for representing data that:

  • increase the number of dimensions represented on a plane surface
  • increase the data density (e.g. the amount of information represented per unit area)
By representing frequency data, tag clouds increase the dimensionality of the data if a tag cloud is compared to a list of posts.

Since a tag cloud is a compressed, compact structure it also increases data density.

Tufte uncovers an interesting paradox for data representation:

to clarify, add detail (Tufte 1990, p. 37)

He disagrees with chart-wisdom that data representations need to be simplified to communicate, and argues that:
panorama, vista, and prospect deliver to viewers the freedom of choice that derives from overview, a capacity to compare and sort through detail (Tufte 1990, p.38)
So what about the question of detail in tag clouds?

When you generate a tag cloud from a dataset, you are displaying the frequency distribution for the most commonly used tags in your set. One of the choices that you are going to have to make is how many tags to include. At the extreme case, if you only include one tag, your cloud will simply display the tag you have used most often. If you include all of your tags, your tag cloud may be enormous and hard to deploy as it won't fit onto the screen.



The illustration above is a screen shot from http://del.icio.us/help/tagrolls. This screen enables a delicious user to design a tag roll - a way of presenting delicious bookmarks and tags for example on a blog. On the left, delicious enable me to define my options for the tag roll. On the right, I see the tag roll as it would be if I selected this set of options. So when the size parameter is set close to its minimum value, the resulting tag cloud has few tags in it. If the size parameter is set to maximum, then all tags are shown.

For most people's tags, the frequency distribution of their use of these tags will approximate to a power law distribution. So there will be a small number of tags that are used very frequently, and many tags that have only been used once. This is the distribution from my delicious tags:



There are 188 tags that I've only used once
The tag I have used most frequently has been used 61 times.

The famous long tail of power law distributions is very apparent.

So the detail question really becomes a matter of how much of your tail feathers you're prepared to shake.

This stimulating post by Fred Stutzman: Unit Structures: The Long Tail of Identity explores the idea that the really interesting stuff you can learn from other people's tag clouds is in that long tail. He argues that long tail will show the guilty music secrets, the dodgy movies choices, the strange hobbies that overtake us for a weekend and then pass on.

I guess in terms of the Goffman distinction I discussed earlier, the high frequency tags are what is given whereas the long-tail tags are what is given-off.


0 comments:

Subscribe to: Comments (Atom)