I’ve been trying to find away to make this blog useful, alive and more widely read. A simple idea came to me yesterday: I have a spectrum of academic alert feeds that allow me to often find great drafts to suggest to friends; instead of taking notes about a paper that might not necessarily end up in my references, or figuring who would be interested and why, I though that I should publish my comment, maybe attract the authors’ attention to remarks, and forward a link to those of my friends who could be interested.
After all, you (the tax payers) are paying us to work; you might as well have the right to see how sausages are made (Well, no, not really — but I think it’s best if we open the room a little bit). The though didn’t came to me earlier, probably because I tend to be rather detailed (i.e. harsh to the point of being rude) in many of the comments I want to make in workshops. More then (virtual) publishers or automated reviewer selections, what we (grad students) need are seminars — and internet is the best place to do that. During workshops, my contribuitions are much more relevant (and more accurate) when I can use Papers to go on-line and sort the arguments out, find the references I want to quote and check the data, before making a comment.
So: let us learn to improve our work, asynchronously and publicly, let me learn how to have a great and visible blog, let’s start with with a paper about. . . learning how to publish videos. Well, not quite — but the conclusions could aim towards that direction.
“Crowdsourcing, Attention and Productivity” is a pre-print by Pr. Bernardo A Huberman and Daniel M Romero from the HP Labs, made available on arXiv two weeks ago. Authors know best, so let them explain what this is all about with their abstract:
The tragedy of the digital commons does not prevent the copious voluntary production of content that one witnesses in the web. We show through an analysis of a massive data set from YouTube that the productivity exhibited in crowdsourcing exhibits a strong positive dependence on attention, measured by the number of downloads. Conversely, a lack of attention leads to a decrease in the number of videos uploaded and the consequent drop in productivity, which in many cases asymptotes to no uploads whatsoever. Moreover, uploaders compare themselves to others when having low productivity and to themselves when exceeding a threshold.
The paper has two main empirical results:
- the first could be described as obvious —although needed to reassure all those who love to sing that user-generated media is ugly— but it’s now official: people post more videos when they have many viewers;
- more importantly, the reference point appears to be the average views for early poster, and their historic average for more experienced posters.
Combined with more results, such as the fact that people seem to improve the quality of their contribution (at least on Digg) and the fact that viewers attract viewers (at least on flickr), we can now expect an econometric model of attention, output and quality. We can measure the rationality of posters, or even justify subsidies (!) for user-generated sites because these could help sites reach higher equilibrium level of participation.
Let us however focus on the work at hand: one bold claim made by Huberman and Romero is Causality measure. Sounds like a classic “post hoc, ergo proper hoc” that any undergrad does with stats? No: the behaviours measured are quite distinct, notoriously unpredictable and as explained in the paper, one explains the other, but not the other way around.
Although the conclusions are fine, one point is not addressed: are we seeing quality filters through attention, or success encouraging to spend more time, uploading better videos or are the reactions mostly good advices and feedback that help posters learn about how to make good, successful videos? The success of professionally edited videos, and the increasing quality score favour the learning hypothesis: but is it improving by doing, or thanks to the social construct of references, anticipation? Data on comments would be useful, especially the presence of technical terms, precise references to elements in the video, replies and back-and-forth involving the author, connections data outside of posting, etc.
There is another detail that is not completely clear in the paper: the sequences. Are they common to all users? What if —say the two-week period start every other Saturday, at midnight before Sunday— what if someone posts late on Saturday nights: this makes to active periods, and one can be artificially low (e.g. posting from 8PM to half-past midnight). I’m assuming the authors have chosen a day and time with the least activity to avoid that, but still: maybe it’s best to filter activity periods for each user separately, not with a unique partition or the calendar? Actually, a distribution of time-spans between two uploads would be a great addition to the paper: there would be a likely hike for very short time (successive uploads) and the biorythmic usual suspects: daily, weekly and monthly peak. Comparing those would help to measure what are the actual rythms on-line — and finding a dip might help decides what makes an individual period. User’s productivity between periods could be compared to the length of their rather sustained activity periods, to see if regularity helps, etc.
That could allow to answer to another question: Can quality be intrisic and come from longer periods then active sessions? — e.g. dealing with the previous session takes times (replying to comments) that prevent from posting; good ideas appear at a constant rate, or come from comments; and longer preparation (invisible due to filtering out the inactive periods) could be key to quality, but we can’t see it. All that is speculation — but question that now can, thanks to those very extensive logs, be answered.