January 21, 2013

Freedom of information versus the economics of information

This content is being re-posted from my microblog, Tumbld Thoughts.

This is a follow-up to my last Synthetic Daisies post, and turns the argument (should information be free of cost and/or financial value) on its heads [1]. Would you like to be paid to read books? Does this sound too good to be true? Actually, this is an idea from Kevin Kelly (of Wired fame) at his blog "The Technium". A bit like the "economics of the free" [2] and some of the ideas advanced in my recent slideshow on science innovation [3].

[1] Does information want to be free? Or in this case, of "negative value" to the creator? In this case, any negative value is temporary, and actually becomes a strategic investment. This is more like the electricity consumer who (through the use of renewables) supplies energy back to the electric grid rather than the investor who gets negative interest on their savings account.

[2] For more information, please see the book "Free" by Chris Anderson. Could you replace the concept "books" and "reading" in Kelly's model with any other concept? There may or may not be a free lunch today, depending on your reading of "Free".

[3] A New Route to Science Innovation, doi:10.6084/m9.figshare.106823. In addition, there are some interesting relationships between this model and the potential monetization of social capital. This would be especially powerful if Kelly's model were implemented on a large social media network (e.g. Facebook or LinkedIn).

January 15, 2013

Does information want to be free?

Just finished preparing my lecture on alternatives to the IP/government grant science funding regime (slides on Figshare), when I heard that Aaron Schwartz (founder of Reddit, open-source advocate) committed suicide. Will he become a martyr for the open-source movement? Below is his open-source publishing call-to-arms called the Guerilla Open Access manifesto, which I am reposting here with comments (no permissions needed). R.I.P. Aaron.

Aaron Schwartz, as featured in Talking Points Memo (TPM) article.

Information is power [1]. But like all power, there are those who want to keep it for themselves. The world’s entire scientific and cultural heritage, published over centuries in books and journals, is increasingly being digitized and locked up by a handful of private corporations. Want to read the papers featuring the most famous results of the sciences? You’ll need to send enormous amounts to publishers like Reed Elsevier [2].

There are those struggling to change this. The Open Access Movement has fought valiantly to ensure that scientists do not sign their copyrights away but instead ensure their work is published on the Internet, under terms that allow anyone to access it. But even under the best scenarios, their work will only apply to things published in the future. Everything up until now will have been lost.

That is too high a price to pay. Forcing academics to pay money to read the work of their colleagues? Scanning entire libraries but only allowing the folks at Google to read them? Providing scientific articles to those at elite universities in the First World, but not to children in the Global South? It’s outrageous and unacceptable.

“I agree,” many say, “but what can we do? The companies hold the copyrights, they make enormous amounts of money by charging for access, and it’s perfectly legal — there’s nothing we can do to stop them.” But there is something we can, something that’s already being done: we can fight back.

Those with access to these resources — students, librarians, scientists — you have been given a privilege [3]. You get to feed at this banquet of knowledge while the rest of the world is locked out. But you need not — indeed, morally, you cannot — keep this privilege for yourselves. You have a duty to share it with the world. And you have: trading passwords with colleagues, filling download requests for friends.

Meanwhile, those who have been locked out are not standing idly by. You have been sneaking through holes and climbing over fences, liberating the information locked up by the publishers and sharing them with your friends.

But all of this action goes on in the dark, hidden underground. It’s called stealing or piracy, as if sharing a wealth of knowledge were the moral equivalent of plundering a ship and murdering its crew. But sharing isn’t immoral — it’s a moral imperative [4]. Only those blinded by greed would refuse to let a friend make a copy.

Large corporations, of course, are blinded by greed [5]. The laws under which they operate require it — their shareholders would revolt at anything less. And the politicians they have bought off back them, passing laws giving them the exclusive power to decide who can make copies.

There is no justice in following unjust laws. It’s time to come into the light and, in the grand tradition of civil disobedience, declare our opposition to this private theft of public culture.
We need to take information, wherever it is stored, make our copies and share them with the world. We need to take stuff that's out of copyright and add it to the archive. We need to buy secret databases and put them on the Web. We need to download scientific journals and upload them to file sharing networks. We need to fight for Guerilla Open Access [6].

With enough of us, around the world, we’ll not just send a strong message opposing the privatization of knowledge — we’ll make it a thing of the past. Will you join us?

Aaron Swartz
July 2008, Eremo, Italy

[1] Indeed. More specifically: Knowledge is power, but those who control access to that knowledge have the real power.

[2] Here is a link to the academic scientist-lead Elsevier boycott. Interestingly, many of the high-profile open-source journals also charge a publishing fee for authors but keep access free. While this is good for sharing information, it could still exclude citizen scientists from broadcasting information.

[3] The economics of librarianship are vastly different from the market economics of, say, the entertainment industry. Part of this is due to the intangibility of the product, but a comparison between traditional libraries and entertainment markets are a good comparison (the objects of exchange are similarly intangible -- unlike land or objects such as cars, it is hard to discretize, enclose, and assign value to these units). I did not cover this directly in my talk, but it brings up a host of interesting issues.

[4] The mores of downloading (or sharing information in all its forms) are not clear. Is it righful sharing? Or is unrestricted downloading a form of theft? Here is a New York Times story on the Megaupload case, a legal perspective on digital "theft". And here is a list of content theft in all its forms from the MPAA. As a rebuttal to the NYT story, here is a Freakonomics blog post on why downloading is not theft. And finally, here is a Wired opinion piece on how "piracy" (the online dissemination of music, movies, and books) is simply the next step in publishing's technological evolution. In general, people have quite strong moralistic positions on this despite the inherent intangibility of the information in question.

[5] It is not so much that corporate entities are blinded by greed, but rather that their raison d'etre is to extract rent from both the creators of content and the consumers of content. This might make sense in a world where technology restricts how this information is distributed to the audience, or in cases where the costs are minimal and the benefits have the potential to be highly advantageous to the creator, but probably makes little sense in an age of high-bandwidth internet access. In other words, the proprietary journals et.al must now prove their worth.

[6] For more information (not neccessarily of the Guerilla variety) on how to be more open access-friendly, please see two recent blog posts:

1) What your can do to promote open access (Peter Suber).

2) 10 things you can REALLY do to support Open Access (Michael Eisen, "Tree of Life").

In addition, Dan Cohen (at Wired Blogs) wrote a post in tribute to Aaron Schwartz called "How to Make Open-Access Work". Fits well within the scope of my talk.

ADDED ON 1/17: A group of Mathematicians have just announced that the Episciences project (open-source peer-review system and journal) will commence soon as an arXiv overlay journal. Good luck!

January 11, 2013

Metabiology and the Evolutionary Proof

I have recently read the book "Proving Darwin" by Gregory Chaitin. Greg is a mathematician who is best known for his work in computational complexity [1]. "Proving Darwin" is not only a Mathematician's take on evolution by natural selection, but also a rebuttal of intelligent design. Despite its slimness [2], the book is quite interesting. Chaitin is particularly interested in the creative and algorithmic complexity aspects of natural selection. While this would seem a bit obscure to a biologist, this perspective provides crucial evidence for the viability of natural selection, and even parallels some empirical results from artificial life and evolutionary biology. Aside from providing a review of this book, I will also provide a conceptual evaluation of of the core thesis by using simulated data.

To orient the biological audience, a Mathematical proof is a formal logical construct one uses to demonstrate the plausibility of a given solution. In the case of an unsolved mathematical problem (e.g. Riemann, P vs. NP), the proof is what is required to claim that a problem is solved [3]. In this case, Chaitin does not lay out his proof directly, but does draw his conclusions from it. This is an approach he calls "Metabiology" [4]. The metabiological approach characterizes living systems as programs. Therefore, the primary difference between living and non-living entities in metabiology is the ability to transmit information. This is not a new observation, but is key to his characterization of evolutionary systems. A flame is used to highlight this difference: while a flame can have metabolism and can self-reproduce, it cannot evolve like a living entity because it cannot transmit information [5]. While this may not be a completely reasonable analogy, it does allow for us to further understand the nature of creativity in living systems.

The algorithm of evolution proposed in the book is a k-bit program which can be modified via mutation [6]. This is similar to in silico approaches to evolutionary biology such as Tierra and Avida. The relative information content (or program size complexity) of a mutation in the k-bit program is the information content of a non-mutated program (e.g. genotype) given the information content of a mutant program [7]. The k-bit program (Figure 1), as a highly stylized and abstract population of genomes, of results in a partially-stochastic dynamical system, the relevance of which will soon become clear.

Figure 1. Schematic of a k-bit program and how it evolves

In the context of evolution, creativity is the ability to create new combinations -- and ultimately forms. And according to Metabiology, the most creative lineages and forms that contain the most information-rich mutations. But what does it mean to possess "informative" mutations? To better understand this, Chaitin provides us with three evolutionary regimes to test his mathematical theory. These are: 1) intelligent design, 2) brainless exhaustive search, and 3) cumulative random evolution. Intelligent design is provided as a representative of evolution being a deterministic process that is limited to fixed number of outcomes. By contrast, brainless exhaustive search assumes that evolution is entirely random, with no constraints. Furthermore, this model has no memory, and does not allow for evolutionary conservation or modularity. Cumulative random evolution provides a middle path between these two scenarios: random mutation given a
limited fitness landscape, resulting in evolutionary trajectories that are highly path-dependent [8].

By solely considering the Big O metric of program complexity [9], it is shown that the intelligent design scenario achieves a maximum fitness value much faster than either brainless exhaustive search or cumulative random evolution (Figure 2). Why is this and how does such a result prove natural selection? Evolutionary algorithms tend to be much slower than comparable techniques in contexts such as multivariate optimization [10], even for a convex (e.g. where the goal is clearly defined) search space. However, the speed of attaining a maximum fitness value does not translate into a robust result. This is because in Metabiology, fitness [11] is defined as the rate of biological creativity rather than as being related to differential reproduction. The speed (or tempo) of evolution is measured by how fast fitness grows [12]. While this does not provide a specific test of selective forces, it does move us away from the naive adage of "survival of the fittest". Instead, we find lineages with varying amounts of creativity, and a situation akin to "survival of the most creative".

Figure 2. How quickly maximum fitness is achieved for three evolutionary scenarios. Inset: the earliest portion of evolution (indicated by bracket). Hypothetical scenario modeled using simulated (pseudo) data.

The maximum fitness for the intelligent design scenario (as shown in Figure 3) occurs very early in the evolutionary trajectory. Essentially, it shows a big initial gain for intelligent design. Presumably, this strategy finds the known strategies first. The other two strategies reach their maximal fitness later on. However, it is the secondary properties, not simply speed of solution, which actually make for an adaptive outcome. Using these secondary properties as a guide, we can assess exactly why cumulative random evolution is more powerful than the alternatives. As characterized in the book, there are four additional ways to evaluate these models: creativity, time, robustness, and coherence of function. Figure 4 shows continua for each of these.

Figure 3. Table that shows the degree of creativity, time taken, robustness exhibited, and systemic coherence exhibited by each evolutionary scenario. Hypothetical scenario modeled using simulated (pseudo) data.

Finally, let us consider how these results square away with findings from the artificial life literature. When I first started to read about Metabiology, I was struck by how parallel this approach is to the Avida platform. Both approaches use programs (in the form of instruction sets) as a stand-in for the genome. In particular, the tendency for fitness (and complexity) to grow over evolutionary time. This either suggests that there is a fundamental insight to be had here, or that systems like these are very good at maximizing yield.

Figure 4. Table that shows the degree of creativity, time taken, robustness exhibited, and systemic coherence exhibited by each evolutionary scenario. Values along continuum are arbitrary. 1 = Intelligent Design, 2 = Brainless Exhaustive Search, 3 = Cumulative Random Search.

Instruction-based evolutionary models are derived from Turing machines [12], for which the machine's state is indeterminate. This is based on the halting mechanism, which does not allow for guided behavior in any way. Yet evolution must be guided in some fashion, and intelligent design provides brittle results at best. This is where Chaitin introduces readers the concept of an Oracle [13], which can be thought of as a top-down guidance mechanism for the halting (or mutational) behavior of a random system. Think of this as a natural version of site-directed mutagenesis, so that lethal combinations would be much less likely than otherwise. This natural (albeit hypothetical) oracle could also resolve some of the evolutionary paradoxes we often witness in natural populations. For example, an oracle might prevent populations from getting stuck in low-fitness regimes, or make large portions of a rugged fitness landscape more accessible [14]. In metabiological explorations, throwing out lethal mutants maintains high levels of fitness and allows for evolutionary dynamics such as competition and sustained arms races.

A naturalistic mechanism for the proposed oracle is still very much a mystery. However, the architecture of execution routines and the structure of genetic regulatory networks (GRNs) is very similar. From a programming perspective, a hierarchical structure could serve as a heuristic stand-in for this oracle. In particular, an approach called subrecursion (e.g the use of subrecursive hierarchies -- [15]) might provide a mechanism for the supervised maximization of fitness. This is the direction the end of the book should have taken -- however, this might also be work in progress. A more general mechanism may lie in the mysteries of mathematical incompleteness and randomness itself. Mathematical incompleteness (discussed at length in this book in the form of Chaitin's Omega number) suggests that all consistent statements must include undecidable propositions which neither be proven nor disproven. This might help us understand why although the forces of evolution are well characterized, the explanatory power of evolutionary laws is limited [16].

So what are the theorems of evolution? Chaitin does not make them explicit in this book, which is a letdown. And a causal reader gets the sense that Metabiology is simply a reformulation of results already confirmed in the areas of population genetics and digital biology. However, it is interesting that Chaitin has come at these results using a parallel methodology. Perhaps this convergence is the best evidence for evolution by natural selection.


[1] The field of Algorithmic Information Theory, to be exact.

[2] Contrast this with the thickness of "The Structure of Evolutionary Theory" by Steven J. Gould.

[3] As opposed to a test of the null hypothesis (NHST) or transitional fossil.

[4] This is based on a course at UFRJ and a lecture at the Santa Fe Institute (which I cannot locate on YouTube). For more information, please view this blog post at Logical Atomist.

[5] Another difference between a flame and a living population is the scale of its replicators. A flame, while hard to characterize as a population, behaves more like a slime mold than an intermittently-coordinated population (e.g. flock of birds or herd of sheep). This is an interesting point which is beyond the scope of the book.

[6] The probability of mutation (or probability of M) scales asymptotically to program length 2-k.

[7] In other words, the lineage resembles a probabilistic graphical model (PGM - e.g. Bayesian network). The more complex lineages are simply those with more informative mutations.

For a tutorial on PGMs, please see the following reference: Airoldi, E.M.  Getting Started in Probabilistic Graphical Models. PLoS Computational Biology, 3(12), e252 (2007).

[8] For simplicity's sake we can assume that all programs maximize their fitness on a fitness landscape. The actual topology is not important in this example.

[9] A Beginner's Guide to Big O Notation. Rob Bell blog, June 2009. See also Big O Notation, Wikipedia page.

[10] For the speed of evolutionary algorithms in the context of multiobjective optimization, please see: Coello-Coello, C.A., Lamont, G.B., and Van Veldhuizen, D.A.  Evolutionary Algorithms for Solving Multi-objective Problems. Springer, Berlin (2007).

[11] Fitness is characterized using the Busy Beaver (BB) function. In metabiology, BB(N) is the maximum fitness function attainable given the template program, suite of mutations, and strategy of evolution pursued.

For comparative purposes, another person doing work on the role of creativity in evolution (in silico) is Joel Lehman at UT-Austin.

[12] For more on the speed (or rate) of evolution through the accumulation of mutations in E. coli (e.g. wet-lab experimentation), please see: Kryazhimskiy, S., Tkacik, G., and Plotkin, J.B.  The dynamics of adaptation on correlated fitness landscapes. PNAS, 106(44), 18638-18643 (2009) AND Desai, M.M., Fisher, D.S., and Murray, A.W.   The Speed of Evolution and Maintenance of Variation in Asexual Populations. Current Biology, 17(5), 385-394 (2007).

For an Avidian study on the rate of evolution in the context of fitness landscape ruggedness (e.g. in silico experimentation), please see: Clune, J., Misevic, D., Ofria, C., Lenski, R.E., Elena, S.F., Sanjuan, R.  Natural Selection Fails to Optimize Mutation Rates for Long-Term Adaptation on Rugged Fitness Landscapes. PLoS Computational Biology, 4(9), e100087.

[13] In thermodynamics, such an oracle is referred to as Maxwell's Demon. While this demon is rendered hypothetically impossible, a computational oracle is less bound by the laws of physics. For a primer on Maxwell's Demon, please see this Java app (shown below).

[14] Franke, J., Klozer, A., de Visser, J.A.G.M., and Krug, J.  Evolutionary Accessibility of Mutational Pathways. PLoS Computational Biology, 7(8), e1002134 (2011).

[15] For more information, please see:  Calude, C.  Theories of Computational Complexity. Elsevier, Amsterdam (1988) AND Calude, C.  Incompleteness, Complexity, Randomness, and Beyond. Minds and Machines, 12(4), 503-517 (2002).

[16] For those who are unfamiliar, the best-known evolutionary law is allometry, or the scaling of size and growth across phylogeny.

January 9, 2013

Playing for Science

This is being re-posted from my microblog, Tumbld Thoughts.

Here is an interesting article from "The Scientist" called "Games for Science". It features a number of games designed for massively parallel data analysis [1]. The idea is to distribute data to personal computers, perform operations there, and them re-assemble the analyzed data at its source. Two projects are EteRNA (RNA structural design) and Phylo (multiple sequence alignments).

[1] one example is BOINC (admined at UC-Berkeley). An example of distributed computing.

January 3, 2013

As seen on Carnival of Evolution, #55

The new installment of Carnival of Evolution (#55) is now live. A lot of interesting posts on theoretical biology, evolutionary methods, the latest papers from the scientific literature, and the practice of science. Thanks to Suzanne Elvidge at the Genome Engineering blog for her New Year's Resolution themed presentation. A post from my microblog (Tumbld Thoughts) was featured [1]. If you would like to submit to Carnival of Evolution for an upcoming month or host (they are always interested in contributors), the organizers would love to hear from you.

Reposted from Tumbld Thoughts (December 5):

Here is a link to the Evolution episode of Carl Sagan’s Cosmos [2] series. Carl Sagan had a way of making scientific concepts seem so simple, yet at the same time so profound.

Above are two visual representations of evolution [3] taken from the video:

1) evolutionary transitions from the first plants (8:30) to cell colonies (8:56).

2) evolution of sea life from polyps with tentacles (9:26) to the first ancestral fishes (filter feeders with gill slits - 9:40).


[1] No posts from Synthetic Daisies in CoE this month. Just didn't have a chance to create a longer-ish, evolution-related post last month (I tend to use my microblog for shorter features, or for less polished profiles and features).

[2] a reboot of Cosmos is planned for the near future (hosted by Neil Degrasse Tyson and produced by Seth MacFarlane).

[3] progression of time in video is roughly proportional to elapsed evolutionary time.