March 14, 2015

A Modest Framework for Scientific Transparency

Here are six points for the integration of open-access science publishing and open data. This was developed from personal practice and research in addition to interactions with the Research Data Service (University of Illinois) and the SciFund challenge. This pipeline begins at the write-up stage, but some points rely on practice prior to analysis and write-up.


A)   Preprint (e.g. kernel of hypothesis- or question-driven results).

A number of options exist for this, including arXiv, bioRxiv, PLoS One, or another permanent location that provides a formal archival address or digital object identifier (doi). The core paper should be brief (6-12 pgs) and formal.


B)   Advanced methods/theory.

These can be submitted as supplemental materials, either in the same repository as the preprint itself or on another permanent server. As opposed to simple auxillary files, this should be set up more along the lines of an iPython notebook.


C)   Advanced Analysis.

This can be treated in the same manner as the advanced methods/theory. This will include transformational datasets (e.g. time-frequency decompositions, log transforms, combinations of data from multiple sources in a common framework) and the associated data tables and figures/graphs.


D)   Datasets.

1)   Raw Data: images, unprocessed vectorial or matricial output.

These will be stored as formatted image files, ASCII files, or tabular files.

2)   Processed Data: numeric variables, simple annotation.

These will be appended to the raw data either in the file or as linked files in the same directory.

3)   Higher-level Data: correlational, data fusion, decompositional.

These will include the transformational datasets mentioned in the section on Advanced Analysis. These datasets are to be linked to the raw and processed data directory. Simple annotation methods will confirm the identity.

4) Higher-level Representation: RDF/XML descriptive models, algorithmic (e.g. data landscapes, possibility spaces).

These types of representations can help us go beyond the typical reliance on “statistical significance” and “future directions” to provide a rigorous approach to guide future investigations. An example of this is parameterization models from existing data.


E)   Blogging Publicity.

All materials should be promoted through a blog post. This can be in the form of a feature article, or as a series of annotated links. This can be followed up with reposting key features of the initial post to a social blog like Tumblr or sharing a link via Twitter.


F)   Peer Commentary.

While this is typically kept confidential, there are so-called post-peer-review venues that provide a means to review work (e.g. PeerJ, F1000). This includes both formal (actionable) statements and informal statements in the form of critiques. 


This outline represents the entirely of a scientific reporting pipeline (from formal write-up to published items), although I am no doubt missing something. I will be fleshing each of these points out in future posts with real data and examples from Orthogonal Research and my work at the University of Illinois.

March 9, 2015

Review of "Arrival of the Fittest"

"Arrival of the Fittest" is the latest book by Andreas Wagner, a professor at the University of Zurich in Switzerland. The book [1] tackles a subtly complicated topic: the evolution and evolvability of innovations using a biochemical and computational perspective. For the most part, Wagner succeeds at presenting an elegant case for how the ability to naturally evolve innovations lies at the heart of the evolutionary process. To get there, however, Wagner must introduce us to a number of semi-obscure concepts (at least to the layman). The book can be summarized in four parts: survival of the most novel (I), the concept of innovability (II), the safe and the risky (III), and multiple origins, multiple solutions (IV). I will give a technical review of the book by highlighting these four interrelated themes.



I. Survival of the most novel.
To understand the propagation of innovative solutions throughout the tree of life, it is important to understand the difference between the force of evolution versus the role of selection. Wagner proposes that, rather than creating innovations, the role of natural selection is to preserve them. Innovations themselves are the result mutation, recombination and historical contingencies. The first two mechanisms are capable of randomly generating either very simple innovations or the components of more complex innovations. But to achieve the "tinkering" that seems to be prevalent in complex genetic pathways and phenotypes, we need to have standard meta-components that can lock in previous changes. This allows for the limited exploration of a fitness space without incurring the cost of losing previous advances.


Wagner teases historical contingencies this into two classes: building blocks and standards. In the case of building blocks, the working components that result from previous innovation are modularized into larger units. This allows for increasing complexity to be had from a stochastic process (evolution) as well as accelerating the process of finding and retaining novelties (innovation). The coordination of building blocks often leads to standards, which are much more flexible than hyper-specialized systems but are much more able to deal with hyper-complexity.

Building blocks assembled into a complex structure.

Wagner illustrates this by comparing metabolic engines (a high-tolerance biological engine) with the internal combustion engine (a high-precision mechanical innovation). While metabolic engines can use an interchangeable set of fuels and reactants, internal combustion engines require specific specifications to operate. While one might argue that this implies mechanical engines will be "optimized" and metabolic engines will be made "good enough", this is indeed the point. Rather than survival of the fittest, we observe a survival of the fit enough, with selection acting strongest on functional novelties that augment fitness.

A high-tolerance but highly-specific engine (internal combustion) type.

II. Concept of innovability.
In the case of complex metabolic networks, the basic function has not changed throughout evolutionary history. What has changed is the number of reactions, which scales with evolutionary complexity. This scaling requires very general standards. But to make these standards interoperable across divergent evolution, a diversity of building blocks regulatory mechanisms are also required. This leads us to a set of principles which can explain general trends in innovation rather than on a case-by-case basis. Another such principle involves the number of parts and identity of a specific innovation. The number of interacting parts and their configuration provides a means to compare specific innovations in a hypothetical manner. Wagner proposes that this be done using a high-dimensional structure such as a hypercube [2] or a neutral network [3].

In a hypercube representation, each node represents a specific genotype, while the edges represent pathways (or the accumulation of mutations) between genotypes. In short, the shortest paths are the most probable. In a highly-connected space, a long distance can be traveled across the space in a short number of mutations (or edges).

III. The safe and the risky.
How can the structure of a biological system make things safe for innovation? Certainly, blindly changing key components of a metabolic network or developmental scaffolding without regard for essential function can result in lethality. The presence of the building blocks and standards principles provides a failsafe means to tinker or change without lethal disruption. But there are two other principles at work: redundancy and connectivity. These principles are arguably more important in acting as gatekeepers for the fitness benefits reaped by innovations.

A machine finding its way through a maze. One way to search for novel solutions in a de novo fashion.

Redundancy involves the presence of multiple parts that play a similar or interchangeable role in the functioning of a system. For example, both genetic and metabolic networks can be robust to the removal of single components. When components get removed without having a consequence on function [4], we can say that such a component is redundant. Gene duplications can play a similar role: single copies can be knocked out without a large detriment to fitness. Related to redundant function is, of course, robustness. Robustness can be thought of as how much redundancy exists in a specific system. Wagner gives the example of phenocopying as a way in which developmental robustness can lead to redundant phenotypic configurations.

Connectivity is a less appreciated aspect of evolution. Yet in terms of functional and conceptual unity, connectivity is an essential component of evolving systems that produce innovation. Three aspects of connectivity are most important here: interaction type, density, and connection order. There are a number of interaction types in biological networks that can give rise to innovation. For example, we can look to genetic networks (interaction between genes), breeding networks (interactions between conspecifics), or even the aforementioned neutral networks (interactions between evolutionary configurations) for ways in which connectivity can yield pathways that lead to innovations of high fitness.

An example of a genetic (gene interaction) network. Sometimes this approach is called "hairball science". COURTESY: Figure 1 in [5].

Network density (or the number of interconnections between nodes) allows for a greater number of potential innovative solutions to be explored in a shorter number of steps. This can increase the number of potential solutions, which can be reached in a shorter number of evolutionary steps and/or time (depending on your perspective). In the neutral network, a number of "safe" pathways of equivalent fitness are created for the innovating pathway or organism to explore. While the network density is important in facilitating innovation, the connection order of the network is also important. In networks with a short connection order (e.g. small-world networks), even random paths through the network can yield large-scale, non-deleterious change.

An example of constituent genotypes in a neutral network with reference to robustness. COURTESY: Ricardo Azevedo, Wikipedia.

IV. Multiple origins, multiple solutions.
The final point of Wagner's book is to remind us that innovations can be achieved through multiple unique solutions, and that existing innovations may have had multiple origin points. While evolution also has no unique solution, innovation is particularly subject to parallel evolution. Since this book does not shy away from computational representations, Wagner applies the idea of evolution strategies [6] to illustrate how innovations can be modeled as a five-part process. Specifically, innovation involves trial-and-error, population-based exploration, solutions with multiple origins, a combinatorial structure, and a stochastic process with a mutation-selection structure. While these factors also define the evolutionary process, we might say that innovation is inseparable from evolution by natural selection. Thinking more broadly and considering social systems, cultural innovation might also be best characterized as an evolutionary process. Wagner's approach to innovation as evolvability is a useful and accessible introduction to the topic, and places this new set of concepts and terminology directly into the context of evolution by natural selection.

NOTES:
[1] For other, briefer reviews, please see: Hoppe, R.B.   Andreas Wagner: Arrival of the Fittest: Solving Evolution’s Greatest Puzzle. Panda's Thumb blog, November 4 (2014) AND Pagel, M.   The Neighborly Nature of Evolution. Nature, 514, 34 (2014).

[2] For an example, please see: Gavrilets, S. and Gravner, J.   Percolation on the Fitness Hypercube and the Evolution of Reproductive Isolation. Journal of Theoretical Biology, 184(1), 51–64 (1997).

[3] For an example, please see: Wagner, A.   Robustness and Evolvability in Living Systems. Princeton University Press (2005).

[4] Li, J., Yuan, Z., and Zhang, Z.   The Cellular Robustness by Genetic Redundancy in Budding Yeast. PLoS Genetics, 6(11), e1001187 (2010).

[5] Magtanong, L. et.al   Dosage suppression genetic interaction networks enhance functional wiring diagrams of the cell. Nature Biotechnology, 29, 505-511 (2011).

[6] Evolution strategies assumes that evolution is an optimizing process that results from the iterative application of a mutation-selection model. For more, please see: Schwefel, H-P.   Numerical optimazaion of computer models. Wiley Press, Chichester (1981) AND Beyer, H-G. and Schwefel, H-P.   Evolution Strategies: A Comprehensive Introduction. Journal of Natural Computing, 1(1), 3–52 (2002).

February 24, 2015

Attack of the Bots (quest for the data)!

Is this the signature of an advertising bot invasion? To find out, look over the following three graphs and then compare with this post on a traffiic surge involving Bitcoin and the Ukraine from Summer 2014.

1) traffic spike over a period of 36 hours.

2) traffic entering blog from a large number of short URLs (some belonging to advertising networks).

3) traffic going to no specific set of posts (traffic patterns are not much different that that of a typical week).


February 12, 2015

Darwin Day Short

He did it all for the finches (and their beaks).

Here's wishing everyone a Happy Darwin Day for 2015 (Darwin's 206th posthumous birthday)! Here is a short pictoral profile. And check out the hashtag #darwinday on Tumblr for more features and events. And, as a bonus, here is a new paper in Nature [1] that uses genome resequencing to better understand the adaptive variation in finch beaks.

Young Darwin

Middle-aged Darwin

Old Man Darwin

NOTES:
[1] Lamichhaney, S.     Evolution of Darwin’s finches and their beaks revealed by genome sequencing. Nature, doi:10.1038/nature14181 (2015).


February 8, 2015

Scientific Paradigm Network


Sometimes, a picture is worth 1000 morsels of food for thought. Here is a map of selected scientific topical categories courtesy of Seed Magazine and Map of Science. In this graph, each of the 776 categories is a paradigm with an epistomological basis. The linkages between them represent shared papers between paradigms. Visit the Information Esthetics website for more information (reprints of the poster version are sold out). Having been originally published in 2006, the arcs of the network (citation information) are bit out of date. The overall topology, however, is still valid.

Printfriendly