December 30, 2016

New Paper at The Winnower

I have a new paper up at The Winnower, just in time for holiday (New Year's Day) reading. The Winnower is a publishing platform that allows people to post manuscripts and other writing while sharing it with the general public and receive feedback. They use a post-publication peer review system, and allows people to gather reviews and revise the original submission (winnowing) before assigning a formal doi (publication). This is my second experience with this type of publication system.

The paper is titled "On Braitenberg's Vehicles, Compound Polygons, and Evolutionary Developmental Structural Complexity", and network theory to analyzing the geometry and spatial composition of biological phenotypes. The paper is currently open for review (which you can submit at the site). I invite you to read and evaluate!

December 1, 2016

Searching for Food and Better Data Science at the Same Time

Two presentations to announce, both of which are happening live on 12/2. The first is the latest OpenWorm Journal Club, happening via YouTube live stream. The title is "The Search For Food", and is a survey of a recently-published paper on food search behaviors in C. elegans [1].

While the live-stream will be available in near-term perpetuity [2] on YouTube, the talk will begin at 12:45 EST [3]. The abstract is here:
Random search is a behavioral strategy used by organisms from bacteria to humans to locate food that is randomly distributed and undetectable at a distance. We investigated this behavior in the nematode Caenorhabditis elegans, an organism with a small, well-described nervous system. Here we formulate a mathematical model of random search abstracted from the C. elegans connectome and fit to a large-scale kinematic analysis of C. elegans behavior at submicron resolution. The model predicts behavioral effects of neuronal ablations and genetic perturbations, as well as unexpected aspects of wild type behavior. The predictive success of the model indicates that random search in C. elegans can be understood in terms of a neuronal flip-flop circuit involving reciprocal inhibition between two populations of stochastic neurons. Our findings establish a unified theoretical framework for understanding C. elegans locomotion and a testable neuronal model of random search that can be applied to other organisms.
The other presentation is one that I will give at the Champaign-Urbana Data Science Users' Group. This will be a bit more informal (20 minutes long), and part of the monthly meeting. The meeting will be live (12 noon CST) at the Enterprise Works building in the University Research Park. The archived slides are located here. The title is "Open Data Science and Theory", and the abstract is here:
Over the past few years, I have been working to develop a way to use secondary data and Open Science practices and standards for the purpose of establishing new systems-level discoveries as well as confirming theoretical propositions. While much of this work has been done in the field of comparative biology, many of the things I will be highlighting will apply to other disciplines. Of particular interest is in how the merger of data science and Open Science principles will facilitate interdisciplinary science.

[1] Subtitle: To boldly go where no worm has gone before. Yup, Star Trek pun. Full reference: Roberts, W.   A stochastic neuronal model predicts random search behaviors at multiple spatial scales in C. elegans. eLife, 2016; 5: e12572.

[2] for as long as YouTube exists.

[3] Click here for UTC conversion.

November 21, 2016

Be as Brief as Possible but no Briefer

Nature Highlights article on the Journal of Brief Ideas, which itself is brief.

No, this is not an Einstein quote. But Einstein very well may have submitted to the Journal of Brief Ideas [1], an open access version of Occam's razor. I just submitted a brief paper called "Playing Games with Ideas: when epistemology pays off", which is the equivalent of a fully-indexed abstract [2]. While some people might find 200 words to be too brief, the Journal allows for attachments to be submitted, thus allowing a bit of circumventing with regard to the word limit [3].

According to the Journal FAQ, submitting such brief reports is part of establishing something below the current standard for the minimal publishable unit. It is also important for enforcing good scientific citizenship practices [4]. Very short papers have occasionally been published in regular journals. Mathematics papers by Lander and Parkin [5] and Conway and Soifer [6] accomplished mathematical proofs in less than a paragraph (but with multiple figures). Other than these rather mythical examples, it is quite the challenge to integrate a well-formulated idea into the Journal of Brief Ideas' 200 word limit.

[1] Woolston, C. (2015). Journal publishes 200-word papers. Nature, 518, 277.

[2] Indexing done via document object identification on Zenodo, doi:10.5281/zenodo.167647

[3] If a picture is worth 1000 words, then the Journal of Brief Ideas become less brief than its name implies.

[4] Neisseria (2015). All you need to publish in this journal is an idea. Science Made Easy blog, February 13.

[4] Lander, L.J. and Parkin, T.R. (1966). Counterexample to Euler's Conjecture on sums of like powers. Bulletin of the American Mathematical Society, 72(6), 1079.

[5] Conway, J.H. and Soifer, A. (2004). Can n2 + 1 unit equilateral triangles cover an equilateral triangle of side > n, say n + ɛ? American Mathematical Monthly, 1.

October 27, 2016

Open Access Week: Working with Secondary Datasets

This is one of two posts in celebration of Open Access week (on Twitter: #oaweek, #open access, #OpenScience #OpenData). This post will focus on the use of secondary data in scientific discovery.

The analysis of open datasets has become a major part of my research program. There are many sources of secondary data, from web scraping [1] to downloading data from repositories. Likewise, there are many potential uses for secondary data, from meta-analysis to validating simulations [2]. If sufficiently annotated [3], we can use secondary data for purposes of conducting new analyses [4], fusion with other relevant data, and data visualization. Access to secondary (and tertiary) data access relies on a philosophy of open data amongst researchers which has been validated by major funding agencies.

The first step in reusing a dataset is to select datasets that are relevant to the question or phenomenon you are interested in. While data reuse is not synonymous with exploratory data analysis, secondary datasets can be used for a variety of purposes, including for exploratory purposes. It is important to understand what data you need to address your set of issues, why you want to assemble the dataset, and how you want to manage the associated metadata [5]. Examples of data repositories include Dryad Digital Repository, Figshare, or the Gene Expression Omnibus (GEO). It is also important to remember that successful data reuse relies on good data management practices to allow for first-hand data to be applied to new contexts [6].

An example of an archived dataset from the Dryad repository (original analysis published in doi:10.1098/rsos/150333).

Now let's focus on three ways to reuse data. The simplest way to reuse data is to download and reanalyze data from a repository using a technique not used by the first-hand generators of the data. This could be done by using a different statistical model (e.g. Bayesian inference), or by including the data in a meta-analysis (e.g. surveying the effects size across multiple studies of similar design). Such research can be useful in terms of looking at the broader scope of a specific set of research questions.

The second way is to download data from a repository for the purpose of combining data from multiple sources. This is what is sometimes referred to as data fusion or data integration, and can be done in a number of ways. One way this has been useful in my research has been for comparative analysis, such as computational analyses of gene expression data across different cell types within a species [7], or developmental processes across species [8]. Another potential use of recombined data is to verify models and validating theoretical assumptions. This is a particular concern for datasets that focus on basic science.

In fact, the recent technological and cultural changes associated with increased data sharing is enabling broader research questions to be asked. Instead of asking strictly mechanistic questions (what is the effect of x on y), combined datasets enable larger-scale (e.g. systems-level) comparisons (what are the combinatorial effects of all x and all y) across experiments. Doing this in a reductionist manner might take many orders of magnitude more time than assembling and analyzing a well-combined dataset. This allows us to verify the replicability of single experiments, in addition to applying statistical learning techniques [9] in order to find previously undiscovered relationships between datasets and experimental conditions.

The third way is to annotate and reuse data generated by your own research group [10]. This type of data reuse allows us to engage in data validation, test new hypotheses as we learn more about the problem, and comparing across different ways to attack the same problem. The practice of data reuse within your own research group can encourage research continuity that transcends turnover in personnel, encouraging people to make data and methods open and explicit. Internal data reuse also educational opportunities, such as providing students with hands-on opportunities to analyze and integrate well-characterized data. Be aware that reusing familiar data still requires extensive annotation of both the data and previous attempts at analysis, and that there is of yet no culturally-coherent set of standard practices for sharing data [11].

There are a few caveats with respect to successfully using data. As is the case with experimental design, the data should be sufficient to answer the type of questions you would like to answer. This includes going back to the original paper and associated metadata to understand how the data was collected and what it was originally intended to measure. While this does not directly limit what you can do with the data, it is important to understand in terms of combining datasets. There is a need for ways of assessing internal validity for secondary datasets, whether they be single data sources or combinations of data sources.

To learn more about these techniques, please try to earn the Literature Mining badge series hosted by the OpenWorm badge system. You can earn Literature Mining I (working with papers), or both Literature Mining I and II (working with secondary data). Here you will learn about how to use secondary data sources to address scientific questions, as well as the interrelationship between the scientific literature and secondary data sources.

NOTES (Try accessing the paper dois through

[1] Marres, N. and Weltevrede, E. (2012). Scraping the Social? Issues in live social research. Journal of Cultural Economy, 6(3), 313-315. doi:10.1080/17530350.2013.772070

[2] Sargent, R.G. (2013). Verification and validation of simulation models. Journal of Simulation, 7(1), 12–24. doi:10.1057/jos.2012.20.

[3] For an example of how this has been a consideration in the ENCODE project, please see:
Hong, E.L., Sloan, C.A., Chan, E.T., Davidson, J.M., Malladi, V.S., Strattan, J.S., Hitz, B.C., Gabdank, I., Narayanan, A.K., Ho, M., Lee, B.T., Rowe, L.D., Dreszer, T.R., Roe, G.R., Podduturi, N.R., Tanaka, F., Hilton, J.A., and Cherry, J.M. (2016). Principles of metadata organization at the ENCODE data coordination center. Database, pii:bav001. doi: 10.1093/database/bav001.

[4] Church, R.M. (2001). The Effective Use of Secondary Data. Learning and Motivation, 33, 32–45. doi:10.1006/lmot.2001.1098.

[5] One example of this includes: Kyoda, K., Tohsato, Y., Ho, K.H.L., and Onami, S. (2014). Biological Dynamics Markup Language (BDML): an open format for representing quantitative biological dynamics data. Bioinformatics, 31(7), 1044-1052. doi: 10.1093/bioinformatics/ btu767

[6] Fecher, B., Friesike, S., and Hebing, M. (2015). What drives academic data sharing? PLoS One, 10(2), e0118053. doi:10.1371/journal.pone.0118053.

[7] Alicea, Bradly (2016): Dataset for "Collective properties of cellular identity: a computational approach". Figshare, doi:10.6084/m9.figshare.4082400

[8] Here is an example of a comparative analysis based on data from two secondary datasets: Alicea, B. and Gordon, R. (2016). C. elegans Embryonic Differentiation Tree (10 division events). Figshare, doi:10.6084/m9.figshare.2118049 AND Alicea, B. and Gordon, R. (2016). C. intestinalis Embryonic Differentiation Tree (1- to 112-cell stage). Figshare, doi:10.6084/m9.figshare.2117152

[9] Dietterich, T.G. Machine Learning for Sequential Data: A Review. Structural, Syntactic, and
Statistical Pattern Recognition, LNCS, 2396. doi:10.1007/3-540-70659-3_2

[10] Federer, L.M., Lu, Y-L., Joubert, D.J., Welsh, J., and Brandys, B. (2015). Biomedical Data Sharing and Reuse: Attitudes and Practices of Clinical and Scientific Research Staff. PLoS One, 10(6), e0129506. doi:10.1371/journal.pone.0129506.

[11] Pampel, H. and Dallmeier-Tiessen, S. (2014). Open Research Data: From Vision to Practice. In "Opening Science", S. Bartling and S. Friesike eds., Pgs. 213-224. Springer Open, Berlin. Dynamic version.

October 24, 2016

Open Access Week: How Am I Doing, Altmetrics?

This is one of two posts in celebration of Open Access week (on Twitter: #oaweek, #open access, #OpenScience). To kick things off, we will go through an informal evaluation of Altmetrics and other indicators of research paper usership.

In this post, I will discuss some quick investigations I did using the Altmetric metric system (known visually as the number within the multicolored donut). Altmetrics go beyond academic metrics based solely on academic journal prestige or number of formal citations in academic papers (e.g. h-index). In this post, I will discuss how these metrics might be used to help better understand the full impact on one's work.

The Altmetric donut and its diversity of input sources. The Altmetric score is based on how many interactions your content received from each source medium.

The first exercise I did was to acquire Altmetric donuts for journal articles and preprints for which I did not have such data. This includes venues such as arXivStem Cells and Development, and Principles of Cloning II, which do not feature Altmetric donuts on their pages. Interestingly, the bioRxiv preprint server does, in addition to tracking .pdf download and abstract view counts.

Example of an Altmetric donut in context (top) and readership stats (bottom) from a recent Biology paper for which I am an author. 

Retrieving a donut and data summary from the Altmetric database is easy. You embed a few line of code (see inset below) into an HTML document, and the donut and score appear where desired. While the donut is most useful for augmenting a publication list, in this case I simply created a test document for collating data from across many papers.

// Formal journal article citation
Alicea, B., Murthy, S., Keaton, S.A., Cobbett, P., Cibelli, J.B., and Suhr, S.T.  Defining phenotypic respecification diversity using multiple cell lines and reprogramming regimens. Stem Cells and Development, 22(19), 2641-2654 (2013).
// Code for donut and database call; possible data subclasses include:
// data-arxiv-id
// data-handle
// data-doi

In context, the donut can provide useful information about how a given paper is diffusing through the academic internet. In the case of the Stem Cells and Development paper (see code), the paper has an Altmetric score of 9. While the Journal website does not have Altmetric or download data, it does provide a doi identifier and select forward citations.

Examples of the Altmetric database entry (top) and the Journal website (bottom) for the Stem Cells and Development paper.

Similar data exist for a follow-up paper to the Stem Cells and Development paper -- in this case, a preprint involving a specialized quantitative analysis (based on Signal Detection Theory) of the same data. For this paper, we have an arXiv identifier, which provides us with a donut and statistics on the relative popularity of the paper based on age and other similar documents in the Altmetric database.

A typical arXiv article page, in this case for an arXiv preprint related to the Stem Cells and Development paper.

This arXiv preprint comes with code for the analysis, which is posted to Github.

For this particular paper, there is an associated Github repository. Even for preprint repositories with Altmetric and readership data (such as bioRxiv), the integration of Github materials is rather poor, particularly in generating an Altmetric. Alternately, there is an opportunity for Github to This is an area for which user statistics linked back to the original paper would be appreciated. 

Altmetrics for the same arXiv preprint. We can access data on the sources of the Altmetric score, as well as the attention score in the context of all other tracked documents in the Altmetrics database.

We can also integrate readership data across sources to come up with a picture of how our academic work is being shared, consumed, and diffused. In this example, I will show how data from a blog analytics engine and Altmetric data can be combined. Research blogs are an up-and-coming area of research in Altmetric statistics capture. I have taken two blogrolls (Carnival of Evolution #46 and Carnival of Evolution #70), for which citable versions were posted to Figshare immediately after going live. My blogging platform (Blogger) has readership stats but no Altmetrics, while Figshare has Altmetrics and readership stats for the Figshare version only.

Altmetric data for two blogrolls cross-posted to Figshare, which provides both a doi identifier and an Altmetric donut. There is also view and download information for the Figshare version, which may or may not be inclusive of people viewing such content on the blog site.

Let's look at the Figshare data first. Carnival #46 has an Altmetrics score of 10 with 188 views and 58 downloads. By contrast, Carnival #70 has an Altmetrics score of 6 with 331 views and 82 downloads. Clearly, there is some variation in direct engagement between the two datasets that is proportional to the score.

Readership statistics for Carnival of Evolution #46 (top) and Carnival of Evolution 
#70 (bottom). Blog analytics only provides the number of "reads" on the home site since publication.

There is also little relationship between the number of Blogger reads and the Altmetric score (as the Altmetric score does not directly capture this number). Carnival #46 has 7928 reads over roughly 4 years and 7 months. Carnival #70 has 1602 reads over roughly 2 years and 7 months. 

Even in cases where no Altmetric donut can be generated (such as for book chapters), there are still ways to evaluate an article's reach. In the case of, a new feature has been added that allows people to leave a comment when they interact with a document. This is a more qualitative assessment of engagement, but also provides authors an idea of whether or not "reads" or "views" translate into more than just a passing glance.

Two consumers of a book chapter took time to express their gratitude. Other reasons can be quite interesting as well, particularly when they have to do with educational purposes.

Hope you have enjoyed this exercise. It is not meant to be an exhaustive discussion of the Altmetric evaluation system, nor is it the limit of what can be done with Altmetrics and other tools for tracking you work. While there is clearly more technical work to be done on this front, tools such as Altmetric APIs are available. The biggest challenge is to building a social economy based on a variety of research outputs. The field is moving quite rapidly, so what I have shown here is likely to be just the beginning. 

October 20, 2016

OpenWorm Blog: Announcing the OpenWorm Open House 2016

The content is being cross-posted from the OpenWorm blog, and will be updated periodically.

Hello Everybody!

We want to announce our first Open House for 2016 that will happen on October 25th from 10:30am to 4pm EST (UTC-4) (check here for your timezone), so mark the date on your calendars! The event will be live streamed at this link.

If you were waiting for an opportunity to look at the recent progress we’ve made across all the projects, this is your chance. During the meetings many contributors will present a number of flash talks and various demos, so if you are interested to hear the latest about PyOpenWorm, c302, Sibernetic, Geppetto, Analysis toolbox or any other thing happening under our roof don’t miss this opportunity!

Click below for the schedule of events.

Streamed Online:

10:30 AM - 11AM: Welcome (Stephen Larson)

Flash talks
          11:00 - 11:05: Recent progress in OpenWorm (Stephen Larson)
          11:10 - 11:15: C. elegans nervous system simulation (Padraig Gleeson)
          11:20 - 11:25: C. elegans body simulation (Andrey Palyanov)
          11:30 - 11:35: OpenWorm Badge System (Chee-Wai Lee)
          11:40 - 11:45: DevoWorm Overview (Bradly Alicea)
          11:50 - 11:55: Neuroinformatics (Rick Gerkin)
          12:00 - 12:05: Geppetto (Matteo Cantarelli)
          12:10 - 12:15: Movement Validation (Michael Currie)
          12:20 - 12:25: WormSim (Giovanni Idili)

On social media channels
          12:30 - 1:30: Social media interactions & break out signup

Streamed online (links to be added)
          1:30 PM - 3:00: Multiple track breakout sessions
                    Morphozoic Tutorial - Tom Portegys

On social media channels
3:00 - 3:30: Wrap up & Social Media Networking

Oh and bring along your nerdy friends, the more the merrier!

Hope to see you there!

The OpenWorm team

September 23, 2016

Learning by Doing, Where Doing is Earning Badges

As a member of the OpenWorm Foundation community committee (see previous post), we have been trying to find a means of engaging potential contributors within the context of the various projects. One type of activity is the Badge, a bite-sized [1] learning opportunity that we plan to use as both certifications of competency and concrete goals for the various projects. The OpenWorm Badge System is being spearheaded by Chee-Wai Lee, and is an emerging method in Educational Technology [2]. More details about this will be shared to the community by Chee-Wai in the form of a tutorial at the upcoming OpenWorm Open House.

An example of how semantic data on phenotypes can be extracted from the scientific literature. PICTURE:, BLOGPOST: Phenoscape blog

Each badge is designed to impart a specific skill. The OpenWorm badge system currently covers scientific topics (Muscle Model Builder, Hodgkin-Huxley) and research skills (Literature Mining). My contribution is the Literature Mining (LM) series. Literature mining is a technique used to organize the scientific literature, extract useful metadata (e.g. semantic data) from these sources, and identify secondary datasets for re-analysis [3]. Learning skills in Literature Mining will be useful to a wide range of badge earners, particularly those interested in Bioinformatics and Open Science research. These are skills used extensively in the DevoWorm project, and we will be planning more badges on related topical areas in the future.

The first LM badge is focused on working with the scientific literature, while the second (LMII) badge introduces learners to open-access secondary datasets. The only prerequisite is that you must earn Badge I in order to earn Badge II. Both of these badges recently went live, and you may start working on them immediately.

Example of the badge curriculum for LMI. The badgelist system requires learners to complete each step one at a time, and then request feedback (if applicable) from the Admin (e.g. instructor).

[1] why not "byte-sized", you say? Well, the Literature Mining badges are almost byte-sized (seven requirements apiece), so you could say that we are headed in that direction!

[2] Ferdig, R. and Pytash, K. (2014).  There's a badge for that. Tech and Learning, February 26.

[3] For examples of how Literature Mining can be useful, please see the Nature site for news on literature mining research.

September 6, 2016

Now Announcing the OpenWorm Open House

OpenWorm Browser. Courtesy Christian Grove, WormBase and Caltech.

About two years ago, I announced the start of the DevoWorm project to the OpenWorm community. Now both OpenWorm and DevoWorm have grown up a bit, with the former (OpenWorm) now being a Foundation and the latter (DevoWorm) resulting in multiple publications. Now we will be celebrating all of the projects that make up the OpenWorm Foundation in an Open House format, taking place in cyberspace and tentatively scheduled for October.

Image courtesy Matteo Farinella: These posters are the outcome of an OpenWorm Kickstarter campaign several years ago.

The details of the schedule are still being worked out, but the format is to include both short, 5-minute talks (Ignite-style) and longer tutorials (45-60 minutes, plus questions). The short talks will highlight the various ongoing projects within OpenWorm, while the tutorials will focus on specific methods or procedures employed by the projects. If you happen to be a project leader or major contributor, I have probably already asked you for content. Interested in either contributing content or attending? Please let me know

Dr. Stephen Larson (pre-PhD), discussing the connection between Lt. Data and C. elegans at Ignite San Diego.

I have also been involved in committee work for the OpenWorm foundation. One of the initiatives we are in the process of establishing is the OpenWorm badge system, which is being spearheaded by Dr. Chee-Wai Lee. Currently trendy in the online learning world, this is an experiment in open learning that provides micro-credentials to a global community. Badges are a great way to learn new skills, as well as a means to motivate people's contributions to different projects within OpenWorm. Currently, OpenWorm is offering tutorials on the Hodgkin-Huxley model, the Muscle Model builder, and the Muscle Model explorer. If there are any tutorials you would like to see us offer, or if you think there is a need for a particular skill to be highlighted, please let me know.

August 19, 2016

From Toy Models to Quantifying Mosaic Development

Time travel in the Terminator metaverse. COURTESY: Michael Talley.

Almost two years ago, Richard Gordon and I published a paper in the journal Biosystems called "Toy Models for Macroevolutionary Patterns and Trends" [1]. Now, almost exactly two years later [2], we have published a second paper (not quite a follow-up) called "Quantifying Mosaic Development: towards an evo-devo postmodern synthesis of the evolution of development via differentiation trees of embryos". While the title is quite long, the approach can be best described as computational/ statistical evolution of development (evo-devo).

Sketch of a generic differentiation tree, which figures prominently in our theoretical synthesis and analysis. COURTESY: Dr. Richard Gordon.

This paper is part of a special issue in the journal Biology called "Beyond the Modern Evolutionary Synthesis- what have we missed?" and a product of the DevoWorm project. The paper itself is a hybrid theoretical synthesis/research report, and introduces a variety of comparative statistical and computational techniques [3] that are used to analyze quantitative spatial and temporal datasets representing early embryogenesis. Part of this approach was previewed in our most recent public lecture to the OpenWorm Foundation.

The comparative data analysis involves investigations within and between two species from different parts of the tree of life: Caenorhabditis elegans (Nematode, invertebrate) and Ciona intestinalis (Tunicate, chordate). The main comparison involves different instances of early mosaic development, or a developmental process that is deterministic with respect to cellular fate. We also reference data from the regulative developing Axolotl (Amphibian, vertebrate) in one of the analyses. All of the analyses involve the reuse and analysis of secondary data, which is becoming an important part of the scientific process for many research groups.

One of the techniques featured in the paper is an information-theoretic technique called information isometry [4]. This method was developed within the DevoWorm group, and uses a mathematical representation called an isometric graph to visualize cell lineages organized in different ways (e.g. a lineage tree vs. a differentiation tree). This method is summarized and validated in our paper "Information Isometry Technique Reveals Organizational Features in Developmental Cell Lineages" [4]. Briefly, each level of the cell lineage is represented as an isoline, which contains points of a specific Hamming distance. The Hamming distance is the distance between that particular cell in two alternative cell lineage orderings (the forementioned lineage and differentiation trees).

An example of an isometric graph from Caenorhabditis elegans, taken from Figure 12 in [5]. The position of a point representing a cell is based on the depth of its node in the cell lineage. The positions of all points are rotated 45 degrees clockwise from a bottom-to-top differentiation tree (in this case) ordering, where the one-cell stage is at the bottom of the graph.

A final word on the new Biology paper as it related to the use of references. Recently, I ran across a paper called "The Memory of Science: Inflation, Myopia, and the Knowledge Network" [6], which introduced me to the statistical definition of citation age. This inspired me to calculate the citation age of all journal references from three papers: Toy Models, Quantifying Mosaic Development, and a Nature Reviews Neuroscience paper from Bohil, Alicea (me), and Biocca, published in 2011. This was used as an analytical control -- as it is a review, it should contain papers which are older than the contemporary literature. Here are the age distributions for all three papers.

Distribution of Citation Ages from "Toy Models for Macroevolutionary Patterns and Trends" (circa 2014).

Distribution of Citation Ages from "Quantifying Mosaic Development: Towards an Evo-Devo Postmodern Synthesis of the Evolution of Development Via Differentiation Trees of Embryos" (circa 2016).

Distribution of Citation Ages from "Virtual Reality in Neuroscience Research and Therapy" (circa 2011).

What is interesting here is that both "Toy Models" and "Quantifying Mosaic Development" show a long tail with respect to age, while the review article shows very little in terms of a distributional tail. While there are differences in topical literatures (the VR and associated perceptual literature is not that old, after all) that influence the result, it seems that the recurrent academic Terminators utilize the literature in a way somewhat differently than most contemporary research papers. While the respect for history is somewhat author and topically dependent, it does seem to add a extra dimension to the research.

[1] the Toy Models paper was part of a Biosystems special issue called "Patterns in Evolution".

[2] This is a Terminator metaverse reference, in which the Terminator comes back every ten years to cause, effect, and/or stop Judgement Day.

[3] Gittleman, J.L. and Luh, H. (1992). On Comparing Comparative Methods. Annual Review of Ecology and Systematics, 23, 383-404.

[4] Alicea, B., Portegys, T.E., and Gordon, R. (2016). Information Isometry Technique Reveals Organizational Features in Developmental Cell Lineages. bioRxiv, doi:10.1101/062539

[5] Alicea, B. and Gordon, R. (2016). Quantifying Mosaic Development: Towards an Evo-Devo Postmodern Synthesis of the Evolution of Development Via Differentiation Trees of Embryos. Biology, 5(3), 33.

[6] Pan, R.K., Petersen, A.M., Pammolli, F., and Fortunato, S. (2016). The Memory of Science: Inflation, Myopia, and the Knowledge Network. arXiv, 1607.05606.

August 3, 2016

Slate and the Solitary Ethnographic Diagram

While his style and message does not resonate with me at all, I've always thought that Donald Trump's speeches were highly-structured rhetoric. He seems to be using a form of intersubjective signaling [1] understood by a number of constituencies as communicating their values in an authentic manner. Specifically, the speeches have a sentence structure and cadence that can be differentiated from the literalism of contemporary mainstream society or more traditional forms of doublespeak ubiquitous in American politics.

This is why the most recent challenge from Slate Magazine was too good to pass up. The challenge (which has the feel of a Will Shortz challenge): diagram a passage from a Donald Trump speech given on July 21 in Sun City, South Carolina. The passage is as follows:
"Look, having nuclear—my uncle was a great professor and scientist and engineer, Dr. John Trump at MIT; good genes, very good genes, OK, very smart, the Wharton School of Finance, very good, very smart—you know, if you’re a conservative Republican, if I were a liberal, if, like, OK, if I ran as a liberal Democrat, they would say I’m one of the smartest people anywhere in the world—it’s true!—but when you’re a conservative Republican they try—oh, do they do a number—that’s why I always start off: Went to Wharton, was a good student, went there, went there, did this, built a fortune—you know I have to give my like credentials all the time, because we’re a little disadvantaged—but you look at the nuclear deal, the thing that really bothers me—it would have been so easy, and it’s not as important as these lives are (nuclear is powerful; my uncle explained that to me many, many years ago, the power and that was 35 years ago; he would explain the power of what’s going to happen and he was right—who would have thought?), but when you look at what’s going on with the four prisoners—now it used to be three, now it’s four—but when it was three and even now, I would have said it’s all in the messenger; fellas, and it is fellas because, you know, they don’t, they haven’t figured that the women are smarter right now than the men, so, you know, it’s gonna take them about another 150 years—but the Persians are great negotiators, the Iranians are great negotiators, so, and they, they just killed, they just killed us"
Okay, here you go -- an ethnographic-style diagram [2] based on one man, but perhaps instructive of an entire American subculture (click to enlarge). The diagram focuses on the relationship between John and Donald Trump (context-specific braintrust) and a specific worldview of power wielded through nuclear weapons, financial ability, and persuasion.

[1] In this case, intersubjective signaling could be used as a mechanism to reinforce group cohesion, particularly when the group's belief structure is defined by epistemic closure.

[2] Perceived lack of agency shown as red arcs terminated with a dot.

August 1, 2016

Reaction to the Future, part infinity

Alvin Toffler, futurist and author of Future Shock and The Third Wave, died recently at the age of 87. Future Shock and The Third Wave [1] were favorite books of mine when I was in High School, and contains a lot of unexplored themes. The book's main argument was that rapid technological change is accompanied by a number of negative social effects, including reactionary political movements and collective psychocultural responses. As the rate and scope of technological change has increased [2], this shock to human society has become more acute [3]. The Thrid Wave was more directly related to cultural change, and assumed that major observed  transitions in cultural evolution [4] required profound shifts in sociology, economics, and psychology.

We can see this effect in our own society, particularly with respect to the economy of mind. As a cultural trend, more young people are pursuing a life of creative and/or mental productivity [5]. While some of this productivity is tangible (see the contemporary focus on applications of University research),  In particular, there is a strain of austerity thinking [6] that has arisen since 2008 which views intellectual expertise more generally and academic activity more specifically as a superfluous fraud. Since many of these pursuits require public (government) funding and/or provide no immediate tangible return, there is ideological bias at play as well. More generally, future shock can manifest itself as a revolt against modernity.

The blogger Drugmonkey, advancing the Mellon Doctrine (among other types of reactionary thinking) in the realm of biomedical science. 

Don't be a neo-reactionary! Hint: you don't need to appeal to religion to take this point of view.

The legacy of Toffler's ideas have gotten a bit muddled [7], and in exposed one of the problems with futurism: namely, it is hard to discern solid predictions from quasi-religious pronouncements. The unfortunate event of Toffler's death also coincides with the 50th Anniversary of the first episode of Star Trek (circa 1966). Star Trek's prime directive is an interesting detail of the Starfleet Academy rulebook consistent with Toffler's argument. The prime directive is more directly related to cultural evolution, and states that Starfleet cannot interfere in the normal trajectory of a given culture's development [8]. It is not clear how this works in practice, however, since mere cultural contact can change the trajectory of cultural evolution more than simple exposure to various foreign technologies [9]. On the other hand, if they are adopted, the introduction of single tools or cultural practices can have profound effects on a culture's trajectory.

This concept will take a long time to become culturally consistent. How long? Probably much longer than predicted by Gene Roddenbery (creator of Star Trek).

Some people who might argue that Toffler and Roddenbery are simply products of their era (the 20th century, a period of rapid technological change). Their views on the outcomes of change (technological advancement and modern cultural mores) are biased towards a historical positivism. In other words, progressive technological change is inevitable, even if we mediate this path to eventual enlightenment. Yet this view ignores the basic outlines of historical complexity -- that cultural and technological complexity does tend to increase, even if the process is painful, chaotic, and uneven [10].

[1] Toffler, A.   Future Shock. Random House, 1970 AND Toffler, A.   The Third Wave. Bantam Books, 1980.

[2] this is not necessarily equivalent to the rate of innovation, but rather has to do with the dynamics of technology adoption. For those of you who are familiar with early period (pre-2005) Wired magazine, ideological constructions around the term "neo-Luddite" characterizes the cultural aspect of this effect. For more, please see: Katz, J.   Return of the Luddites. Wired, June 1, 1995.

[3] this can be observational (such as noticing the preponderance of payphones in an several-decades old movie) or more profound (such as automation-related job losses).

[4] the major transitions of cultural evolution may or may not result from directional trends in cultural complexity.

[5] AKA The "yuccie" manifesto. For more, please see: Infante, D.   The hipster is dead, and you might not like who comes next. Mashable, June 09, 2015.

[6] Austerity thinking is associated with an obsession with debt which is underlain by a number of cultural and epistemic biases. There are a number of cultural antecedents that stress the connections between debt and morality, while most if not all cultural traditions are ill-equipped to deal with the logic and technical details of modern economics and finance. This latter point (an incompatibility between cultural traditions and advanced technology) was addressed in the book "Technopoly" by Neil Postman. A similar conceptual gap is also seen amongst popular responses to technologies such as genetic modification, which comes into conflict with many traditional cultural themes involving cleanliness and purity.

[7] There is a fascinating political subtext to how Toffler's ideas played out in society, namely his association with Newt Gingrich and anti-neo-luddite politics in the 1990s. Not particularly in line with Toffler's own views, but definitely a study in historical context. For more, please see: Murphy, T. Newt's New-Age Love Gurus. Mother Jones, January 30, 2012.

[8] For one interesting dissent on the optimality of the prime directive, please see: Clint, E.   The Prime Directive: Star Trek’s doctrine of moral laziness. Skeptic Ink blog, November 4, 2012.

[9] This statement is consistent with a process called "trans-cultural diffusion". For more, please see. Albrecht, K.   Trans-cultural diffusion. September 13, 2013.

[10] The essential lesson from the emerging field of cliodynamics. For more, please see: Turchin, P.   Arise 'cliodynamics'. Nature, 454, 34-35 (2008).

July 24, 2016

Catching up on Free Alife

Here are three Alife-related resources to catch up on, some new and some not yet posted to this blog:

Alife XV just concluded, and was hosted in Cancun by Carlos Gershenson and the Self-organizing Systems Lab at UNAM. The proceedings are available here.

Here are the Proceedings from the previous Alife conference (XIV), held in NYC during the Summer of 2014.

And here is the Spring 2016 issue of Artificial Life journal, which features selected papers from the Alife XIV conference (held in NYC in 2014). Be sure to check out the paper "An Informational Study of the Evolution of Codes and of Emerging Concepts in Populations of Agents", which I reviewed.

July 18, 2016

The Data of Stories, Recent Developments

The following features are cross-posted on Tumbld Thoughts. The first featuee is a nice set of resources on the shape of stories. The first one is a lecture (video) by Kurt Vonnegut [1], circa 1985 on the qualitative shape of various narratives.

An Infographic [2] can also be used to show Vonnegut’s story shapes in more detail. As we can see, there are a limited number of story motifs (the function), each with an associated emotional state (the amplitude of the function). In Vonnegut's formulation, these functions are largely qualitative, with no clear statistical validity.

A new paper [3] on the computational study of storytelling makes a more quantitative attempt to characterize the shape and statistics of Vonnegut's functions using a large dataset (over 1700 narratives from Project Gutenberg) and data mining techniques to quantitatively uncover these patterns.

The two images above are from Figures 2 (an illustration with Harry Potter) and 4 (the full Support Vector Machine -- SVM -- Analysis) in [3], respectively.

Tangentially, we also have a dataset that describes the career of Robert DeNiro. In fact, we can characterize the self-imposed timelessness of Robert DeNiro in two images [4, 5]. Taken together, these images suggests there are actually two points in time (July 1999 and August 2002) at which Robert DeNiro stopped caring [4].

[1] Kurt Vonnegut on the shape of stories, YouTube.

[2] Infographic by mayaeilam,

[3] Reagan, A.J., Mitchell, L., Kiley, D., Danforth, C.M., and Dodds, P.S. The emotional arcs of stories are dominated by six basic shapes. arXiv, 1606. 07772 (2016).

[4] SOURCE: Reddit’s dataisbeautiful

[5] Heisler, Y.   Nine ancient and abandoned websites from the 1990s that are still up and running. BGR, July 24, 2015.

June 15, 2016

Your Strandbeests Want to Engage in Sodaplay

The strandbeest Animaris Ordis in an non-native environment (a visit to MIT).

Several years ago [1], I discovered the wonder that is Theo Jansen's Strandbeests (beach beasts in Dutch). Strandbeests are mechatronic creatures partially designed using evolutionary algorithms and built to roam the sands (or are at least demonstrated at the beach). Standbeests mimic the movement patterns of biological animals, despite having only approximations of limbs and joints. Some of these creatures even have a "stomach" without conventional animal muscles [2].

The wing- and bottle-propelled "stomach" of Animaris percipiere. COURTESY:

A Strandbeest forelimb with each segment in its optimized proportions [3]. Jansen calls these "magic numbers", but in biological terms they more closely resemble allometric scaling.

While there is great artistic (kinetic sculpture) and scientific (biomechanical) value to the Strandbeest, it can also teach us a great deal about the ability of point masses to approximate biological movement. For the Strandbeests are reminiscent of another model of movement, this one being entirely digital. This model, Sodaplay, is actually a classic internet-based application first developed around the year 2000. Sodaconstructor allowed people to build animated creatures based on point physics and an approximation of muscle activity (via central pattern generation).

Simulated strandbeest on the move. COURTESY: YouTube user petabyte99.

In the sodaplay model, a mass-spring system is used to provide structure to the phenotype. Springs (connectors) are used to approximate muscles and connect point masses, which provide inertial responses to gravity and motion [4]. These connectors can be modulated as desired, going beyond the default sinusoidal response. In general, a networked mass-spring model can be used to examine the geometric effects of a phenotypic configuration. Depending on how the points are arranged, certain ranges of motion are possible. In the case of sodaplay, certain configurations can also lead to certain death (or collapse of the model due to gravitational conditions in the virtual environment).

An example of the Sodaconstructor (seemingly now defunct). Sodaplay models (for example, Daintywalker) are reliant upon the human expertise and perceptual selection [5] rather than natural selection. Nevertheless, this form of constructivist selection has results in nearly boundless innovation, and Sodarace allows humans to innovate against a genetic algorithm.

An approximation of quadrupedal gait in Strandbeests by tracing joint and end-effector movement. COURTESY: [6].

UPDATE (6/15):
A regular reader of this blog (Dr. Richard Gordon) provided an insight that the blog's commenting system was not able to post: "It seems to me that Strandbeasts and tensegrity structures are special cases of a broader class of objects, which may be instantiated by cytoskeleton and its motor and attachment proteins".

Indeed, there are some interesting linkages between biomechanical systems and tensegrity structures that have yet to be explored. In the case of Strandbeests, Theo Jansen has actually hit upon very different (but equally functional) biomechanical systems for "limb movement" and "stomach movement". While Strandbeests do not have biological muscle (and its associated biochemistry), nor the ability to produce isometric force, they can still produce powered movements.

As is the case with homoplastic traits (e.g. bird, bat, and insect wings), both purely mechanical and biomechanical system uses identical physical principles (e.g. levers and pulleys) to produce biologically realistic movements.

[1] Alicea, B.   Theo Jansen, Lord of the Strandbeests. Synthetic Daisies, May 28 (2012).

[2] Revisiting this post as well: Alicea, B.   On Rats (cardiomyocytes) and Jellyfish (bodies). Synthetic Daisies blog, August 22 (2012).

[3] Thor, P.   Project 3(Strandbeest). Wikiversity, December 10 (2012).

[4] McOwan, P.W. and Burton, E.J.   Sodarace: Continuing Adventures in Artificial Life. In "Artificial Life Models in Software". M. Komosinski and A. Adamatzky, eds. Chapter 3, 61-77. Springer (2009).

[5] Ostler, E.   Sodaplay Sodaconstructor. Mathematics and Computer Education, Spring (2002).

[6] Walking Strandbeests Dynamics. Online Technical Discussion Groups, Wolfram Community.