October 27, 2016

Open Access Week: Working with Secondary Datasets

This is one of two posts in celebration of Open Access week (on Twitter: #oaweek, #open access, #OpenScience #OpenData). This post will focus on the use of secondary data in scientific discovery.


The analysis of open datasets has become a major part of my research program. There are many sources of secondary data, from web scraping [1] to downloading data from repositories. Likewise, there are many potential uses for secondary data, from meta-analysis to validating simulations [2]. If sufficiently annotated [3], we can use secondary data for purposes of conducting new analyses [4], fusion with other relevant data, and data visualization. Access to secondary (and tertiary) data access relies on a philosophy of open data amongst researchers which has been validated by major funding agencies.

The first step in reusing a dataset is to select datasets that are relevant to the question or phenomenon you are interested in. While data reuse is not synonymous with exploratory data analysis, secondary datasets can be used for a variety of purposes, including for exploratory purposes. It is important to understand what data you need to address your set of issues, why you want to assemble the dataset, and how you want to manage the associated metadata [5]. Examples of data repositories include Dryad Digital Repository, Figshare, or the Gene Expression Omnibus (GEO). It is also important to remember that successful data reuse relies on good data management practices to allow for first-hand data to be applied to new contexts [6].

An example of an archived dataset from the Dryad repository (original analysis published in doi:10.1098/rsos/150333).

Now let's focus on three ways to reuse data. The simplest way to reuse data is to download and reanalyze data from a repository using a technique not used by the first-hand generators of the data. This could be done by using a different statistical model (e.g. Bayesian inference), or by including the data in a meta-analysis (e.g. surveying the effects size across multiple studies of similar design). Such research can be useful in terms of looking at the broader scope of a specific set of research questions.

The second way is to download data from a repository for the purpose of combining data from multiple sources. This is what is sometimes referred to as data fusion or data integration, and can be done in a number of ways. One way this has been useful in my research has been for comparative analysis, such as computational analyses of gene expression data across different cell types within a species [7], or developmental processes across species [8]. Another potential use of recombined data is to verify models and validating theoretical assumptions. This is a particular concern for datasets that focus on basic science.

In fact, the recent technological and cultural changes associated with increased data sharing is enabling broader research questions to be asked. Instead of asking strictly mechanistic questions (what is the effect of x on y), combined datasets enable larger-scale (e.g. systems-level) comparisons (what are the combinatorial effects of all x and all y) across experiments. Doing this in a reductionist manner might take many orders of magnitude more time than assembling and analyzing a well-combined dataset. This allows us to verify the replicability of single experiments, in addition to applying statistical learning techniques [9] in order to find previously undiscovered relationships between datasets and experimental conditions.

The third way is to annotate and reuse data generated by your own research group [10]. This type of data reuse allows us to engage in data validation, test new hypotheses as we learn more about the problem, and comparing across different ways to attack the same problem. The practice of data reuse within your own research group can encourage research continuity that transcends turnover in personnel, encouraging people to make data and methods open and explicit. Internal data reuse also educational opportunities, such as providing students with hands-on opportunities to analyze and integrate well-characterized data. Be aware that reusing familiar data still requires extensive annotation of both the data and previous attempts at analysis, and that there is of yet no culturally-coherent set of standard practices for sharing data [11].


There are a few caveats with respect to successfully using data. As is the case with experimental design, the data should be sufficient to answer the type of questions you would like to answer. This includes going back to the original paper and associated metadata to understand how the data was collected and what it was originally intended to measure. While this does not directly limit what you can do with the data, it is important to understand in terms of combining datasets. There is a need for ways of assessing internal validity for secondary datasets, whether they be single data sources or combinations of data sources.

To learn more about these techniques, please try to earn the Literature Mining badge series hosted by the OpenWorm badge system. You can earn Literature Mining I (working with papers), or both Literature Mining I and II (working with secondary data). Here you will learn about how to use secondary data sources to address scientific questions, as well as the interrelationship between the scientific literature and secondary data sources.


NOTES (Try accessing the paper dois through http://oadoi.org):

[1] Marres, N. and Weltevrede, E. (2012). Scraping the Social? Issues in live social research. Journal of Cultural Economy, 6(3), 313-315. doi:10.1080/17530350.2013.772070

[2] Sargent, R.G. (2013). Verification and validation of simulation models. Journal of Simulation, 7(1), 12–24. doi:10.1057/jos.2012.20.

[3] For an example of how this has been a consideration in the ENCODE project, please see:
Hong, E.L., Sloan, C.A., Chan, E.T., Davidson, J.M., Malladi, V.S., Strattan, J.S., Hitz, B.C., Gabdank, I., Narayanan, A.K., Ho, M., Lee, B.T., Rowe, L.D., Dreszer, T.R., Roe, G.R., Podduturi, N.R., Tanaka, F., Hilton, J.A., and Cherry, J.M. (2016). Principles of metadata organization at the ENCODE data coordination center. Database, pii:bav001. doi: 10.1093/database/bav001.

[4] Church, R.M. (2001). The Effective Use of Secondary Data. Learning and Motivation, 33, 32–45. doi:10.1006/lmot.2001.1098.

[5] One example of this includes: Kyoda, K., Tohsato, Y., Ho, K.H.L., and Onami, S. (2014). Biological Dynamics Markup Language (BDML): an open format for representing quantitative biological dynamics data. Bioinformatics, 31(7), 1044-1052. doi: 10.1093/bioinformatics/ btu767

[6] Fecher, B., Friesike, S., and Hebing, M. (2015). What drives academic data sharing? PLoS One, 10(2), e0118053. doi:10.1371/journal.pone.0118053.

[7] Alicea, Bradly (2016): Dataset for "Collective properties of cellular identity: a computational approach". Figshare, doi:10.6084/m9.figshare.4082400

[8] Here is an example of a comparative analysis based on data from two secondary datasets: Alicea, B. and Gordon, R. (2016). C. elegans Embryonic Differentiation Tree (10 division events). Figshare, doi:10.6084/m9.figshare.2118049 AND Alicea, B. and Gordon, R. (2016). C. intestinalis Embryonic Differentiation Tree (1- to 112-cell stage). Figshare, doi:10.6084/m9.figshare.2117152

[9] Dietterich, T.G. Machine Learning for Sequential Data: A Review. Structural, Syntactic, and
Statistical Pattern Recognition, LNCS, 2396. doi:10.1007/3-540-70659-3_2

[10] Federer, L.M., Lu, Y-L., Joubert, D.J., Welsh, J., and Brandys, B. (2015). Biomedical Data Sharing and Reuse: Attitudes and Practices of Clinical and Scientific Research Staff. PLoS One, 10(6), e0129506. doi:10.1371/journal.pone.0129506.

[11] Pampel, H. and Dallmeier-Tiessen, S. (2014). Open Research Data: From Vision to Practice. In "Opening Science", S. Bartling and S. Friesike eds., Pgs. 213-224. Springer Open, Berlin. Dynamic version.

No comments:

Post a Comment

Printfriendly