Making neurophysiological data analysis reproducible.
Why and how?
Matthieu Delescluse, Romain Franconville, Sébastien Joucla, Tiffany Lieury,
Christophe Pouzat
To cite this version:
Matthieu Delescluse, Romain Franconville, Sébastien Joucla, Tiffany Lieury, Christophe Pouzat. Making neurophysiological data analysis reproducible. Why and how?. 2011. hal-00591455v3
HAL Id: hal-00591455
https://hal.archives-ouvertes.fr/hal-00591455v3
Preprint submitted on 1 Sep 2011
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Making neurophysiological data analysis reproducible. Why and how?
Matthieu Delesclusea , Romain Franconvillea,1 , Sébastien Jouclaa,2 , Tiffany Lieurya , Christophe Pouzata,∗
a Laboratoire
de physiologie cérébrale, CNRS UMR 8118, UFR biomédicale, Université Paris-Descartes, 45, rue des Saints-Pères, 75006 Paris,
France
Abstract
Reproducible data analysis is an approach aiming at complementing classical printed scientific articles with everything required to independently reproduce the results they present. “Everything” covers here: the data, the computer
codes and a precise description of how the code was applied to the data. A brief history of this approach is presented first, starting with what economists have been calling replication since the early eighties to end with what is
now called reproducible research in computational data analysis oriented fields like statistics and signal processing.
Since efficient tools are instrumental for a routine implementation of these approaches, a description of some of the
available ones is presented next. A toy example demonstrates then the use of two open source software programs
for reproducible data analysis: the “Sweave family” and the org-mode of emacs. The former is bound to R while
the latter can be used with R, Matlab, Python and many more “generalist” data processing software. Both solutions
can be used with Unix-like, Windows and Mac families of operating systems. It is argued that neuroscientists could
communicate much more efficiently their results by adopting the reproducible research paradigm from their lab books
all the way to their articles, thesis and books.
Keywords: Software, R, emacs, Matlab, Octave, LATEX, org-mode, Python
1. Introduction
An article about computational science in
a scientific publication is not the scholarship
itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and
the complete set of instructions which generated the figures.
Thoughts of Jon Claerbout “distilled” by Buckheit and
Donoho (1995).
∗ Corresponding
author
Email addresses:
matthieu.delescluse@polytechnique.org (Matthieu
Delescluse), franconviller@janelia.hhmi.org (Romain
Franconville), sebastien.joucla@parisdescartes.fr
(Sébastien Joucla), tiffany.lieury@parisdescartes.fr
(Tiffany Lieury), christophe.pouzat@parisdescartes.fr
(Christophe Pouzat)
1 Present address: Janelia Farm Research Campus, 19700 Helix
Drive, Ashburn, VA 20147, USA
2 Present address:
Institut de Neurosciences Cognitives et
Intégratives d’Aquitaine, CNRS UMR 5287, Université de Bordeaux,
Bâtiment B2 - Biologie Animale - 4ème étage, Avenue des Facultés,
33405 Talence cedex, France
Preprint submitted to Journal of Physiology Paris
The preparation of manuscripts and reports in neuroscience often involves a lot of data analysis as well as
a careful design and realization of figures and tables, in
addition to the time spent on the bench doing experiments. The data analysis part can require the setting of
parameters by the analyst and it often leads to the development of dedicated scripts and routines. Before reaching the analysis stage per se the data frequently undergo
a preprocessing which is rarely extensively documented
in the methods section of the paper. When the article includes numerical simulations, key elements of the analysis, like the time step used for conductance based neuronal models, are often omitted in the description. As
readers or referees of articles / manuscripts we are therefore often led to ask questions like:
• What would happen to the analysis (or simulation)
results if a given parameter had another value?
• What would be the effect of applying my preprocessing to the data instead of the one used by the
authors?
• What would a given figure look like with a log
scale ordinate instead of the linear scale use by the
authors?
August 29, 2011
• What would be the result of applying that same
analysis to my own data set ?
signal processing (Vandewalle et al., 2009; Donoho
et al., 2009), statistics (Buckheit and Donoho, 1995;
Rossini, 2001; Leisch, 2002a), biostatistics (Gentleman and Temple Lang, 2007; Diggle and Zeger, 2010),
econometrics (Koenker and Zeileis, 2007), epidemiology (Peng and Dominici, 2008) and climatology (Stein,
2010; McShane and Wyner, 2010) where the debate on
analysis reproducibility has reached a particularly acrimonious stage. The good news about this is that researchers have already worked out solutions to our mundane problem. The reader should not conclude from the
above short list that our community is immune to the
“reproducibility concern” since the computational neuroscience community is also now addressing the problem by proposing standard ways to describe simulations (Nordlie et al., 2009) as well as developing software like Sumatra3 to make simulations / analysis reproducible. The “whole” scientific community is also
giving a growing publicity to the problem and its solutions as witnessed by two workshops taking place in
2011: “Verifiable, reproducible research and computational science”, a mini-symposium at the SIAM Conference on Computational Science & Engineering in Reno,
NV on March 4, 20114 ; “Reproducible Research: Tools
and Strategies for Scientific Computing”, a workshop
in association with Applied Mathematics Perspectives
2011 University of British Columbia, July 13-16, 20115
as well as by “The Executable Paper Grand Challenge”
organized by Elsevier6 .
In the next section we review some of the already
available tools for reproducible research, which include
data sharing and software solutions for mixing code,
text and figures.
We can of course all think of a dozen of similar questions. The problem is to find a way to address them.
Clearly the classical journal article format cannot do
the job. Editors cannot publish two versions of each
figure to satisfy different readers. Many intricate analysis and modeling methods would require too long a
description to fit the usual bounds of the printed paper. This is reasonable for we all have a lot of different things to do and we cannot afford to systematically
look at every piece of work as thoroughly as suggested
above. Many people (Claerbout and Karrenbach, 1992;
Buckheit and Donoho, 1995; Rossini and Leisch, 2003;
Baggerly, 2010; Diggle and Zeger, 2010; Stein, 2010)
feel nevertheless uncomfortable with the present way of
diffusing scientific information as a canonical (printed)
journal article. We suggest what is needed are more
systematic and more explicit ways to describe how the
analysis (or modeling) was done.
These issues are not specific to published material.
Any scientist after a few years of activity is very likely
to have experienced a situation similar to the one we
now sketch. A project is ended after an intensive work
requiring repeated daily long sequences of sometimes
“tricky” analysis. After six months or one year we get
to do again very similar analysis for a related project;
but the nightmare scenario starts since we forgot:
• The numerical filter settings we used.
• The detection threshold we used.
• The way to get initial guesses for our nonlinear fitting software to converge reliably.
2. Reproducible research tools
In other words, given enough time, we often struggle
to exactly reproduce our own work. The same mechanisms lead to know-how being lost from a laboratory
when a student or a postdoc leaves: the few parameters
having to be carefully set for a successful analysis were
not documented as such and there is nowhere to find
their typical range. This leads to an important time loss
which could ultimately culminate in a project abandonment.
We are afraid that similar considerations sound all too
familiar to most of our readers. It turns out that the
problems described above are not specific to our scientific domain, and seem instead to be rather common
at least in the following domains: economics (Dewald
et al., 1986; Anderson and Dewald, 1994; McCullough
et al., 2006; McCullough, 2006), geophysics (Claerbout and Karrenbach, 1992; Schwab et al., 2000),
2.1. Sharing research results
Reproducing published analysis clearly depends, in
the general case, on the availability of both data and
code. It is perhaps worth reminding at this stage NIH
and NSF grantees of the data sharing policies of these
two institutions: “Data sharing should be timely and no
later than the acceptance of publication of the main findings from the final dataset. Data from large studies can
be released in waves as data become available or as they
are published” (NIH, 2003) and “Investigators are expected to share with other researchers, at no more than
3 http://neuralensemble.org/trac/sumatra/wiki
4 http://jarrodmillman.com/events/siam2011.html.
5 http://www.stodden.net/AMP2011/.
6 http://www.executablepapers.com/index.html.
2
• Allow any user/reader to easily re-run and modify
the presented analysis.
incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of
work under NSF grants.” (The National Science Foundation, 2011, Chap. VI, Sec. D 4 b). Publishers like Elsevier (Elsevier, 2011, Sec. Duties of Authors) and Nature (ESF, 2007, p. 15) also require that authors make
their data available upon request, even if that seems to
be mere lip service in some cases (McCullough and
McKitrick, 2009, pp. 16-17).
In short, many of us already work and publish in a
context where we have to share data and sometimes
codes even if we are not aware of it. The data sharing
issue is presently actively debated but the trend set by
the different committees and organizations seems clear:
data will have to be shared sooner or later. We can
expect or hope that the same will be true for publicly
funded code developments.
We will first present a brief survey of existing tools
matching these criteria, before focusing on the two that
appear to us as the most promising and generally applicable to neuroscience : Sweave and org-mode. We will
then illustrate their use with a “toy example”.
2.3. Implementations of reproducible analysis approaches
A first comprehensive solution: The Stanford Exploration Project. Claerbout and Karrenbach (1992), geophysicists of the Stanford Exploration Project, state in
their abstract: “A revolution in education and technology transfer follows from the marriage of word processing and software command scripts. In this marriage
an author attaches to every figure caption a pushbutton or a name tag usable to recalculate the figure from
all its data, parameters, and programs. This provides
a concrete definition of reproducibility in computationally oriented research. Experience at the Stanford Exploration Project shows that preparing such electronic
documents is little effort beyond our customary report
writing; mainly, we need to file everything in a systematic way.” Going beyond the database for data and code
concept, they decided to link organically some of the
figures of their publications and reports with the data
and the code which generated them. They achieved this
goal by using TEX (Knuth, 1984b) and LATEX (Lamport,
1986) to typeset their documents, a flavor of fortran
called ratfor (for rational fortran) and C to perform
the computation and cake a flavor of make to repeat automatically the succession of program calls required to
reproduce any given figure automatically. Colleagues
were then given a CD ROM with the data, the codes,
the final document as well as other required open source
software. They then could reproduce the final document
– or change it by altering parameter values – provided
they had a UNIX running computer as well as the appropriate C and fortran compilers. The approach was
comprehensive in the sense that it linked organically the
results of a publication with the codes used to generate them. It was nevertheless not very portable since
it required UNIX running computers. The Stanford Exploration Project has since then done a considerable effort towards portability with their Madagascar project7
(Fomel and Hennenfent, 2007). Madagascar users can
work with compiled codes (C, C++, fortran) as well as
2.2. Early attempts: database deposits
The economists took on very early the problem of
reproducible analysis, that they usually call replication, of published results: “In 1982, the Journal of
Money, Credit and Banking, with financial support from
the National Science Foundation, embarked upon the
JMCB Data Storage and Evaluation Project. As part
of the Project, the JMCB adopted an editorial policy
of requesting from authors the programs and data used
in their articles and making these programs and data
available to other researches on request.” Dewald et al.
(1986). The results turned out to be scary, with just 2
out of 54 (3.7%) papers having reproducible / replicable results (Dewald et al., 1986). This high level of non
reproducible results was largely due to authors not respecting the JMCB guidelines and not giving access to
both data and codes. A recent study (McCullough et al.,
2006) using the 1996-2003 archives of the JMCB found
a better – but still small – rate of reproducible results,
14/62 (22%). We do not expect the neurobiologists to
behave much better in the same context. Policies are
obviously not sufficient. This points out the need for
dedicated tools making reproducibility as effortless as
possible. Besides data sharing platforms, this is a call
for reproducible research software solutions. The ideal
software should :
• Provide the complete analysis code along with
its documentation (anyone working with data and
code knows they are useless if the author did not
take the time to properly document them) and its
theoretic or practical description,
7 www.reproducibility.org/wiki/Main
3
Page
the analysis (calculations, figures, tables) presented by
the document. The processed file is a new text document (in LATEX or HTML) in which the text of the original have been copied verbatim and where the code has
been replaced by the results of its execution. Sweave
stems from the idea of “Literate Programming” (Knuth,
1984a) consisting in mixing a computer code with its
description. The goal is to make the code easily readable and understandable by a human being rather than
by a compiler. The human readable text file with code
and description can be preprocessed to generate :
with “generalist” packages like Matlab or Octave or
with interpreted scripting languages like Python. The
use of the package on Windows still requires “motivation” since users have to install and use cygwin – a
Linux like environment for Windows – and this is, in
our experience, a real barrier for the users. In addition, although the link is “organic”, figures and the code
which generated them are not kept in the same file.
WaveLab: A more portable solution. Buckheit and
Donoho (1995); Donoho et al. (2009), statisticians from
Standford University inspired by their colleague Jon
Claerbout, proposed next a “reproducible research” solution based entirely on Matlab in the form of their
WaveLab library8 . Being wavelets experts, and having
to talk, work, and write articles with mathematicians
and experimentalists, they had to face the problem of
keeping a precise track of what different contributors
had done. They also had to frequently change the “details” of their analysis based on the asynchronous feedback of the different contributors. A constraint which
naturally called for a fully scripted analysis – with adjustable parameters – as well as scripted figures and tables generation. As Matlab users, they naturally solved
their problems with this software. The results usable
on Windows, Mac OS and Unix, the WaveLab library,
accompanies published articles and book chapters and
includes Matlab codes, data and the documentation of
both. As with Madagascar, codes are distributed separately from the final document. Besides, this approach
obviously requires to possess a Matlab license.
• a file that the compiler can “understand”, executing
the computations and generating figures
• a file that the TEX processor can “understand”giving a printable documentation as its output.
Sweave’s users have to type their texts in TEX, LATEX
or HTML, but a user contributed package odfWeave
(Kuhn, 2010) allows users to type the text parts with
OpenOffice. Examples of both Sweave and odfWeave
will be given in the sequel.
Emacs and org mode: A very versatile solution. We
are well aware that the vast majority of our readers is
unlikely to give up its usual data analysis software and
switch to R just to make its analysis reproducible. We
will therefore also detail in this article a solution which
appeared recently: the org-mode11 of Emacs12 . Emacs
(Stallman, 1981) is an extremely powerful text editor. It
is open source and runs on nearly every operating system. Emacs has many modes, specific for the edition
of text files in different “languages”: a C mode to edit
C codes, an html mode to edit web pages, several TEX
and LATEX modes to edit files in these languages, etc.
Since Emacs is extensible and customizable, its users
have extended it in bewildering number of directions
over its more than 30 years of existence. We are going to use one of these modes, org-mode, which allows users to type simple texts using ASCII or UTF8
encoding and that can output files in HTML, PDF, DocBook, LATEX, etc. In other words you can type a very
decent LATEX document with org-mode without having
to know the LATEX syntax. Thanks to the Babel13 extension of org-mode (Schulte and Davison, 2011), users
can also mix text with code, exactly like with Sweave.
This becomes therefore a tool for reproducible analysis.
Sweave: A comprehensive and portable solution. The
first approach we are going to illustrate in this article
is portable since it is based on free software, R9 (R
Development Core Team, 2010; Ihaka and Gentleman,
1996) and its user contributed packages10 for data analysis, and LATEX or HTML for typesetting. These software are available for every operating system likely to
be found in neurobiological laboratories. R is a general
data analysis software whose popularity grows every
day and is intensively used by statisticians and sometimes by neurophysiologists (Tabelow et al., 2006; Wallstrom et al., 2007; Pouzat and Chaffiol, 2009; Pippow
et al., 2009; Joucla et al., 2010). R has a special function called Sweave (Leisch, 2002a,b, 2003; Rossini and
Leisch, 2003) to process specific text files where the
text of a document is mixed with the code producing
8 http://www-stat.stanford.edu/∼wavelab/
11 http://orgmode.org/.
9 http://www.r-project.org
12 http://www.gnu.org/software/emacs/
10 http://cran.at.r-project.org
13 http://orgmode.org/worg/org-contrib/babel/index.html.
4
Furthermore, users do not have to restrict themselves to
codes written in R, they can also use Matlab, Octave,
Python, etc (33 languages supported as of org-mode
7.7). Even more interestingly, they can use different
scripting languages in the same document.
3. A toy example
We illustrate the reproducible analysis approach on a
simple example of Local Field Potentials (LFPs) detection. The end-product paragraph in a published paper
would roughly look like our next sub-section.
3.1. Paper version
Experimental methods. Experiments were performed
on mouse embryonic medulla-spinal cord preparations
that were isolated at embryonic (E) day E12 and maintained in culture on 4 × 15 Micro Electrodes Arrays
(MEAs). After two days in vitro, neuronal activity was
recorded at the junction between the medulla and the
spinal cord. Data were acquired at 10 kHz, off-line lowpass-filtered at 200 Hz and 20-times downsampled. The
activity was characterized by slow (several tens of µs)
LFPs that were simultaneously recorded on a line of 4
electrodes with an inter electrode interval of 250 µm.
Figure 1: Detection of LFP activity on the first time-derivative of
the data. The same scale applies to all channels (vertical bar: 200
µV / ms). Channel specific detection threshold shown in dashed red.
Detected events are shown as vertical red bar at bottom.
Those kind of issues can be dealt with if, together
with the paper, authors provide a document in a “reproducible” format describing precisely the analysis/processing conducted and allowing to reproduce the figures, tables and more generally computations of the
paper. As we mentioned earlier, several options are
available to produce such a document, and we’re going to illustrate the idea with two of them : Sweave and
org-mode.
Events detection. We want to estimate activity latencies with a simple thresholding method. To this end,
we use the information redundancy between the signals
recorded on each electrode. To enhance the signal-tonoise ratio, the detection was done on the time derivative of the raw data. An LFP was detected when the signal exceeded 4 times the standard deviation of the noise
on at least 3 of the 4 channels. The resulting detection
is presented on Fig. 1.
A reader of the paper, or the analyst himself coming
back to the study months later, may ask the following
questions:
3.2. Sweave version
Sweave relies on LATEX or HTML for editing the text
and processes R code.
LATEX. Like HTML, LATEX relies on markups: the source
file does not look exactly like its output; for instance
a web browser automatically interprets something like
<i>Warning!</i> and shows Warning!. In the same
way the sequence of characters \textit{Warning!}
in a LATEX file will appear like Warning! in the final PDF
document. This might seem annoying to people used
to word processing software but it has two immediate
advantages: you know exactly where your formatting
commands start and end – how many times did you lose
patience using a word processor because you did not
want anymore to have the next word you typed to be in
italic like the previous one? – and the source file you
are working with is a pure text (ASCII or UTF8) file –
• To what extent taking the derivative of the signal
does increase the S/N, as the authors claim? Can
we have a look at the raw data, since only the time
derivative is shown in the paper?
• How sensitive is the detection to the choice of the
threshold?
• How sensitive is the detection to the choice of the
number of required channels?
• Can I easily test this method on my own data set?
• How would my own method for LFPs detection
perform on that data set?
5
and end with
meaning it is easy to store, can be directly integrated to
your e-mails and is readily exchanged between different operating systems. LATEX is unsurpassed for writing
equations, for its typographic quality14 and splits the
content of a text (its logical structure) from its layout
(the fonts used, the margin sizes, etc). The motivation
of this last feature is that the user / writer should focus on content and logical structure and let LATEX do
the formatting work. LATEX is free software program
so you can give it a try without ruining yourself. We
recommend the TeX Live distribution15 but many other
ones are also available. The Not So Short Introduction
to LATEX2e (Oetiker et al., 2011) is an excellent starting
point.
@
both on a single line. Here name is a label that can
be used to refer to this specific code chunk somewhere
else in the document. The first code chunk on Listing 1 (line 10) starts with: <<load-data>> (no options are specified). Various options allow to control
the way the code chunk is interpreted. For example, the
option eval=FALSE in the get-doc-on-readBin (line
18) chunk tells Sweave to simply print the code without evaluating it. Similarly, the option fig=TRUE can
be used to display the output of plotting commands as
a figure in the final document. It is clear that by giving
access to the .Rnw file, one allows the reader / user to
reproduce or modify an analysis with the corresponding
figures. Listing 1 shows the whole .Rnw file with the
commands required to download the data, compute the
time derivatives, detect the events and generate Fig. 1.
The first page of the PDF file, LFPdetection.pdf,
obtained after processing LFPdetection.Rnw with R
(see Appendix A.2 for details) and then processing the
resulting LFPdetection.tex with LATEX is shown in
Fig. 2. Finally, when two “programming” languages
like R and LATEX have to be used together it is extremely
useful to have an editor providing facilities for both.
The two interfaces mentioned in the previous paragraph,
emacs and RStudio, provide such facilities. They do
in fact much more since they integrate the two in the
sense that code chunks can be evaluated line by line or
“en bloc” within a single interface (there is no need to
switch back and forth between an R running program
and a LATEX editor).
R. “R is a system for statistical computation and graphics. It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in
script files.”16 It is open source software released and
distributed by the R Foundation. “R is being developed for the Unix-like, Windows and Mac families of
operating systems”17 . R can run directly, through editors like emacs – the best long term solution in our opinion – or through sophisticated graphical interfaces like
RStudio18 (very useful for beginners). Readers eager
to learn R should take a couple of days to go through the
lecture notes of Lumley (2006) and through the superb
lectures of Ross Ihaka19 . Many excellent books have
been published showing how to use R to perform specific tasks. A good general reference is Adler (2009).
Sweave. A Sweave file (conventionally ending with a
.Rnw or a .Snw extension) looks like a regular LATEX or
HTML file as shown on Listing 120 , except that R code
chunks are included. Code chunks start with
odfWeave. Readers already familiar with R but not with
LATEX, or readers simply wanting to learn a single new
software at a time can try the user contributed package
odfWeave21 . With this package OpenOffice is used
instead of LATEX or HTML for the text parts. The demarcation of the code chunks and their options is identical to the one used in Sweave as shown on Fig. 3. The
processing of these odfWeave within R is slightly more
complicated than the one of the .Rnw files as shown in
Appendix A.3. These complications should nevertheless offset some of the reticence of readers interested by
the concept of reproducible data analysis but not ready
to embark into learning two sophisticated languages.
<<name,option1=...,option2=...,...>>=
14 See:
http://nitens.org/taraborelli/latex.
15 http://www.tug.org/texlive/
16 http://cran.r-project.org/doc/FAQ/R-FAQ.html
17 R-FAQ.
18 http://www.rstudio.org/
19 Statistical
Computing
(Undergraduate);
Statistical
Computing (Graduate);
Information Visualisation These
courses can be accessed through the following page:
http://www.stat.auckland.ac.nz/∼ihaka/?Teaching .
20 This listing has been edited to fit the whole code into a single
page. It is therefore harder to read than the original file we worked
with (many R commands are put on a single line separated by “;”).
The R code itself is also compact implying it will be difficult to follow
for readers unfamiliar with R. The part which include the generated
PDF figure into the final document is missing due to space constraints
but all the computations are shown.
21 http://cran.at.r-project.org/web/packages/odfWeave/index.html
6
1
3
5
7
9
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
60
62
64
66
68
70
72
74
\documentclass[a4paper,12pt,english]{article}
\usepackage{fullpage}
\begin{document}
\section{Loading data into \texttt{R}}
The data recorded from 4 electrodes result from a preprocessing briefly described in the main text and are
stored as signed integer coded on 4 Bytes. They must be mutiplied by 0.12715626 on channels 1 and 4 and by
0.01271439 on channels 2 and 3 to get voltage in $\mu$V. They are sampled at 500 Hz and 600 second are stored.
We can then read the data from our web repository and assign them to variable \texttt{Data\ raw} of our
\texttt{work space}:
<<load−data>>=
reposName <− ”http://www.biomedicale.univ−paris5.fr/physcerv/C Pouzat/Data folder/”
dN <− paste(”SpinalCordMouseEmbryo CH”,1:4,”.dat”,sep=””);fullN <− paste(reposName,dN,sep=””)
nSamples <− 500∗600; Data raw <− sapply(fullN, readBin, n=nSamples,what=”integer”)
Data raw <− t(t(Data raw)∗c(0.12715626,0.01271439,0.01271439,0.12715626))
@
For readers unfamiliar with \texttt{R}, the assignment operator ‘‘<−’’ can be replaced by the usual symbol
‘‘=’’. \texttt{R} users can always get the documentation of native and user contributed functions with:
<<get−doc−on−readBin, eval=FALSE>>=
?readBin
@
The time derivatives of the measurements are simply obtained using a difference equation whose precision is
$o(\deltaˆ2)$:
\begin{displaymath} f’(x) = \frac{f(x+\delta) − f(x−\delta)}{2 \, \delta} \end{displaymath}
<<Data derivative>>=
Data derivative <− apply(Data raw,2, function(x) c(0,diff(x,2)/2,0)∗500/1000)
@
Here the unit of \texttt{Data\ derivative} is $\mu$V / ms.
\section{LFP detection}
We are going to detect minima on each channel whose amplitudes are below a \emph{user set} multiple of the
channel standard deviation. We start by computing this quantity for each of the two versions of the data we
might choose to work with, ‘‘raw’’ of ‘‘derivative’’:
<<SD>>=
SD raw <− apply(Data raw, 2, sd); SD derivative <− apply(Data derivative, 2, sd)
@
Here \texttt{SD\ raw} and \texttt{SD\ derivative} are \emph{vectors} with as many elements as
\texttt{Data\ raw} and \texttt{Data\ derivative} have columns, that is, as many elements as recording channels.
We are going to use a threshold of 4 times the standard deviation on each channel:
<<threshold−on−derivative>>=
factor <− 4
@
A inquiring reader could easily make another choice like using a threshold of 3.5:
<<threshold−on−raw, eval=FALSE>>=
factor <− 3.5
@
As explained in the main text we \emph{decided} to identify events as minima exceeding (in absolute value) a
threshold on 3 channels simultaneously. To this end we define a variable, \texttt{activeElecNumber}, which
contains our number of required active channels. The value of this variable can easily be changed:
<<activeElecNumber>>=
activeElecNumber <− 3
@
The detection can now proceed:
<<detect−LFPs>>=
Times <− (1:dim(Data derivative)[1])/500
timeLFP <− Times[apply(t(t(Data derivative)/SD derivative) < −factor,1,sum) >= 3]
@
We decide moreover to keep only detected events which are more than 100 ms apart. When ‘‘too close’’ events are
found, the second one is discarded. This elimination is done recursively starting with the second event:
<<keep−far−apart−events>>=
timeLFP2 <− timeLFP; nbLFP <− length(timeLFP); last <− 1; new <− 2
while (new <= nbLFP) { tDiff <− timeLFP[new]−timeLFP2[last]
if (tDiff >= 0.1) {last <− last+1;timeLFP2[last] <− timeLFP[new]};new <− new+1}
timeLFP <− timeLFP2[1:last];rm(timeLFP2)
@
We can now produce our summary Fig.˜\ref{fig:detectionOnDerivativeData} with:
<<make−figure,fig=TRUE>>=
Data derivativeN <− Data derivative/diff(range(Data derivative));Data derivativeN.min <− min(Data derivativeN)
Data derivativeN <− Data derivativeN−Data derivativeN.min;Data derivativeN <− t(t(Data derivativeN)−c(0,1,2,3))
thresh <− −factor∗SD derivative/diff(range(Data derivative))−Data derivativeN.min − c(0,1,2,3)
lwr <− 0−Data derivativeN.min−c(0,1,2,3);upr <− 2/diff(range(Data derivative))−Data derivativeN.min−c(0,1,2,3)
plot(0,0,type=”n”,xlab=”Time (s)”,ylab=””,xlim=c(0,600),ylim=c(−3,1),axes=FALSE)
sapply(1:4, function(i) lines(Times,Data derivativeN[,i],lwd=1))
sapply(1:4,function(i)text(550,2−i,paste(”Channel”,i)));abline(h=thresh,col=”red”,lty=2,lwd=3)
axis(1,at=(0:6)∗100,lwd=3);rug(timeLFP,col=”red”,lwd=5);segments(−5,lwr,−5,upr,lwd=5)
@
\end{document}
Listing 1: LFPdetection.Rnw, code chunks have a pale yellow background, documentation chunks have a white one.
7
1
Loading data into R
The data recorded from 4 electrodes result from a preprocessing briefly described in the
main text and are stored as signed integer coded on 4 Bytes. They must be mutiplied by
0.12715626 on channels 1 and 4 and by 0.01271439 on channels 2 and 3 to get voltage in
µV. They are sampled at 500 Hz and 600 second are stored. We can then read the data
from our web repository and assign them to variable Data_raw of our work space:
>
>
>
>
>
>
+
reposName <- "http://www.biomedicale.univ-paris5.fr/physcerv/C_Pouzat/Data_folder/"
dN <- paste("SpinalCordMouseEmbryo_CH", 1:4, ".dat", sep = "")
fullN <- paste(reposName, dN, sep = "")
nSamples <- 500 * 600
Data_raw <- sapply(fullN, readBin, n = nSamples, what = "integer")
Data_raw <- t(t(Data_raw) * c(0.12715626, 0.01271439, 0.01271439,
0.12715626))
For readers unfamiliar with R, the assignment operator “<-” can be replaced by the usual
symbol “=”. R users can always get the documentation of native and user contributed
functions with:
> `?`(readBin)
The time derivatives of the measurements are simply obtained using a difference equation whose precision is o(δ 2 ):
f ′ (x) =
f (x + δ) − f (x − δ)
2δ
> Data_derivative <- apply(Data_raw, 2, function(x) c(0, diff(x,
+
2)/2, 0) * 500/1000)
Here the unit of Data_derivative is µV / ms.
2
LFP detection
We are going to detect minima on each channel whose amplitudes are below a user set
multiple of the channel standard deviation. We start by computing this quantity for each
of the two versions of the data we might choose to work with, “raw” of “derivative”:
> SD_raw <- apply(Data_raw, 2, sd)
> SD_derivative <- apply(Data_derivative, 2, sd)
Here SD_raw and SD_derivative are vectors with as many elements as Data_raw and
Data_derivative have columns, that is, as many elements as recording channels.
We are going to use a threshold of 4 times the standard deviation on each channel:
> factor <- 4
A inquiring reader could easily make another choice like using a threshold of 3.5:
1
Figure 2: The first page of the PDF file obtained after processing LFPdetection.Rnw (Listing 1 – page 1 corresponds to lines 1 to 41) with R and
LATEX.
8
Emacs. This is the corner stone software of the Free
Software Foundation meaning that it is open source
and that it runs on every operating system except some
very exotic ones. This editor is a world in itself but one
does not have to know it in depth in order to start using it
with org-mode. Going once through the Guided Tour
of Emacs22 should be enough for the beginner.
org-mode. Org-mode files (conventionally ending with
an .org extension) are quite similar to Sweave files
since they are made of textual description and code
blocks (the code chunks of org-mode) as shown on Listing 2. Comparing with Listing 1 we see that the beginning of a new section with its title
\section{Loading data into \texttt{R}}
Figure 3: The beginning of the odfWeave covering the first two code
chunks shown on Listing 1.
in LATEX (Listing 1, line 4) becomes (Listing 2, line 3)
* Downloading data with =Python= and...
cacheSweave. Analysis or simulations performed for a
paper or whose results get simply archived in a labbook can be quite long. It becomes therefore interesting, when working with tools like Sweave, to be able to
store intermediate results –from the code chunks having
a long run time. In this way users do not have to recompute everything every time they want to generate a
PDF document from their .Rnw file. Saving intermediate
results must clearly be done with care since a code modification in chunk k could change the result of chunk k+ j
(for j positive) implying that chunk k + j should also be
(re)evaluated even if its result was stored. The user contributed package cacheSweave (Peng, 2011) does precisely that: it saves on disk the results of the successive
code chunks while keeping track of their dependencies
and re-evaluates, following the modification of a single
code chunk, as many chunks as necessary to keep the
whole analysis/simulation consistent. It is also an obviously useful package when running an analysis in batch
mode in a crash prone environment.
So section headings in org-mode are introduced by
“* ” and get properly converted into LATEX or HTML (or
DocBook) depending on the user choice. Sub-section
heading are introduced by “** ” and so on. Formatting
commands like the “\texttt{R}” (typewriter text) of
LATEX become “=R=” – using “/R/” would produce an
italic, while “*R*” would produce bold type face. Greek
letters are simply entered as “\mu” for µ and hyperlink,
like the one to the python web site (Listing 2, line 7),
are entered as
[[http://www.python.org/][python]]
but they get immediately reformatted by emacs to appear colored and underlined. In short, org-mode brings
most of the structuring and maths editing capabilities of
LATEX with minimal hassle and it provides output flexibility while preserving the text file nature of the source
file. Org-mode files can be opened and edited with any
text editor, a feature that can be used with profit for
collaborative projects, where some participants do not
know how to use emacs but still want to be able to modify the text (or the code part).
3.3. Org-mode version
org-mode is a mode of the emacs editor. org-mode
facilitates the implementation of reproducible research
in two ways:
Code blocks. Source code can be included in
org-mode files in the same spirit, but with a different syntax, that the code chunks of Sweave. The
code blocks of org-mode admit more optional arguments since they can fundamentally do “more things”
than their Sweave counterpart. First of all the language
used in the code block has to be specified like in
• The syntax of the text part is considerably simplified compared to HTLM or LATEX but perfectly valid
source files for both of these languages can be generated directly from the same org-mode source.
• 28 programming language in addition to R are supported including Matlab, Octave (open source
clone of the former) and Python.
22 http://www.gnu.org/software/emacs/tour/
9
1
2
4
6
8
10
12
#+STYLE: <link rel=”stylesheet” href=”http://orgmode.org/org.css” type=”text/css” />
∗ Downloading data with =Python= and loading them into =octave=
The data recorded from 4 electrodes result from a preprocessing briefly described in the main text and are
stored as signed integer coded on 4 Bytes. They must be mutiplied by 0.12715626 on channels 1 and 4 and by
0.01271439 on channels 2 and 3 to get voltage in \mu V. They are sampled at 500 Hz and 600 second are stored.
We will here then dowload the data from our web repository using [[http://www.python.org/][python]]. To this
end we start by defining a =list= containing the file namse under which we want to store the data on our
hard−drive:
#+srcname: dN
#+begin src python :session ∗Python∗ :exports code :results value pp
dN=[”SpinalCordMouseEmbryo CH”+str(i)+”.dat” for i in range(1,5)]
dN
#+end src
14
16
18
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
#+results: dN
: [’SpinalCordMouseEmbryo CH1.dat’,
: ’SpinalCordMouseEmbryo CH2.dat’,
: ’SpinalCordMouseEmbryo CH3.dat’,
: ’SpinalCordMouseEmbryo CH4.dat’]
After loading the =urllib= library we can proceed and download the data:
#+srcname: download−data
#+begin src python :session ∗Python∗ :exports code :results silent
import urllib
reposName = ”http://www.biomedicale.univ−paris5.fr/physcerv/C Pouzat/Data folder/”
for n in dN:urllib.urlretrieve(reposName+n,n)
#+end src
We then load the data in an [[http://www.gnu.org/software/octave/][octave]] session (that’s the occasion to
make use of the
[[http://orgmode.org/worg/org−contrib/babel/intro.html#meta−programming−language][meta−programming language]]
capabilities of =org−mode= −− a variable created by =python=, =dN=, is going to be used directly in =octave=):
#+srcname: load−to−octave
#+begin src octave :session ∗octave∗ :exports code :results silent :var fN=dN
nSamples = 500 ∗ 600;
Data raw = zeros(nSamples,4);
i2v = [0.12715626 0.01271439 0.01271439 0.12715626];
for i=1:4
fid=fopen(fN(i,:),’r’);
[C,n]=fread(fid,nSamples,’int32’);
fclose(fid);
Data raw(:,i)=C∗i2v(i);
end
#+end src
The time derivatives of the measurements are simply obtained using a difference equation whose precision is
o(\deltaˆ{2}): \[ f’(x) = (f(x+\delta) − f(x−\delta))/(2 \delta) \]
#+srcname: Data−derivative
#+begin src octave :session ∗octave∗ :exports code :results silent
Data derivative = zeros(nSamples,4);
for i=1:4
Data derivative(2:(nSamples−1),i)=(Data raw(3:nSamples,i)−Data raw(1:(nSamples−2),i))∗500/2/1000;
end
#+end src
Here the unit of =Data derivative= is \mu V / ms.
54
56
58
60
62
∗ LFP detection
We are going to detect minima on each channel whose amplitudes are below a /user set/ multiple of the channel
standard deviation. We start by computing this quantity for each of the two versions of the data we might
choose to work with, ‘‘raw’’ of ‘‘derivative’’:
#+srcname: SD
#+begin src octave :session ∗octave∗ :exports code :results silent
SD raw = std(Data raw);
SD derivative = std(Data derivative);
#+end src
Listing 2: First part of LFPdetection.org, code blocks have a pale yellow background.
10
Figure 4: The first third of the HTML output of LFPdetection.org whose listing is shown on Listing 2.
11
But org-mode allows us to do something pretty remarkable here: we are passing to an octave code block, as
a variable value with “:var fN=dN”, the result of another code block written in a different language: an instance of meta-programming. This means that within
the Octave session the variable fN will be given the
output value24 of the code block named dN regardless of
the language in which the latter was written.
#+begin_src python
at the beginning of the first code block in Listing 2 (line
11) where the Python language is used. This is the required minimum to open a code block which is closed
by (Listing 2, line 14):
#+end_src
But optional arguments can also be specified like (Listing 2, line 11):
Saving intermediate results. Babel proposes a feature
similar to what the cacheSweave package of R brings
to Sweave: the possibility to “store” intermediate results so that the code blocks generating them are not
re-evaluated every time a PDF or HTML output is generated. This can be done by setting the optional argument :cache to yes (the default is no). See Schulte
and Davison (2011) as well as the org-mode manual
for details.
:session *Python* :exports code
followed by:
:results value pp
Here “:session *Python*” means that the evaluation
of the code block will be performed in a Python session within emacs which will outlive the evaluation itself and that will appear in an emacs buffer called
“*Python*”. This session will be accessible independently of the org-mode file. Variables created by the
evaluation, like dN in this code block, can be used
and referred to later on in another Python code block
with the same :session name (like the second code
block in Listing 2, lines 24 - 28). Optional argument
“:exports code” controls what is exported when the
LATEX or HTML output is generated. Here we just want
the code part and not the result of the evaluation to be
exported. The output produced can be seen on Fig. 4.
Argument “:results value pp” controls what happens to the org-mode file when the code block is evaluated. We want here the value of the last expression
evaluated by the code block (dN on the second line) and
we want it “pretty printed” (pp). Listing 2 shows this
value below the line starting with “#+results: dN”
(lines 16 - 20). The reader can see that this value appears
only in the org-mode file (Listing 2) and not in its HTML
output (Fig. 4). We could have exported both code and
results in our HTML file by setting “:exports both”.
Remark that we have set a name for this code block with
“#+srcname: dN”. This name is used again in the third
code block of Listing 2 starting with (line 35):
Thanks to org-mode, Matlab and Python users can
also easily implement a reproducible data analysis approach, if they are ready to make learn the basics of
emacs. Being more interactive than Sweave (because
of the difference between what is exported and what is
written in the org-mode file upon code block evaluation) org-mode files can be used as a true lab book.
The obvious benefit is that scripts and notes are stored
in a single file.
4. Conclusions
We have advocated here the “reproducible data analysis / reproducible research” approach, illustrating,
with a toy-example, several dedicated tools facilitating
its implementation. Reproducible research, and more
specifically the creation of files mixing text and code
blocks should bring four major benefits:
• Analysis reproducibility.
• Analysis transparency.
• Conservation of results and progress on a daily basis.
#+begin_src octave :session *octave*
and continuing with:
• Transmission of the accumulated knowledge inside
a group as well as within a scientific community.
:exports code :results silent :var fN=dN
This third code block uses a different language,
Octave23 , something we could not do with Sweave.
24 One should nevertheless use this functionality being aware that
output values, when they exist, are stored as org-table objects, that
is, floats are converted to ASCII. That can generate severe memory
loads and computation slow downs with output vector or matrix containing of the order of 105 elements.
23 If we had Matlab we could simply replace Octave by matlab
here, change nothing else, and get the same result.
12
Analysis reproducibility. The first point should seem
totally trivial to an outsider; after all reproducibility
is definitely one of the most basic tenets of the (natural) sciences; so how could we pretend doing science
if our work is not reproducible? Sadly empirical evidences show that the gap between reality and principles
can be significant (Dewald et al., 1986; McCullough
et al., 2006; Vandewalle et al., 2009). It is also perhaps
worth clarifying our choice of locution, “reproducible
data analysis” in addition to the now more common
one, “reproducible research”. The latter has emerged
in fields like statistics and signal processing where the
data are (more or less) taken for granted. But when we
talk about reproducibility in neuroscience we cover both
the reproducibility of the data and the reproducibility of
their analysis. We have discussed only the latter here.
With that in mind it could in fact be better to come back
to the original denomination of economists: replication.
Still, an interesting “side effect” resulting from the approach we have been advocating for –requiring an open
access to the raw data– is that experimentalists, having
access to others’ raw data, can compare with their own.
This should allow the community to spot more easily the
most obvious problems concerning data reproducibility.
Information transmission. Finally, generalizing the reproducible research approach within labs, especially in
teams with high turn-over, provides or at least reduces
the loss of accumulated knowledge. It is also a perfect medium for transmitting information from one researcher to the other, it facilitates team work and collaborations.
Software version dependence. As any computer user
knows, software and hardware are evolving fast resulting in stressful experiences where a document that could
be easily opened with the last version of a software cannot be opened anymore with the new version. Even with
well designed software one can have bad surprises like
a change in the numerical value of a calculation following a library or compiler update (Belding, 2000). This is
obviously an issue for the approach advocated here, we
could get different results for the “same” analysis after
a software update. That implies that reproducible data
analysis practitioners should keep rigorous record of the
software version they have been using25 . This also encourages these practitioners to use software whose license does not expire and whose “old” versions remain
available for a long time; using open source software
can help a lot here. Another solution is to build an image
of the whole environment one has been using–including
the operating system and all the dedicated software; image that can be later run directly on a suitable virtual machine. This is the solution recommended by two challengers of Elsevier’s “Executable Paper Grand Challenge” (Van Gorp and Mazanek, 2011; Brammer et al.,
2011).
Analysis transparency. The reproducible data analysis
approach obviously facilitates the spread and the selection of efficient methods. It should greatly improve the
trustworthiness of codes. Indeed as soon as fairly complex codes are developed for data analysis, like for anything else, bugs are present. As the open source movement has clearly demonstrated over the years, letting
other people see and contribute to code development is
a reliable way to “good” software. And a proper documentation, an integral part of transparency in our view,
greases the wheels.
Raw data and code servers. Independently of the good
will of scientists to share as thoroughly as possible their
“production”: data, code, etc; the problem of the hardware framework required to implement data sharing on
a large scale will have to be addressed. Experimentalists
recording for hours from tens of extracellular electrodes
at 15 to 20 kHz do not necessarily have the server infrastructure and the technical know-how required to make
their data available to all after publication. The situation is luckily evolving fast and initiatives like the “International Neuroinformatics Coordinating Facility”26
are now supporting data storage services for the neurosciences27 .
Results conservation. Thanks to the tools available on
multi-operating systems for several programming language, reproducible data analysis is becoming simpler
to practice on a daily basis. Researchers can therefore
use it as a tool for writing their lab books. They can not
only keep a written trace of what they tried out, but save
in the same document their ideas, comments, codes and
settings, not to mention the link to the data. A more systematic and comprehensive approach of analysis archiving should reduce mistakes, simplify the routine analysis of many data sets and allow a straightforward investigation of the effect of parameter settings. Moreover,
since those files will be containing an exhaustive information, analysis will be easier to reuse, and easier to
reuse faster.
25 R’s sessionInfo function is very useful in that perspective since
it returns the version of R and of all the packages used in a session.
26 http://www.incf.org/.
27 http://datasharing.incf.org/ep/Resources.
13
Copyright issues. If or, let us be optimistic, when reproducible data analysis practices generalize copyright
issues will appear: to what extend could the distribution of the code reproducing a published figure infringe
the copyright protecting the figure? The issues on data
copyright and use could even be more problematic. We
have not touched these questions here since they extensively discussed in Stodden (2009a,b).
flow has been specified, it can be shared with other researchers, making the analysis reproducible. In bioinformatics dedicated web sites and servers for workflows
sharing are already maintained37 .
To conclude: reproducible data analysis is a challenge but a reachable one. The tools are there and we,
like others, have been using them for a few years now
(Delescluse, 2005). Give it a try, we will all win at the
end!
Software. We have chosen to illustrate two types of
tools to implement the reproducible data analysis approach: the “Sweave family” and the org mode of
emacs. Hopefully our toy example has convinced our
readers that one can reasonably easily go from principle
to practice. We also give some “serious” examples on
our website28 . Our experience with both of these tools
is that it is really possible to use them systematically
when we do data analysis at every stage of a manuscript
preparation, starting with our lab books. Although
we illustrated the use of some of the open-source tools
available the reader should not conclude that the ones
we did not illustrate are “bad”, we just have little or no
experience with them and our space is limited. Proprietary software also exists like Inference for R29 –
for using R with Microsoft Office – but we don’t
have any experience with them. Open-source tools
like Dexy30 and Sumatra31 are clearly very promising and have capabilities similar to Sweave and orgmode. Mathematica32 and Sage33 (an open source,
Python based, environment running “on top” of many
open source mathematical software programs like R and
Octave) both include “active notebooks” that can be
used for reproducible data analysis. We have moreover
focused on “literate programming derived” tools but alternative, “workflow based”, solutions exist like Vis
Trails34 and Taverna35 . With these approaches, an
analysis is constructed and described graphically. The
advantage is an easy access for people without programming background, the drawback is, like with any
GUI based system, an inefficiency as soon as relatively
complex tasks have to be performed36 . Once the work-
Appendix A. Reproducing the toy example
Appendix A.1. Getting the source files
The source files of the three versions of our toy example are:
• LFPdetection.Rnw: identical to Listing 1 with
few extra lines at the end ensuring a proper inclusion of the generated figure in the final PDF. This
file can be edited with any text editor.
• LFPdetection in.odt: the beginning of this
file is shown on Fig. 3, it can be edited with
OpenOffice.
• LFPdetection.org: the beginning of this file is
shown on Listing 2, it can be edited with any text
editor but is best edited with emacs.
These
files
can
be
downloaded
from:
http://www.biomedicale.univ-paris5.fr/physcerv/C Pouzat/ReproducibleDataAnalysis/.
Appendix A.2. Rnw
To
generate
LFPdetection.pdf
from
LFPdetection.Rnw, start R (assuming you have
already installed it) from the folder where you have
downloaded LFPdetection.Rnw. Type
> Sweave("LFPdetection.Rnw")
28 http://www.biomedicale.univ-paris5.fr/physcerv/C Pouzat/ReproducibleDataAnalysis/ReproducibleDataAnalysis.html
29 http://inferenceforr.com/default.aspx
30 http://www.dexy.it/
31 http://neuralensemble.org/trac/sumatra/wiki
32 http://www.wolfram.com/mathematica/
33 www.sagemath.org
34 http://www.vistrails.org/index.php/Main Page
35 http://www.taverna.org.uk/
36 There is only a very limited amount of actions or concepts one can
unambiguously specify with boxes and arrows–this paper is written
Once the command has been evaluated, process the
resulting LFPdetection.tex in your folder like any
LATEX file (Oetiker et al., 2011) to get the PDF.
with words not with boxes and arrows–so users of workflows or GUIs,
end up spending a lot of time pointing at- and clicking on- pop-up
menus allowing them to specify the required parameters. On a long
term, we think that users are better off writing code directly.
37 http://www.myexperiment.org/
14
Appendix A.3. odfWeave
Appendix A.4. Org
The package odfWeave allows the processing of an
Open Office Writer document mixing text and sweave
chunks. This package is not part of R by default which
means that after installing R you will have to install the
package with:
A note on org-mode. New emacs releases come about
once a year while org-mode evolves faster with two or
more releases per year. This means that although orgmode is included in emacs it is unlikely that the orgmode of your emacs is the most recent one. So download the last stable version following the instructions for
download and installation given on the org-mode web
site38 .
> install.packages("odfWeave")
Then start R from the folder where you have downloaded LFPdetection in.odt. The next step is to
load the odfWeave library with:
Required software. This version of the toy example
uses Python39 and Octave40 . You will therefore have
to make these two software available on your computer
in order to regenerate this version of the toy example.
> library(odfWeave)
Contrary to Sweave, the document settings in
odfWeave, such as page dimensions, font settings, figures or tables margins are all defined in a list of options.
It is not recommended to change the settings modifying
directly this list since the default settings would be lost
for this session. The pre-existing styles can be accessed
calling the function getStyleDefs and copied in a new
variable, that we call here “myNewStyle”.
Toy example regeneration. After downloading
LFPdetection.org and open it in emacs. Press
the “key-chord”: C-c C-e (where C-c means press the
control key and the C key together), in the list that
appears select the output format you want: h for HTML,
d to generate a PDF and view it immediately. After that
emacs will start evaluating the code blocks one after the
other, asking you every time to confirm that you want
to evaluate them, so answer “yes” every time. After a
few “yes” your output file will be ready.
> myNewStyle <- getStyleDefs()
Customisations of styles will only be made on
“myNewStyle” with:
>
>
>
>
>
>
>
Some tricks. The confirmation asked by emacs upon
each code block evaluation can be suppressed by setting variable org-confirm-babel-evaluate to nil.
This can be done by typing in the *scratch* buffer of
emacs:
myNewStyle$ttRed$fontColor = "#000000"
myNewStyle$RlandscapePage$pageWidth <- "8.3in"
myNewStyle$RlandscapePage$pageHeight <- "11.7in"
myNewStyle$RlandscapePage$marginLeft <- "1in"
myNewStyle$RlandscapePage$marginRight <- "1in"
myNewStyle$RlandscapePage$marginTop <- "1.6in"
myNewStyle$RlandscapePage$marginBottom <- "1.6in"
(setq org-confirm-babel-evaluate nil)
before evaluating this expression by placing your cursor just after the closing parenthesis and pressing the
key chord: C-x C-e. To learn more about the variables
controlling the default working of org-mode, read the
manual.
where the R commands code for changing the font
colour of the displayed code blocks from red to black,
and defining new page margins. New styles assignments can be saved and loaded calling the function
setStyleDefs.
> setStyleDefs(myNewStyle)
Acknowledgments
The image format and sizes are specified using
getImageDefs and setImageDefs through a similar
process.
>
>
>
>
We thank Jonathan Bradley, Alain Marty, Avner BarHen and Eric Schulte for comments on the manuscript;
Gaute Einevoll and Hans Plesser for comments, discussion and for pointing out Sumatra to us; and two anonymous reviewers for constructive comments and additional references/software suggestions which greatly
improved the manuscript’s scope.
imageDefs <- getImageDefs()
imageDefs$dispWidth <- 4
imageDefs$dispHeight <- 4
setImageDefs(imageDefs)
Finally, the input file is compiled calling to odfWeave
function, with the input file name as first argument, and
the output file name as second argument.
38 http://orgmode.org/
39 http://www.python.org/
> odfWeave("LFPdetection_in.odt",
"LFPdetection_out.odt")
40 http://www.gnu.org/software/octave/
15
References
ries / Department of Statistics and Mathematics 60. Department of
Statistics and Mathematics, WU Vienna University of Economics
and Business, Vienna. Available at: http://epub.wu.ac.at/
638/.
Kuhn, M., 2010. odfWeave: Sweave processing of Open Document
Format (ODF) files. R package version 0.7.17.
Lamport, L., 1986. LaTeX: A Document Preparation System.
Addison-Wesley, Reading, Massachusetts.
Leisch, F., 2002a. Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis, in: Härdle, W., Rönz, B.
(Eds.), Compstat 2002 — Proceedings in Computational Statistics, Physica Verlag, Heidelberg. pp. 575–580. Available at: http:
//www.statistik.uni-muenchen.de/~leisch/Sweave/.
Leisch, F., 2002b. Sweave, Part I: Mixing R and LATEX. R News 2,
28–31.
Leisch, F., 2003. Sweave, Part II: Package Vignettes. R News 3,
21–24.
Lumley, T., 2006. R Fundamentals and Programming Techniques.
Available at: http://faculty.washington.edu/tlumley/
Rcourse/.
McCullough, B., McKitrick, R., 2009. Check the Numbers: The Case
for Due Diligence in Policy Formation. Research Studies. Fraser
Institute. Available at: http://www.fraserinstitute.org/
research-news/display.aspx?id=12933.
McCullough, B.D., 2006. Section editor’s introduction. Journal of
Economic and Social Measurement 31, 103–105. Available at:
http://www.pages.drexel.edu/~bdm25/publications.
html.
McCullough, B.D., McGeary, K.A., Harrison, T., 2006. Lessons from
the jmcb archive. Journal of Money, Credit and Banking 38, 1093–
1107. Available at: http://www.pages.drexel.edu/~bdm25/
publications.html.
McShane, B.B., Wyner, A.J., 2010. A statistical analysis of multiple
temperature proxies: Are reconstructions of surface temperatures
over the last 1000 years reliable? To be published in The Annals
of Applied Statistics.
NIH, 2003. Nih data sharing brochure. web. Available at: http:
//grants.nih.gov/grants/policy/data_sharing/.
Nordlie, E., Gewaltig, M.O., Plesser, H.E., 2009. Towards reproducible descriptions of neuronal network models. PLoS Comput
Biol 5, e1000456.
Oetiker, T., Partl, H., Hyna, I., Schlegl, E., 2011. The Not So Short
Introduction To LATEX2e. 5.01 edition. Available at: http://www.
ctan.org/tex-archive/info/lshort/english.
Peng, R.D., 2011. cacheSweave: Tools for caching Sweave computations. With contributions from Tobias Abenius, R package version
0.6.
Peng, R.D., Dominici, F., 2008. Statistical Methods for Environmental Epidemiology with R. Use R!, Springer.
Pippow, A., Husch, A., Pouzat, C., Kloppenburg, P., 2009. Differences of Ca(2+) handling properties in identified central olfactory
neurons of the antennal lobe. Cell Calcium 46, 87–98.
Pouzat, C., Chaffiol, A., 2009. Automatic Spike Train Analysis and Report Generation. An Implementation with R,
R2HTML and STAR.
J Neurosci Methods 181, 119–144.
Pre-print available at: http://sites.google.com/site/
spiketrainanalysiswithr/Home/PouzatChaffiol_JNM_
2009.pdf?attredirects=0.
R Development Core Team, 2010. R: A Language and Environment
for Statistical Computing. R Foundation for Statistical Computing.
Vienna, Austria. ISBN 3-900051-07-0.
Rossini, A., Leisch, F., 2003. Literate Statistical Practice. UW Biostatistics Working Paper Series 194. University of Washington.
http://www.bepress.com/uwbiostat/paper194/.
Rossini, A.J., 2001. Literate Statistical Analysis, in: Hornik, K.,
Adler, J., 2009. R IN A NUTSHELL. 0’REILLY.
Anderson, R.G., Dewald, W.G., 1994. Replication and Scientific Standards in Economics a Decade Later: The Impact of the JMCB
Project. Working Paper 1994-007C. Federal Reserve Bank of St.
Louis. Available at: http://research.stlouisfed.org/wp/
more/1994-007/.
Baggerly, K., 2010. Disclose all data in publications. Nature 467,
401–401.
Belding, T.C., 2000. Numerical replication of computer simulations:
Some pitfalls and how to avoid them. Eprint arXiv:nlin/0001057.
Brammer, G.R., Crosby, R.W., Matthews, S.J., Williams, T.L., 2011.
Paper mâché: Creating dynamic reproducible science. Procedia
Computer Science 4, 658 – 667. Proceedings of the International
Conference on Computational Science, ICCS 2011.
Buckheit, J.B., Donoho, D.L., 1995. Wavelets and Statistics. Springer.
chapter Wavelab and Reproducible Research. Preprint available
at: http://www-stat.stanford.edu/~wavelab/Wavelab_
850/wavelab.pdf.
Claerbout, J., Karrenbach, M., 1992. Electronic documents give reproducible research a new meaning, in: Proceedings of the 62nd
Annual Meeting of the Society of Exploration Geophysics, pp.
601–604. Available at: http://sepwww.stanford.edu/doku.
php?id=sep:research:reproducible:seg92.
Delescluse, M., 2005. Une approche Monte Carlo par Chaı̂nes de
Markov pour la classification des potentiels d’action. Application
à l’étude des corrélations d’activité des cellules de Purkinje. Ph.D.
thesis. Université Pierre et Marie Curie. Available at: http://
tel.archives-ouvertes.fr/tel-00011123/fr/.
Dewald, W.G., Thursby, J.G., Anderson, R.G., 1986. Replication in
empirical economics: The journal of money, credit, and banking
project. American Economic Review 76, 587–603.
Diggle, P.J., Zeger, S.L., 2010. Editorial. Biostatistics 11, 375–375.
Donoho, D.L., Maleki, A., Rahman, I.U., Shahram, M., Stodden, V., 2009. Reproducible research in computational harmonic analysis. Computing in Science and Engineering 11, 8–
18. Preprint available at: http://www-stat.stanford.edu/
~donoho/Reports/2008/15YrsReproResch-20080426.pdf.
Elsevier, 2011. Ethical guidelines for journal publication. web.
ESF, 2007. Shared responsibilities in sharing research data: Policies
and partnerships. reports of an esf-dfg workshop, 21 september
2007. Web. Available at: www.dfg.de/download/pdf/.../
sharing_research_data_esf_dfg_0709.pdf.
Fomel, S., Hennenfent, G., 2007. Reproducible computational experiments using scons, in: Proc. IEEE Int’l Conf. Acoustics, Speech
and Signal Processing, p. IV1257–IV1260.
Gentleman, R., Temple Lang, D., 2007. Statistical Analyses and
Reproducible Research. Journal of Computational and Graphical
Statistics 16, 1–23. http://pubs.amstat.org/doi/pdf/10.
1198/106186007X178663.
Ihaka, R., Gentleman, R., 1996. R: A Language for Data Analysis
and Graphics. Journal of Graphical and Computational Statistics
5, 299–314.
Joucla, S., Pippow, A., Kloppenburg, P., Pouzat, C., 2010. Quantitative estimation of calcium dynamics from ratiometric measurements: A direct, non-ratioing, method. Journal of Neurophysiology 103, 1130–1144.
Knuth, D.E., 1984a.
Literate programming.
The Computer
Journal 27, 97–111.
Reprint available at: http://www.
literateprogramming.com/knuthweb.pdf.
Knuth, D.E., 1984b. The TeXbook. Addison-Wesley, Reading, Massachusetts.
Koenker, R., Zeileis, A., 2007. Reproducible Econometric Research.
A Critical Review of the State of the Art. Research Report Se-
16
Leisch, F. (Eds.), Proceedings of the 2nd International Workshop
on Distributed Statistical Computing, Vienna, Austria. ISSN 1609395X.
Schulte, E., Davison, D., 2011.
Active document with orgmode.
Computing in Science & Engineering 13, 66–73.
Available at: http://www.cs.unm.edu/~eschulte/data/
CISE-13-3-SciProg.pdf.
Schwab, M., Karrenbach, N., Claerbout, J., 2000. Making scientific computations reproducible.
Computing in Science
& Engineering 6, 61– 67.
Preprinty available at: http:
//sep.stanford.edu/lib/exe/fetch.php?media=sep:
research:reproducible:cip.ps.
Stallman, R.M., 1981.
EMACS: The Extensible, Customizable, Self-Documenting Display Editor.
Technical Report
AIM-519A. MIT Artificial Intelligence Laboratory. Available
at: ftp://publications.ai.mit.edu/ai-publications/
pdf/AIM-519A.pdf.
Stein, M.L., 2010.
Editorial.
Available at: http://www.
e-publications.org/ims/submission/index.php/AOAS/
user/submissionFile/8887?confirm=6adde642.
Stodden, V., 2009a. Enabling reproducible research: Licensing for
scientific innovation. International Journal of Communications
Law and Policy 13.
Stodden, V., 2009b. The legal framework for reproducible research
in the sciences: Licensing and copyright. IEEE Computing in
Science and Engineering 11, 35–40. Available at: http://www.
stanford.edu/~vcs/Papers.html.
Tabelow, K., Polzehl, J., Voss, H., Spokoiny., V., 2006. Analyzing
fmri experiments with structural adaptive smoothing procedures.
NeuroImage 33, 55–62.
The National Science Foundation, 2011. Proposal and award policies and procedure guide. part ii – award & admistration guide.
web. Available at: http://www.nsf.gov/pubs/policydocs/
pappguide/nsf11001/index.jsp.
Van Gorp, P., Mazanek, S., 2011. Share: a web portal for creating
and sharing executable research papers. Procedia Computer Science 4, 589 – 597. Proceedings of the International Conference on
Computational Science, ICCS 2011.
Vandewalle, P., Kovacevic, J., Vetterli, M., 2009. Reproducible research in signal processing - what, why, and how. IEEE Signal
Processing Magazine 26, 37–47. Available at: http://rr.epfl.
ch/17/.
Wallstrom, G., Liebner, J., Kass, R.E., 2007. An Implementation of
Bayesian Adaptive Regression Splines (BARS) in C with S and R
Wrappers. Journal of Statistical Software 26, 1–21.
17