## Information transfer at multiple time-scales

• strict warning: Non-static method view::load() should not be called statically in /customers/c/4/b/danielfurth.com/httpd.www/modules/views/views.module on line 879.
• strict warning: Declaration of views_handler_filter::options_validate() should be compatible with views_handler::options_validate($form, &$form_state) in /customers/c/4/b/danielfurth.com/httpd.www/modules/views/handlers/views_handler_filter.inc on line 589.
• strict warning: Declaration of views_handler_filter::options_submit() should be compatible with views_handler::options_submit($form, &$form_state) in /customers/c/4/b/danielfurth.com/httpd.www/modules/views/handlers/views_handler_filter.inc on line 589.
• strict warning: Declaration of views_handler_filter_boolean_operator::value_validate() should be compatible with views_handler_filter::value_validate($form, &$form_state) in /customers/c/4/b/danielfurth.com/httpd.www/modules/views/handlers/views_handler_filter_boolean_operator.inc on line 149.
• strict warning: Declaration of views_plugin_style_default::options() should be compatible with views_object::options() in /customers/c/4/b/danielfurth.com/httpd.www/modules/views/plugins/views_plugin_style_default.inc on line 25.
• strict warning: Declaration of views_plugin_row::options_validate() should be compatible with views_plugin::options_validate(&$form, &$form_state) in /customers/c/4/b/danielfurth.com/httpd.www/modules/views/plugins/views_plugin_row.inc on line 135.
• strict warning: Declaration of views_plugin_row::options_submit() should be compatible with views_plugin::options_submit(&$form, &$form_state) in /customers/c/4/b/danielfurth.com/httpd.www/modules/views/plugins/views_plugin_row.inc on line 135.
• strict warning: Non-static method view::load() should not be called statically in /customers/c/4/b/danielfurth.com/httpd.www/modules/views/views.module on line 879.
• strict warning: Non-static method view::load() should not be called statically in /customers/c/4/b/danielfurth.com/httpd.www/modules/views/views.module on line 879.
• strict warning: Declaration of views_handler_field_comment::init() should be compatible with views_handler_field::init(&$view,$options) in /customers/c/4/b/danielfurth.com/httpd.www/modules/views/modules/comment/views_handler_field_comment.inc on line 50.
• strict warning: Declaration of views_handler_filter_node_status::operator_form() should be compatible with views_handler_filter::operator_form(&$form, &$form_state) in /customers/c/4/b/danielfurth.com/httpd.www/modules/views/modules/node/views_handler_filter_node_status.inc on line 14.

Summary:
Preliminary remark on the separate presentations held by Randy Gallistel, Greg Jensen, David Freestone during the last meetings of the Information, learning and temporal cognition seminars at Peter Balsam’s lab. I here argue that the concept of a trial is not a theoretical meaningful unit and can not even be sustained as a basis for methods of statistical evaluation.
An alternative 'psychophysical' approach is presented where time is conjectured as a fractal set. It can be shown that the scalar property of interval timing arise from properties of the wavelet function that represents physical time in the time-scale plane. Transfer entropy is introduced to define an asymmetric measurement of mutual information that deals with the third-variable problem, e.g. the influence of deprivation/satiation status.
Transfer entropy and wavelet analysis are combined to examine information transfer in a matching experiment. In agreement with previous studies on matching (Gallistel et al. (2007) 'Is Matching Innate?') it is shown that trial-by-trial (!) feed-back mechanisms are not necessary for matching to occur, and that the majority of information transfer flows from the response trajectory to the reinforcement trajectory at short to medium scales (trial-by-trial level) which is in line with a purely trial-by-trial feed-forward model that makes small-sample estimates of outcomes. It is shown that the sensitivity parameter of the matching law is determined by the extent that receiving a consecutive number of reinforcements (or lack thereof) cues the organism that a new schedule is introduced and that it is time to switch investment ratio (i.e. feedback mechanisms at higher time-scales)

# Introduction

The general question that has occured to most of us in these seminars is how to identify and characterize structure and hidden dependencies between components of a complex dynamical system such as the interaction between environment and the organisms behavior.

In it’s most general formulation the problem has been to characterize the set of functions that maps between the set of responses to that of the set of environmental events, such as a set of stimuli. Both the response and the stimulus is an event extended with time.

But what is an event? The answer to this question depend on whom you ask, the probability theorist will answer that an event is a set of outcomes (measurable subsets of the sample space, constituting a σ-algebra over the sample space itself) to which a probability is assigned. Of importance is that the σ-algebra is subset of the sample space that is closed under complementation and countable unions, an important topic I will return to in a later post (For now; note that a closed set restricts the semantics beyond what is reasonable to assume. The problem is that human everyday-reasoning on topics such as temporal contingency often violates the Law of Excluded Middle, $\large \dpi{80} {\color{Red} \lnot(p\wedge \lnot p)}$, as well as the Law of Double Negation $\large \dpi{80} {\color{Red} p = \lnot \lnot p}$. In future blog posts I will point to the epistemological problems associated with such a representation, especially if one aims at understanding memory).

The physicist may answer that an event indicates a physical situation or occurrence, located at a specific point in space and time (or space-time). The physicist may then go on to measure the distance between points in space and time through the use of appropriate metric tensor. Although, an important subject, physics provides little knowledge to deal with our problem. The physicist is interested in finding structure of our physical reality from measurement provided by our senses. The psychophysicist is interested in the opposite problem: to deduce the structure that constitute our sensory experience from physical measurements.

In my opinion the best answers are provided by philosophers. Bertrand Russell repeatedly pointed out an obvious problem for anyone who tries to build a theory of time perception grounded in the work of spatial perception. We begin with the simple observation that change is an ubiquitous property of our experience. Maybe one would like to claim that there can be change in things and change in events. Things change because they gain and lose properties, this is trivial. However, there cannot be change in events, because events have their properties essentially.

Objects are extended in time: if we encounter an object at a specific point in time we are not experiencing a temporal part of the object. In contrast, we often encounter spatial parts of very large objects as is the case when Crusoe is washed-up on the beach of an island. A process on the other hand is extended in time but we do not at any single moment encounter the whole process only parts of it. Since processes have parts that are themselves events it must be the case that events are characterized by change, in fact events are changes. It should thus be clear that asking if an event changes, that is often done in change-point analysis, is not always a meaningful question. An event ontology is crucial if we identify change as events. If changes are events then we can not include pure spatial variation as changes since we can not maintain the position of relationalism.

Within the framework of change-point analysis learning is conjectured merely as a process of change. The process is fully determined by its parts: events. These events occur in time. More importantly these events are thought to occur on a 'trial'. Therefore the concept of a trial needs to be investigated.

# Trials, sessions and the time-scale domain

The concept of a 'trial' is a convention made by the scientific community and by the scientist that designed the experiment. However, the ontological status of the 'trial' is comparable to any other social convention that splits continuous time into periods of repeated events like days, months, weeks etc. If we define the period of these repeated events by the unit of another temporal measurement we speak of different temporal frequencies.

The question then arise why one should settle for one specific temporal frequency over another. For example, sometimes learning theorists use the trial as the fundamental frequency of analysis and sometimes they use blocks of trials of any arbitrary length, mostly with the purpose to reduce what is regarded as high frequency 'noise' in the learning trajectory. Sometimes the training is clustered into the structure of trials embedded into sessions. The practice of arbitrary cluster trials together into blocks and sessions is thus highly prevalent in the literature, however, one could just as well argue that there exist some unmeasured period of high frequent events between trials, although this has never been put into practice by contemporary learning theorist because of the failure to see that the unit of a 'trial' is just one among an infinite number of possible frequencies of analysis. This fundamental problem has been acknowledge by Gallistel, and can best be expressed in my favorite passage from Memory and the Computational Brain "...life does not consist of trials" (Gallistel & King, 2009).

Nevertheless, what remains to be done is to take the leap of faith to get rid of the concept of a trail altogether, not just remove it as a unit in computational models of the organisms behavior but also ban the concept from the statistical methods by which we evaluate learning performance as well (e.g. change-point analysis). In what follows I will examine what exactly happens when we take this leap of faith, I assure you that the results will be quite informative.

Figure 1.A division of the time-scale plane by discrete-time wavelet series decomposition using ideal octave filter banks. The current time-scale plane, or temporal map, is of length 64 observations with 6 number of scales

# Wavelet filter banks, a non-probabilistic alternative to pacemaker models

All current models of time perception heavily rely on probability representations. However, as I remarked in the introduction probability representations (in fact all boolean algebras) are highly undesirable because of the structure of human contingency judgements. Therefore I will develop a completely deterministic account of time perception where probability theory have no room, n.b: boolean algebra is not (yet) rejected in the current model. We begin by defining the view that physical time is fractal by nature of the measurement procedure, a procedure that results in the time-scale space.

The time-scale space consists of a sequence of nested closed subspaces or time bundles that emit sensory signals $\large \dpi{80} \varphi$ for the brain to encode:

$\large \dpi{80} \cdots V_{2} \subset V_{1} \subset V_{0} \subset V_{-1} \subset V_{-2} \cdots$

With the following properties:

(iii) Time-scale invariance
$\large \dpi{80} f(t) \in V_{m} \rightarrow f(2^{m}t) \in V_{0}$

(iv) Translational ('shift') invariance
$\large \dpi{80} \all n \in \mathbb{Z}: f(t) \in V_{0} \rightarrow f(t-n) \in V_{0}$

(v) Normalized and orthogonal (orthonormal) basis:
$\large \dpi{80} \varphi(t) \in V_{0},\quad \{\varphi(t-n) | n \in \mathbb{Z} \}$

At present I will not comment on methods to verify or reject the empirical status of these axioms. However, the axiom of time-scale invariance is noteworthy. The axiom can be found in the introduction of John Gibbon's seminal paper 'Scalar Timing and Semi-Markov Chains in Free-Operant Avoidance' as equation 1a on page 111. Nevertheless, apart from that specific equation my approach and Gibbon's are very different. Gibbon conjectured time-scale invariance as a fundamental property of a probability distribution, hence randomness was an intricate part of the theory. In the theory that I will outline randomness is not a part of our perception of time. My final aim is a complete geometric representation and deterministic theory of time-perception, not a probabilistic one. At the heart of the theory is the concept of a temporal map represented as the time-scale plane in Figure 1.

The existence of an orthonormal basis in axiom (v) gives us the scaling function, $\large \dpi{80} \varphi(t)$. Together with time-scale invariance and translational invariance we can define the function or sensory time signal $\large \dpi{80} \varphi_{m, n}$ for any integers $\large \dpi{80} n$ and $\large \dpi{80} m$:

$\large \dpi{80} \varphi_{m, n}(t) = 2^{-\frac{m}{2}}\varphi(2^{-m}t-n),$

The factor $\large \dpi{80} 2^{-\frac{m}{2}}$ exists to normalize the norm $\large \dpi{80} \|\varphi_{m, n}\| = 1$ regardless of the choice of $\large \dpi{80} m$ and $\large \dpi{80} n$. It can be proven (Daubechies, ) that when (i)-(v) are satisfied for the sequence of subspaces (time bundles) there exists an orthonormal basis for $\large \dpi{80} L_{2}(\mathbb{R})$:

$\large \dpi{80} \psi_{m, n}(t) = 2^{-\frac{m}{2}}\psi(2^{-m}t-n),$

$\large \dpi{80} \{\psi_{m, n} \}$ serve as an orthonormal basis for another space: the psychometric time-scale space $\large \dpi{80} W_{m}$ which is the orthogonal complement of $\large \dpi{80} V_{m}$ in $\large \dpi{80} V_{m-1}$: $\large \dpi{80} V_{m-1} = V_{m} \oplus W_{m}$. The function $\large \dpi{80} \psi(t)$ is the wavelet function (or the brain's temporal encoding function). Because $\large \dpi{80} \varphi(t) \in V_{0} \subset V_{-1}$ and the functions of the form $\large \dpi{80} 2^{\frac{1}{2}}\varphi(2t-n)$ spans the space $\large \dpi{80} V_{m-1}$ the brain can decompose the sensory temporal signal into $\large \dpi{80} n$ number of components:

$\large \dpi{80} \varphi(t) =\sqrt{2} \displaystyle\sum_{n\in\mathbb{Z}}h_{0}(n) \varphi(2t-n),$

Where $\large \dpi{80} h_{0}(n)$ are normalized coefficients, $\large \dpi{80} \|h_{0}(n)\| = 1$. The preliminary representation has been outlined. next we will examine possible mechanism that could have physical parallels in the brain.

Down- and upsampling

The sensory signals that the brain receives contain too much information to be thoroughly sampled. Downsampling and upsampling are two signal processing techniques which are quite handy when a signal needs to be compressed but with out to much information loss. If we have a discrete input signal $\large \dpi{80} x_{n}$ then the downsampling of $\large \dpi{80} x_{n}$ results in a transformed signal $\large \dpi{80} y$. This can be written as:

$\large \dpi{80} y_{n} = x_{Nn}$

Where $\large \dpi{80} N$ is an integer. If we have a signal that transmit at a rate of 1000 Hz then if $\large \dpi{80} N = 1000$ we have downsampled the signal to 1 Hz. If the opposite is done we speak of upsampling the signal by inserting a bunch of zeros between the sampled intervals:

$\large \dpi{80} y_{n} = x_{n/M} n = kM$

Here $\large \dpi{80} k$ is an integer that gives the new sampling speed. Downsampling and upsampling is performed by a z-transformation of the original sequence of real or complex numbers into a complex frequency-domain representation. Because downsampling of a signal can induce aliasing it is recommended that a filter is used before the signal is downsampled. In a filter bank this is done by an array of different bandpass filters that separate different frequency components.

Two-channel filter bank

The two-channel filter bank transforms a signal, $\large \dpi{80} x_{n}$, into the complex frequency domain, $\large \dpi{80} X(z)$, and produces a perfect reconstruction $\large \dpi{80} \widehat{X}(z)$, where $\large \dpi{80} X(z)$ is the z-transform of $\large \dpi{80} x_{n}$ with $\large \dpi{80} z$ being a complex number. In matrix notation the perfect reconstruction of a two-channel filter bank is given by:

$\large \dpi{80} \hat{X}(z) = \frac{1}{2}\begin{bmatrix} F_{0}(z)\\ F_{1}(z) \end{bmatrix}^{T} \begin{bmatrix} H_{0}(z) & H_{0}(-z) \\ H_{1}(z)& H_{1}(-z) \end{bmatrix}\begin{bmatrix} X(z)\\ X(-z) \end{bmatrix}$

Where $\large \dpi{80} H_{1}(z)$ is a filter which passes only high frequency information further whereas $\large \dpi{80} H_{0}(z)$ is the corresponding low-pass filter. The lowest order high-pass filter gives us the threshold that establish a lower time-scale limit of our temporal perception, equivalent to the pacemaker rate in probabilistic models such as SET. The corresponding $\large \dpi{80} F(z)$ functions denotes the output functions of the system that reconstructs the signal after upsampling, analogous to the decision and memory components in SET. For a schematic of a multi-channel filter bank see Figure 2..

Figure 2.Schematic of a multi-channel filter bank (the neural machinery which can encode fractal time, V, into a wavelet representation in the time-scale plane, W, see Figure 1.). The filter bank can be extended by adding more and more components to filter the signal to the highest time-scale we can comprehend by experience. The mechanism by which the area of the time-scale plane (i.e. temporal map) is realized are open for discussion.

Scalar variability

SET is grounded on the observation that performance in interval timing obeys Weber's law, that the Just-noticeable difference of an interval is proportional to the interval's duration multiplied by a constant factor, the weber coefficient:

It was reasoned that a pacemaker timer with the arrival times drawn from a poisson distribution can not achieve the Weber law. It was hypothesized that small processing glitches in other components such as memory and decision making ad the desired normally distributed timing variability that follows Weber's law.

If one adopts a set of filter banks instead of a single pacemaker timer Weber's law follows naturally. Infact the channel capacity of the filter bank pacemaker system can be completely determined.

According to the axiom of translational scale invariance we see that the wavelet uses a varying width of the difference limen (i.e. the time window or j.n.d.), hence we have a fixed number of cycles $\large \dpi{80} \Delta fT=1$, increasing or decreasing the difference limen will affect the time resolution in the opposite direction in the time domain this is because $\large \dpi{80} \Delta f \Delta t$ is constant.

Therefore, $\large \dpi{80} \Delta f \propto f$ which gives us:

$\large \dpi{80} \frac{\Delta f}{f}=w$

Which is Weber's law in the frequency domain. It can further be shown (upon request I can show this in another blog post by the aid of the Hausdorff-Young inequality) that the channel capacity of the wavelet has the following limit:

$\large \dpi{80} \Delta f \Delta t \geq \frac{1}{2}$

This inequality establish why we experience instantaneous moments.
Inconclusion: the wavelet representation exhibit all the necessary psychophysical properties of interval timing and it can be physically realizable in the brain by, for example, biological mechanism that mimics the behavior of filter banks. Before implementing my view of temporal cognition to data I need to clearify why the concept of mutual information employed by Balsam and Gallistel is not a desirable measurement of contingency: since it is symmetric and subjected to the third-variable problem. This argument will be elaborated in the following section. In addition, the measurement of entropy transfer (Schreiber, 2000) is suggested to replace the use of mutual information.

# Information Transfer

For a discrete and stationary signal $\large \dpi{80} X(t)$, let $\large \dpi{80} p(x)$ be the time independent probability to observe symbol $\large \dpi{80} x$ from alphabet $\large \dpi{80} A$, e.g. $\large \dpi{80} x \in \{1, 2, ..., S\}$ where $\large \dpi{80} S$ denotes the number of symbols in the alphabet. Without taking possible correlations into account the average number of bits needed to optimally encode the signal $\large \dpi{80} X(t)$ is given by the Shannon entropy:

$\large \dpi{80} H_{X} = -\displaystyle\sum\limits_{x=1}^S p(x) \log_{2} p(x), \quad 0 \leq H_{X} \leq \log_{2}S,$

The Shannon entropy includes only the case where we have one possible probability distribution. In the settings of Pavlovian conditioning we however deals with multiple variables such as different stimuli or responses and their dependence. To resolve this we introduce the Kullback-Leibler divergence which is a measure of the expected number of extra bits required to code samples generated by $\large \dpi{80} p(\cdot )$ if we use $\large \dpi{80} q(\cdot )$ instead.

$\large \dpi{80} D_{KL}(P || Q) = \displaystyle\sum\limits_{x=1}^S p(x) \log_{2} \left( \frac{p(x)}{q(x)} \right),$

The Kullback-Leiber divergence resembles a metric distance since it is never negative, $\large \dpi{80} D_{KL}(P || Q) \geqslant 0$, nevertheless it can not be said to be a metric distance since it is asymmetric and does not obey the triangle inequality, properties which the norm derived metrics per definition obey. Hence forth I will simply refer to the Kullback-Leiber divergence as the relative entropy and we hereafter interpret the quantity as a measure of the information gain about the random variable $\large \dpi{80} X$ that is achieved if $\large \dpi{80} p$ can be used instead of $\large \dpi{80} q$.

We can use the relative entropy to measure how much information gain we can get about a random variable $\large \dpi{80} X$ (or random variable $\large \dpi{80} Y$) if we assume that $\large \dpi{80} X$ and $\large \dpi{80} Y$ are dependent instead of independent events.

$\large \dpi{80} I(X; Y) = \displaystyle\sum\limits_{x \in X}^S \displaystyle\sum\limits_{y \in Y}^S p(x, y) \log_{2} \left( \frac{p(x, y)}{p_{1}(x) p_{2}(y)} \right),$

Where $\large \dpi{80} p_{1}(x)$ and $\large \dpi{80} p_{2}(y)$ are the marginal distributions. If the random variables $\large \dpi{80} X$ and $\large \dpi{80} Y$ are independent then $\large \dpi{80} p_{1}(x) \cdot p_{2}(y) = p(x, y)$ and the ratio $\large \dpi{80} \log_{2}(p(x, y)/p_{1}(x) p_{2}(y))$ is 0, hence nothhing is gained by assuming that $\large \dpi{80} X$ and $\large \dpi{80} Y$ are dependent. The relative entropy function that measures the dependencies between random variables is known as the mutual information. One should note that the mutual information is a symmetric measure, it doesn't tell us if the information is gained from variable $\large \dpi{80} X$ or $\large \dpi{80} Y$.

Practically this means that if $\large \dpi{80} Y$ is some measurement of reinforcement delivery and $\large \dpi{80} X$ is some measurement of response elicitation then we cannot discriminate between if information flows from the delivery of reinforcement to the response elicitation (feedback system) or if the information flows purely from the response elicitation to the reinforcement (feedforward system). It has been suggested that these problems can be resolved by providing temporal structure to the measurement of mutual information in a rather ad hoc fashion by using the lagged mutual information;

$\large \dpi{80} I(X(t); Y(t-dt)) = \displaystyle\sum p(x_{t}, y_{t-dt}) \log_{2} \left( \frac{p(x_{t}, y_{t-dt})}{p(x) p(y)} \right),$

Where $\large \dpi{80} d$ denotes the time delay, or lag, between observing $\large \dpi{80} X(t)$ and $\large \dpi{80} Y(t)$. However, the lagged mutual information is not a suitable measurement for examining information flow between two or more time series in the context of learning. The reason is that the lagged mutual information is subjected to the third-variable problem, i.e. any possible relationship between X and Y is not due to the fact that Y provides input to X, or vice versa, but rather there exist a third, unmeasured variable, that affects the relationship between X and Y. Examples of these unmeasured variables could be the animal's status of satiation or deprivation in appetitive conditioning, or simply the general levels of arousal or attentional resources of the organism. In summary: all variables that clearly changes as a function of time into the conditioning protocol but are not necessarily related to the knowledge obtained by the organism are to be considered as 'third-variables'.

What we rather need is an asymmetric entropy measurement which can seperate statistical dependencies originating in the input signal Y from those dependencies which derives from the common history of both signals. This goal can be attained by considering the system as a k order Markov process:

$\large \dpi{80} p(x_{t+dt} | x_{1}, x_{2}, ... , x_{k}) = p(x_{t+dt} | x_{1}, x_{2}, ... , x_{k}, y_{1}, y_{2}, ... , y_{l})$

Where $\large \dpi{80} p( \cdot | \cdot )$ denotes the transition state probability and $\large \dpi{80} k$ and $\large \dpi{80} l$ denotes the length of the delay embedding vectors, or `block length' of the system's memory for Y and X, respectively. The average number of bits needed to encode one additional state of the system if all previous states are known can be formulated by the relative entropy between assuming either $\large \dpi{80} k +1$ or $\large \dpi{80} k$ dimensional delay vectors:

$\large \dpi{80} h_{X}(k) = -\sum p(x_{1}, x_{2}, ... , x_{k+1} ) \log_{2} \left( \frac{p(x_{1}, x_{2}, ... , x_{k}, x_{k+1} ) }{p(x_{1}, x_{2}, ... , x_{k} ) } \right)$

It is easy to see that this is merely the difference between the Shannon entropies of blocks with either length $\large \dpi{80} k +1$ or $\large \dpi{80} k$, that is to say the above equation can be rewritten as:

$\large \dpi{80} h_{X}(k) = H_{X} (k + 1) - H_{X} (k), \quad 0 \leq h_{x}(k) \leq H_{X}$

We will refer to $\large \dpi{80} h_{X}(k)$ as the entropy rate of the system. However, in the limit, as $\large \dpi{80} \lim_{k \to \infty} h_{X}(k) = h_{X}$, which I'll, for pedagogical reasons, refer to as the entropy of the source. In case of periodic and determinate signals $\large \dpi{80} h_{X} = 0$, and for purely stochastic signals $\large \dpi{80} h_{X} = H_{X}$. In practice, the block probabilities, e.g. $\large \dpi{80} p(x_{1}, x_{2}, ... , x_{k} )$, are estimated by the relative frequencies: $\large \dpi{80} p(x_{1}, x_{2}, ... , x_{k} ) = N^{-1} \cdot n(x_{1}, x_{2}, ... , x_{k} )$, where $\large \dpi{80} n(x_{1}, x_{2}, ... , x_{k} )$ denotes the number of occurrences of the sequence $\large \dpi{80} (x_{1}, x_{2}, ... , x_{k} )$ in the embedded series of length $\large \dpi{80} N$.

Obviously we have some severe estimation problems here since the limit $\large \dpi{80} k \rightarrow \infty$ is empirically impossible to achieve, hence we must search for a sufficiently large value of $\large \dpi{80} k$. However, as $\large \dpi{80} k$ increases the number of possible ordered samples with replacement (permutations with repetition) increases $\large \dpi{80} S^{k + l}$ where $\large \dpi{80} S$ is the number of symbols in the alphabet, if $\large \dpi{80} k$ is to large for the particular finite data set to support then the estimate of $\large \dpi{80} h_{X}$ will tend to 0 due to the finite sample effect. Hence, the amount of data needed is proportional to $\large \dpi{80} S^{k + l}$.

This estimation problem is the reason why the proposed method of analysis can not be used in small samples without careful adjustment of parameters. It is also the reason why my work on compiling this document got delayed so much. I would really appreciate any ideas or possible strategies (Bayesian?) to deal with the estimation problem in finite samples.

A simple, and rather ugly strategy, is to reduce the number of symbols, $\large \dpi{80} S$, in the alphabet by coarse graining the signal. One can also set the memory length of the hypothesized input signal $\large \dpi{80} l = 1$ and then choose $\large \dpi{80} k$ to be as large as possible. If this is done then we can estimate the entropy of the source extended to two time series, $\large \dpi{80} X(t)$ and $\large \dpi{80} Y(t)$. We will refer to this quantity as the {transfer entropy or information flow, $\large \dpi{80} T_{Y\rightarrow X}$. This quantity is asymmetric, i.e. in general $\large \dpi{80} T_{Y\rightarrow X} \neq T_{X\rightarrow Y}$. The transfer entropy $\large \dpi{80} T_{Y \rightarrow X}$, or information flow from $\large \dpi{80} Y$ into $\large \dpi{80} X$, is defined as the information gained about future observations of the dependent signal $\large \dpi{80} X(t + 1)$ from past joint observations of $\large \dpi{80} X$ and $\large \dpi{80} Y$ minus the information gained about future observations $\large \dpi{80} X(t+1)$ from past states of $\large \dpi{80} X$ only:

$\large \dpi{80} T_{Y \rightarrow X}(k, l) = \sum p(x_{1}, x_{2}, ... , x_{k+1}, y_{1}, ..., y_{l} ) \log_{2} \left( \frac{p( x_{k + 1} | x_{1}, x_{2}, ... , x_{k}, y_{1}, ..., y_{l}) }{p(x_{k + 1} | x_{1}, x_{2}, ... , x_{k},) } \right)$

Equation can be simplified to the block entropy form:

$\large \dpi{80} T_{Y \rightarrow X}(k, l) = h_{X}(k) - h_{XY}(k, l) , \quad 0 \leq T_{Y \rightarrow X} (k, l) \leq H_{X}$

One should be careful in choosing the value of $\large \dpi{80} k$, if it is too large to support the data then $\large \dpi{80} T_{Y \rightarrow X}$ is subjected to finite sample effects, if it is too small then we run the risk of misinterpret the information contained in past states of both variables as information flow from Y to X. Hence, if $\large \dpi{80} k$ is to small the transfer entropy measurement will not provide any additional information that the lagged mutual information not already reveals. In the next section we will combine what we now know about information transfer with our knowledge about wavelets to compute the information transfer at multiple time scales. In all analyses that follow the parameters are set to S = 3, k = 4, and l = 1. All wavelet representations are performed by using the Continious Wavelet Transform of the Morlet wavelet function with a central frequency of 5 radians per sample. The Morlet is used since it offers an excellent time versus frequency trade-off. When I am at Gallistel's lab I can demonstrate some principles and guidlines on how to estimate wavelet functions from data (a topic that I will not deal with here).

# Application to simulated data

Lets define an one-step autoregressive process, AR(1):

$\large \dpi{80} X(t + 1) = 0.7X(t) + 0.7\textup{cos}(0.3t) + \varepsilon_{X}$

First note that the prediction ahead combines 0.7 of the output from the past observation and combines it with a cosine frequency component together with signal specific random noise drawn from a normal distribution with mean zero and 0.2 in variance.

More importantly, let the AR(1) process $\large \dpi{80} X(t)$ serve as input to another AR(1) process $\large \dpi{80} Y(t)$:

$\large \dpi{80} Y(t + 1) = 0.7Y(t) + 0.7X(t) + 0.7\textup{sin}(0.06t) + \varepsilon_{Y}$

The reader should be comfortable with these two processes, their structure and dependencies otherwise the following simulation will not make sense. We begin by generating 5000 samples of the processes and then restrict analysis to the final 3000 observations. A small time window illustrating the processes can be seen in Figure 3., note that at this level we clearly see that some synchronicity is going on between the two trajectories, but we can not determine in which direction the coupling goes.

Figure 3.The two time-series X and Y which one is the input and which is the output?

The limitation of traditionally lagged measures such as the lagged mutual information can be seen in Figure 4.. The figure illustrates the Cross Correlation Function of the two time-series (a bivariate generalization of the autocorrelation function). Although we can clearly identify the phases where the processes synchronize we can not delineate in which direction they are coupled.

Figure 4.The failure of measures that try to recreate the asymmetry of temporal measurements by the use of ad hoc solutions can be illustrated by the cross correlation function (ccf) which is the bivariate generalization of the autocovariance function (ACF)

In contrast, the information transfer from Y to X (blue) and X to Y (black) depicted in Figure 5. clearly illustrate that the information flows from X into Y and not the other way around. In Figure 5. the transfer entropy is computed for different coupling parameters c between 0 and 1, notice that for the flow of information from X to Y the transmitted information increases monotonically until it reaches the channel's capacity. This simulation demonstrates that even for low coupled systems (c = 0.1~0.3) transfer entropy is sensitive enough.

Figure 5.Information transfer between the two time-series and as a function of the coupling parameter that directs the input of X to Y

However, both AR(1) processes contain frequency components. X includes a cosine wave with a frequency of 0.3 whereas the Y process both received a proportion of the lagged cosine term by it's coupling with X but also had a sine wave with a frequency of 0.6, a since wave that strictly originating from Y alone.

Any robust method of analysis not only needs to be able to extract information on coupled systems independent from a common source, but it also needs to able to decompose the information transfer into it's periodic elements. By using the Continious Wavelet Transform (CWT) of both time-series and it's representation into the time-scale plane on a total of 87 detail coefficients (see Figure 2. and Figure 1.) we can compute the information transfer between X and Y on each of the 87 time-scales. By definition an AR(1) process transfer information on a trial-by-trial basis (i.e. t+1), therefore we would expect a significant proportion of information transfer at the lowest time scale, 1.

In addition, because X with it's cosine 0.3 frequency oscillation partly serves as input into Y we should see information transfer at scales near the equivalent frequency of 0.3 (to understand how scales can be translated into frequencies the concept of central frequency needs to be understood, email me if you have any concerns about this). More importantly if multi-scale analysis of information transfer is sensitive we should not see any transfer around 0.06 which is the frequency of inherent oscillations of Y.

All of these three hypothesis, i.e. (1) transfer from X into Y at time-scale one (2) transfer of the cosine term at the 0.3 frequency (3) no transfer of the sine term around 0.06, can be seen in Figure 5 for the reader to verify. In Figure 5 information transfer is plotted as a function of time-scale represented on the abscissa, on the opposite axis the corresponding frequency is shown (note the logarithmic scale induced by the properties of the time-scale plane, compare with Figure 1.)

Figure 6.The wavelet representation of the information flow seen in Figure 5. Note that most information is transmitted in the lowest scale (1-5 Hz), more importantly we see a peak around 30 Hz which corresponds to the frequency of the cosine term in X. Lastly, note the absence of an equal amount of information transfered at the low frequency (0.06 Hz), hence the wavelet representation of information flow can distinguish between the relative contributions at different time-scales

# Implications for 'The Matching Law'

Since Richard Herrstein's original 1961 paper the attention received by what is today known as 'the matching law' has steadily increased, not just within the narrow field of behavioral analysis but within neuroscience and economics as well.

In 1974 Baum described the Generalized Matching Law that could incorporate deviations from strict matching by a sensitivity parameter, s, and a response bias, b. In the case of two concurrent operadum the Generalized Matching Law can be stated as follows:

$\large \dpi{80} \frac{B_{1}}{B_{2}}=b\left( \frac{R_{1}}{R_2}\right )^S$

In this section time-scale analysis of entropy transfer will be combined with multilevel linear modeling to examine the heterogeneity in matching seen among participants. Often this inter-individual variability is attributed to what is refered to as 'dynamical' or 'non-asymptotic' behavior and many studies of matching not interested in this 'dynamical' behavior therefore make certain that all data analyzed is obtained from sessions of well-trained organisms. This practice of ignoring valuable information is quite absurd, it is similar to standing at night under a street light with a flash light in your hand and directing the flash light's beam onto your own feet, a place that is obviously already illuminated. To their defence there are a souring interest in understanding this dynamical or non-asymptotic behavior. However, this approach is staggering with strange and silly concepts borrowed from physics/chemistry such as 'molecular', 'molar', 'momentum' together with a modeling jargong that compares animal behavior to the behavior of pistons, basins etc. as if an act of perception or remembering shared anything more than superficially to the mechanical behavior of a steam engine.

We begin by rearranging the behavioral ratio for the ith session with a specific concurrent schedule into it's log-transform. We then express the matching law as a function between random variables with Gaussian distributions. The mean is expressed as a linear function of a response bias parameter that is free to vary between individuals and the log-transformed reinforcement ratio multiplied by the sensitivity parameter (that also varies between individuals):

$\large \dpi{80} \log_{2} \left( \frac{[B_{1}]_{i}}{[B_{2}]_{i}} \right) \sim N\left( s_{j[i]} \log_{2}\left( \frac{[R_{1}]_{i}}{[R_2]_{i}}\right ) + \log_{2}(b)_{j[i]} \quad , \quad \sigma^{2} \right)$

As said before, the free parameters are allowed to vary between individuals j according to a normal distribution, formally:

$\large \dpi{80} \binom{\log_{2}(b)_{j}}{s_{j}} \sim N\left( \binom{\mu_{\log_{2}(b)}}{\mu_{s}}, \binom{\sigma^{2}_{\log_{2}(b)} \quad \rho \sigma_{\log_{2}(b)}\sigma_{s}}{\rho \sigma_{\log_{2}(b)}\sigma_{s} \quad \sigma^{2}_{s}} \right)$

The variability in parameters is determined by the between-individual correlation parameter $\large \dpi{80} \rho$. The above model will be applied to a instrumental social learning paradigm that mimics aspect of traditional free operant VI concurrent schedules.

The data to-be-analyzed is from an experiment on social dominance and aversive learning, see Figure 7.. In this simple paradigm the task is to participate in a ball thossing game, one can not win or loose. During the game if the participant throws the ball to one of the confederates there is a probability that the other confederate will disapprove of this and make a frown. At 1500 ms from the onset of the frown there is a possibility that a mild 100 ms electrotactile stimulation is delivered to the participant's armwrist. The Reinforcement Learning Module (RLM) of the paradigm is programmed by me for PC with the aid of DirectX, it was originally modelled after Eisenberger et al's (2009) paradigm for social exclusion.

For a short video that shows excerpt from the software (everything is made by me) see the following link: http://vimeo.com/21850959. Note that this is not a demonstration of the Reinforcement Learning Module (RLM) but of a paradigm that simulates social exclusion (see the article by Eisenberger, et al. 2003 in Science).

The current paradigm is far from an ideal paradigm to evaluate important aspects of matching since the paradigm comes with a rigid trial structure (one can not throw a ball that one does have). nevertheless, it is the only data set I have access to.

Figure 7.Example where the participant (here depicted as Paul Ekman) throws the ball to the male confederate, the female confederate frowns and 'delivers' a shock to the participant. The Reinforcement Learning Module (RLM), using frowning faces and electric shocks as US. The faces are animated from the Cohn-Kanade AU-Coded Facial Expression Database

All participants were exposed to the same order of programmed schedules occuring in blocks of 40 trials within in the same session without any interruption or cue. No change over delay (COD) was used since COD is not a natural part of the social interaction of interest. The probability schedule for a frowning face and electric shock was [ {0.7:0.3}, {0.3:0.7}, {0.9:0.1}, {0.15:0.85}, {0.85:0.15}, {0.1:0.9} ]. Each block consisted of 40 trials (throws made by the participant), resulting in 6 blocks with three ratios counter-balanced between confederates.

The matching plots from a sample of 25 individuals are shown in Figure 8. the reader can verify that some participants approximate the generalized matching law such that the fraction of allocated passes to a confederate is proportional to the fraction of obtained punishment delivered by that participant although note the variability in slopes (sensitivity parameter) between participants.

Figure 8.Sample of 25 individuals that participated in a matching law experiment on social dominance. Data points are obtained log-ratio of frequencies (behavior, B, or reinforcement, R) in each block, see text.

The statistical strategy is to enter the proportion of feedforward information transfer, $\large \dpi{80} \frac{{}T_{B_{1} \rightarrow R_{1}}}{T_{B_{1} \rightarrow R_{1}}+T_{R_{1} \rightarrow B_{1}}}$, as a subject-level predictor. $\large \dpi{80} {T_{B_{1} \rightarrow R_{1}}}$ denotes the entropy transfer from behavioral option 1 to reinforcement option 1, and $\large \dpi{80} T_{R_{1} \rightarrow B_{1}}$ denotes information flow in the opposite direction (reinforcement or feedback driven). By entering $\large \dpi{80} \frac{{}T_{B_{1} \rightarrow R_{1}}}{T_{B_{1} \rightarrow R_{1}}+T_{R_{1} \rightarrow B_{1}}}$, as a subject-level predictor. $\large \dpi{80} {T_{B_{1} \rightarrow R_{1}}}$ for each scale into the random effect model we can examine when the subject-predictor contributes significantly to the null model that has no information flow parameter that varies between individuals. The time-scale that reduces most variance in sensitivity parameters across participant is the time-scale that contributes the the hetereogenity seen in Figure 8. The null model has already been described, the alternative model that includes the subject-predictor for a specific time-scale is seen bellow:

$\large \dpi{80} \binom{\log_{2}(b)_{j}}{s_{j}} \sim N\left( \binom{\gamma^{log_{2}(b))}_{0}+\gamma^{log_{2}(b))}_{1}\mu_{j}}{\gamma^{s}_{0}+\gamma^{s}_{1}\mu_{j}}, \binom{\sigma^{2}_{\log_{2}(b)} \quad \rho \sigma_{\log_{2}(b)}\sigma_{s}}{\rho \sigma_{\log_{2}(b)}\sigma_{s} \quad \sigma^{2}_{s}} \right)$

The mean of the random effects have been replaced by a linear equation that describes the impact of on response bias $\large \dpi{80} \gamma^{log_{2}(b))}_{0}+\gamma^{log_{2}(b))}_{1}\mu_{j}$ and sensitivity $\large \dpi{80} \gamma^{s}+\gamma^{s}_{1}\mu_{j}$ across participants. Where $\large \dpi{80} \mu_{j} = \frac{{}T_{B_{1} \rightarrow R_{1}}}{T_{B_{1} \rightarrow R_{1}}+T_{R_{1} \rightarrow B_{1}}}$ for the jth participant.

Both the response onset and the delivery of an electric shock (option 1) were simultaneously sampled and offline downsampled for computational convenience to a sampling rate of 1 sample per 100 ms. The time-series were then transformed into time-scale space represented by 84 different time-scales. The resulting transfer entropy is seen in Figure 9.. The filled black circles and lines gives the transfer of information from the delivery of the US whereas the hollow circles gives the transfer of information from the response to the US.

Figure 9.Information transfer averaged across the 25 participants on each time-scale. Errorbars denotes +/-1 SEM.

Three properties are of importance in Figure 9.:

(1) The majority of information flows from the response to the US at small to mediate time-scales. This is equivalent to the behavior of a feed-forward model that makes episodic small-sample estimates (Gallistel et al., 2007)

(2) At time-scales 19 to 34 we see that the information that flows from the US to the response increases above that of the transfer from the response to the US. Scales 19 to 34 is equivalent the the average frequency of when the participant enters a new block and the schedule changes. The increase in information flow from the US to the response can be described by a feedback system at the time-scales of blocks where an apparent streak of shocks implicitly signals that the scheduled has changed.

(3) At higher time-scales the entropy transfer between the two series either converges or the total amount of information transmitted drops.

Before running the statistical model that will examine the time-scales that can explain variability in the sensitivity parameter, lets plot the information transfer in Figure 9. as a function of sensitivity/slope. This can be seen in the heatmap bellow:

Figure 10.Each thick on the y-axis denotes a single participant with it's corresponding sensitivity parameter. The x-axis shows the time-scales. The contours and colors in the heatmap shows the the proportion of information transfer that flows from the response to the US

On the y-axis each thick denotes a single participant with it's fitted sensitivity parameter (obtained by single subject OLS regression). The x-axis shows us the time-scale and the contours and colors in the heatmap shows us the measurement which I'll use as a between subject-predictor, i.e. the proportion of information transfer that flows from the response to the US.

In Figure 10. notice the following observations: (1) no systematic variability at short to medium time-scales between different slope values indicating that the episodic sampling behavior is not related to the sensitivity parameter. (2) at time-scales around 30 there seems to be the case that participants with a low sensitivity do not use the same amount of information from the US as highly sensitive individuals do, the dark blue area (3) green colors represents equilibrium (50-50).

Infact observation (2) is crucial. Because when we run our model we find that the majority of the variability in the sensitivity parameter can be accounted for by variability in information transfer at the rather narrow time scale 30 to 32, where it accounts for between 40-80% of the variability in slopes. It goes without saying that at these time-scales the model using information transfer as a fix effect is better than a model without, LL-ratio = 9.58, ChiSq = 9.59, df = 2, p< .01. The interaction of information transfer on sensitivity is depicted in Figure 11..

Figure 11.Individual participant's reinforcement sensitivity as a function of information transfer for time-scale 32. The error bars denotes individual subjects 95% CI. The regression lines CI is 95% as well

To grasp the temporal precision by which we can target the entropy transfer related to variability in reinforcement sensitivity we can superimpose these time-scales on the heatmap in Figure 11.. The three time-scales that contributes to the variability in the sensitivity parameter is highlighted as a short strip in the heatmap shown in Figure 12.:

Figure 12.The highlighted region shows the time-scale where variability in information transfer explains variability in reinforcement sensitivity

All scripts and functions used to compute information transfer will soon be available on this website, first for R then I will port the script to MatLab.

Yours Sincerely,
Daniel

### Blog

2012-10-26 10:34

It is very simple to inspect the source code of any command from a package in R by simply typing in the console the package...

2012-09-23 03:33

In this blog post I will especially look at the scaling problems that arise in the design of...

2011-04-12 20:51

Summary:
Preliminary remark on the separate presentations held by Randy Gallistel, Greg Jensen, David...