noctiluca.util
noctiluca.util.clean
Everything to do with cleaning up experimental data
- noctiluca.util.clean.split_trajectory_at_big_steps(traj, threshold)
Removes suspected mislinkages.
Exception: two consecutive big steps such that the overall displacement is smaller than the threshold are single misconnections in otherwise fine trajectories. In this case we will simply remove the misconnected point.
- Parameters:
traj (Trajectory) – the trajectory to investigate
threshold (float) – the maximum allowed frame to frame displacement
- Returns:
set of Trajectory
See also
Notes
As of now, this only works on trajectories with
N=1
.This really just checks for frame-to-frame connections exceeding the threshold. So if there are missing frames in a trajectory, the step across those missing data will not be considered.
- noctiluca.util.clean.split_dataset_at_big_steps(data, threshold)
Apply
split_trajectory_at_big_steps
to a whole data set- Parameters:
data (TaggedSet) – the dataset
threshold (float) – the maximum allowed step size
- Returns:
TaggedSet – a new data set with the split trajectories
See also
noctiluca.util.mcmc
A small module for running MCMC sampling. It provides the abstract base class
Sampler
that can be subclassed for concrete sampling schemes.
- class noctiluca.util.mcmc.MCMCRun(logLs=None, samples=None)
Bases:
object
The results of an MCMC run
- logLs
the likelihoods from the run
- Type:
np.array(dtype=float)
- samples
the actual MCMC sample
- Type:
list of <sample data type>)
Notes
The list of samples is not automatically copied upon initialization.
- logLs_trunc()
Give only the likelihoods associated with the samples
Sampler.run
returns likelihoods starting at the first iteration, but samples only after a given burn-in period. This function cuts this initial overhang from the likelihood array- Returns:
(len(samples),) np.ndarray
- best_sample_logL()
Give the best sample (and log-likelihood)
- Returns:
sample (<sample data type>) – the maximum likelihood estimate
logL (float) – the maximum likelihood value
- acceptance_rate(criterion='sample_identity', comp_fun=None)
Calculate fraction of accepted moves
We can see whether a move was accepted or rejected by checking whether the samples are actually different. There are three different ways to do so: syntactically correct would be sample comparison (using
==
). However, since samples have a user-defined data type, we do not necessarily have the == operator defined. For mutable objects we can exploit that they would not be copied for an unaccepted step, i.e. we can use identity check (is
). As a last resort, we can use the likelihood as a proxy: it is very unlikeliy that we did a move where the likelihood remained exactly the same.Alternatively, you can provide a comparison function
comp_fun(sample0, sample1)->bool
that provides the comparison. This overrides thecriterion
setting.- Parameters:
criterion ({'sample_equality', 'sample_identity', 'likelihood_equality'}) – which method to use to determine whether a step was accepted or rejected.
comp_fun (callable, optional) – should return
True
if the two arguments compare equal,False
otherwise
- Returns:
float – the calculated acceptance rate
- evaluate(fun)
Evaluate a function on all samples
This exploits that (if the sample data type is e.g. a user-defined class) many samples will actually be identical and we have to evaluate the function significantly fewer than
len(samples)
times.- Parameters:
fun (callable of signature
fun(sample) --> object
) – the function to evaluate. It should expect a single sample as input and return something.- Returns:
list – a list of output values, in the order of
samples
.
Notes
This function is supposed to decrease computational cost. It is usually quicker however, to use a vectorized function
fun
instead.
- class noctiluca.util.mcmc.Sampler
Bases:
ABC
Abstract base class for MCMC sampling
To implement a sampling scheme, subclass this and override
callback_logging
(optional)callback_stopping
(optional)
See their documentations in this base class for more details.
The above methods take an argument
params
that is supposed to represent the parameters of the model. The structure of this representation (e.g. a list of parameters, anp.ndarray
, or a custom class) is completely up to the user. Since all explicit handling of the parameters happens in thepropose_update
andlogL
methods, the code in this base class is completely independent from the parameterization.Example
>>> import numpy as np ... import scipy.stats
>>> class normMCMC(Sampler): ... def __init__(self, stepsize=0.1): ... self.stepsize = stepsize ... ... def propose_update(self, current_value): ... proposed_value = current_value + np.random.normal(scale=self.stepsize) ... logp_forward = -0.5*(proposed_value - current_value)**2/self.stepsize**2 - 0.5*np.log(2*np.pi*self.stepsize**2) ... logp_backward = -0.5*(current_value - proposed_value)**2/self.stepsize**2 - 0.5*np.log(2*np.pi*self.stepsize**2) ... return proposed_value, logp_forward, logp_backward ... ... def logL(self, value): ... return -0.5*value**2 - 0.5*np.log(2*np.pi) ... ... def callback_logging(self, current_value, best_value): ... print("Current value: {}".format(current_value))
>>> mc = normMCMC() ... mc.configure(iterations=100, burn_in=50, log_every=10, show_progress=True) ... logL, vals = mc.run(1)
- abstract propose_update(params)
Propose an update step.
- Parameters:
params (<user-specified parameter structure>) – the current parameters
- Returns:
proposed_values (<user-specified parameter structure>) – the proposed new parameters
logp_forward (float) – (log of the) probability of proposing these values from the current ones
logp_backward (float) – (log of the) probability of proposing the current values from the proposed ones
See also
- abstract logL(params)
Calculate log-likelihood of the given parameters.
- Parameters:
params (<user-specified parameter structure>) – the parameters to use
- Returns:
float – the log-likelihood for the given parameters.
See also
- callback_logging(current_params, best_params)
Callback upon logging.
This function will be called whenever a status line is printed. It can thus be used to provide additional information about where the sampler is, or how the current values compare to the best ones found so far, etc.
- Parameters:
current_params (<user-specified parameter structure>) – the current parameter set.
best_params (<user-specified parameter structure>) – the best (i.e. highest likelihood) parameter set the sampler has seen so far.
See also
- callback_stopping(myrun)
Callback to enable early stopping
This function will be called at well-defined intervals to check whether sampling should abort. Judgement is to be made based on the data so far, which is handed over as the
myrun
argument. This is anMCMCRun
object.- Parameters:
myrun (MCMCRun) – the data generated so far
- Returns:
stop (bool) – should be
True
if the sampling is to stop,False
otherwise
- configure(iterations=1, burn_in=0, log_every=-1, check_stopping_every=-1, show_progress=False)
Set the configuration of the sampler.
Note that after calling this function once, you can also access single configuration entries via the attribute
config
.- Parameters:
iterations (int) – how many MCMC iterations to run total
burn_in (int) – how many steps to discard at the beginning
log_every (int) – print a status line every … steps. Set to
-1
(default) to disable status lines.check_stopping_every (int) – how often to call the
callback_stopping
. Set to-1
(default) to never check.show_progress (bool) – whether to show a progress bar using
tqdm
See also
- run(initial_values)
Run the sampling scheme
Remember to run
configure
before this.- Parameters:
initial_values (<user-specified parameter structure>) – the initial values for the sampling scheme
- Returns:
MCMCRun – the output data of the run. This is essentially a wrapper class for an array of likelihoods and the associated samples. Plus some convenience functions.
See also
Notes
The returned
MCMCRun
contains likelihoods for all steps starting at initialization, while the list of sampled parameters starts only after the burn-in period.
noctiluca.util.userinput
Convert user data to noctiluca objects (Trajectory, TaggedSet)
This module provides convenience functions to convert from various formats in which users might provide their data to the objects that noctiluca and libraries building on it use. These functions are designed to be used under the hood in downstream libraries, such that beginning users do not have to go the extra step of creating Trajectory/TaggedSet objects from their data. In the long run it is of course recommended to get familiar with these objects, as they provide some useful additional functionality.
For trajectories, the following inputs are accepted:
noctiluca.Trajectory
, which is returned unchangedpandas.DataFrame
; columns should be named'x', 'y', 'z'
to indicate dimension, possibly with a suffix identifying different loci. So for example('x', 'y', 'x2', 'y2')
would indicate a trajectory with two loci in two spatial dimensionsequivalent column specifications to the above include
('x_left', 'x_right', 'y_left', 'y_right')
,(x0, y0, x1, y1)
,('x_red', 'y_red', 'x_green', y_green')
the precise regular expression to identify columns containing positional data is
'(x|y|z)(\d*|_.*)'
.for each locus all dimensions should exist, i.e.
('x0', 'y0', 'x0')
would be invaliddimensions should be labelled starting with
'x'
, i.e. a dataframe containing just a'y'
column would be invalid
Beyond the positional data, the dataframe may contain a column
'frame'
indicating the frame number that each detection belongs to. This column will be coerced to integer.numpy array-like; anything that casts well to a numpy array is accepted. This case is simply forwarded to the
noctiluca.Trajectory
constructor, which accepts array-like structures. Possible shapes for array-like inputs:(N, T, d)
for a trajectory withN
loci ind
spatial dimensions overT
time steps(T, d)
for a single-locus trajectory(T,)
for a trajectory with a single locus in a single spatial dimension.
pandas.Series
, which is cast to a one dimensional numpy array. In doing so, we use the index of the series as frame number, i.e.pd.Series(data=[1, 2, 3], index=[5, 7, 8])
would be read as a trajectory with 4 frames, where the second one is missing.
For datasets (i.e. a collection of trajectories) we accept
noctiluca.TaggedSet
, which is returned unchangedany iterable of structures that
make_Trajectory
can deal with. Specifically this includes a list of numpy arrays (which then can be of different length / dimension) or a big numpy array, where the different trajectories are indexed along the first dimension.pandas.DataFrame
with column names conforming to the patterns outlined above formake_Trajectory
, plus an additional column'particle'
indicating which detections belong to which trajectory. The name'particle'
for this column was chosen for compatibility withtrackpy
. Note that contrary tomake_Trajectory
, the specification of the'frame'
column is now mandatory.
More control over the input process (e.g. the assumed column names for pandas
dataframes) can be attained by using the functions in this module explicitly,
in which case additional keyword arguments can be specified. For yet finer
control we recommend directly using the noctiluca.Trajectory
and
noctiluca.TaggedSet
classes.
- noctiluca.util.userinput.make_Trajectory(inpt, **kwargs)
Create a
noctiluca.Trajectory
from user specified data- Parameters:
inpt (<various>) – see
module docstring
for description.- Keyword Arguments:
t_column (str) – name of the column containing frame numbers when giving a
pandas.DataFrame
asinpt
pos_columns ((list of) list of str) – names of the columns containing the positional data when giving a
pandas.DataFrame
asinpt
. This should be either a list of identifiers for the different dimensions (e.g.['x', 'y']
) or a list of such lists, if the trajectory has multiple loci (e.g.[['x1', 'y1'], ['x2', 'y2']]
).
- Returns:
noctiluca.Trajectory
- noctiluca.util.userinput.make_TaggedSet(inpt, **kwargs)
Convert a collection of trajectories to
noctiluca.TaggedSet
- Parameters:
inpt (<various>) – see
module docstring
for description.- Keyword Arguments:
id_column (str) – the name of the column containing trajectory identifiers (when specifying a
pandas.DataFrame
). The data in this column does not have to be numerical. Defaults to'particle'
.**kwargs (<other keyword arguments>) – forwarded to
make_Trajectory
. Note't_column'
and'pos_columns'
.