[1]:
import noctiluca as nl

TaggedSet

The TaggedSet is used to store and work with collections of trajectories. At its core, this is a list:

[2]:
data = nl.TaggedSet()
for i in range(10):
    data.add(nl.Trajectory(i)) # inject some dummy trajectories into `data`

print(f"Number of trajectories: {len(data)}")
print(f"Trajectory #5: {data[5]}")
Number of trajectories: 10
Trajectory #5: <noctiluca.trajectory.Trajectory object at 0x7fcfe6cc74c0>

So what’s the use of this? Afterall, python already has lists. This is where the “Tagged” part comes in.

The fundamental idea of the TaggedSet is to identify different “kinds” of trajectories (experimental conditions, framerates, etc.) by attaching tags, thus facilitating queries like

  • plot <some nice plot> for all trajectories in the ΔCTCF condition

  • run <some analysis> on all 10ms trajectories

  • run <some detailed analysis> on all 10ms trajectories from the ΔCTCF condition

Note that the picture at this point stops being that of a classic list. Instead, a TaggedSet should be thought of as an unordered pile of labelled trajectories: just stuff all your data into it; if you want to work with some subset of trajectories, you pull everything with the proper label out of the pile. (This understanding is why it’s TaggedSet, not TaggedList.)

To see how that works, we start out by manually creating a TaggedSet. Note that in production this would be a bit more streamlined (see e.g. the beginning of the MSD tutorial)

[3]:
data = nl.TaggedSet()

data.add(nl.Trajectory(0), tags=[ '10ms', 'ΔCTCF'])
data.add(nl.Trajectory(1), tags=[ '10ms', 'ΔCTCF'])
data.add(nl.Trajectory(2), tags=['100ms', 'ΔCTCF'])
data.add(nl.Trajectory(3), tags=['100ms', 'ΔCTCF'])
data.add(nl.Trajectory(4), tags=[ '10ms',    'WT'])
data.add(nl.Trajectory(5), tags=[ '10ms',    'WT'])
data.add(nl.Trajectory(6), tags=['100ms',    'WT'])
data.add(nl.Trajectory(7), tags=['100ms',    'WT'])
data.add(nl.Trajectory(8))
data.add(nl.Trajectory(9))   # tags are optional

print(f"total number of trajectories: {len(data)}")
total number of trajectories: 10

data now contains 20 trajectories; each trajectory has only one data point, which we use as index to indentify the trajectories.

Now we can query data for different subsets of trajectories, based on the tags that are associated with them:

[4]:
data.makeSelection('ΔCTCF')
print(f"trajectories in ΔCTCF condition:                           ", " ".join([str(traj[0][0]) for traj in data]))

data.makeSelection('10ms')
print(f"trajectories with 10ms framerate:                          ", " ".join([str(traj[0][0]) for traj in data]))

data.makeSelection(tags=['10ms', 'ΔCTCF'], logic=all)
print(f"trajectories with 10ms framerate in ΔCTCF condition:       ", " ".join([str(traj[0][0]) for traj in data]))

data.makeSelection(tags=['10ms', 'ΔCTCF'], logic=any)
print(f"trajectories with either 10ms framerate or ΔCTCF condition:", " ".join([str(traj[0][0]) for traj in data]))

print()
data.makeSelection()
print(f"Still containing all {len(data)} trajectories")
trajectories in ΔCTCF condition:                            0 1 2 3
trajectories with 10ms framerate:                           0 1 4 5
trajectories with 10ms framerate in ΔCTCF condition:        0 1
trajectories with either 10ms framerate or ΔCTCF condition: 0 1 2 3 4 5

Still containing all 10 trajectories

A few things to note about the above code:

  • TaggedSet objects are iterable, i.e. can be used in for loops

  • once the trajectories of interest are selected by makeSelection(), the TaggedSet object behaves as if it contained only those trajectories; the calls for producing the actual output are exactly identical.

  • when selecting by multiple tags, you have to specify whether you want those trajectories that have all the tags, or those that have any of them. The logic keyword accepts callables like python’s built-in all() or any(); you can also write a more specific logic yourself. See also below for tips on making complicated selections

  • calling makeSelection() without arguments resets the selection to the whole data set

Working with tags

Three useful features when working with tags:

  • TaggedSet.tagset() returns a set of all tags in the current data

  • TaggedSet.addTags() adds one or more tags to all trajectories

  • you can use call syntax to also return tags when iterating

[5]:
data.makeSelection() # good style: clear selection before doing anything
print(f"Full set of available tags: {data.tagset()}")

# Adding new tag to identify trajectories with '10ms' and 'ΔCTCF' tags
data.makeSelection(['10ms', 'ΔCTCF'], logic=all)
data.addTags("10ms + ΔCTCF")
data.makeSelection()
print(f"Full set of available tags: {data.tagset()}")

data.makeSelection('10ms + ΔCTCF')
print(f"trajectories with '10ms + ΔCTCF' tag:", " ".join([str(traj[0][0]) for traj in data]))

print()

data.makeSelection()
for traj, tags in data(giveTags=True):
    print(f" - trajectory #{traj[0][0]:d} carries tags: {tags}")
Full set of available tags: {'100ms', '10ms', 'WT', 'ΔCTCF'}
Full set of available tags: {'100ms', '10ms', 'WT', 'ΔCTCF', '10ms + ΔCTCF'}
trajectories with '10ms + ΔCTCF' tag: 0 1

 - trajectory #0 carries tags: {'10ms + ΔCTCF', '10ms', 'ΔCTCF'}
 - trajectory #1 carries tags: {'10ms + ΔCTCF', '10ms', 'ΔCTCF'}
 - trajectory #2 carries tags: {'100ms', 'ΔCTCF'}
 - trajectory #3 carries tags: {'100ms', 'ΔCTCF'}
 - trajectory #4 carries tags: {'10ms', 'WT'}
 - trajectory #5 carries tags: {'10ms', 'WT'}
 - trajectory #6 carries tags: {'100ms', 'WT'}
 - trajectory #7 carries tags: {'100ms', 'WT'}
 - trajectory #8 carries tags: set()
 - trajectory #9 carries tags: set()

More complicated selections

Sometimes you need to pull trajectories by a criterion that does not naturally have a tag associated with it. The mechanism for custom selections is to hand a callable (i.e. function) to TaggedSet.makeSelection(). This callable should expect two arguments—the trajectory and the associated tags—and return a bool, indicating whether it should be selected or not. A natural use case for this mechanism is to filter by trajectory length.

[6]:
data.makeSelection(lambda traj, _ : 1 <= traj[0][0] < 7)
print("Selected trajectories:", " ".join([str(traj[0][0]) for traj in data]))

data.makeSelection(lambda traj, tags : 1 <= traj[0][0] < 7 and 'ΔCTCF' in tags)
print("Selected trajectories:", " ".join([str(traj[0][0]) for traj in data]))
Selected trajectories: 1 2 3 4 5 6
Selected trajectories: 1 2 3

The second version, where we filter by a user-defined feature and a tag, could also be thought of as a selection in multiple steps. Let’s try:

[7]:
data.makeSelection(lambda traj, _: 1 <= traj[0][0] < 7)
data.makeSelection('ΔCTCF')
print("Selected trajectories:", " ".join([str(traj[0][0]) for traj in data]))
Selected trajectories: 0 1 2 3

This is not what we wanted! Trajectory #0 should have been excluded by the first filter; what happened?

makeSelection() always makes exactly the specified selection, overwriting any potentially existing previous ones. A call to makeSelection() thus provides a break point: the selection after this call is always clear from just the call itself, regardless of previous history. This ensures reproducibility and code readability, since you can structure your code into blocks that begin with makeSelection() statements (as we did above).

But stepwise selection still sounds like a useful feature—in fact it is highly recommended for code readability (a stepwise process is easier to follow than a single convoluted conditional). The function to use is called refineSelection():

[8]:
data.makeSelection('ΔCTCF')
data.refineSelection(lambda traj, _ : 1 <= traj[0][0] < 7)
print("Selected trajectories:", " ".join([str(traj[0][0]) for traj in data]))
Selected trajectories: 1 2 3

In fact, refineSelection(...) is simply an alias for makeSelection(..., refining=True); you can thus use it just like you would use makeSelection(), except that it respects previous selections.

Selections can be saved temporarily and then restored. This can be useful for cycling through subselection steps—note, however, that it is often more readable to just repeat the full selection process.

[9]:
data.makeSelection(lambda traj, _ : 1 <= traj[0][0] < 7)
sel = data.saveSelection()
print("Selected trajectories:", " ".join([str(traj[0][0]) for traj in data]))

for tag in ['ΔCTCF', 'WT']:
    data.restoreSelection(sel)
    data.refineSelection(tag)
    print("Selected trajectories:", " ".join([str(traj[0][0]) for traj in data]))
Selected trajectories: 1 2 3 4 5 6
Selected trajectories: 1 2 3
Selected trajectories: 4 5 6

Sometimes you need a random subset of your data; makeSelection() accepts the keyword arguments nrand=... to randomly select a given number of trajectories, and prand=... to select each trajectory with a given probability:

[10]:
data.makeSelection(nrand=3)
print("selected trajectories:", " ".join(str(traj[0][0]) for traj in data))
selected trajectories: 1 5 7

Data processing

A TaggedSet is iterable (see above), which means that it naturally interfaces with built-in functions such as map(). For example, to get the length of all trajectories in the set:

[11]:
data.makeSelection()
print("Trajectory lengths:", list(map(len, data)))
Trajectory lengths: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

To modify the actual data use TaggedSet.apply(). This has the benefit (over map()) of keeping the tags associated with the trajectory in order.

[12]:
data.makeSelection('WT') # selects trajectories 4, 5, 6, 7 (c.f. above)
data_new = data.apply(lambda traj: traj.rescale(-1))
data.makeSelection()
data_new.makeSelection() # this is unnecessary, just for clarity
print("data:",     " ".join(str(traj[0][0]) for traj in data), "   # all data, unmodified")
print("data_new:", " ".join(str(traj[0][0]) for traj in data_new), "       # just the processed data")
data: 0 1 2 3 4 5 6 7 8 9    # all data, unmodified
data_new: -4 -5 -6 -7        # just the processed data

To overwrite the original data with the new results, you can use apply(..., inplace=True).

Sometimes you need to access some information about your trajectories that should be the same for all of them (e.g. length in the example above; more commonly things like “number of particles”, “number of spatial dimensions”, etc.); a common quickfix solution is to simply query the first trajectory. TaggedSet.map_unique() provides a safer version of this by then also proceeding to check that all remaining trajectories give the same result and raising a RuntimeError if that is not the case:

[13]:
data.makeSelection()
print(f"All trajectories have N={data.map_unique(lambda traj: traj.N)}, "
                            f"d={data.map_unique(lambda traj: traj.d)}, "
                            f"and {data.map_unique(lambda traj: len(traj))} frames")

try:
    x = data.map_unique(lambda traj: traj[0][0])
    print(f"First data point of all trajectories is {x}")
except RuntimeError:
    print("First data point differs across trajectories")
All trajectories have N=1, d=1, and 1 frames
First data point differs across trajectories

Merging data sets

Merging the trajectories from one TaggedSet into another can be achieved by the mergein() function, or equivalently the |= operator (the latter being inspired by the understanding of TaggedSet as a set).

[14]:
data.makeSelection()
print(f"before adding trajectories: len(data) = {len(data)}")

more_data = nl.TaggedSet((nl.Trajectory(i) for i in range(10, 15)), hasTags=False)
data |= more_data
print(f"added 5 new trajectories:   len(data) = {len(data)}")
before adding trajectories: len(data) = 10
added 5 new trajectories:   len(data) = 15

Next: In- & Output from/to file or memory

In the next tutorial we will learn how to read/write data, thus concluding the core tutorial series.