{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import noctiluca as nl"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TaggedSet\n",
    "The [TaggedSet](../noctiluca.rst#noctiluca.taggedset.TaggedSet) is used to store and work with collections of trajectories. At its core, this is a list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = nl.TaggedSet()\n",
    "for i in range(10):\n",
    "    data.add(nl.Trajectory(i)) # inject some dummy trajectories into `data`\n",
    "\n",
    "print(f\"Number of trajectories: {len(data)}\")\n",
    "print(f\"Trajectory #5: {data[5]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So what's the use of this? Afterall, python already has lists. This is where the \"Tagged\" part comes in.\n",
    "\n",
    "The fundamental idea of the ``TaggedSet`` is to identify different \"kinds\" of trajectories (experimental conditions, framerates, etc.) by attaching **tags**, thus facilitating queries like\n",
    "\n",
    "+ plot ``<some nice plot>`` for all trajectories in the **ΔCTCF** condition\n",
    "+ run ``<some analysis>`` on all **10ms** trajectories\n",
    "+ run ``<some detailed analysis>`` on all **10ms** trajectories from the **ΔCTCF** condition\n",
    "\n",
    "Note that the picture at this point stops being that of a classic list. Instead, a ``TaggedSet`` should be thought of as an unordered pile of labelled trajectories: just stuff all your data into it; if you want to work with some subset of trajectories, you pull everything with the proper label out of the pile. (This understanding is why it's ``TaggedSet``, not ``TaggedList``.)\n",
    "\n",
    "To see how that works, we start out by manually creating a ``TaggedSet``. Note that in production this would be a bit more streamlined (see e.g. the beginning of the [MSD tutorial](04_MSD_analysis.ipynb))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = nl.TaggedSet()\n",
    "\n",
    "data.add(nl.Trajectory(0), tags=[ '10ms', 'ΔCTCF'])\n",
    "data.add(nl.Trajectory(1), tags=[ '10ms', 'ΔCTCF'])\n",
    "data.add(nl.Trajectory(2), tags=['100ms', 'ΔCTCF'])\n",
    "data.add(nl.Trajectory(3), tags=['100ms', 'ΔCTCF'])\n",
    "data.add(nl.Trajectory(4), tags=[ '10ms',    'WT'])\n",
    "data.add(nl.Trajectory(5), tags=[ '10ms',    'WT'])\n",
    "data.add(nl.Trajectory(6), tags=['100ms',    'WT'])\n",
    "data.add(nl.Trajectory(7), tags=['100ms',    'WT'])\n",
    "data.add(nl.Trajectory(8))\n",
    "data.add(nl.Trajectory(9))   # tags are optional\n",
    "\n",
    "print(f\"total number of trajectories: {len(data)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`data` now contains 20 trajectories; each trajectory has only one data point, which we use as index to indentify the trajectories.\n",
    "\n",
    "Now we can query `data` for different subsets of trajectories, based on the tags that are associated with them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.makeSelection('ΔCTCF')\n",
    "print(f\"trajectories in ΔCTCF condition:                           \", \" \".join([str(traj[0][0]) for traj in data]))\n",
    "\n",
    "data.makeSelection('10ms')\n",
    "print(f\"trajectories with 10ms framerate:                          \", \" \".join([str(traj[0][0]) for traj in data]))\n",
    "\n",
    "data.makeSelection(tags=['10ms', 'ΔCTCF'], logic=all)\n",
    "print(f\"trajectories with 10ms framerate in ΔCTCF condition:       \", \" \".join([str(traj[0][0]) for traj in data]))\n",
    "\n",
    "data.makeSelection(tags=['10ms', 'ΔCTCF'], logic=any)\n",
    "print(f\"trajectories with either 10ms framerate or ΔCTCF condition:\", \" \".join([str(traj[0][0]) for traj in data]))\n",
    "\n",
    "print()\n",
    "data.makeSelection()\n",
    "print(f\"Still containing all {len(data)} trajectories\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A few things to note about the above code:\n",
    "\n",
    "+ ``TaggedSet`` objects are iterable, i.e. can be used in ``for`` loops\n",
    "+ once the trajectories of interest are selected by ``makeSelection()``, the ``TaggedSet`` object behaves as if it contained only those trajectories; the calls for producing the actual output are exactly identical.\n",
    "+ when selecting by multiple tags, you have to specify whether you want those trajectories that have *all* the tags, or those that have *any* of them. The ``logic`` keyword accepts callables like python's built-in ``all()`` or ``any()``; you can also write a more specific logic yourself. See also below for tips on making complicated selections\n",
    "+ calling ``makeSelection()`` without arguments resets the selection to the whole data set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Working with tags\n",
    "Three useful features when working with tags:\n",
    "\n",
    "+ ``TaggedSet.tagset()`` returns a set of all tags in the current data\n",
    "+ ``TaggedSet.addTags()`` adds one or more tags to all trajectories\n",
    "+ you can use call syntax to also return tags when iterating"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.makeSelection() # good style: clear selection before doing anything\n",
    "print(f\"Full set of available tags: {data.tagset()}\")\n",
    "\n",
    "# Adding new tag to identify trajectories with '10ms' and 'ΔCTCF' tags\n",
    "data.makeSelection(['10ms', 'ΔCTCF'], logic=all)\n",
    "data.addTags(\"10ms + ΔCTCF\")\n",
    "data.makeSelection()\n",
    "print(f\"Full set of available tags: {data.tagset()}\")\n",
    "\n",
    "data.makeSelection('10ms + ΔCTCF')\n",
    "print(f\"trajectories with '10ms + ΔCTCF' tag:\", \" \".join([str(traj[0][0]) for traj in data]))\n",
    "\n",
    "print()\n",
    "\n",
    "data.makeSelection()\n",
    "for traj, tags in data(giveTags=True):\n",
    "    print(f\" - trajectory #{traj[0][0]:d} carries tags: {tags}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## More complicated selections\n",
    "Sometimes you need to pull trajectories by a criterion that does not naturally have a tag associated with it. The mechanism for custom selections is to hand a callable (i.e. function) to ``TaggedSet.makeSelection()``. This callable should expect two arguments—the trajectory and the associated tags—and return a bool, indicating whether it should be selected or not. A natural use case for this mechanism is to filter by trajectory length."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.makeSelection(lambda traj, _ : 1 <= traj[0][0] < 7)\n",
    "print(\"Selected trajectories:\", \" \".join([str(traj[0][0]) for traj in data]))\n",
    "\n",
    "data.makeSelection(lambda traj, tags : 1 <= traj[0][0] < 7 and 'ΔCTCF' in tags)\n",
    "print(\"Selected trajectories:\", \" \".join([str(traj[0][0]) for traj in data]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The second version, where we filter by a user-defined feature *and* a tag, could also be thought of as a selection in multiple steps. Let's try:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.makeSelection(lambda traj, _: 1 <= traj[0][0] < 7)\n",
    "data.makeSelection('ΔCTCF')\n",
    "print(\"Selected trajectories:\", \" \".join([str(traj[0][0]) for traj in data]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is not what we wanted! Trajectory \\#0 should have been excluded by the first filter; what happened?\n",
    "\n",
    "``makeSelection()`` always makes exactly the specified selection, overwriting any potentially existing previous ones. A call to ``makeSelection()`` thus provides a break point: the selection after this call is always clear from just the call itself, regardless of previous history. This ensures reproducibility and code readability, since you can structure your code into blocks that begin with ``makeSelection()`` statements (as we did above).\n",
    "\n",
    "But stepwise selection still sounds like a useful feature—in fact it is highly recommended for code readability (a stepwise process is easier to follow than a single convoluted conditional). The function to use is called ``refineSelection()``:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.makeSelection('ΔCTCF')\n",
    "data.refineSelection(lambda traj, _ : 1 <= traj[0][0] < 7)\n",
    "print(\"Selected trajectories:\", \" \".join([str(traj[0][0]) for traj in data]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In fact, ``refineSelection(...)`` is simply an alias for ``makeSelection(..., refining=True)``; you can thus use it just like you would use ``makeSelection()``, except that it respects previous selections.\n",
    "\n",
    "Selections can be saved temporarily and then restored. This can be useful for cycling through subselection steps—note, however, that it is often more readable to just repeat the full selection process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.makeSelection(lambda traj, _ : 1 <= traj[0][0] < 7)\n",
    "sel = data.saveSelection()\n",
    "print(\"Selected trajectories:\", \" \".join([str(traj[0][0]) for traj in data]))\n",
    "\n",
    "for tag in ['ΔCTCF', 'WT']:\n",
    "    data.restoreSelection(sel)\n",
    "    data.refineSelection(tag)\n",
    "    print(\"Selected trajectories:\", \" \".join([str(traj[0][0]) for traj in data]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes you need a random subset of your data; ``makeSelection()`` accepts the keyword arguments ``nrand=...`` to randomly select a given number of trajectories, and ``prand=...`` to select each trajectory with a given probability:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.makeSelection(nrand=3)\n",
    "print(\"selected trajectories:\", \" \".join(str(traj[0][0]) for traj in data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data processing\n",
    "A ``TaggedSet`` is iterable (see above), which means that it naturally interfaces with built-in functions such as ``map()``. For example, to get the length of all trajectories in the set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.makeSelection()\n",
    "print(\"Trajectory lengths:\", list(map(len, data)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To modify the actual data use ``TaggedSet.apply()``. This has the benefit (over ``map()``) of keeping the tags associated with the trajectory in order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.makeSelection('WT') # selects trajectories 4, 5, 6, 7 (c.f. above)\n",
    "data_new = data.apply(lambda traj: traj.rescale(-1))\n",
    "data.makeSelection()\n",
    "data_new.makeSelection() # this is unnecessary, just for clarity\n",
    "print(\"data:\",     \" \".join(str(traj[0][0]) for traj in data), \"   # all data, unmodified\")\n",
    "print(\"data_new:\", \" \".join(str(traj[0][0]) for traj in data_new), \"       # just the processed data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To overwrite the original data with the new results, you can use ``apply(..., inplace=True)``.\n",
    "\n",
    "Sometimes you need to access some information about your trajectories that should be the same for all of them (e.g. length in the example above; more commonly things like \"number of particles\", \"number of spatial dimensions\", etc.); a common quickfix solution is to simply query the first trajectory. ``TaggedSet.map_unique()`` provides a safer version of this by then also proceeding to check that all remaining trajectories give the same result and raising a ``RuntimeError`` if that is not the case:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.makeSelection()\n",
    "print(f\"All trajectories have N={data.map_unique(lambda traj: traj.N)}, \"\n",
    "                            f\"d={data.map_unique(lambda traj: traj.d)}, \"\n",
    "                            f\"and {data.map_unique(lambda traj: len(traj))} frames\")\n",
    "\n",
    "try:\n",
    "    x = data.map_unique(lambda traj: traj[0][0])\n",
    "    print(f\"First data point of all trajectories is {x}\")\n",
    "except RuntimeError:\n",
    "    print(\"First data point differs across trajectories\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Merging data sets\n",
    "Merging the trajectories from one ``TaggedSet`` into another can be achieved by the ``mergein()`` function, or equivalently the ``|=`` operator (the latter being inspired by the understanding of ``TaggedSet`` as a set)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.makeSelection()\n",
    "print(f\"before adding trajectories: len(data) = {len(data)}\")\n",
    "\n",
    "more_data = nl.TaggedSet((nl.Trajectory(i) for i in range(10, 15)), hasTags=False)\n",
    "data |= more_data\n",
    "print(f\"added 5 new trajectories:   len(data) = {len(data)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next: In- & Output from/to file or memory\n",
    "In [the next tutorial](03_IO.ipynb) we will learn how to read/write data, thus concluding the core tutorial series."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}