{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import noctiluca as nl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# In- & Output from/to file\n", "SPT trajectories are commonly saved in ``.csv`` format, i.e. as human-readable text files; this satisfies most basic needs and is easy to read for anyone, human or machine. More versatile storage—e.g. including arbitrary meta data—can be achieved by the binary, tree-structured HDF5 format. ``noctiluca`` is equipped to handle both.\n", "\n", "## .csv files\n", "``noctiluca`` can read ``.csv`` files with minimal requirements on the format. Let's take a look:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write a file like what we might expect to see in the wild\n", "filename = \"example_data_in.csv\"\n", "with open(filename, 'wt') as f:\n", " f.write(\"trajid\\tframe\\tpos_x\\tpos_y\\tpos_z\\n\")\n", " f.write(\"5 \\t 7 \\t 0.3 \\t 0.7 \\t -1.2 \\n\")\n", " f.write(\"5 \\t 8 \\t 1.5 \\t -0.4 \\t -0.8 \\n\")\n", " f.write(\"5 \\t 10 \\t 1.1 \\t -0.2 \\t 0.1 \\n\")\n", " f.write(\"6 \\t 4 \\t 2.1 \\t 11.3 \\t 5.1 \\n\")\n", " f.write(\"6 \\t 5 \\t 2.3 \\t 12.5 \\t 4.7 \\n\")\n", " \n", "print(f\"Contents of {filename}:\\n\")\n", "with open(filename, 'r') as f:\n", " for line in f:\n", " print(line[:-1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load .csv file as `TaggedSet`\n", "data = nl.io.load.csv(filename, columns=['id', 't', 'x', 'y', 'z'], delimiter='\\t', skip_header=1)\n", "\n", "# Check what we got\n", "for i, traj in enumerate(data):\n", " print(f\"traj #{i}\")\n", " print(traj[:])\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the call to [nl.io.load.csv()](../noctiluca.io.rst#noctiluca.io.load.csv) to understand how this works. We start from the end:\n", "\n", "+ ``load.csv()`` accepts any keywords that [numpy.genfromtxt()](https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html) understands, most importantly the last two in the call above, ``delimiter`` and ``skip_header``. The latter gives the number of lines at the start of the file to ignore; in this case we set it to 1, such that we start reading right at the data.\n", "+ But wait, that means we ignore the column names completely? Yes! ``load.csv()`` does not care what your columns are called (or whether they have names at all, for that matter). How to process the data is instead specified by the ``columns`` argument. This list assigns pre-defined identifiers to the columns, in the order in which they appear in the file.\n", "+ two of those identifiers are mandatory: ``'id'`` should be a column that has a unique entry for each trajectory, identifying which localization belongs to which trajectory; ``'t'`` should be a column containing integer frame numbers. Note that missing frames (above: frame 9 in trajectory 5) are simply patched with ``np.nan``.\n", "+ beyond those two, possible identifiers are ``x, y, z, x2, y2, z2`` for spatial coordinates of trajectories with up to three spatial dimensions and two particles; and ``None``, which indicates that the corresponding column should be ignored\n", "+ finally, any string not recognized as one of the defined identifiers means that the data from the corresponding column will be written to the ``meta``-dict of the corresponding trajectory, with that string as key. Refer to [nl.io.load.csv()](../noctiluca.io.rst#noctiluca.io.load.csv) for more details." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Writing ``.csv`` files is straight-forward:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "filename = \"example_data_out.csv\"\n", "nl.io.write.csv(data, filename)\n", "\n", "# Let's see what that produced\n", "print(f\"Contents of {filename}:\\n\")\n", "with open(filename, 'r') as f:\n", " for line in f:\n", " print(line[:-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that when loading data we do not keep track of the \"real\" trajectory ID, or frame numbers. Therefore, when writing back to file, these are just indexed starting at zero." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## HDF5 (.h5 / .hdf5) files\n", "[HDF5](https://www.hdfgroup.org/solutions/hdf5/) is a binary, tree-structured file format that is optimized for storage of large data structures, specifically numerical arrays. ``noctiluca`` implements an interface to store ``TaggedSet`` objects in HDF5 files:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# still using the dataset from above\n", "filename = \"example_data_out.h5\"\n", "nl.io.write.hdf5(data, filename)\n", "\n", "# Let's check what's in there\n", "print(nl.io.hdf5.ls(filename))\n", "print(nl.io.hdf5.ls(filename, '/_data'))\n", "print(nl.io.hdf5.ls(filename, '/_data/0'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The structure of HDF5 files is similar to a file system: data is organized in \"groups\" that can contain \"datasets\" (i.e. numerical arrays, indicated by ``[]`` in the above output), \"attributes\" (meta data, indicated by ``{}``), or subgroups. You can parse the file with [nl.io.hdf5.ls()](../noctiluca.io.rst#noctiluca.io.hdf5.ls) and inspect its structure, starting from the root group ``'/'``.\n", "\n", "Internally, ``noctiluca`` uses the [h5py](https://www.h5py.org/) package to handle HDF5 files. See there for more details on HDF5 and how to work with it in python (if you want/need to go beyond the base functionality exposed by ``noctiluca``).\n", "\n", "In the above example, the root group directly contains the ``TaggedSet`` we wrote to the file. It is often useful to write some documentation of what the file contains, like so:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nl.io.write.hdf5({}, filename) # clear all file contents\n", "nl.io.write.hdf5(data, filename, group='data') # write actual data\n", "nl.io.write.hdf5(\"\"\"\n", "This is an example file, showing how to write HDF5 files. It contains the following:\n", "+ 'data': sample `TaggedSet` used in the demonstration\n", "\"\"\"[1:-1], filename, group='info') # write a comment telling people what's in the file\n", "\n", "# So now what does this look like?\n", "print(nl.io.hdf5.ls(filename))\n", "print()\n", "print(nl.io.load.hdf5(filename, group='info'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notes on the above snippet:\n", "\n", "+ HDF5 files can be written incrementally; we first write the data, then add the description, and below we will add to that\n", "+ when ``group`` is not specified, like in the first line, the existing file is overwritten. Otherwise the specified group is added; this might overwrite existing entries, but will not delete anything else\n", "+ [nl.io.load.hdf5()](../noctiluca.io.rst#noctiluca.io.load.hdf5) can be used to load any data from an HDF5 file, not just ``TaggedSet`` objects. Note the omnipresent attribute ``_HDF5_ORIG_TYPE_`` which tells the loader what data type the data in this group should be loaded as.\n", "\n", "HDF5 files also support random access, which plays well with the [selection machinery](../noctiluca.rst#noctiluca.taggedset.TaggedSet.makeSelection) of ``TaggedSet``. Specifically, you can save references to a subset of your data: if you have a big data set out of which you routinely need only a specific part, you can use [write.hdf5_subTaggedSet()](../noctiluca.io.rst#noctiluca.io.write.hdf5_subTaggedSet):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Save a specific part of a data set as directly loadable `TaggedSet`\n", "data.makeSelection(lambda traj, _: len(traj) > 2) # some selection on the \"big data set\"\n", "nl.io.write.hdf5_subTaggedSet(data, filename,\n", " group='important data', # where to store this subset\n", " refTaggedSet='data', # where the big data set is stored in the file\n", " )\n", "\n", "# Don't forget to update the description!\n", "nl.io.write.hdf5(nl.io.load.hdf5(filename, 'info') + \"\"\"\n", "+ 'important data': subset of 'data' that is very important in its own right\n", "\"\"\"[:-1], filename, 'info')\n", "\n", "# Check what we have\n", "print(nl.io.hdf5.ls(filename))\n", "print()\n", "print(nl.io.load.hdf5(filename, group='info'))\n", "print()\n", "\n", "# 'important data' is now lloadable as `TaggedSet` by itself, and contains only the specified data\n", "important_data = nl.io.load.hdf5(filename, 'important data')\n", "print(type(important_data))\n", "important_data.makeSelection() # just to demonstrate that there is no selection active here\n", "print(f\"len(important_data) = {len(important_data)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other formats\n", "Beyond ``.csv`` and HDF5, currently ``noctiluca`` supports writing MATLAB ``.mat`` files, though of course we recommend using non-proprietary formats. If you have a favorite format that is currently not supported, feel free to [submit an issue on GitHub](https://github.com/OpenTrajectoryAnalysis/noctiluca/issues) or implement it yourself and submit a pull request." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Input from memory\n", "If you have your data available in form of some python object (e.g. a ``pandas.DataFrame`` or ``numpy.ndarray``), you can convert it to a ``TaggedSet`` (or ``Trajectory``) using the [make_TaggedSet()](../noctiluca.util.rst#noctiluca.util.userinput.make_TaggedSet) or [make_Trajectory()](../noctiluca.util.rst#noctiluca.util.userinput.make_Trajectory) functions. For example you can circumvent ``nl.io.load.csv()`` by using ``pandas.read_csv()`` and then converting to ``TaggedSet``:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "filename = \"example_data_in.csv\" # reuse the file we wrote above in the .csv example\n", "df = pd.read_csv(filename, delimiter='\\t')\n", "data = nl.make_TaggedSet(df,\n", " id_column='trajid',\n", " t_column='frame',\n", " pos_columns=['pos_x', 'pos_y', 'pos_z'],\n", " )\n", "\n", "# Resulting `TaggedSet` is the same as reading directly from .csv (compare above)\n", "for i, traj in enumerate(data):\n", " print(f\"traj #{i}\")\n", " print(traj[:])\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See [nl.util.userinput](../noctiluca.util.rst#module-noctiluca.util.userinput) for more details.\n", "\n", "The main purpose of ``make_TaggedSet()`` and ``make_Trajectory()`` is actually to provide downstream analysis libraries with an easy way to accept data in non-``noctiluca`` formats, such that they are more independent. For examples see\n", "\n", "+ [bayesmsd](https://github.com/OpenTrajectoryAnalysis/bayesmsd)\n", "+ [bild](https://github.com/OpenTrajectoryAnalysis/bild)\n", "\n", "which are written such that for a first pass the user does not have to be familiar with ``noctiluca``." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }