.. _data_specs: Data specifications, spaces, and sources ======================================== Data specifications, often called ``data_specs``, are used as a specification for requesting and providing data in a certain format across different parts of a Pylearn2 experiment. A ``data_specs`` is a ``(space, source)`` pair, where ``source`` is an identifier or the data source or sources required (for instance, inputs and targets), and ``space`` is an instance of :class:`.space.Space` representing the *format* of these data (for instance, a vector, a 3D tensor representing an RGB image, or a one-hot vector). The main use of ``data_specs`` is to request data from a :class:`.datasets.Dataset` object, via an iterator. Various objects can request data this way: models, costs, monitoring channels, training algorithms, even some datasets that perform a transformation on data. The ``Space`` object ==================== A ``Space`` represents a way in which a mini-batch of data can be formatted. For instance, a batch of RGB images (each of shape ``(rows, columns)``) can be represented in different ways, for instance: - as a matrix where each row corresponds to a different image, and is of length ``rows * columns * 3``: the corresponding space would be a :class:`.space.VectorSpace`, more precisely ``VectorSpace(dim=(rows * columns * 3))``; - as a 4-dimensional tensor, where rows, columns, and channels (here: red, green, and blue) are different axes: the corresponding space would be a :class:`.space.Conv2DSpace`. Theano convolutions prefer that tensor to have shape ``(batch_size, channels, rows, columns)``, which corresponds to ``Conv2DSpace(shape=(rows, columns), num_channels=3, axes=('b', 'c', 0, 1))``; - as a 4-dimensional tensors with a different shape: for instance, cuda-convnet prefers ``(channels, rows, columns, batch_size)``: the space would be ``Conv2DSpace(shape=(rows, columns), num_channels=3, axes=('c', 0, 1, 'b'))``. Spaces can be either elementary, representing one mini-batch from one source of data, such as ``VectorSpace`` and ``Conv2DSpace`` mentioned above, or *composite* (:class:`.space.CompositeSpace`), representing the aggregation of several sources of data (some of these may in turn be aggregations of sources). A mini-batch for an elementary space will usually be a NumPy ``ndarray``, whereas a mini-batch for a ``CompositeSpace`` will be a Python tuple of elementary (or composite) mini-batches. Notable methods of the :class:`Space` class are: - :meth:`Space.make_theano_batch`: creates a Theano Variable (or tuple of Theano Variable in the case of ``CompositeSpace``) representing a *symbolic* mini-batch of data. For instance, ``VectorSpace(...).make_theano_batch(...)`` will essentially call ``theano.tensor.matrix()``. - :meth:`Space.validate(batch)` will check that symbolic variable ``batch`` can correctly represent a mini-batch of data for the corresponding space. For instance, ``VectorSpace(...).validate(theano.tensor.matrix())`` will work, but ``VectorSpace(...).validate(theano.tensor.vector())`` will raise an exception. - :meth:`Space.np_validate(batch)` (where ``np`` stands for NumPy) is similar, but operates on a mini-batch of numeric data, rather than on a symbolic variable. This enables more checks to be performed. For instance, ``VectorSpace(dim=3).validate(np.zeros((4, 3)))`` will work, because it correctly describes a mini-batch of 4 samples of dimension 3, but ``VectorSpace(dim=4).validate(np.zeros((4, 3)))`` will raise an exception. - :meth:`Space.format_as(batch, space)` and :meth:`Space.np_format_as(batch, space)` are the way we can convert data from their original space into the destination ``space``. ``format_as`` operates on a symbolic ``batch``, and returns a symbolic expression of the newly-formatted data, whereas ``np_format_as`` operates on a numeric batch, and returns numeric data. This formatting can happen between different instances of the same ``Space`` class, for instance, converting between two instances of ``Conv2DSpace`` with different ``axes`` amounts to correctly transpose the ``batch``. It can also happen between different subclasses of ``Space``, for instance, converting between a ``VectorSpace`` and ``Conv2DSpace`` of compatible shape involves reshaping and transposition of the data. Sources ======= Sources are simple identifiers that specify *which* data should be returned, whereas spaces specify *how* that data should be formatted. An elementary source is identified by a Python string. For instance, the most used sources are ``'features'`` and ``'targets'``. ``'features'`` usually denotes the part of the data that models will use as input, and ``'targets'``, for labeled datasets, contains the value the model will try to predict. However, this is only a convention, and some datasets will declare other sources, that can be used in varying ways by models, for instance when using multi-modal data. A composite source is identified by a tuple of sources. For instance, to request features and targets from a dataset, the `source` would be ``('features', 'targets')``. Structure of data specifications ================================ When using data specifications ``data_specs=(space, source)``, ``space`` and ``source`` have to have the same *structure*. This means that: - if ``space`` is an elementary space, then ``source`` has to be an elementary source, i.e., a string; - if ``space`` is a composite space, then ``source`` has to be a composite source (a tuple), with exactly as many components as the number of sub-spaces of ``space``; and the corresponding sub-sources and sub-spaces again have to have the same *structure*. For example, let us define the following spaces: .. code-block:: python input_vecspace = VectorSpace(dim=(32 * 32 * 3)) input_convspace = Conv2DSpace(shape=(32, 32), num_channels=3, axes=('b', 'c', 0, 1)) target_space = VectorSpace(dim=10) and suppose ``"features"`` and ``"targets"`` are sources present in our data. Then, the following data_specs are correct: - ``(input_vecspace, "features")``: only the features, mini-batches will be matrices; - ``(input_convspace, "features")``: only the features, mini-batches will be 4-D tensors; - ``(target_space, "targets")``: only the targets, mini-batches will be matrices; - ``(CompositeSpace((input_vecspace, target_space)), ("features", "targets"))``: features and targets, in that order; mini-batches will be (matrix, matrix) pairs; - ``(CompositeSpace((target_space, input_convspace)), ("targets", "features"))``: targets and features, in that order; mini-batches will be (matrix, 4-D tensor) pairs; - ``(CompositeSpace((input_vecspace, input_vecspace, input_vecspace, target_space)), ("features", "features", "features", "targets"))``: features repeated 3 times, then targets; mini-batches will be (matrix, matrix, matrix, matrix) tuples; - ``(CompositeSpace((CompositeSpace((input_vecspace, input_vecspace, input_vecspace)), target_space)), (("features", "features", "features"), "targets"))``: same as above, but the repeated features are in another CompositeSpace; mini-batches will be ((matrix, matrix, matrix), matrix) pairs with the first element being a triplet. The following ones are **incorrect**: - ``(target_vecspace, "features")``: it will not crash immediately, but as soon as actual data are used, it will crash because feature data will have a width of 32 * 32 * 3 = 3072, but ``target_vecspace.dim`` is 10; - ``(CompositeSpace((input_vecspace, input_convspace)), "features")``: the ``source`` part has to have as many elements as there are sub-spaces of the ``CompositeSpace``, but ``"features"`` is not a pair. You would need to write ``(CompositeSpace((input_vecspace, input_convspace)), ("features", "features"))``; - ``(CompositeSpace((input_vecspace,)), "features")``: the ``source`` part should be a tuple of length 1, not a string. You would need to write ``(CompositeSpace((input_vecspace,)), ("features",))``; - ``(CompositeSpace((input_vecspace, input_vecspace, input_vecspace, target_space)), (("features", "features", "features"), "targets"))``: even if the total number of elementary spaces and elementary sources match, their *structure* do not: the sub-spaces are in a flat tuple of length 4, the sources are in a nested tuple; - ``(CompositeSpace((CompositeSpace((input_vecspace, input_vecspace, input_vecspace)), target_space)), ("features", "features", "features", "targets"))``: it is the same problem, the other way around. Examples of use =============== Here are some examples of how data specifications are currently used in different Pylearn2 objects. The big picture --------------- The ``TrainingAlgorithm`` object (for instance ``DefaultTrainingAlgorithm``, or ``SGD``) is usually the one requesting the data_specs from the various objects defined in an experiment script (model, costs, monitor channels), combines them in one nested data_specs, flattens it, requests iterators from the datasets, iterates over the dataset, converting back the flat version of the data so it can be correctly dispatched between all the objects requiring data. Input of a model ---------------- A Model object used in an experiment has to declare its input source and space, so the right data will be provided to it by the dataset iterator, in the appropriate format. This is done by the methods :meth:`.models.Model.get_input_source()` and :meth:`.models.Model.get_input_space()`. By default, most models will simply use ``"features"`` as input source, but that could be changed for an experiment where the user wants to apply the model on a different source of the dataset, or on a dataset where sources are named differently. Models that do not care for the topology of the input will use a ``VectorSpace`` as input space, whereas convolutional models, for instance, will use an instance of ``Conv2DSpace``. Models also declare an output space, which can be useful for the cost, for instance, or for other objects that can use or embed a model. Input of a cost --------------- A Cost object needs to implement the :meth:`.costs.Cost.get_data_specs(self, model)` method, which will be used to determine which data (and format) will be passed as the ``data`` argument of :meth:`.costs.Cost.expr(self, model, data)` and :meth:`.costs.Cost.get_gradients(self, model, data)`. Example 1: cost without data ++++++++++++++++++++++++++++ For instance, a cost that does not depend on data at all, but only on the model parameters, like an L1 regularization penalty, would typically use ``(NullSpace(), '')`` for data specifications, and ``expr`` would be passed ``data=None``. Example 2: unsupervised cost ++++++++++++++++++++++++++++ An unsupervised cost, that uses only unlabeled features, and not targets, will usually use ``(model.get_input_space(), model.get_input_source())``, so the ``data`` passed to ``expr`` will directly be usable by the model. Example 3: supervised cost ++++++++++++++++++++++++++ Finally, a supervised cost, needing both features and targets, will usually request the targets to be in the same space as the model's predictions (the model's output space): .. code-block:: python def get_data_specs(self, model): return (CompositeSpace((model.get_input_space(), model.get_output_space())), (model.get_input_source(), "targets")) Then, ``data`` would be a pair, the first element of which can be passed directly to the model. Of course, it does not have to be implemented that way, and the following is as correct (if more confusing) if you prefer having ``data`` be a (targets, inputs) pair instead: .. code-block:: python def get_data_specs(self, model): return (CompositeSpace((model.get_output_space(), model.get_input_space())), ("targets", model.get_input_source())) Input of a monitoring channel ----------------------------- As for costs used for training, variables monitored by MonitorChannels have to declare data specs corresponding to the input variables necessary to compute the monitored value. It is passed directly to the constructor, for instance, when calling: .. code-block:: python channel = MonitorChannel( graph_inputs=input_variables, val=monitored_value, name='channel_name', data_specs=data_specs, dataset=dataset) ``data_specs`` describe the format and semantics of ``input_variables``. As in the previous section, if ``val`` does not need any input data, for instance if it is a shared variable, ``data_specs`` will be ``(NullSpace(), '')``. If ``val`` corresponds to an unsupervised cost, or quantity depending only on the ``"features"`` source, ``data_specs`` could be ``(VectorSpace(...), "features")``, etc. For monitored values defined in :meth:`.models.Model.get_monitoring_channels(self, data)`, the data_specs of ``data``, which are also the ``data_specs`` to pass to MonitorChannel's constructor, are returned by a call to :meth:`.models.Model.get_monitoring_channels_data(self)`. Nesting and flattening data_specs --------------------------------- In order to avoid duplicating data and creating lots of symbolic inputs to Theano functions (which also do not support nested arguments), it can be useful to convert a nested, composite data_specs into a flat, non-redundant one. That *flat* data_specs can be used to create theano variables or get mini-batches of data, for instance, which are then *nested* back into the original *structure* of the data_specs. We use the :class:`.utils.data_specs.DataSpecsMapping` class to build a *mapping* between the original, nested data specs, and the flat one. For instance, using the spaces defined earlier: .. code-block:: python source = ("features", ("features", "targets")) space = CompositeSpace((input_vecspace, CompositeSpace((input_convspace, target_space)))) mapping = DataSpecsMapping((space, source)) flat_source = mapping.flatten(source) # flat_source == ('features', 'features', 'targets') flat_space = mapping.flatten(space) # flat_space == (input_vecspace, input_convspace, target_space) # We can use the mapping the other way around nested_source = mapping.nest(flat_source) assert source == flat_source nested_space = mapping.nest(flat_space) assert space == flat_space # We can also nest other things print mapping.nest((1, 2, 3)) # (1, (2, 3)) Here, ``'features'`` appear twice in the flat source, that is because the corresponding space is different. However, if there is an actual duplicate, it gets removed: .. code-block:: python source = (("features", "targets"), ("features", "targets")) space = CompositeSpace((CompositeSpace((input_vecspace, target_space)), CompositeSpace((input_vecspace, target_space)))) mapping = DataSpecsMapping((space, source)) flat_source = mapping.flatten(source) # flat_source == ('features', 'targets') flat_space = mapping.flatten(space) # flat_space == (input_vecspace, target_space) # We can use the mapping the other way around nested_source = mapping.nest(flat_source) assert source == flat_source nested_space = mapping.nest(flat_space) assert space == flat_space # We can also nest other things print mapping.nest((1, 2)) # ((1, 2), (1, 2)) The flat tuple of spaces can be used to create non-redundant Theano input variables, which will be nested back to be dispatched between the different components having requested them: .. code-block:: python # From the block above: # flat_space == (input_vecspace, target_space) flat_composite_space = CompositeSpace(flat_space) flat_inputs = flat_composite_space.make_theano_variables(name='input') print flat_inputs # (input[0], input[1]) # We can use the mapping to nest the theano variables nested_inputs = mapping.nest(theano_inputs) print nested_inputs # ((input[0], input[1]), (input[0], input[1])) # Then, we can build expressions from these input variables. # Finally, a Theano function will be compiled with f = theano.function(flat_inputs, outputs, ...) # A dataset iterator can also be created from the flat composite space it = my_dataset.iterator(..., data_specs=(flat_composite_space, flat_source)) # When it is time to call f on data, we can then do for flat_data in it: out = f(*flat_data)