Data specifications, often called data_specs, are used as a specification for requesting and providing data in a certain format across different parts of a Pylearn2 experiment.
A data_specs is a (space, source) pair, where source is an identifier or the data source or sources required (for instance, inputs and targets), and space is an instance of space.Space representing the format of these data (for instance, a vector, a 3D tensor representing an RGB image, or a one-hot vector).
The main use of data_specs is to request data from a datasets.Dataset object, via an iterator. Various objects can request data this way: models, costs, monitoring channels, training algorithms, even some datasets that perform a transformation on data.
A Space represents a way in which a mini-batch of data can be formatted. For instance, a batch of RGB images (each of shape (rows, columns)) can be represented in different ways, for instance:
Spaces can be either elementary, representing one mini-batch from one source of data, such as VectorSpace and Conv2DSpace mentioned above, or composite (space.CompositeSpace), representing the aggregation of several sources of data (some of these may in turn be aggregations of sources). A mini-batch for an elementary space will usually be a NumPy ndarray, whereas a mini-batch for a CompositeSpace will be a Python tuple of elementary (or composite) mini-batches.
Notable methods of the Space class are:
Sources are simple identifiers that specify which data should be returned, whereas spaces specify how that data should be formatted.
An elementary source is identified by a Python string. For instance, the most used sources are 'features' and 'targets'. 'features' usually denotes the part of the data that models will use as input, and 'targets', for labeled datasets, contains the value the model will try to predict. However, this is only a convention, and some datasets will declare other sources, that can be used in varying ways by models, for instance when using multi-modal data.
A composite source is identified by a tuple of sources. For instance, to request features and targets from a dataset, the source would be ('features', 'targets').
When using data specifications data_specs=(space, source), space and source have to have the same structure. This means that:
For example, let us define the following spaces:
input_vecspace = VectorSpace(dim=(32 * 32 * 3))
input_convspace = Conv2DSpace(shape=(32, 32), num_channels=3,
axes=('b', 'c', 0, 1))
target_space = VectorSpace(dim=10)
and suppose "features" and "targets" are sources present in our data. Then, the following data_specs are correct:
The following ones are incorrect:
Here are some examples of how data specifications are currently used in different Pylearn2 objects.
The TrainingAlgorithm object (for instance DefaultTrainingAlgorithm, or SGD) is usually the one requesting the data_specs from the various objects defined in an experiment script (model, costs, monitor channels), combines them in one nested data_specs, flattens it, requests iterators from the datasets, iterates over the dataset, converting back the flat version of the data so it can be correctly dispatched between all the objects requiring data.
A Model object used in an experiment has to declare its input source and space, so the right data will be provided to it by the dataset iterator, in the appropriate format. This is done by the methods models.Model.get_input_source() and models.Model.get_input_space().
By default, most models will simply use "features" as input source, but that could be changed for an experiment where the user wants to apply the model on a different source of the dataset, or on a dataset where sources are named differently.
Models that do not care for the topology of the input will use a VectorSpace as input space, whereas convolutional models, for instance, will use an instance of Conv2DSpace.
Models also declare an output space, which can be useful for the cost, for instance, or for other objects that can use or embed a model.
A Cost object needs to implement the costs.Cost.get_data_specs(self, model)() method, which will be used to determine which data (and format) will be passed as the data argument of costs.Cost.expr(self, model, data)() and costs.Cost.get_gradients(self, model, data)().
For instance, a cost that does not depend on data at all, but only on the model parameters, like an L1 regularization penalty, would typically use (NullSpace(), '') for data specifications, and expr would be passed data=None.
An unsupervised cost, that uses only unlabeled features, and not targets, will usually use (model.get_input_space(), model.get_input_source()), so the data passed to expr will directly be usable by the model.
Finally, a supervised cost, needing both features and targets, will usually request the targets to be in the same space as the model’s predictions (the model’s output space):
def get_data_specs(self, model):
return (CompositeSpace((model.get_input_space(),
model.get_output_space())),
(model.get_input_source(),
"targets"))
Then, data would be a pair, the first element of which can be passed directly to the model.
Of course, it does not have to be implemented that way, and the following is as correct (if more confusing) if you prefer having data be a (targets, inputs) pair instead:
def get_data_specs(self, model):
return (CompositeSpace((model.get_output_space(),
model.get_input_space())),
("targets",
model.get_input_source()))
As for costs used for training, variables monitored by MonitorChannels have to declare data specs corresponding to the input variables necessary to compute the monitored value. It is passed directly to the constructor, for instance, when calling:
channel = MonitorChannel(
graph_inputs=input_variables,
val=monitored_value,
name='channel_name',
data_specs=data_specs,
dataset=dataset)
data_specs describe the format and semantics of input_variables.
As in the previous section, if val does not need any input data, for instance if it is a shared variable, data_specs will be (NullSpace(), ''). If val corresponds to an unsupervised cost, or quantity depending only on the "features" source, data_specs could be (VectorSpace(...), "features"), etc.
For monitored values defined in models.Model.get_monitoring_channels(self, data)(), the data_specs of data, which are also the data_specs to pass to MonitorChannel’s constructor, are returned by a call to models.Model.get_monitoring_channels_data(self)().
In order to avoid duplicating data and creating lots of symbolic inputs to Theano functions (which also do not support nested arguments), it can be useful to convert a nested, composite data_specs into a flat, non-redundant one. That flat data_specs can be used to create theano variables or get mini-batches of data, for instance, which are then nested back into the original structure of the data_specs.
We use the utils.data_specs.DataSpecsMapping class to build a mapping between the original, nested data specs, and the flat one. For instance, using the spaces defined earlier:
source = ("features", ("features", "targets"))
space = CompositeSpace((input_vecspace,
CompositeSpace((input_convspace,
target_space))))
mapping = DataSpecsMapping((space, source))
flat_source = mapping.flatten(source)
# flat_source == ('features', 'features', 'targets')
flat_space = mapping.flatten(space)
# flat_space == (input_vecspace, input_convspace, target_space)
# We can use the mapping the other way around
nested_source = mapping.nest(flat_source)
assert source == flat_source
nested_space = mapping.nest(flat_space)
assert space == flat_space
# We can also nest other things
print mapping.nest((1, 2, 3))
# (1, (2, 3))
Here, 'features' appear twice in the flat source, that is because the corresponding space is different. However, if there is an actual duplicate, it gets removed:
source = (("features", "targets"), ("features", "targets"))
space = CompositeSpace((CompositeSpace((input_vecspace, target_space)),
CompositeSpace((input_vecspace, target_space))))
mapping = DataSpecsMapping((space, source))
flat_source = mapping.flatten(source)
# flat_source == ('features', 'targets')
flat_space = mapping.flatten(space)
# flat_space == (input_vecspace, target_space)
# We can use the mapping the other way around
nested_source = mapping.nest(flat_source)
assert source == flat_source
nested_space = mapping.nest(flat_space)
assert space == flat_space
# We can also nest other things
print mapping.nest((1, 2))
# ((1, 2), (1, 2))
The flat tuple of spaces can be used to create non-redundant Theano input variables, which will be nested back to be dispatched between the different components having requested them:
# From the block above:
# flat_space == (input_vecspace, target_space)
flat_composite_space = CompositeSpace(flat_space)
flat_inputs = flat_composite_space.make_theano_variables(name='input')
print flat_inputs
# (input[0], input[1])
# We can use the mapping to nest the theano variables
nested_inputs = mapping.nest(theano_inputs)
print nested_inputs
# ((input[0], input[1]), (input[0], input[1]))
# Then, we can build expressions from these input variables.
# Finally, a Theano function will be compiled with
f = theano.function(flat_inputs, outputs, ...)
# A dataset iterator can also be created from the flat composite space
it = my_dataset.iterator(..., data_specs=(flat_composite_space, flat_source))
# When it is time to call f on data, we can then do
for flat_data in it:
out = f(*flat_data)