Create Predefined Custom Data Models for your datasets¶

This tutorial will teach you how to defined derived classes from SampleData, in order to create datasets with an automatically generated data model that is tailored to a specific need.

I - SampleData derived classes¶

The SampleData class allows to create and interact with complex HDF5 datasets. New datasets are created empty, and can be constructed freely according to the needs of the user. When using the class to work with many datasets that should share the same type of internal organization and content, users will have to rebuild this internal data model for each new dataset. In addition, in order to defined scripts or classes that aim at batch processing some data items that are found in each of these datasest, they will have to make sure that the these item names and/or pathes are identical in all datasets.

These considerations show that the automatic generation of a non-empty and specific data model would be a usefull addition to the features of SampleData. For that purpose, the class implements two simple mechanisms through class inheritance, that are the subject of the present tutorial.

Custom Data Model¶

The SampleData class defines a minimal data model for all the datasets that structures all created datasets. This data model is an organized collection of data item indexnames, pathes and types, provided via two dictionaries, that are:

minimal_content_index_dic: the path of each data item in the data model
minimal_content_type_dic: the type of each data item in the data model

The content index dictionary¶

Each item of this dictionary defines a data item of the data model. Its key will be the indexname given to the data item in the dataset, and the item value must be a string giving a valid path for the data item in the dataset. When a dataset is created, the class will automatically create a data item for each key of this dictionary, and set its path in the dataset with the associated value in the dictionary.

For the SampleData class, this dictionary is empty, no data model is prescribed. Hence, datasets that are created with SampleData are empty (they just hase a Root Group, as explained in a previous tutorial. To create datasets with a prescribed data model, the idea is to implement a class that is derived from SampleData, with a non-empty minimal_content_index_dic, that implements the desired data model.

This dictionary should hence look like this:

minimal_content_index_dic = {'item1': '/path_to_item1',
                             'item2': '/path_to_item1/path_to_item2',
                             'item3': '/path_to_item3',
                             'item4': '/path_to_item1/path_to_item4',
                              '...': '...',}

An item of the form 'wrongitem': '/undeclared_item/path_to_wrong_item' would have been a non valid path.

The dictionary example just above would lead to the creation of at least 4 data items, with names item1, item2, item3 and item4, with items 1 and 3 being directly attached to the dataset Root Group, and the items 2 and 4 being childrens of item 1.

The content type dictionary¶

The second dictionary must have the same keys as the minimal_content_index_dic. Its values must be valid SampleData data item types. The type of data item that are automatically created at the dataset creation with the names and pathes specified by minimal_content_index_dic, are prescribed by the minimal_content_type_dic

Possible values and associated data types are (see previous tutorials for description of these data types):

Group: creates a HDF5 group data item
2DImage, 3DImage, or Image: creates an empty Image group
2DMesh, 3DMesh, Mesh: creates an empty Mesh group
data_array: creates an empty Data Array
field_array: creates an empty Field Array (its path must be a children of a an Image or Mesh group)
string_array: creates an empty String Array
a numpy.dtype or a tables.IsDescription class (see here and the tutorial on basic data items):

This dictionary should share the same keys as the minimal_content_index dictionary, and should look like this:

minimal_content_type_dic = {'item1': '3DMesh',
                            'item2': 'field_array',
                            'item3': 'data_array',
                            'item4':  array_np.dtype,
                            '...': '...',}

In this case, the first item would be created as a Mesh Group, the second will be created as a field data item stored in this mesh, the third as a data array attached to the Root Group, and the last as a Structured Table attached to the Mesh Group.

These two dictionaries are returned by the minimal_data_model method of the SampleData class. They are used during the dataset object initialization, to create the prescribed data model, and populate it with empty objects, with the right names and organization. This allows to prepend a set of names and pathes that form a particular data model that all objects created by the class should have.

To create a dataset class with the above data model, its implementation should thus look like this at this stage:

class MyDatasets(SampleData):
    """Example of SampleData derived class.

       This is how to implement a class of datasets with a custom data model.
    """

    def minimal_data_model(self):

        minimal_content_index_dic = {'item1': '/path_to_item1',
                                    'item2': '/path_to_item1/path_to_item2',
                                    'item3': '/path_to_item3',
                                    'item4': '/path_to_item1/path_to_item4'}
        minimal_content_type_dic = {'item1': '3DMesh',
                                    'item2': 'field_array',
                                    'item3': 'data_array',
                                    'item4':  array_np.dtype}

        return minimal_content_index_dic, minimal_content_type_dic

This dictionaries are labeled as minimal data model, as they only prescribe the data items and organization that will be generated in each created dataset of the subclass. The user is free to enrich the datasets with any additional data item (see previous tutorial to learn how to do it).

To sum up, creating a interface to create and interact with datasets with a prescribed data model, you have to:

Implement a new class, inherited from SampleData
Override the minimal_data_model method and write your data model in the two dictionaries returned by the class

You will then get a class derived from SampleData (hence with all its methods and features), that creates datasets with this prescribed data model.

Custom initialization¶

The other mechanisms that is important to design subclasses of SampleData, is the specification of initialization commands that are runed each time at dataset opening. These operations can include, for instance, the definition of class attributes, prints, sanity checks etc….. The _after_file_open method of the SampleData class has been designed to this end. It is called by the class constructor after opening the HDF5 dataset and loading the dataset Index and data tree in the class instance.

To create your custom dataset initialization routine, you can hence override the _after_file_open method in your derived class, and implement your initialization procedure. For instance, if you want your class to warn the user that a data item is empty in the dataset, you could implement your class as follows:

class MyDatasets(SampleData):
    """Example of SampleData derived class.

       This is how to implement a class of datasets with a custom data model
       and initialization procedure.
    """

    def minimal_data_model(self):
        """Define data model of MyDatasets class."""
        minimal_content_index_dic = {'item1': '/path_to_item1',
                                    'item2': '/path_to_item1/path_to_item2',
                                    'item3': '/path_to_item3',
                                    'item4': '/path_to_item1/path_to_item4'}
        minimal_content_type_dic = {'item1': '3DMesh',
                                    'item2': 'field_array',
                                    'item3': 'data_array',
                                    'item4':  array_np.dtype}

        return minimal_content_index_dic, minimal_content_type_dic

    def _after_file_open(self):
        """Initialization procedure for MyDatasets."""

        if self._is_empty('item3'):
            print('Warning: data array "item3" is empty in the dataset !')
        else:
            print('"item3" is not empty !')
        return

II - A practical example : The Microstructure Class¶

The Microstructure class has been designed to build datasets representing polycrystalline material samples. The Microstructure class also offers many application specific methods to interact with polycrystalline materials datasets, that are detailed in dedicated pages of this User’s guide.

Following the principles detailed in the previous section, the Microstructure class is implemented as a subclass of the SampleData class:

class Microstructure(SampleData):

Let us review its prescribed data model and initialization procedure to use it as a practical example of custom data model creation.

Class data model¶

The code of the minimal_data_model method of the Microstructure class is replicated below:

def minimal_data_model(self):
    """Data model for a polycrystalline microstructure.

    Specify the minimal contents of the hdf5 (Group names, paths and group
    types) in the form of a dictionary {content: location}. This extends
    `~pymicro.core.SampleData.minimal_data_model` method.

    :return: a tuple containing the two dictionnaries.
    """
    minimal_content_index_dic = {'Image_data': '/CellData',
                                 'grain_map': '/CellData/grain_map',
                                 'phase_map': '/CellData/phase_map',
                                 'mask': '/CellData/mask',
                                 'Mesh_data': '/MeshData',
                                 'Grain_data': '/GrainData',
                                 'GrainDataTable': ('/GrainData/'
                                                    'GrainDataTable'),
                                 'Phase_data': '/PhaseData'}
    minimal_content_type_dic = {'Image_data': '3DImage',
                                'grain_map': 'field_array',
                                'phase_map': 'field_array',
                                'mask': 'field_array',
                                'Mesh_data': 'Mesh',
                                'Grain_data': 'Group',
                                'GrainDataTable': GrainData,
                                'Phase_data': 'Group'}
    return minimal_content_index_dic, minimal_content_type_dic

Datasets initialization¶

The _after_file_open method of the Microstructure is composed of the following lines of code:

def _after_file_open(self):
    """Initialization code to run after opening a Sample Data file."""
    self.grains = self.get_node('GrainDataTable')
    if self._file_exist:
        self.active_grain_map = self.get_attribute('active_grain_map',
                                                   'CellData')
        if self.active_grain_map is None:
            self.set_active_grain_map()
        self._init_phase(phase)
        if not hasattr(self, 'active_phase_id'):
            self.active_phase_id = 1
    else:
        self.set_active_grain_map()
        self._init_phase(phase)
        self.active_phase_id = 1
    return

When opening a dataset, a class attribute grains is associated with the Structured Array node GrainDataTable. This grains attribute is used by many of the class methods. Hence, the _after_file_open method is used here to ensure that this attribute is properly associated to the GrainDataTable data item, for each opening of the dataset. The class initialization also executes the _init_phase and set_active_grain_map methods, that serve a similar purpose for other data items.

Creating a Microstructure dataset¶

To conclude this tutorial, we will create a Microstructure object and look at its content. The class constructor arguments are similar to those of the SampleData class.

[5]:

# import SampleData class
from pymicro.crystal.microstructure import Microstructure

# create a microstructure dataset
micro = Microstructure(filename='test_microstructure', autodelete=True)

# print the content of the microstructure dataset
print(micro)

# print class attributes that are initialized by the _after_file_open method
print('The Grain object has been initialized:')
print(micro.grains)

# close the dataset
del micro

Adding empty field /CellData/grain_map to mesh group /CellData
Adding empty field /CellData/phase_map to mesh group /CellData
Adding empty field /CellData/mask to mesh group /CellData
new phase added: unknown
Microstructure
* name: micro
* lattice: Lattice (Symmetry.cubic) a=1.000, b=1.000, c=1.000 alpha=90.0, beta=90.0, gamma=90.0

Dataset Content Index :
------------------------:
index printed with max depth `3` and under local root `/`

         Name : Image_data                                H5_Path : /CellData
         Name : Mesh_data                                 H5_Path : /MeshData
         Name : Grain_data                                H5_Path : /GrainData
         Name : Phase_data                                H5_Path : /PhaseData
         Name : grain_map                                 H5_Path : /CellData/grain_map
         Name : Image_data_Field_index                    H5_Path : /CellData/Field_index
         Name : phase_map                                 H5_Path : /CellData/phase_map
         Name : mask                                      H5_Path : /CellData/mask
         Name : GrainDataTable                            H5_Path : /GrainData/GrainDataTable
         Name : phase_01                                  H5_Path : /PhaseData/phase_01

Printing dataset content with max depth 3
  |--GROUP CellData: /CellData (emptyImage)
     --NODE Field_index: /CellData/Field_index (string_array - empty) (   63.999 Kb)
     --NODE grain_map: /CellData/grain_map (field_array - empty) (   64.000 Kb)
     --NODE mask: /CellData/mask (field_array - empty) (   64.000 Kb)
     --NODE phase_map: /CellData/phase_map (field_array - empty) (   64.000 Kb)

  |--GROUP GrainData: /GrainData (Group)
     --NODE GrainDataTable: /GrainData/GrainDataTable (structured_array - empty) (    0.000 bytes)

  |--GROUP MeshData: /MeshData (emptyMesh)
  |--GROUP PhaseData: /PhaseData (Group)
    |--GROUP phase_01: /PhaseData/phase_01 (Group)


The Grain object has been initialized:
/GrainData/GrainDataTable (Table(0,)) ''
Microstructure Autodelete:
 Removing hdf5 file test_microstructure.h5

The dataset has indeed been created with a content that conforms to the data model prescribed by the minimal_data_model method of the Microstructure class. Each of these items corresponds to data used systematically to study a polycrystalline material sample. In this context, the implementation of the data model serves the following purposes:

it can be used as a standard data model for polycrystalline data sets, thus promoting data exchange and interoperability
it allows to implement a high level interface with reduced complexity to interact with these data items

The interface is provided by the Microstructure class, that allows to perform data processings that are frequently used in material science on polycrystalline datasets. In addition, the pre-existing data model facilitates the implementation of new processing functionalities within the class. This is illustrated here by the grains class attribute, that has been associated to the GrainDataTable data item in the dataset, as shown by the printed information above. This attribute an accessible and explicit object to get information and apply processing on the data describing the grains of the microstructure represented by the dataset.

This conclude this short tutorial on creating custom data models with SampleData. The Microstructure class features and use is detailed in a dedicated part of this User Guide.