.. _data_management_label: Scientific Data Management with Pymicro ======================================== One of Pymicro's main goals is to help material scientists to handle the vast amount of data of many kinds that is generated during the study of material samples via various simulation of characterization techniques. To this end, a specific data platform has been designed and implemented in the ``pymicro.core`` package. It provides a **data model**, a **file format**, and an interface to create, store, and work with material science datasets. This page presents the different concepts and objects used for data management in Pymicro. It is followed by a step-by-step tutorial to master its practical use. Pymicro's Data Platform ----------------------- The ``pymicro.core`` module has been designed to be an independent module. Its aim is to serve as a prototype of what could constitute a generic data platform for the material science community. It is centered on a main *class*, the ``SampleData`` class. The ``SampleData`` class is designed to handle all data modalities relevant to material science, to bring them together and organize them into multimodal datasets, and to reduce the technical complexity of data management for the end user. The class provides two essential features: 1. A file format and a data model for these multimodal datasets 2. A high-level interface to interact with datasets As its name suggests, it is designed to associate a dataset to each studied material sample. Each ``SampleData`` class instance is associated to one dataset file in memory. The class methods then provide the interface to interact with the data contained in the dataset. The interface enables to: * Add, remove, and organize data of various types, formats, shapes into the same dataset. * Add, remove and organize light metadata for all elements of a dataset. * Get simple or detailed information on the dataset content. * Enable 3D visualization of spatially organized data (images, maps, measured or simulated fields). * Provide efficient and distinct compression options for each element of a dataset. * Provide a framework to automate interface between the datasets and data processing or numerical simulation tools. * Provide a framework to derive classes from SampleData that are specialized to a specific type of material samples, such as the ``Microstructure`` class of the ``pymicro.crystal.microstructure`` module. File format ----------- HDF5 ^^^^ ``SampleData`` associates to each class instance a **HDF5 dataset file**. `HDF5 `_ is a library and file format that supports an unlimited variety of datatypes, designed to store large data. It is a hierarchical format: it allows to gather multiple data and create links to organize them, creating an internal hierarchy that replicates a file system within one single file. It also allows to add text metadata to all data elements in the file, called *HDF5 attributes*. The interface with HDF5 files implemented in ``SampleData`` is built on top of the `Pytables `_ package, that extends the HDF5 data model, and efficient data compression and I/O tools. XDMF ^^^^ In addition, the ``SampleData`` class can produce an XMDF file, associated to the HDF5 dataset. `XDMF `_ is an extension of XML file format that allows to provide a data model for arrays representing spatially organized data, *i.e.* 2D or 3D data (fields). Practically, it is an XML file containing the description, for each grid supporting the fields, of its topology, its geometry and its dimensions. It also provides the description of the fields defined on the grid (type, dimension), and enables to refer to data arrays stored in HDF5 data files to define their values. **It can be seen as an light metadata file allowing softwares to interpret the raw binary data stored in an HDF5 file**. This XDMF/HDF5 file pair is a data format is used by data processing or numerical simulation softwares that are widely used by the material science community (`Fenics `_, `DAMASK `_, `Dream3D `_...). HDF5/XDMF file pairs can be loaded by the powerfull and popular 3D visualization free software `Paraview `_. **This is a practical way to visualize 2D or 3D data stored into a *Pymicro* HDF5 dataset.** You will find below an example of a XDMF file associated to a HDF5 file: .. code-block:: xml 0 0 0 1 1 1 test.hdf:/images/0/image This file pair contains a 3D image (regular grid) of 400x300x300 voxels, that supports a scalar field ``second 0``, that is stored in the node ``/images/0/image`` of the HDF5 file ``test.hdf5``. The ``SampleData`` class handles the automatic generation of such files to describe the HDF5 datasets, and benefit from 3D visualization with Paraview. Data Model ---------- ``SampleData`` datasets comply with a specific **data model**. A data model is the set of convention to store and read data that must be known to make sens out of it. For instance, a three dimensional array of floating point vales may be the X,Y,Z geometry for a grid or calculated vector values. Without a data model, it is impossible to tell the difference. The data model implemented within the ``SampleData`` class provides several **data items**, that can be used to store the various types of data that must be gather into material science datasets. These various items are briefly presented here. Specific notebooks will follow to detail the features and interface for each of these type of items. HDF5 Data Model ^^^^^^^^^^^^^^^ Pymicro's data model is built on top of the HDF5 data model. The latter is based on the following items: * **Groups**: a data structure that can be linked to other groups or datasets, and is used to organized data objects. They can been seen as the 'directories' of a HDF5 dataset. Every HDF5 file contains a root group that can contain other groups. * **Data arrays** or **Nodes**: arrays of data that can have different types/shapes, and are attached to a Group. * **Attributes**: Name/Value pairs just like in a Python dictionary. Values can take any form as long as they remain small (large arrays must not be stored as attributes). It is the way to store metadata into HDF5 files. Both Groups and Datasets can hold as many Attributes as required. Pymicro's Data Model ^^^^^^^^^^^^^^^^^^^^ The ``SampleData`` data model introduces two types of particular HDF5 Groups. They are dedicated to the representation of spatially organized data. Such data consist in geometrical grids supporting data array representing field values. These grids can be regular grids (images), or have more complex topologies (meshes). Those two situations correspond to the 2 Group types in the data model, that are: * **Image Groups** are HDF5 groups designed to store data describing 2D or 3D images, *i.e.* regular grids supporting scalar or tensorial fields. They are used for instance, to store data coming from SEM/EBSD or X-ray tomography imaging, or FFT-based simulation results. 3 sub-types of image groups are implemented: - ``2DImage`` groups: two dimensional images, *i.e.* grids of :math:`(N_x,N_y)` pixels - ``3DImage`` groups: three dimensional images, *i.e.* grids of :math:`(N_x,N_y, N_z)` voxels - ``emptyImage`` groups: image groups that do not support any data and topology yet * **Mesh Groups** are HDF5 groups designed to store data describing 2D or 3D meshes, *i.e.* grids described by node coordinates and elements connectivities, supporting scalar or tensorial fields. They can be used to store data coming from finite element simulations, CAD designs etc... 3 sub-types of mesh groups are implemented: - ``2DMesh`` groups: two dimensional meshes, *i.e.* grid of nodes defined by their :math:`(X_i,Y_i)` coordinate pairs. - ``3DMesh`` groups: three dimensional meshes, *i.e.* grid of nodes defined by their :math:`(X_i,Y_i,Z_i)` coordinate pairs. - ``emptyMesh`` groups: represent mesh groups that do not support any data and topology yet. In addition, the data model introduces two additional types of HDF5 **Nodes**: * **Structured Tables** are heterogeneous and bidimensional data arrays, *i.e.* that may contain data of different types (integers, floats, strings, sub arrays...) within the same row, all rows having the same format. Each column of those arrays can have a specific name. Those arrays are equivalent to `Numpy structured arrays `_. * **Fields** are specific data arrays that must belong to a grid group (Image or Mesh Group), and whose shape and dimensions must comply with the grid topology. They are used to store and manipulate spatially organized arrays, that represent mechanical fields (for instance, displacement or temperature fields, microstructure phase maps, EBSD orientation maps....). Those specific data objects introduced by ``SampleData`` all have a specific data model, associated metadata, and a specific interface. Each one of them is the subject of a dedicated Notebook in this User Guide. Data Model example ^^^^^^^^^^^^^^^^^^ To illustrate what ``SampleData`` datasets may look like, a virtual example of a polycrystalline material sample dataset is represented in the schematic diagram below: .. image:: ./Images/Introduction/DataModel-figure1.png Groups are used to organize data into coherent categories. Metadata are represented in green, and datasets in red. As it appears on the diagram, Groups can be dedicated only to organize metadata, for instance to document material nature, composition, elaboration process, or the experimental set-up used for imaging and mechanical tests. * The *Macro Data* and *Statistics* Groups will typically contain simple *arrays* datasets or *structured tables*, to store mechanical tests outputs (loading curves) or some statistics on the microstructure geometry or mechanical state. * The *Heavy Data* group will typically contain the spatially organized data, coming from 3D or 2D in-situ imaging techniques, and simulation softwares. It will most likely contain *Image or Mesh Groups*, and contain *Fields*.