Pymicro’s Datasets

This first tutorial will introduce you to the creation and deletion of Pymicro’s datasets.

I - Create and Open datasets with the SampleData class

In this first section, we will see how to create SampleData datasets, or open pre-existing ones. These two operations are performed by instantiating a SampleData class object.

Before that, you will need to import the SampleData class. We will import it with the alias name SD, by executing:

Import SampleData and get help

[1]:
from pymicro.core.samples import SampleData as SD

Before starting to create our datasets, we will take a look at the SampleData class documenation, to discover the arguments of the class constructor. You can read it on the pymicro.core package API doc page, or print interactively by executing:

>>> help(SD)

or, if you are working with a Jupyter notebook, by executing the magic command:

>>> ?SD

Do not hesitate to systematically use the ``help`` function or the ``”?”`` magic command to get information on methods when you encounter a new one. All SampleData methods are documented with explicative docstrings, that detail the method arguments and returns.

Dataset creation

The class docstring is divided in multiple rubrics, one of them giving the list of the class constructor arguments. Let us review them one by one.

  • filename: basename of the HDF5 pair of file of the dataset

This is the first and only mandatory argument of the class constructor. If this string corresponds to an existing file, the SampleData class will open these file, and create a file instance to interact with this already existing dataset. If the filename do not correspond to an existing file, the class will create a new dataset, which is what we want to do here.

Let us create a SampleData dataset:

[2]:
data = SD(filename='my_first_dataset')

That is it. The class has created a new HDF5/XDMF pair of files, and associated the interface with this dataset to the variable data. No message has been returned by the code, how can we know that the dataset has been created ?

When the name of the file is not an absolute path, the default behavior of the class is to create the dataset in the current work directory. Let us print the content of this directory then !

[3]:
import os # load python module to interact with operating system
cwd = os.getcwd() # get current directory
file_list = os.listdir(cwd) # get content of current work directory
print(file_list,'\n')

# now print only HDF5 files
print('Our dataset files:')
for file in file_list:
    if file.endswith('.h5'):
        print(file)
['EBSD Graph.ipynb', 'Polycrystalline_Datasets.rst', 'Phases.ipynb', 'Get_Datasets_Information.ipynb', 'Mesh_data.ipynb', 'Cell_Data.ipynb', 'Grain_Data.ipynb', 'Custom_Data_Models.ipynb', 'Datasets_Files.ipynb', 'Reference_Sheet_Datasets.ipynb', 'Image_data.ipynb', 'Data_Items.ipynb', 'Images', 'Data_compression.ipynb', 'my_first_dataset.h5', 'Microstructure_class.ipynb', 'Data_Management.rst', 'Data_Management_tutorial.rst']

Our dataset files:
my_first_dataset.h5

The file my_first_dataset.h5 has indeed been created. If you want interactive prints about the dataset creation, you can set the verbose argument to True. This will set the activate the verbose mode of the class. When it is, the class instance prints a lot of information about what it is doing. This flag can be set by using the set_verbosity method:

[4]:
data.set_verbosity(True)

Let us now close our dataset, and see if the class instance prints information about it:

[5]:
del data

Deleting DataSample object
.... writing xdmf file : my_first_dataset.xdmf

.... building xdmf tree
.... Storing content index in my_first_dataset.h5:/Index attributes
.... flushing data in file my_first_dataset.h5
File my_first_dataset.h5 synchronized with in memory data tree

Dataset and Datafiles closed

Note

It is a good practice to always delete your SampleData instances once you are done working with a dataset, or if you want to re-open it. As the class instance handles opened files as long as it exists, deleting it ensures that the files are properly closed. Otherwise, file may close at some random times or stay opened, and you may encounter undesired behavior of your datasets.

The class indeed returns some prints during the instance destruction. As you can see, the class instance wrights into the HDF5 file the data that is stored into the class instance, and then closes the dataset instance and the files.

Dataset opening and verbose mode

Let us now try to create a new SD instance for the same dataset file "my_first_dataset". As the HDF5 dataset already exist, this new SampleData instance will open it and synchronize with it. With the verbose mode activated, SampleData class instances will display messages about the actions performed by the class (creating, deleting data items for instance)

[6]:
data = SD(filename='my_first_dataset', verbose=True)
-- Opening file "my_first_dataset.h5"

Data model initialization....

Data model initialization done


**** FILE CONTENT ****
Printing dataset content with max depth 2

****** DATA SET CONTENT ******
 -- File: my_first_dataset.h5
 -- Size:     3.453 Kb
 -- Data Model Class: SampleData

 GROUP /
=====================
 -- Parent Group : None -- Root Group
 -- Group attributes :
         * description :
         * sample_name :
 -- Childrens : Index,
----------------
************************************************


.... Storing content index in my_first_dataset.h5:/Index attributes
.... flushing data in file my_first_dataset.h5
File my_first_dataset.h5 synchronized with in memory data tree

You can see that the printed information states that the dataset file my_first_dataset.h5 has been opened, and not created, because we provided a filename that already existed to the class constructor.

Some information about the dataset content are also printed by the class in verbose mode. This information can be retrived with specific methods that will be detailed in the next section of this Notebook. Let us focus for now on one part of it.

The printed info reveals that our dataset content is composed only of one data item, a Group data object named /.

This group is the Root Group of the dataset. Each dataset has necessarily a Root Group, automatically created along with the dataset. You can see that this Group has no parent group, and already have a Child, named Index. This particular data object will be presented in the third section of this Notebook. You can also observe that the Root Group already has attributes (recall from introduction Notebook that they are Name/Value pairs used to store metadata in datasets). Two of those attributes match arguments of the SampleData class constructor:

  • the description attribute

  • the sample_name attribute

The description and sample_name are not modified in the dataset when reading a dataset. These SD constructor arguments are only used when creating a dataset. They are string metadata whose role is to give a general name/title to the dataset, and a general description. However, they can be set to a new value after the dataset creation with the methods set_sample_name and set_description, used a little further in this Notebook.

Now we know how to open a dataset previously created with SampleData. We could want to open a new dataset, with the name of an already existing data, but overwrite it. The SampleData constructor allows to do that, and we will see it in the next subsection. But first, we will close our dataset again:

[7]:
del data

Deleting DataSample object
.... writing xdmf file : my_first_dataset.xdmf

.... building xdmf tree
.... Storing content index in my_first_dataset.h5:/Index attributes
.... flushing data in file my_first_dataset.h5
File my_first_dataset.h5 synchronized with in memory data tree

Dataset and Datafiles closed

Overwriting datasets

The overwrite_hdf5 argument of the class constructor, if it is set to True, will remove the filename dataset and create a new empty one, if this dataset already exists:

[8]:
data = SD(filename='my_first_dataset',  verbose=True, overwrite_hdf5=True)

-- File "/home/docs/checkouts/readthedocs.org/user_builds/pymicro/checkouts/latest/examples/UserGuide/my_first_dataset.h5" exists  and will be overwritten

-- File "my_first_dataset.h5" not found : file created

Data model initialization....

Data model initialization done


**** FILE CONTENT ****
Printing dataset content with max depth 2

****** DATA SET CONTENT ******
 -- File: my_first_dataset.h5
 -- Size:    96.000 bytes
 -- Data Model Class: SampleData

 GROUP /
=====================
 -- Parent Group : None -- Root Group
 -- Group attributes :
 -- Childrens : Index,
----------------
************************************************


.... Storing content index in my_first_dataset.h5:/Index attributes
.... flushing data in file my_first_dataset.h5
File my_first_dataset.h5 synchronized with in memory data tree

As you can see, the dataset files have been overwritten, as requested. We will now close our dataset again and continue to see the possibilities offered by the class constructor.

[9]:
del data

Deleting DataSample object
.... writing xdmf file : my_first_dataset.xdmf

.... building xdmf tree
.... Storing content index in my_first_dataset.h5:/Index attributes
.... flushing data in file my_first_dataset.h5
File my_first_dataset.h5 synchronized with in memory data tree

Dataset and Datafiles closed

Our dataset is now closed and we can move on to other ways to create and remove datasets.

Up to now, there is no mechanism implemented into the class to protect datasets from being overwritten. Be carefull with your data when using this functionality !

Test Copying datasets

One last thing that may be interesting to do with already existing dataset files, is to create a new dataset that is a copy of them, associated with a new class instance. This is usefull for instance when you have to try new processing on a set of valuable data, without risking to damage the data.

To do this, you may use the copy_sample method of the SampleData class. Its main arguments are:

  • src_sample_file: basename of the dataset files to copy (source file)

  • dst_sample_file: basename of the dataset to create as a copy of the source (destination file)

  • get_object: if False, the method will just create the new dataset files and close them. If True, the method will leave the files open and return a SampleData instance that you may use to interact with your new dataset.

Let us try to create a copy of our first dataset:

[10]:
data2 = SD.copy_sample(src_sample_file='my_first_dataset', dst_sample_file='dataset_copy', get_object=True)
[11]:
cwd = os.getcwd() # get current directory
file_list = os.listdir(cwd) # get content of current work directory
print(file_list,'\n')

# now print only files that start with our dataset basename
print('Our dataset files:')
for file in file_list:
    if file.startswith('dataset_copy'):
        print(file)
['EBSD Graph.ipynb', 'Polycrystalline_Datasets.rst', 'Phases.ipynb', 'Get_Datasets_Information.ipynb', 'dataset_copy.h5', 'Mesh_data.ipynb', 'Cell_Data.ipynb', 'Grain_Data.ipynb', 'Custom_Data_Models.ipynb', 'Datasets_Files.ipynb', 'my_first_dataset.xdmf', 'Reference_Sheet_Datasets.ipynb', 'Image_data.ipynb', 'Data_Items.ipynb', 'Images', 'Data_compression.ipynb', 'my_first_dataset.h5', 'Microstructure_class.ipynb', 'Data_Management.rst', 'Data_Management_tutorial.rst']

Our dataset files:
dataset_copy.h5

The copy_dataset.h5 HDF5 file has indeed been created, and is a copy of the my_first_dataset.h5.

Note that the copy_sample is a static method, that can be called even without SampleData instance. Note also that it has an overwrite argument, that allows to overwrite an already existing dst_sample_file. It also has, like the class constructor, a autodelete argument, that we will discover in the next subsection.

Automatically removing dataset files

In some occasions, we may want to remove our dataset files after using our SampleData class instance. This can be the case for instance if you are trying some new data processing, or using the class for visualization purposes, and are not interested in keeping your test data.

The class has a autodelete attribute for this purpose. IF it is set to True, the class destructor will remove the dataset file pair in addition to deleting the class instance. The class constructor and the copy_sample method also have a autodelete argument, which, if True, will automatically set the class instance autodelete attribute to True.

To illustrate this feature, we will try to change the autodelete attribute of our copied dataset to True, and remove it.

[12]:
# set the autodelete argument to True
data2.autodelete = True
# Set the verbose mode on for copied dataset
data2.set_verbosity(True)
[13]:
# Close copied dataset
del data2

Deleting DataSample object
.... writing xdmf file : dataset_copy.xdmf

.... building xdmf tree
.... Storing content index in dataset_copy.h5:/Index attributes
.... flushing data in file dataset_copy.h5
File dataset_copy.h5 synchronized with in memory data tree

Dataset and Datafiles closed
SampleData Autodelete:
 Removing hdf5 file dataset_copy.h5

The class destructor ends by priting a confirmation message of the dataset file removal in verbose mode, as you can see in the cell above. Let us verify that it has been effectively deleted:

[14]:
file_list = os.listdir(cwd) # get content of current work directory
print(file_list,'\n')

# now print only files that start with our dataset basename
print('Our copied dataset files:')
for file in file_list:
    if file.startswith('dataset_copy'):
        print(file)
['EBSD Graph.ipynb', 'Polycrystalline_Datasets.rst', 'Phases.ipynb', 'Get_Datasets_Information.ipynb', 'Mesh_data.ipynb', 'Cell_Data.ipynb', 'Grain_Data.ipynb', 'Custom_Data_Models.ipynb', 'Datasets_Files.ipynb', 'my_first_dataset.xdmf', 'Reference_Sheet_Datasets.ipynb', 'Image_data.ipynb', 'Data_Items.ipynb', 'Images', 'Data_compression.ipynb', 'my_first_dataset.h5', 'Microstructure_class.ipynb', 'Data_Management.rst', 'Data_Management_tutorial.rst']

Our copied dataset files:

As you can see, the dataset file has been suppressed. Now we can also open and remove our first created dataset using the class constructor autodelete option:

[15]:
data = SD(filename='my_first_dataset',  verbose=True, autodelete=True)

print(f'Is autodelete mode on ? {data.autodelete}')

del data
-- Opening file "my_first_dataset.h5"

Data model initialization....

Data model initialization done


**** FILE CONTENT ****
Printing dataset content with max depth 2

****** DATA SET CONTENT ******
 -- File: my_first_dataset.h5
 -- Size:     3.453 Kb
 -- Data Model Class: SampleData

 GROUP /
=====================
 -- Parent Group : None -- Root Group
 -- Group attributes :
         * description :
         * sample_name :
 -- Childrens : Index,
----------------
************************************************


.... Storing content index in my_first_dataset.h5:/Index attributes
.... flushing data in file my_first_dataset.h5
File my_first_dataset.h5 synchronized with in memory data tree
Is autodelete mode on ? True

Deleting DataSample object
.... writing xdmf file : my_first_dataset.xdmf

.... building xdmf tree
.... Storing content index in my_first_dataset.h5:/Index attributes
.... flushing data in file my_first_dataset.h5
File my_first_dataset.h5 synchronized with in memory data tree

Dataset and Datafiles closed
SampleData Autodelete:
 Removing hdf5 file my_first_dataset.h5
[16]:
file_list = os.listdir(cwd) # get content of current work directory
print(file_list,'\n')

# now print only files that start with our dataset basename
print('Our dataset files:')
for file in file_list:
    if file.startswith('my_first_dataset'):
        print(file)
['EBSD Graph.ipynb', 'Polycrystalline_Datasets.rst', 'Phases.ipynb', 'Get_Datasets_Information.ipynb', 'Mesh_data.ipynb', 'Cell_Data.ipynb', 'Grain_Data.ipynb', 'Custom_Data_Models.ipynb', 'Datasets_Files.ipynb', 'Reference_Sheet_Datasets.ipynb', 'Image_data.ipynb', 'Data_Items.ipynb', 'Images', 'Data_compression.ipynb', 'Microstructure_class.ipynb', 'Data_Management.rst', 'Data_Management_tutorial.rst']

Our dataset files:

Note

Using the autodelete option is usefull when you want are using the class for tries, or tests, and do not want to keep the dataset files on your computer.

This first tutorial on Data Management with Pymicro User Guide is now finished. You should now know how to create, open or remove SampleData datasets.