{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "75a1458f", "metadata": {}, "source": [ "# Pymicro's Data Items " ] }, { "attachments": {}, "cell_type": "markdown", "id": "ea2b8080", "metadata": {}, "source": [ "This third Data Management tutorial will introduce you most of Pymicro's data items, and how to create, remove and retrieve them. " ] }, { "attachments": {}, "cell_type": "markdown", "id": "1deed6f9", "metadata": {}, "source": [ "## I - Getting data" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f54d98c4", "metadata": {}, "source": [ "This first section will present the generic ways to get the data contained into a data item in a *SampleData* dataset. Like in the [previous tutorial](./Get_Datasets_Information.ipynb), we will use the reference dataset used for the `pymicro.core` package unit tests: " ] }, { "cell_type": "code", "execution_count": null, "id": "452b95f4", "metadata": {}, "outputs": [], "source": [ "from pymicro.core.samples import SampleData as SD" ] }, { "cell_type": "code", "execution_count": null, "id": "d7f1f538", "metadata": {}, "outputs": [], "source": [ "from pymicro import get_examples_data_dir # import file directory path\n", "PYMICRO_EXAMPLES_DATA_DIR = get_examples_data_dir() # get the file directory path\n", "import os\n", "dataset_file = os.path.join(PYMICRO_EXAMPLES_DATA_DIR, 'test_sampledata_ref') # test dataset file path\n", "data = SD(filename=dataset_file)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "c080535d", "metadata": {}, "source": [ "We will start by printing the content of the dataset and its Index (see [previous tutorial, section II](./Get_Datasets_Information.ipynb)), to see which data we could load from the dataset: " ] }, { "cell_type": "code", "execution_count": null, "id": "917d9482", "metadata": {}, "outputs": [], "source": [ "data.print_dataset_content(short=True)\n", "data.print_index()" ] }, { "cell_type": "markdown", "id": "4eb0a145", "metadata": {}, "source": [ "*SampleData* datasets can contain many types of data items, with different formats, shapes and contents. For this reason, the class provides specific methods to get each type of data item. They will be presented, for each type of data, in the next sections of this Notebook. \n", "\n", "In addition, the *SampleData* class provides two generic mechanisms to retrieve data, that only require the name of the targeted data item. They automatically try to identify which type of data matches the provided name, and call the adapted specific \"*get*\" method. They are usefull to quickly and easily get data, but do not allow to access all options offered by specific methods. We will start by reviewing these generic data access mechanisms." ] }, { "cell_type": "markdown", "id": "d5febfad", "metadata": {}, "source": [ "### Dictionary like access to data " ] }, { "attachments": {}, "cell_type": "markdown", "id": "5a4999d9", "metadata": {}, "source": [ "The first way to get a data item is to use the *SampleData* class instance as if ot was a dictionary whose keys and values were respectively the data item Names and content. For a given data item, as dictionary key, you can use one of its 4 possible identificators, *Name*, *Path*, *Indexname* or *Aliases* (see [tutorial 1, section I](./Get_Datasets_Information.ipynb)). \n", "\n", "Let us see an example, by trying to get the array `test_array` of our dataset:" ] }, { "cell_type": "code", "execution_count": null, "id": "e48cc005", "metadata": {}, "outputs": [], "source": [ "# get array in a variable, using the data item Name\n", "array = data['test_array']\n", "print(array.shape,'\\n', array)\n", "print(type(array))\n", "\n", "# directly print array, getting it with its Indexname\n", "print('\\n',data['array'])" ] }, { "cell_type": "markdown", "id": "b4ae3625", "metadata": {}, "source": [ "As you can see, when used as a dictionary, the class returned the content of the `test_array` data item as a *numpy* array." ] }, { "cell_type": "markdown", "id": "408993ee", "metadata": {}, "source": [ "### Attribute like access to data" ] }, { "cell_type": "markdown", "id": "50d55384", "metadata": {}, "source": [ "In addition to the dictionary like access, you can also get data items as if they were attributes of the class, using their *Name*, *Indexname* or *Alias*:" ] }, { "cell_type": "code", "execution_count": null, "id": "d76e7cad", "metadata": {}, "outputs": [], "source": [ "print(data.array)" ] }, { "cell_type": "code", "execution_count": null, "id": "497f63ed", "metadata": {}, "outputs": [], "source": [ "# get the test array in a variable with the attribute like access\n", "array2 = data.test_array\n", "\n", "# Test if both array are equal\n", "import numpy as np\n", "np.all(array == array2)" ] }, { "cell_type": "markdown", "id": "2f45dce8", "metadata": {}, "source": [ "Now that these two generic mechanisms have been presented, we will review the basic data item types that can compose your datasets, and how to create or retrieve them with specific class methods. The more complex data types, representing grids and fields, will not be presented here. Dedicated tutorials follow this one to introduce you to these more advanced features of the class. \n", "\n", "To do that, we will create our own dataset. So first, we have to close the test dataset:" ] }, { "cell_type": "code", "execution_count": null, "id": "5ae4d09c", "metadata": {}, "outputs": [], "source": [ "data.set_verbosity(True)\n", "del data" ] }, { "cell_type": "markdown", "id": "44005cac", "metadata": {}, "source": [ "## II - HDF5 Groups" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5c8d75b5", "metadata": {}, "source": [ "We will start by HDF5 Groups, are they are the most simple type of data item in the data model. Groups have 2 functions within *SampleData* datasets:\n", "1. organize data by containing other data items \n", "2. organizing metadata by containing attributes\n", "\n", "First, we will start by creating a dataset, with the `verbose` and `autodelete` options, so that we get information on the actions performed by the class, and so that our dataset is removed once we end this tutorial. We also set the `overwrite_hdf5` option to `True`, in case *tutorial_dataset.h5* exists in the current work directory (created and not removed by another tutorial file for instance)." ] }, { "cell_type": "code", "execution_count": null, "id": "8d0b68fe", "metadata": {}, "outputs": [], "source": [ "data = SD(filename='tutorial_dataset', sample_name='test_sample', verbose = True, autodelete=True, overwrite_hdf5=True)" ] }, { "cell_type": "markdown", "id": "2e77cce0", "metadata": {}, "source": [ "Note that the verbose mode of the class informed us that dataset files with the required name already existed and where hence deleted. \n", "\n", "For now, our dataset is empty and thus contains only the Root group `'/'`. To create a group, you must use the `add_group` method. It has 4 arguments:\n", "* `groupname`: the group to create will have this *Name*\n", "* `location`: indicate the parent group that will contain the created group. By defaults it's value is `'/'`, the root group\n", "* `indexname`: the group to create will have this *Indexname*. If none is provided, the indexname will be duplicated from the Name\n", "* `replace`: if a group with the same *Name* exists, the *SampleData* class will remove it to create the new one only if this argument is set to `True`. If not, the group will not be created. By default, it is set to `False`\n", "\n", "Let us create a test group, from the root group. We will call it `test_group`, and give it a short indexname, for instance `testG`:" ] }, { "cell_type": "code", "execution_count": null, "id": "20e8b084", "metadata": {}, "outputs": [], "source": [ "data.add_group(groupname='test_group', location='/', indexname='testG')" ] }, { "cell_type": "markdown", "id": "90f8cfc2", "metadata": {}, "source": [ "As you can see, the verbose mode prints the confirmation that the group has been created. You can observe also that the method return a Group object, that is an instance of a class from the *Pytables* package. In practice, you do not need to use *Pytables* object when working with the *SampleData* class. However, if you want to use them, you can find the documentation of the *group* class [here](https://www.pytables.org/usersguide/libref/hierarchy_classes.html#the-group-class).\n", "\n", "Let us now look at the content of our dataset:" ] }, { "cell_type": "code", "execution_count": null, "id": "1fcf3ee5", "metadata": {}, "outputs": [], "source": [ "print(data)" ] }, { "cell_type": "markdown", "id": "b81a28e1", "metadata": {}, "source": [ "The group has indeed been created, with the right *Path*, *Name* and *Indexname*. \n", "\n", "Let us try to create a new group, with the same path and name, but a different indexname:" ] }, { "cell_type": "code", "execution_count": null, "id": "22c804c1", "metadata": { "scrolled": true }, "outputs": [], "source": [ "import tables\n", "\n", "# We run the command in a try structure as it will raise an exception\n", "try:\n", " data.add_group(groupname='test_group', location='/', indexname='Gtest')\n", "except tables.NodeError as NodeError:\n", " print(NodeError)" ] }, { "cell_type": "markdown", "id": "049ac904", "metadata": {}, "source": [ "We got an error, more specifically a *NodeError* linked to the HDF5 dataset structure, as the group already exists. As explained earlier, if the `replace` argument is set to `False` (the default value), the class protects the pre-existing data and do not create the new data item. As we are sure that we want to overwrite the Group, we must set the correct argument value: " ] }, { "cell_type": "code", "execution_count": null, "id": "95a124e4", "metadata": {}, "outputs": [], "source": [ "group = data.add_group(groupname='test_group', location='/', indexname='Gtest', replace=True)" ] }, { "cell_type": "markdown", "id": "20d79e24", "metadata": {}, "source": [ "This time we got no error, the previously created group has been deleted, and the new one created. We also assigned the return *Group* object to the variable `group`. Let us verify:" ] }, { "cell_type": "code", "execution_count": null, "id": "b82f3399", "metadata": {}, "outputs": [], "source": [ "print(data)\n", "print(group)" ] }, { "cell_type": "markdown", "id": "b470c29e", "metadata": {}, "source": [ "As explained in the first section, to get this group data item from the dataset, you can use the dictionary or attribute like data item access. Both mechanisms will call the `get_node` *SampleData* method. \n", "\n", "This method is meant to return simple data items under the form of a *numpy* array, or a *Pytables* node/group format. It takes one of the 4 possible data item identificators (name, indexname, path or alias) as argument. In this case, it should return a *Group* object, that should be the same as the one return by the `add_group` method. \n", "\n", "Let us verify it:" ] }, { "cell_type": "code", "execution_count": null, "id": "55eb4325", "metadata": {}, "outputs": [], "source": [ "# get the Group object\n", "group2 = data.get_node('test_group')\n", "print(group2)\n", "\n", "# Group objects can be compared:\n", "print(f' Does the two Group instances represent the same group ? {group == group2}')\n", "\n", "# get again with dictionary like access\n", "group2 = data['test_group']\n", "print(group2)\n", "\n", "# get again with attribute like access\n", "group2 = data.Gtest\n", "print(group2)" ] }, { "cell_type": "markdown", "id": "c887610b", "metadata": {}, "source": [ "As you can see, the `get_node` method and the attribute/dictionary like access return the same *Group* instance, taht was return by the `add_group` method.\n", "\n", "Now you now how to create and retrieve *Groups*." ] }, { "cell_type": "markdown", "id": "1ab96998", "metadata": {}, "source": [ "## III - HDF5 attributes " ] }, { "cell_type": "markdown", "id": "bc6b57c7", "metadata": {}, "source": [ "As explained in the previous section, one of the use of Groups can be to contain *Attributes*, to organize metadata. In this section, we will see how to create attributes. It is actually very simple to add attribute to a data item with *SampleData*.\n", "\n", "Let us see an example. Suppose that we want to add to our new group `test_group` the name of the tutorial notebook file that created it, and the tutorial section where it is created. We will start by creating a dictionary gathering this metadata:" ] }, { "cell_type": "code", "execution_count": null, "id": "aa466ec2", "metadata": {}, "outputs": [], "source": [ "metadata = {'tutorial_file':'2_SampleData_basic_data_items.ipynb', \n", " 'tutorial_section':'Section II'}" ] }, { "cell_type": "markdown", "id": "00fe60a4", "metadata": {}, "source": [ "Then, we simply add this metadata to `test_group` with the `add_attributes` method: " ] }, { "cell_type": "code", "execution_count": null, "id": "88c4f821", "metadata": {}, "outputs": [], "source": [ "data.add_attributes(metadata, nodename='Gtest')" ] }, { "cell_type": "markdown", "id": "cf508bc4", "metadata": {}, "source": [ "Let us look at the content of your group to verify that the attributes have been added:" ] }, { "cell_type": "code", "execution_count": null, "id": "d35f1597", "metadata": {}, "outputs": [], "source": [ "data.print_node_attributes('Gtest')" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ed71d180", "metadata": {}, "source": [ "As you can see, the metadata that we just added to the Group is printed. You can also observe that the Group already had metadata, the `group_type` attribute. Here this attribute is `Group`, which indicates that it is a standard HDF5 group, and not a Grid group (image or mesh, see [data model here](./Data_Management.rst))." ] }, { "attachments": {}, "cell_type": "markdown", "id": "5a446924", "metadata": {}, "source": [ "The methods to get attributes have been presented in the [previous tutorial (sec. I-8)](./1_Getting_Information_from_SampleData_datasets.ipynb). We will reuse them here:" ] }, { "cell_type": "code", "execution_count": null, "id": "29432c27", "metadata": {}, "outputs": [], "source": [ "tutorial_file = data.get_attribute('tutorial_file','Gtest')\n", "tutorial_sec = data.get_attribute('tutorial_section', 'Gtest')\n", "print(f'The group Gtest has been created with the notebook {tutorial_file}, at the section {tutorial_sec}')" ] }, { "cell_type": "code", "execution_count": null, "id": "59b7b76a", "metadata": {}, "outputs": [], "source": [ "Gtest_attrs = data.get_dic_from_attributes('Gtest')\n", "print(f'The group Gtest has been created with the notebook {Gtest_attrs[\"tutorial_file\"]},'\n", " f' at the section {Gtest_attrs[\"tutorial_section\"]}')" ] }, { "cell_type": "markdown", "id": "65a45d80", "metadata": {}, "source": [ "To conclude this section on attribute, we will introduce the `set_description` and `get_description` methods. These methods are a shortcut to create or get the content of a specific data item attribute, `description`. This attribute is intended to be a string of one or a few sentences that explains the content, origin and purpose of the data item: " ] }, { "cell_type": "code", "execution_count": null, "id": "2f031594", "metadata": {}, "outputs": [], "source": [ "data.set_description(description=\"Just a short example to see how to use descriptions. This is a description.\",\n", " node='Gtest')\n", "data.print_node_attributes('Gtest')" ] }, { "cell_type": "code", "execution_count": null, "id": "522b0e52", "metadata": {}, "outputs": [], "source": [ "print(data.get_description('Gtest'))" ] }, { "cell_type": "markdown", "id": "80d5fc57", "metadata": {}, "source": [ "## IV - Data arrays" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1bffd40c", "metadata": {}, "source": [ "Now that we can create Group and organize our datasets, we will want to add actual data in it. \n", "\n", "The most common form of scientific data is an array of numbers. The most common and powerfull Python package used to manipulate large numeric arrays is the *Numpy* package. Through its implementation, and the support of the *Pytables* package, the *SampleData* class can directly load and return *Numpy* arrays, for the storage of numerical arrays in the datasets.\n", "\n", "The method that you will need to use to add a numeric array is `add_data_array`. It accepts the following arguments:\n", "\n", "* `location`: the parent group that will contain the created group. Mandatory argument\n", "* `name`: the data array to create will have this *Name*\n", "* array\n", "* `indexname`: the data array to create will have this *Indexname*. If none is provided, the indexname will be duplicated from the Name\n", "* `replace`: if an array with the same *Name* exists, the *SampleData* class will remove it to create the new one only if this argument is set to `True`. By default, it is set to `False`\n", "* `array`: a `numpy.ndarray`, the numeric array to be stored into the dataset\n", "\n", "In addition, it accepts two arguments linked to data compression, the `chunkshape` and `compression_options` arguments, that will not be discussed here, but rather in the tutorial dedicated to [data compression](./Data_compression.ipynb). The `name`, `indexname`, `location` and `replace` work exactly as for the `add_group` method presented in the previous section, and the `array` argument is pretty explicit. \n", "\n", "It is allowed to create an empty data array item in the dataset. The main purpose of this option will be highlighted in the tutorial explaining how to use `SampleData` to build [custom data models](./Custom_Data_Models.ipynb). For now, note that it allows to create the internal organization of your dataset without having to add any data to it. It allows for instance to preempt some data item names, indexnames, and to already add metadata. We will see below how to create an empty data array item, and later, add actual data to it.\n", "\n", "Let us start by creating a random array of data with the `numpy.random` package, and store it into a data array in the group `test_group`." ] }, { "cell_type": "code", "execution_count": null, "id": "1198fa91", "metadata": {}, "outputs": [], "source": [ "# we start by importing the numpy package\n", "import numpy as np\n", "# we create a random array of 20 elements\n", "A = np.random.rand(20)\n", "print(A)" ] }, { "cell_type": "markdown", "id": "f5949970", "metadata": {}, "source": [ "Now we add the array `A` to our dataset with name `test_array`, indexname `Tarray`, to the group `test_group`:" ] }, { "cell_type": "code", "execution_count": null, "id": "f9ac9f2d", "metadata": {}, "outputs": [], "source": [ "data.add_data_array(location='Gtest', name='test_array', indexname='Tarray', array=A)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9d6aa444", "metadata": {}, "source": [ "As for the group creation in the previous section, the verbose mode of the class informed us that the array has been added to the dataset, in the desired group. Once again, the method has return a *Pytables* object. Here it is a `tables.Node` object and not a `tables.Group` object. \n", "\n", "You can see that this object contains a array, that is a `Carray`. A `Carray` is a chunkable array. It is a specific type of HDF5 array, whose data are split into multiple chunks which are all stored separately in the file. This enables a strong optimization of the reading speed when dealing with multidimensional arrays. We will not detail this feature of the HDF5 library in this tutorial, but rather in the tutorial dedicated to [data compression](./Data_compression.ipynb). Though, it is strongly advised to study the concept ouf HDF5 chuncked layout to optimally use *SampleData* datasets (see [Pytables optimization tips](https://www.pytables.org/usersguide/optimization.html) or [HDF5 group dedicated page](https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/)).\n", "\n", "Let us look at the content of our dataset:" ] }, { "cell_type": "code", "execution_count": null, "id": "eb6d4dc5", "metadata": {}, "outputs": [], "source": [ "print(data)" ] }, { "cell_type": "markdown", "id": "d0b21f27", "metadata": {}, "source": [ "We now see that our array has been added as a children of the `test_group` Group, as requested. Like for the Group created in the section II, we can add metadata to this dat item:" ] }, { "cell_type": "code", "execution_count": null, "id": "48767f3e", "metadata": {}, "outputs": [], "source": [ "metadata = {'tutorial_file':'2_SampleData_basic_data_items.ipynb', \n", " 'tutorial_section':'Section IV'}\n", "data.add_attributes(metadata, 'Tarray')\n", "data.print_node_attributes('Tarray')" ] }, { "cell_type": "markdown", "id": "a96917da", "metadata": {}, "source": [ "As you can observe, array data items have an `empty` attribute, that indicates if the data item is associated or not to an empty data array (see a few cell above). They also have a `node_type` attribute, indicating their data item nature, here a *Data Array*.\n", "\n", "Let us try to create a empty array now. In this case, you just have to remove the `array` argument from the method call:" ] }, { "cell_type": "code", "execution_count": null, "id": "88e61b4e", "metadata": {}, "outputs": [], "source": [ "data.add_data_array(location='Gtest', name='empty_array', indexname='emptyA')" ] }, { "cell_type": "markdown", "id": "200433d1", "metadata": {}, "source": [ "The verbose mode of the class indeed informs us that we created an empty data array item. Let us print again the content of our dataset, this time with the detailed format to also see the attributes of our data items:" ] }, { "cell_type": "code", "execution_count": null, "id": "459b206e", "metadata": {}, "outputs": [], "source": [ "data.print_dataset_content(short=False)" ] }, { "cell_type": "markdown", "id": "8a5a293a", "metadata": {}, "source": [ "We have as requested our array with the 20 element array, and the empty array. You can see that *SampleData* stores a 1 element array in empty arrays, as it is not possible to create a node with a 0d array. That is why the `empty` attribute is attached to array data items" ] }, { "cell_type": "markdown", "id": "1fb37be3", "metadata": {}, "source": [ "We will try to get our newly added data items as numpy arrays now. As for the `test_array` Group, we can use the `get_node` method for this. We can now introduce its second argument, the `as_numpy` option. If this argument is set to `True`, `get_node` return the data item as a `numpy.ndarray`. If it is set to `False` (its default value), the method returns a `tables.Node` object from the *Pytables* class, identical to the one returned by the `add_data_array` method when the data item was created. \n", "\n", "Let us see some example:" ] }, { "cell_type": "code", "execution_count": null, "id": "9ab02a42", "metadata": {}, "outputs": [], "source": [ "array_node = data.get_node('test_array')\n", "array = data.get_node('test_array', as_numpy=True)\n", "print('array_node returned with \"as_nump=False\":\\n',array_node,'\\n')\n", "print('array returned with \"as_nump=True\":\\n', array, type(array))" ] }, { "cell_type": "markdown", "id": "45abf0a6", "metadata": {}, "source": [ "What happens if we try to get our empty array ?" ] }, { "cell_type": "code", "execution_count": null, "id": "47247a58", "metadata": {}, "outputs": [], "source": [ "empty_array_node = data.get_node('empty_array')\n", "empty_array = data.get_node('empty_array', as_numpy=True)\n", "\n", "print(f'The empty array node {empty_array_node} is not truly empty in the dataset','\\n')\n", "print(f'It actually contains ... {empty_array} ... a one element array with value 0')" ] }, { "cell_type": "markdown", "id": "13f5a6b9", "metadata": {}, "source": [ "As mentioned earlier, the empty array is not truly empty." ] }, { "cell_type": "markdown", "id": "86a55225", "metadata": {}, "source": [ "We have seen that the `get_node` method can have two different behaviors with data arrays. So what happens if we try to get a data array from our dataset using one of the two generic getter mechanisms explained in section I ?" ] }, { "cell_type": "code", "execution_count": null, "id": "515053e5", "metadata": {}, "outputs": [], "source": [ "array = data['test_array']\n", "print(f'What we got with the dictionary like access is a {type(array)}')\n", "\n", "array = data.test_array\n", "print(f'What we got with the attribute like access is a {type(array)}')" ] }, { "cell_type": "markdown", "id": "53033f75", "metadata": {}, "source": [ "As you can see, the generic mechanisms call the `get_node` method with the `as_numpy=True` option." ] }, { "cell_type": "markdown", "id": "9f36e81f", "metadata": {}, "source": [ "The last thing we have to discuss about data arrays, is how to add actual data to an empty data array item. Actually, when calling the `add_data_array` method with the name/location of an empty array, the method behaves as if it had been called with `replace=True`. However, in this case, all metadata that was attached to the empty node is preserved and reattached to the data item created with the inputed array.\n", "\n", "Let us add some metadata to the empty array, to test this feature:" ] }, { "cell_type": "code", "execution_count": null, "id": "104b653d", "metadata": {}, "outputs": [], "source": [ "metadata = {'tutorial_file':'2_SampleData_basic_data_items.ipynb', \n", " 'tutorial_section':'Section IV'}\n", "data.add_attributes(metadata, 'empty_array')\n", "data.print_node_attributes('empty_array')" ] }, { "cell_type": "markdown", "id": "8b3205cd", "metadata": {}, "source": [ "Let us create a data array and try to add it to the empty array." ] }, { "cell_type": "code", "execution_count": null, "id": "b8fe6dbb", "metadata": {}, "outputs": [], "source": [ "A = np.arange(20)\n", "data.add_data_array(location='Gtest', name='empty_array', indexname='emptyA', array=A)" ] }, { "cell_type": "code", "execution_count": null, "id": "c60dcccf", "metadata": {}, "outputs": [], "source": [ "print(data['emptyA'])\n", "data.print_node_attributes('empty_array')" ] }, { "cell_type": "markdown", "id": "ba2c84d7", "metadata": {}, "source": [ "As you can see thanks to the verbose mode, the old empty data array node is removed, and replaced with a new one containing our data, but also the metadata previously attached to the empty array. A `node_type` as additionally been attached to it to account for the nature of the data newly stored in the data item." ] }, { "cell_type": "markdown", "id": "73410d96", "metadata": {}, "source": [ "Except for the control of data compression parameters, you now know all that is to know to create and retrieve data arrays with *SampleData*. " ] }, { "cell_type": "markdown", "id": "3c4d00a5", "metadata": {}, "source": [ "## V - String arrays" ] }, { "cell_type": "markdown", "id": "79d4d44f", "metadata": {}, "source": [ "We now move to another usefull type of data item. In many cases, it may be usefull to store long lists of strings. Data arrays are restricted to numerical arrays. Attributes are meant to store data of small size and are thus not suited fot it either.\n", "\n", "To realize this task, you will need to rely on `String arrays`, that can be added thanks to the `add_string_array` method. It is basically a mapping of a Python string list to a HDF5 *Pytables* data item. Hence, this method has arguments that you now know well: `name`, `location`, `indexname`, `replace`. In addition, it has a `data` argument that must be the Python list of strings that you want to store in the dataset.\n", "\n", "Let us see an example:" ] }, { "cell_type": "code", "execution_count": null, "id": "d3b09787", "metadata": {}, "outputs": [], "source": [ "List = ['this','is','a','not so long','list','of strings','for','the tutorial !!']\n", "data.add_string_array(name='string_array', location='test_group', indexname='Sarray', data=List)" ] }, { "cell_type": "markdown", "id": "32ef6d1c", "metadata": {}, "source": [ "Like the previous ones, this method verbose mode informs you that the string array has been created, and returns the *Pytables* node object associated to the created data item.\n", "\n", "Let us look at the dataset content:" ] }, { "cell_type": "code", "execution_count": null, "id": "739c9202", "metadata": {}, "outputs": [], "source": [ "data.print_dataset_content(short=False)" ] }, { "cell_type": "markdown", "id": "b973ab95", "metadata": {}, "source": [ "You can see in the information printed about the string array that we just created, that it has a `node_type` attribute, indicating that it is a *String Array*. " ] }, { "cell_type": "markdown", "id": "e58de970", "metadata": {}, "source": [ "To manipulate a string array use the 'get_node' method to get the array, and then manipulate it as a list of binary strings. Indeed, strings are automatically converted to `bytes` when creating this type of data item. You will hence need to use the `str.decode()` method to get the elements of the string_array as UTF-8 or ASCII formatted strings:" ] }, { "cell_type": "code", "execution_count": null, "id": "0bfa4d55", "metadata": {}, "outputs": [], "source": [ "sarray = data['Sarray']\n", "S1 = sarray[0] # here we get a Python bytes string\n", "S2 = sarray[0].decode('utf-8') # here we get a utf-8 string\n", "print(S1)\n", "print(S2,'\\n')\n", "\n", "# Let us print all strings contained in the string array:\n", "for string in sarray:\n", " print(string.decode('utf-8'), end=' ')" ] }, { "cell_type": "markdown", "id": "181021a1", "metadata": {}, "source": [ "The particularity of String arrays, is that they are enlargeable. To add additionnal elements to them, you may use the `append_string_array` method, that takes as arguments the name of the string array, and a list of strings:" ] }, { "cell_type": "code", "execution_count": null, "id": "e54c3507", "metadata": {}, "outputs": [], "source": [ "# we add 3 new elements to the array\n", "data.append_string_array(name='Sarray', data=['We can make','it','bigger !'])\n", "# now Let us print the enlarged list of strings:\n", "for string in data['Sarray']:\n", " print(string.decode('utf-8'), end=' ') " ] }, { "cell_type": "markdown", "id": "1c538c90", "metadata": {}, "source": [ "And that is all that it is to know about String arrays !" ] }, { "cell_type": "markdown", "id": "d8d17361", "metadata": {}, "source": [ "## VI - Structured arrays" ] }, { "cell_type": "markdown", "id": "3825d055", "metadata": {}, "source": [ "The last data item type that we will review in this tutorial is analogous to the data array item type (section IV), but is meant to store *Numpy* [structured arrays](https://numpy.org/doc/stable/user/basics.rec.html) (you are strongly encouraged ). Those are ndarrays whose datatype is a composition of simpler datatypes organized as a sequence of named fields, in other words, heterogeneous arrays. \n", "\n", "The *Pytables* package, which handles the HDF5 dataset within the *SampleData* class, use a class called `table` to store structured arrays. This termonology is reused within the *SampleData* class. Hence, to add a structured array, you may use the `add_table` method. This method accepts the same arguments as `add_data_array`, plus a `description` argument, that may be an instance of the `tables.IsDescription` class ([see here](https://www.pytables.org/usersguide/libref/declarative_classes.html#description-helper-functions)), or a `numpy.dtype` object. It is an object whose role is to describe the structure of the array (name and type of array *Fields*). The `data` argument value must be a `numpy.ndarray` whose *dtype* is consistent with the `description`.\n", "\n", "Let us see an example. Imagine that we want to create a structured array to store data describing material particles, containing for each particle, its nature, an identity number, its dimensions, and a boolean value indicating the presence of damage at the particle. \n", "\n", "To do this, we have to create a suitable `numpy.dtype` and *numpy* structured array:" ] }, { "cell_type": "code", "execution_count": null, "id": "b2a47685", "metadata": {}, "outputs": [], "source": [ "# creation of a numpy dtype --> takes as input a list of tuples ('field_name', 'field_type')\n", "# Numpy dtype reminder: S25: binary strings with 25 characters, ?: boolean values\n", "#\n", "sample_type = np.dtype([('Nature','S25'), ('Id_number',np.int16), ('Dimensions',np.double,(3,)), ('Damaged','?')])\n", "\n", "# now we create an empty array of 2 elements of this type\n", "sample_array = np.empty(shape=(2,), dtype=sample_type)\n", "\n", "# now we create data to represent 2 samples\n", "sample_array['Nature'] = ['Intermetallic', 'Carbide']\n", "sample_array['Id_number'] = [1,2]\n", "sample_array['Dimensions'] = [[20,20,50],[2,2,3]]\n", "sample_array['Damaged'] = [True,False]\n", "\n", "print(sample_array)" ] }, { "cell_type": "markdown", "id": "f67281c4", "metadata": {}, "source": [ "Now that we have our `numpy.dtype`, and our structured array, we can create the *table*:" ] }, { "cell_type": "code", "execution_count": null, "id": "46be7061", "metadata": {}, "outputs": [], "source": [ "# create the structured array data item\n", "tab = data.add_table(name='test_table', location='test_group', indexname='tableT', description=sample_type,\n", " data=sample_array)\n", "\n", "# adding one attribute to the table\n", "data.add_attributes({'tutorial_section':'VI'},'tableT')\n", "\n", "# printing information on the table\n", "data.print_node_info('tableT')" ] }, { "cell_type": "markdown", "id": "f3df7c52", "metadata": {}, "source": [ "You see above that the method `print_node_info` prints the description of the structured array stored in the dataset, which allows you to know what are their fields and associated data types. You can observe however, that these fields and their types are specific object from the *Pytable* package: **column** objects (StringCol, Int16Col...). You can also see here a `node_type` attribute indicating that the node is a *Structured Array* data item.\n", "\n", "The method returned like in for the previous data item studied, the equivalent *Pytables* Node object. This table object interestingly has a description attribute, and a `numpy.dtype` attribute:" ] }, { "cell_type": "code", "execution_count": null, "id": "c56d2325", "metadata": {}, "outputs": [], "source": [ "print(tab.description)\n", "print(tab.dtype)" ] }, { "cell_type": "markdown", "id": "f6fe91cf", "metadata": {}, "source": [ "Once the table is created, it is stil possible to expand it in two ways:\n", "1. adding new rows\n", "2. adding new columns\n", "\n", "To add new rows to the table, you will need to create a `numpy.ndarray` that is compatible with the table, *i.e.* meeting the 2 criteria:\n", "1. having a dtype compatible with the table description (same fields associated to same types)\n", "2. having a compatible shape (all dimensions except last must have identical shapes)\n", "\n", "Let us try to add two new sample rows to our table:" ] }, { "cell_type": "code", "execution_count": null, "id": "aa585025", "metadata": {}, "outputs": [], "source": [ "sample_array['Nature'] = ['Intermetallic', 'Carbide']\n", "sample_array['Id_number'] = [3,4]\n", "sample_array['Dimensions'] = [[50,20,30],[3,2,3]]\n", "sample_array['Damaged'] = [True,True]" ] }, { "cell_type": "code", "execution_count": null, "id": "db71b76a", "metadata": {}, "outputs": [], "source": [ "data.append_table(name='tableT', data=sample_array)" ] }, { "cell_type": "markdown", "id": "594aeb9d", "metadata": {}, "source": [ "We should now look at our dataset to see if the data has been appended to our structured array. Let us see what happens if we use a generic getter mechanism on our structured array:" ] }, { "cell_type": "code", "execution_count": null, "id": "a6299d95", "metadata": {}, "outputs": [], "source": [ "print(type(data['tableT']),'\\n')\n", "print(data['tableT'].dtype, '\\n')\n", "print(data['tableT'],'\\n')\n", "# you can also directly get table columns:\n", "print('Damaged particle ?',data['tableT']['Damaged'],'\\n')\n", "print('Particles Nature:',data['tableT']['Nature'])" ] }, { "cell_type": "markdown", "id": "9fa29742", "metadata": {}, "source": [ "As for the data array item type, the generic mechanisms here return the data item as a `numpy.ndarray`, with the dtype of the structured table.\n", "We can indeed read the right number of lines, and the right values. Note that strings are necessarily stored as `bytes` in the dataset, so you must decode them to print or use them in standard string format: " ] }, { "cell_type": "code", "execution_count": null, "id": "4a95ed4f", "metadata": {}, "outputs": [], "source": [ "print(data['tableT']['Nature'][0].decode('ascii'))" ] }, { "cell_type": "markdown", "id": "a1ebb301", "metadata": {}, "source": [ "We will now see how to add new columns to the table, using the `add_tablecols` method. It allows to add a structured `numpy.ndarray` as additional columns to an already existing *table*. It takes 3 arguments: `tablename` (Name, Path, Indexname or Alias of the table to which you want to add columns), `description` and `data`. As for the `add_table` method, the `data` argument *dtype* must be consistent with the `description` argument. \n", "\n", "Let us add two columns to our structured array, to store for instance the particle position and chemical composition : " ] }, { "cell_type": "code", "execution_count": null, "id": "1bfa0190", "metadata": {}, "outputs": [], "source": [ "# we create a new dtype with the new fields\n", "cols_dtype = np.dtype([('Position',np.double,(3,)), ('Composition','S25')])\n", "\n", "# now we create an empty array of 2 elements of this type\n", "new_cols = np.empty(shape=(4,), dtype=cols_dtype)\n", "\n", "# now we create data to fill the new columns\n", "new_cols['Position'] = [[100.,150.,300],[10,25,10],[520,300,450],[56,12,45]]\n", "new_cols['Composition'] = ['Cr3Si','Fe3C','MgZn2','SiC']" ] }, { "cell_type": "code", "execution_count": null, "id": "d5f75c18", "metadata": {}, "outputs": [], "source": [ "data.add_tablecols(tablename='tableT', description=cols_dtype, data=new_cols)" ] }, { "cell_type": "code", "execution_count": null, "id": "203bfda3", "metadata": {}, "outputs": [], "source": [ "data.print_node_info('tableT')\n", "print(data['tableT'],'\\n')" ] }, { "cell_type": "markdown", "id": "5ed13020", "metadata": {}, "source": [ "As you can see from the verbose mode prints, when adding new columns to a table, the *SampleData* class get the data from the original table, and creates a new tables including the additional columns. From the `print_node_info` output, you can verify that in the process, the metadata attached to the original table has been preserved. You can also observe that the table description has been correctly enriched with the *Position* and *Composition* fields.\n", "\n", "Note that if you provide an additional columns array that do not match the shape of the stored table, you will get a mismatch error." ] }, { "cell_type": "markdown", "id": "24adf9ba", "metadata": {}, "source": [ "To conclude this section on structured arrays, we will see a method allowing to set the values for a full column of a stored structured array, the `set_tablecol` method. You have to pass as arguments, the name of the table and the name of the column field you want to modify, and a *numpy* array to set the new values of the column. Of course, the array type must be consistent with the type of the modified column. \n", "\n", "To see an example, Let us set all the values of the `'Damaged'` column of our table to `True`:" ] }, { "cell_type": "code", "execution_count": null, "id": "f80518ee", "metadata": {}, "outputs": [], "source": [ "data.set_tablecol(tablename='tableT', colname='Damaged', column=np.array([True,True,True,True]))\n", "print('\\n',data['tableT'],'\\n')\n", "print(data['tableT']['Damaged'],'\\n')" ] }, { "cell_type": "markdown", "id": "3b30af82", "metadata": {}, "source": [ "**You have now learn how to create and get values from all basic data item types that can be stored into SampleData datasets.**\n", "\n", "Before closing this tutorial, we will see how to remove data items from datasets." ] }, { "cell_type": "markdown", "id": "6bf99bee", "metadata": {}, "source": [ "## VII - Removing data items from datasets" ] }, { "cell_type": "markdown", "id": "23b9e7f3", "metadata": {}, "source": [ "Removing data items is very easy, you juste have to call the `remove_node` method, and provide the name of the data item you want to remove. When removing non empty Groups from the dataset, the optional `recursive` argument should be set to `True`, to allow the method to remove the group and all of its childrens. If this is not the case, the method will not remove Groups that have childrens. \n", "\n", "Let us try to remove our test_array from the dataset:" ] }, { "cell_type": "code", "execution_count": null, "id": "c20259b1", "metadata": {}, "outputs": [], "source": [ "data.remove_node('test_array')\n", "data.print_dataset_content()" ] }, { "cell_type": "markdown", "id": "7d4fde0f", "metadata": {}, "source": [ "The array data item has indeed been removed from the dataset. Let us now try to remove our test group, and all of its childrens. We should end up with an empty dataset at the end:" ] }, { "cell_type": "code", "execution_count": null, "id": "2f8c9475", "metadata": {}, "outputs": [], "source": [ "data.remove_node('test_group', recursive=True)\n", "data.print_dataset_content()" ] }, { "cell_type": "markdown", "id": "c4d04d39", "metadata": {}, "source": [ "You can see from the verbose mode output that the method has indeed removed the group, and all of its childrens. \n", "You may also remove node attributes using the `remove_attribute` and `remove_attributes` methods. " ] }, { "cell_type": "markdown", "id": "80bbc15f", "metadata": {}, "source": [ "**That's it ! You know now how to remove data items from your datasets. \n", "This tutorial is finished, we can now close our test dataset.**" ] }, { "cell_type": "code", "execution_count": null, "id": "de412087", "metadata": {}, "outputs": [], "source": [ "del data" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8 (default, Aug 18 2020, 08:33:21) \n[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)]" }, "vscode": { "interpreter": { "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1" } } }, "nbformat": 4, "nbformat_minor": 5 }