{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting Information on Datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This second User Guide tutorial will introduce you to getting information on Pymicro's datasets, and how to visualize datasets content. You will find a short summary of all methods reviewed in this tutorial at the end of this page. \n",
    "\n",
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "**Warning** \n",
    "    \n",
    "This Notebook review the methods to get information on SampleData HDF5 datsets content. Some of the methods detailed here produce very long outputs, that have been conserved in the documentation version. Reading completely the content of this output is absolutely not necessary to learn what is detailed on this page, they are just provided here as examples. So do not be afraid free to scroll down quickly when you see large prints !\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pymicro.core.samples import SampleData as SD"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## I - The SampleData Naming system "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SampleData datasets are composed of a set of organized data items. When handling datasets, you will need to specify which item you want to interact with or create. The `SampleData` class provides 4 different ways to refer to datasets. The first type of data item identificator is :\n",
    "\n",
    "1. the **Path** of the data item in the HDF5 file. \n",
    "\n",
    "Like a file within a filesystem has a path, HDF5 data items have a Path within the dataset. Each data item is the children of a HDF5 Group (analogous to a file contained in a directory), and each Group  may also be children of a Group (analogous to a directory contained in a directory). The origin directory, that we already encountered above, is called the *root* group, and has the path `'/'`. The Path offers a completely non-ambiguous way to designate a data item within the dataset, as it is unique. A typical path of a dataitem will look like that : `/Parent_Group1/Parent_Group2/ItemName`. However, pathes can become very long strings, and are usually not a convenient way to name data items. For that reason, you also can refer to them in `SampleData` methods using:\n",
    "\n",
    "2. the **Name** of the data item.\n",
    "\n",
    "It is the last element of its **Path**, that comes after the last `/` character. For a dataset that has the path `/Parent_Group1/Parent_Group2/ItemName`, the dataset Name is `ItemName`. It allows to refer quickly to the data item without writing its whole Path.\n",
    "\n",
    "However, note that two different datasets may have the same Name (but different pathes), and thus it may be necessary to use additional names to refer to them with no ambiguity without having to write their full path. In addition, it may be convenient to be able to use, in addition to its storage name, one or more additional and meaningfull names to designate a data item. For these reasons, two additional identificators can be used:\n",
    "\n",
    "3. the **Indexname** of the data item\n",
    "4. the **Alias** or aliases of the data item\n",
    "\n",
    "Those two types of indentificators are strings that can be used as additional data item **Names**. They play completely similar roles. The **Indexname** is also used in the dataset Index (see below), that gather the data item indexnames together with their pathes within the dataset. **All data items must have an Indexname**, that can be identical to their Name. If additionnal names are given to a dataset, they are stored as an **Alias**.\n",
    "\n",
    "Many `SampleData` methods have a `nodename` or `name` argument. Everytime you will encounter it, you may use one of the 4 identificators presented in this section, to provide the name of the dataset you want to create or interact with. Many examples will follow in the rest of this Notebook, and of this User Guide.\n",
    "\n",
    "Let us now move on to discover the methods that allow to explore the datasets content."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## II- Interactively get information on datasets content "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal of this section is to review the various way to get interactive information on your SampleData dataset (interactive in the sens that you can get them by executing SampleData class methods calls into a Python interpreter console).\n",
    "\n",
    "For this purpose, we will use a pre-existing dataset that already has some data stored, and look into its content. This dataset is a reference SampleData dataset used for the `core` package unit tests. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pymicro import get_examples_data_dir # import file directory path\n",
    "PYMICRO_EXAMPLES_DATA_DIR = get_examples_data_dir() # get the file directory path\n",
    "import os\n",
    "dataset_file = os.path.join(PYMICRO_EXAMPLES_DATA_DIR, 'test_sampledata_ref') # test dataset file path\n",
    "data = SD(filename=dataset_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1- The Dataset Index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As explained in the previous section, all data items have a **Path**, and an **Indexname**. The collection of **Indexname/Path** pairs forms the Index of the dataset. For each `SampleData` dataset, an `Index` Group is stored in the root Group, and the collection of those pairs is stored as attributes of this `Index` Group. Additionnaly, a class attribute `content_index` stores them as a dictionary in the clas instance, and allows to access them easily. The dictionary and the `Index` Group attributes are automatically synchronized by the class. \n",
    "\n",
    "Let us see if we can see the dictionary content:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.content_index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should see the dictionary keys that are names of data items, and associated values, that are hdf5 pathes. You can see also data item Names at the end of their Pathes. The data item **aliases** are also stored in a dictionary, that is an attribute of the class, named `aliases`: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.aliases"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that this dictionary contains keys only for data item that have additional names, and also that those keys are the data item indexnames.\n",
    "\n",
    "The dataset index can be plotted together with the aliases, with a prettier aspect, by calling the method `print_index`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This method prints the content of the dataset Index, with a given depth and from a specific root. The depth is the number of parents that a data item has. The root Group has thus a depth of 0, its childrens a depth of 1, the childrens of its childrens a depth of 2, and so on... The *local root* argument can be changed, to print only the Index for data items that are children of a specific group. When used without arguments, `print_index` uses a depth of 3 and the dataset root as default settings.\n",
    "\n",
    "As you can see, our dataset already contains some data items. We can already identify 3 HDF5 Groups (`test_group`, `test_image`, `test_group`), as they have childrens, and a lot of other data items. \n",
    "\n",
    "Let us try different to print Indexes with different parameters. To start, Let us try to print the Index from a different local root, for instance the group with the path `/test_image`. The way to do it is to use the `local_root` argument. We will hence give it the value of the `/test_image` path. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_index(local_root=\"/test_image\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `print_index` method local root arguments needs the name of the Group whose children Index must be printed. As explained in section II, you may use for this other identificators than its **Path**. Let us try its **Name** (last part of its path), which is `test_image`, or its **Indexname**, which is `image`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_index(local_root=\"test_image\")\n",
    "data.print_index(local_root=\"image\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the result is the same in the 3 cases.\n",
    "\n",
    "Let us now try to print the dataset Index with a maximal data item depth of 2, using the `max_depth` argument:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_index(max_depth=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the information about data items stored into the `/test_mesh/Geometry` group, that have a depth of 3, has not been printed. \n",
    "\n",
    "Of course, you can combine those two arguments:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_index(max_depth=2, local_root='mesh')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `print_index` method is usefull to get a glimpse of the content and organization of the whole dataset, and to quickly see the short indexnames or aliases that you can use to refer to data items. In practical applications, datasets can become complex, so be sure to use the `local_root` and `max_depth` option to limit the printed information and quickly find the data items you are looking for. \n",
    "\n",
    "To add aliases to data items or Groups, you can use the `add_alias` method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Index allows to quickly see the internal structure of your dataset, however, it does not provide detailed information on the data items. We will now see how to retrieve it with the `SampleData` class."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2- The Dataset content"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `SampleData` class provides a method to print an organized and detailed overview of the data items in the dataset, the `print_dataset_content` method. Let us see what the methods prints when called with no arguments:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_dataset_content()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, this method prints by increasing depth, detailed information on each Group and each data item of the dataset, with a maximum depth that can be specified with a `max_depth` argument (like the method `print_index`, that has a default value of 3). The printed output is structured by groups: each Group node is described first, followed by a Group CONTENT string that provide decripstion for all of its childrens. \n",
    "\n",
    "For each data item, or Group, the method prints their name, path, type, attributes, content, compression settings and memory size if it is an array, children names if it is a Group. Hence, when calling this method, you can see the content and organization of the dataset, all the metadata attached to all data items, and the disk size occupied by each data item. As you progress through this tutorial, you will learn the meaning of those informations for all types of *SampleData* data items.\n",
    "\n",
    "The `print_dataset_content` method has a `short` boolean argument, that allows to plot a condensed string representation of the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_dataset_content(short=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This shorter print, provide a complete and visual overview of the dataset organization, and indicate the memory size and type of each data item or Group in the dataset. The printed output distinguishes Group data items, from Nodes data item. The later regroups all types of arrays that may be stored in the HDF5 file. \n",
    "\n",
    "Both short and long version of the `print_dataset_content` output can be written into a text file, if a filename is provided as value for the `to_file` method argument:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_dataset_content(short=True, to_file='dataset_information.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let us open the content of the created file, to see if the dataset information has been written in it:\n",
    "%cat dataset_information.txt\n",
    "\n",
    "os.remove('dataset_information.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "**Note** \n",
    "    \n",
    "The string representation of the *SampleData* class is composed of a first part, which is the output of the `print_index` method, and a second part, that is the output of the `print_datase_content` method (short output).\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SampleData string representation :\n",
    "print(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you know how to get a detailed overview of the dataset content. However, with large datasets, that may have a complex internal organization (many Groups, lot of data items and metadata...), the `print_dataset_content` return string can become very large. In this case, it becomes cumbersome to look for a specific information on a Group or on a particular data item. For this reason, the *SampleData* class provides methods to only print information on one or several data items of the dataset. They are presented in the next subsections."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3- Get information on data items"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get information on a specific data item (including Groups), you may use the `print_node_info` method. This method has 2 arguments: the **name** argument, and the **short** argument. As explain in section II, the **name** argument can be one of the 4 possible identifier that the target node can have (name, path, indexname or alias). The **short** argument has the same effect on the printed output as for the `print_dataset_content` method. Let us look at some examples. Its default value is `False`, *i.e.* the detailed output.\n",
    "\n",
    "First, we will for instance want to have information on the Image Group that is stored in the dataset. The `print_index` and short `print_dataset_content` allowed us to see that this group has the name `test_image`, the indexname `image`, and the path `/test_image`. We will call the method with two of those identificators, and with the two possible values of the `short` argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method called with data item indexname, and short output\n",
    "data.print_node_info(nodename='image', short=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method called with data item Path and long output\n",
    "data.print_node_info(nodename='/test_image', short=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can observe that this method prints the same block of information that the one that appeared in the `print_dataset_content` method output, for the description of the *test_image* group. With this block, we can learn that this Group is a children of the root Group ('/'), that it has five childrens that are the data items named *Field_index*, *test_image_field*, and three items *test_tensor_TX_X*. We can also see its attributes names and values. Here they provide information on the nature of the Group, that is a 3D image group, and on the topology of this image (for instance, that it is a 9x9x9 voxel image, of size 0.2). Finally, we can see that grid data can be associated to a specific time. Indeed, three time values are provided for the image in the `time_list` attribute. They correspond to the time values associated to the three children arrays named *test_tensor_TX_X*.\n",
    "\n",
    "Let us now apply this method on a data item that is not a group, the `test_array` data item. The `print_index` function instructed us that this node has an alias name, that is `test_alias`. We will use it here to get information on this node, to illustrate the use of the only type of node indicator that has not been used throughout this notebook: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_node_info('test_alias')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we can learn which is the Node parent, what is the node Name, see that it has no attributes, see that it is an array of shape (51,), that it is not stored with data compression (compresion level to 0), and that it occupies a disk space of 64 Kb.\n",
    "\n",
    "The `print_node_info` method is usefull to get information on a specific target, and avoir dealing with the sometimes too large output returned by the `print_dataset_content` method. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4- Get information on Groups content"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The previous subsection showed that the `print_node_info` method applied on Groups returns only information about the group name, metadata and children names. The *SampleData* class offers a method that allows to print this information, with in addition, the detailed content of each children of the target group: the `print_group_content` method.\n",
    "\n",
    "Let us try it on the Mesh group of our test dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_group_content(groupname='test_mesh')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Obviously, this methods is identical  to the `print_dataset_content` method, but restricted to one Group. As the first one, it has a `to_file`, a `short` and a `max_depth` arguments. These arguments work just as for `print_dataset_content` method, hence there use is not detailed here. the  However, you may see one difference here. In the output printed above, we see that the `test_mesh` group has a `Geometry` children which is a group, but whose content is not printed. The `print_group_content` has indeed, by default, a non-recursive behavior. To get a recursive print of the group content, you must set the `recursive` argument to `True`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_group_content('test_mesh', recursive=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the information on the childrens of the *Geometry* group have been printed. Note that the `max_depth` argument is considered by this method as an absolute depth, meaning that you have to specify a depth that is at least the depth of the target group to see some output printed for the group content. The default maximum depth for this method is set to a very high value of 1000. Hence, `print_group_content` prints by defaults group contents with a *recursive* behavior. Note also that `print_group_content` with the *recursive* option, is equivalent to  `print_dataset_content` but prints the dataset content as if the target group was the root."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5- Get information on grids"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the `SampleData` class main functionalities is the manipulation and storage of spatially organized data, which is handled by **Grid** groups in the data model. Because they are usually key data for mechanical sample datasets, the *SampleData* class provides a method to print Grouo informations only for Grid groups, the `print_grids_info` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_grids_info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This method also has the `to_file` and `short` arguments of the `print_dataset_content` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_grids_info(short=True, to_file='dataset_information.txt')\n",
    "%cat dataset_information.txt\n",
    "\n",
    "os.remove('dataset_information.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6- Get xdmf tree"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As explained in the [data management information page](./Data_Management.rst) the *SampleData* class can automatically produce a XDMF file containing metadata, describing Grid groups topology, data types and fields, contained in the HDF5 dataset. The class handles internally the construction of a XDMF tree structure that matches the data stored in the dataset, with the help of the **lxml** package. However, you may want to look at the content of the XDMF tree while you are interactively using your *SampleData* instance. In this case, you can use the `print_xdmf` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_xdmf()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can observe, you will get a print of the content of the XDMF file that would be written if you would close the file right now. You can observe that the XDMF file provides information on the grids that match those given by the Groups and Nodes attributes printed above with the previously studied method: the test image is a regular grid of 10x10x10 nodes, *i.e.* a 9x9x9 voxels grid. Only one field is defined on *test_image*, *test_image_field*, whereas two are defined on *test_mesh*.\n",
    "\n",
    "This XDMF file can directly be opened in Paraview, if both file are closed. If any syntax or formatting issue is encountered when Paraview reads the XDMF file, it will return an error message and the data visualization will not be rendered. The `print_xdmf` method allows you to verify your XDMF data and syntax, to make sure that the data formatting is correct."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To generate the XDMF file associated to this tree, you can use the following method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.write_xdmf(filename='test.xdmf')\n",
    "%cat test.xdmf  # print XDMF file content\n",
    "\n",
    "os.remove('test.xdmf')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This method has a `filename` argument, that can be used to specify the name of the written XDMF file. If it is not specified, the file will be written with the same basename as the HDF5 dataset file. This XDMF can be opened with Paraview to visualize the 3D data that the dataset contains. For that, the two files must be in the same directory.\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "\n",
    "**Note** \n",
    "    \n",
    "**The XDMF file associated to the dataset is automatically written when you close a dataset.**\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7- Get memory size of file and data items"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*SampleData* is designed to create large datasets, with data items that can reprensent tens Gb of data or more. Being able to easily see and identify which data items use the most disk space is a crucial aspect for data management. Until now, with the method we have reviewed, we only have been able to print the Nodes disk sizes together with a lot of other information. In order to speed up this process, the *SampleData* class has one method that allow to directly query and print only the memory size of a Node, the `get_node_disk_size` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.get_node_disk_size(nodename='test_array')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the default behavior of this method is to print a message indicating the Node disk size, but also to return a tuple containing the value of the disk size and its unit. If you want to print data in bytes, you may call this method with the `convert` argument set to `False`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.get_node_disk_size(nodename='test_array', convert=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to use this method to get a numerical value within a script, but do not want the class to print anything, you can use the `print_flag` argument:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "size, unit = data.get_node_disk_size(nodename='test_array', print_flag=False)\n",
    "print(f'node size is {size} {unit}')\n",
    "\n",
    "size, unit = data.get_node_disk_size(nodename='test_array', print_flag=False, convert=False)\n",
    "print(f'node size is {size} {unit}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The disk size of the whole HDF5 file can also be printed/returned, using the `get_file_disk_size` method, that has the same `print_flag` and `convert` arguments: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.get_file_disk_size()\n",
    "\n",
    "size, unit = data.get_file_disk_size(convert=False, print_flag=False)\n",
    "print(f'\\nPrinted by script: file size is {size} {unit}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 8- Get nodes/groups attributes (metadata)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another central aspect of the *SampleData* class is the management of metadata, that can be attached to all Groups or Nodes of the dataset. Metadata comes in the form of HDF5 attributes, that are Name/Value pairs, and that we already encountered when exploring the outputs of methods like `print_dataset_content`, `print_node_info`...\n",
    "\n",
    "Those methods print the Group/Node attributes together with other information. To only print the attributes of a given data item, you can use the `print_node_attributes` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_node_attributes(nodename='test_mesh')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, this method prints a list of all data item attributes, with the format `* Name : Value \\n`.\n",
    "It allows you to quickly see what attributes are stored together with a given data item, and their values. \n",
    "\n",
    "If you want to get the value of a specific attribute, you can use the `get_attribute` method. It takes two arguments, the name of the attribute you want to retrieve, and the name of the data item where it is stored:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Nnodes = data.get_attribute(attrname='number_of_nodes', nodename='test_mesh')\n",
    "print(f'The mesh test_mesh has {Nnodes} nodes')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also get all attributes of a data item as a dictionary. In this case, you just need to specify the name of the data item from which you want attributes, and use the `get_dic_from_attributes` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mesh_attrs = data.get_dic_from_attributes(nodename='test_mesh')\n",
    "\n",
    "for name, value in mesh_attrs.items():\n",
    "    print(f' Attribute {name} is {value}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have now seen how to explore all of types of information that a *SampleData* dataset may contain, individually or all together, interactively, from a Python console. Let us review now how to explore the content of *SampleData* datasets with external softwares."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## III - Visualize dataset contents with Vitables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All the information that you can get with all the methods presented in the previous section can also be accessed externally by opening the HDF5 dataset file with the **Vitables** software. This software is usually part of the *Pytables* package, that is a dependency of *pymicro*. You should be able to use it in a Python environement compatible with *pymicro*. If needed, you may refer to the *Vitables* website to find download and installations instructions for *PyPi* or *conda*: https://vitables.org/.\n",
    "\n",
    "*Vitables* provide a graphical interface that allows you to browse through all your dataset data items, and access or modify their stored data and metadata values. You may either open *Vitables* and then open your HDF5 dataset file from the *Vitables* interface, or you can directly open Vitables to read a specfic file from command line, by running:\n",
    "`vitables my_dataset_path.h5`. \n",
    "\n",
    "This command will work only if your dataset file is closed (if the *SampleData* instance still exists in your Python console, this will not work, you first need to delete your instance to close the files). \n",
    "\n",
    "However, the *SampleData* class has a specific method to allowing to open your dataset with *Vitables* interactively, directly from your Python console: the method `pause_for_visualization`. As explained just above, this method closes the HDF5 dataset, and runs in your shell the command `vitables my_dataset_path.h5`. Then, it freezes the interactive Python console and keep the dataset files closed, for as long as the *Vitables* software is running. When *Vitables* is shutdown, the *SampleData* class will reopen the HDF5 file, synchronize with them and resume the interactive Python console. \n",
    "\n",
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "**Warning** \n",
    "    \n",
    "When calling the `pause_for_visualization` method from a python console (ipython, Jupyter...), you may face environment issue leading to your shell not finding the proper *Vitables* software executable. To ensure that the right *Vitables* is found, the method can take an optional argument `Vitables_path`. If this argument is passed, the method will run, after closing the HDF5 dataset, the command \n",
    "`Vitables_path my_dataset_path.h5`\n",
    "</div>\n",
    "\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "\n",
    "**Note** \n",
    "    \n",
    "The method is not called here to allow automatic execution of the Notebook when building the documentation on a platform that do not have Vitables available.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# uncomment to test \n",
    "# data.pause_for_visualization(Vitables=True, Vitables_path='Path_to_Vitables_executable')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please refer to the *Vitables* documentation, that can be downloaded here https://sourceforge.net/projects/vitables/files/ViTables-3.0.0/, to learn how to browse through your HDF5 file. The Vitables software is very intuitive, you will see that it is provides a usefull and convenient tool to explore your SampleData datasets outside of your interactive Python consoles. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## IV - Visualize datasets grids and fields with Paraview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As for **Vitables**, the `pause_for_visualization` method allows you to open your dataset with Paraview, interactively from a Python console. \n",
    "\n",
    "Paraview will provide you with a very powerfull visualization tool to render your spatially organized data (grids) stored in your datasets. Unlike *Vitables*, *Paraview* must read a XDMF file to be able to render 3D data. Hence, if you want to open your dataset with Paraview, outside of a Python console, make sure to have Paraview in your PATH, that you have generated the XDMF file associated to your HDF5 dataset and that both are in the same directory (see above), and run in your shell the command:\n",
    "`paraview my_dataset_path.xdmf`.\n",
    "\n",
    "As you may have guessed, the `pause_for_visualization` method, when called interactively with the *Paraview* argument set to *True*, will execute those steps automatically, just like for the *Vitables* option. The dataset will remained closed and the Python console freezed for as long as you will keep the Paraview software running. When you shutdown Paraview, the *SampleData* class will reopen the HDF5 file, synchronize with them and resume the interactive Python console. \n",
    "\n",
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "**Warning** \n",
    "    \n",
    "When calling the `pause_for_visualization` method from a python console (ipython, Jupyter...), you may face environment issues, leading to the interactive console not finding the proper *Paraview* executable file. To ensure that the right *Paraview* is found, the method can take an optional argument `Paraview_path`. If this argument is passed, the method will run, after closing the HDF5 and XDMF files, the command \n",
    "`Paraview_path my_dataset_path.xdmf`\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Like for Vitables --> uncomment to test\n",
    "# data.pause_for_visualization(Paraview=True, Paraview_path='Path_to_Paraview_executable')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "**Note** \n",
    "    \n",
    "**It is recommended to use a recent version of the Paraview software to visualize SampleData datasets (>= 5.0).**\n",
    "When opening the XDMF file, Paraview may ask you to choose a specific file reader. It is recommended to choose the \n",
    "**XDMF_reader**, and not the **Xdmf3ReaderT**, or **Xdmf3ReaderS**.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## V - Using command line tools"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also examine the content of your HDF5 datasets with generoc HDF5 command line tools, such as h5ls or h5dump:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "**Warning** \n",
    "    \n",
    "In the following, executable programs that come with the HDF5 library and the Pytables package are used. If you are executing this notebook with Jupyter, you may not be able to have those executable in your path, if your environment is not suitably set. A workaround consist in finding the absolute path of the executable, and replacing the executable name in the following cells by its full path. For instance, replace \n",
    "    \n",
    "    `ptdump file.h5` \n",
    "with\n",
    "    \n",
    "    `/full/path/to/ptdump file.h5`\n",
    "    \n",
    "To find this full path, you can run in your shell the command `which ptdump`. Of course, the same applies for `h5ls` and `h5dump`.\n",
    "</div>\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "\n",
    "**Note** \n",
    "    \n",
    "Most code lines below are commented as they produce very large outputs, that otherwise pollute the documentation if they are included in the automatic build process. Uncomment them to test them if you are using interactively these notebooks !\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For that, you must first close your dataset. If you don't, this tools will not be able to open the HDF5 file as it is opened by the SampleData class in the Python interpretor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "del data "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# raw output of H5ls --> prints the childrens of the file root group\n",
    "!h5ls ../data/test_sampledata_ref.h5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# recursive output of h5ls (-r option) -->  prints all data items \n",
    "!h5ls -r ../data/test_sampledata_ref.h5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# recursive (-r) and detailed (-d) output of h5ls --> also print the content of the data arrays\n",
    "# !h5ls -rd ../data/test_sampledata_ref.h5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# output of h5dump:\n",
    "# !h5dump ../data/test_sampledata_ref.h5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see if you uncommented and executed this cell, *h5dump* prints a a fully detailed description of your dataset: organization, data types, item names and path, and item content (value stored in arrays). As it produces a very large output, it may be convenient to write its output in a file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !h5dump ../data/test_sampledata_ref.h5 > test_dump.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !cat test_dump.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also use the command line tool of the *Pytables* software **ptdump**, that also takes as argument the HDF5 file, and has two command options, the verbose mode `-v`, and the detailed mode `-d`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# uncomment to test !\n",
    "# !ptdump ../data/test_sampledata_ref.h5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# uncomment to test!\n",
    "# !ptdump -v ../data/test_sampledata_ref.h5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# uncomment to test !\n",
    "# !ptdump -d ../data/test_sampledata_ref.h5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**This second tutorial on Data Management with Pymicro User Guide is now finished. You should now know how to retrieve all the information you need in your datasets, and visualize 3D data with Paraview from your datasets. **"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.17"
  },
  "vscode": {
   "interpreter": {
    "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}