{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e7dbd6fe",
   "metadata": {},
   "source": [
    "# Data Compression with Pymicro"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0945f582",
   "metadata": {},
   "source": [
    "This short Notebook will introduce you to how to efficiently compress your data within SampleData datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc913479",
   "metadata": {},
   "source": [
    "## I - Data compression with HDF5 and Pytables"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8698fcd3",
   "metadata": {},
   "source": [
    "### HDF5 and data compression"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d17c269",
   "metadata": {},
   "source": [
    "HDF5 allows compression filters to be applied to datasets in a file to minimize the amount of space it consumes. These compression features allow to drastically improve the storage space required for you datasets, as well as the speed of I/O access to datasets, and can differ from one data item to another within the same HDF5 file. A detailed presentation of HDF5 compression possibilities is provided [here](https://support.hdfgroup.org/HDF5/faq/compression.html). \n",
    "\n",
    "The two main ingredients that control compression performances for HDF5 datasets are the compression filters used (which compression algorithm, with which parameters), and the using a chunked layout for the data. This two features are briefly developed hereafter."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2263a8d2",
   "metadata": {},
   "source": [
    "### Pytables and data compression"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ceaf1ad",
   "metadata": {},
   "source": [
    "The application of compression filters to HDF5 files with the *SampleData* class is handled by the *Pytable* package, on which is built the *SampleData* HDF5 interface. *Pytables* implements a specific containers class, the `Filters` class, to gather the various settings of the compression filters to apply to the datasets in a HDF5 file.\n",
    "\n",
    "When using the *SampleData* class, you will have to specify this compression filter settings to class methods dedicated to data compression.  These settings are the parameters of the *Pytables* `Filters` class. These settings and their possible values are detailed in the next subsection. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5036668a",
   "metadata": {},
   "source": [
    "#### Available filter settings"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "098763e1",
   "metadata": {},
   "source": [
    "The description given below of compression options available with *SampleData*/*Pytables* is exctracted from [the *Pytables* documentation of the *Filter* class](https://www.pytables.org/usersguide/libref/helper_classes.html#the-filters-class)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9dc47fe0",
   "metadata": {},
   "source": [
    "* **complevel** (int) – Specifies a compression level for data. The allowed range is 0-9. A value of 0 (the default) disables compression.\n",
    "\n",
    "* **complib** (str) – Specifies the compression library to be used. Right now, `zlib` (the default), `lzo`, `bzip2` and `blosc` are supported. Additional compressors for Blosc like `blosc:blosclz` (‘blosclz’ is the default in case the additional compressor is not specified), `blosc:lz4`, `blosc:lz4hc`, `blosc:snappy`, `blosc:zlib` and `blosc:zstd` are supported too. Specifying a compression library which is not available in the system issues a FiltersWarning and sets the library to the default one.\n",
    "\n",
    "* **shuffle** (bool) – Whether or not to use the Shuffle filter in the HDF5 library. This is normally used to improve the compression ratio. A false value disables shuffling and a true one enables it. The default value depends on whether compression is enabled or not; if compression is enabled, shuffling defaults to be enabled, else shuffling is disabled. Shuffling can only be used when compression is enabled.\n",
    "\n",
    "* **bitshuffle** (bool) – Whether or not to use the BitShuffle filter in the Blosc library. This is normally used to improve the compression ratio. A false value disables bitshuffling and a true one enables it. The default value is disabled.\n",
    "\n",
    "* **fletcher32** (bool) – Whether or not to use the Fletcher32 filter in the HDF5 library. This is used to add a checksum on each data chunk. A false value (the default) disables the checksum.\n",
    "\n",
    "* **least_significant_digit** (int) – If specified, data will be truncated (quantized). In conjunction with enabling compression, this produces ‘lossy’, but significantly more efficient compression. For example, if least_significant_digit=1, data will be quantized using around(scale*data)/scale, where scale = 2^bits, and bits is determined so that a precision of 0.1 is retained (in this case bits=4). Default is None, or no quantization."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7dc127a8",
   "metadata": {},
   "source": [
    "### Chunked storage layout"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c6edfe6",
   "metadata": {},
   "source": [
    "Compressed data is stored in a data array of an HDF5 dataset using a chunked storage mechanism. When chunked storage is used, the data array is split into equally sized chunks each of which is stored separately in the file, as illustred on the diagram below. Compression is applied to each individual chunk. When an I/O operation is performed on a subset of the data array, only chunks that include data from the subset participate in I/O and need to be uncompressed or compressed.\n",
    "\n",
    "Chunking data allows to:\n",
    "\n",
    "* Generally improve, sometimes drastically, the I/O performance of datasets. This comes from the fact that the chunked layout removes the reading speed anisotropy for data array that depends along which dimension its elements are read (*i.e* the same number of disk access are required when reading data in rows or columns).\n",
    "* Chunked storage also enables adding more data to a dataset without rewriting the whole dataset.\n",
    "\n",
    "<img src=\"./Images/Tutorial_5/chuncked_layout.png\" width=\"50%\">\n",
    "\n",
    "**By default, data arrays are stored with a chunked layout in SampleData datasets.** The size of chunks is the key parameter that controls the impact on I/O performances for chunked datasets. **The shape of chunks is computed automatically by the Pytable package, providing a value yielding generally good I/O performances.** if you need to go further in the I/O optimization, you may consult the *Pytables* [documentation page](https://www.pytables.org/usersguide/optimization.html) dedicated to compression optimization for I/O speed and storage space. In addition, it is highly recommended to read [this document](https://support.hdfgroup.org/HDF5/doc/TechNotes/TechNote-HDF5-ImprovingIOPerformanceCompressedDatasets.pdf) in order to be able to efficiently optimize I/O and storage performances for your chunked datasets. These performance issues will not be discussed in this tutorial."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fab0cf21",
   "metadata": {},
   "source": [
    "## II - Compressing your datasets with SampleData"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c63fb275",
   "metadata": {},
   "source": [
    "Within *Sampledata*, data compression can be applied to:\n",
    "\n",
    "* data arrays\n",
    "* structured data arrays\n",
    "* field data arrays\n",
    "\n",
    "There are two ways to control the compression settings of your *SampleData* data arrays:\n",
    "\n",
    "1. Providing compression settings to data item creation methods\n",
    "2. Using the `set_chunkshape_and_compression` and `set_nodes_compression_chunkshape` methods"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50e22a8e",
   "metadata": {},
   "source": [
    "### The compression options dictionary"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31c961fd",
   "metadata": {},
   "source": [
    "In both cases, you will have to pass the various settings of the compression filter you want to apply to your data to the appropriate *SampleData* method. All of these methods accept for that purpose a `compression_options` argument, which must be a dictionary. Its keys and associated values can be chosen among the ones listed in the **Available filter settings** subsection above."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c448bda1",
   "metadata": {},
   "source": [
    "### Compress already existing data arrays"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8258df17",
   "metadata": {},
   "source": [
    "We will start by looking at how we can change compression settings of already existing data in *SampleData* datasets.\n",
    "For that, we will use a material science dataset that is part of the *Pymicro* example datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5bfa7d5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from pymicro import get_examples_data_dir # import file directory path\n",
    "PYMICRO_EXAMPLES_DATA_DIR = get_examples_data_dir() # get file directory path\n",
    "dataset_file = os.path.join(PYMICRO_EXAMPLES_DATA_DIR, 'example_microstructure') # test dataset file path\n",
    "tar_file = os.path.join(PYMICRO_EXAMPLES_DATA_DIR, 'example_microstructure.tar.gz') # dataset archive path"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "061eb446",
   "metadata": {},
   "source": [
    "This file is zipped in the package to reduce its size. We will have to unzip it to use it and learn how to reduce its size with the *SampleData* methods. If you are just reading the documentation and not executing it, you may just skip this cell and the next one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7b6aefa2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save current directory\n",
    "cwd = os.getcwd()\n",
    "# move to example data directory\n",
    "os.chdir(PYMICRO_EXAMPLES_DATA_DIR)\n",
    "# unarchive the dataset\n",
    "os.system(f'tar -xvf {tar_file}')\n",
    "# get back to UserGuide directory\n",
    "os.chdir(cwd)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "329cf696",
   "metadata": {},
   "source": [
    "#### Dataset presentation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3ee3aa9",
   "metadata": {},
   "source": [
    "In this tutorial, we will work on a copy of this dataset, to leave the original data unaltered. \n",
    "We will start by creating an autodeleting copy of the file, and print its content to discover its content. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "573e1fb9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import SampleData class\n",
    "from pymicro.core.samples import SampleData as SD\n",
    "# import Numpy\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98f24ea2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a copy of the existing dataset\n",
    "data = SD.copy_sample(src_sample_file=dataset_file, dst_sample_file='Test_compression', autodelete=True,\n",
    "                      get_object=True, overwrite=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d25c8af",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(data)\n",
    "data.get_file_disk_size()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4b47944",
   "metadata": {},
   "source": [
    "As you can see, this dataset already contains a rich content. It is a digital twin of a real polycristalline microstructure of a grade 2 Titanium sample, gathering both experimental and numerical data obtained through Diffraction Contrast Tomography imaging, and FFT-based mechanical simulation. \n",
    "\n",
    "This dataset has actually been constructed using the `Microstructure` class of the pymicro package, which is based on the *SampleData* class. The link between these classes will be discussed in the next tutorial."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8dc062bc",
   "metadata": {},
   "source": [
    "This dataset contains only uncompressed data. We will try to reduce its size by using various compression methods on the large data items that it contains. You can see that most of them are stored in the *3DImage Group* `CellData`. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3c95962",
   "metadata": {},
   "source": [
    "### Apply compression settings for a specific array"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3cc5883e",
   "metadata": {},
   "source": [
    "We will start by compressing the `grain_map` *Field data array* of the `CellData` image. Let us look more closely on this data item:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "823b9ea2",
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_node_info('grain_map')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d610cd0",
   "metadata": {},
   "source": [
    "We can see above that this data item is not compressed (`complevel=0`), and has a disk size of almost 2 Mb.\n",
    "\n",
    "To apply a set of compression settings to this data item, you need to:\n",
    "\n",
    "1. create a dictionary specifying the compression settings:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a769a5b",
   "metadata": {},
   "outputs": [],
   "source": [
    "compression_options = {'complib':'zlib', 'complevel':1}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa8edc6c",
   "metadata": {},
   "source": [
    "2. use the *SampleData* `set_chunkshape_and_compression` method with the dictionary and the name of the data item as arguments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "55e7bad9",
   "metadata": {},
   "outputs": [],
   "source": [
    "data.set_chunkshape_and_compression(nodename='grain_map', compression_options=compression_options)\n",
    "data.get_node_disk_size('grain_map')\n",
    "data.print_node_compression_info('grain_map')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3ceebce",
   "metadata": {},
   "source": [
    "As you can see, the storage size of the data item has been greatly reduced, by more than 10 times (126 Kb vs 1.945 Mb), using this compression settings. Let us see what will change if we use different settings :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f5f5597",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# No `shuffle` option:\n",
    "print('\\nUsing the shuffle option, with the zlib compressor and a compression level of 1:')\n",
    "compression_options = {'complib':'zlib', 'complevel':1, 'shuffle':True}\n",
    "data.set_chunkshape_and_compression(nodename='grain_map', compression_options=compression_options)\n",
    "data.get_node_disk_size('grain_map')\n",
    "\n",
    "# No `shuffle` option:\n",
    "print('\\nUsing no shuffle option, with the zlib compressor and a compression level of 9:')\n",
    "compression_options = {'complib':'zlib', 'complevel':9, 'shuffle':False}\n",
    "data.set_chunkshape_and_compression(nodename='grain_map', compression_options=compression_options)\n",
    "data.get_node_disk_size('grain_map')\n",
    "\n",
    "# No `shuffle` option:\n",
    "print('\\nUsing the shuffle option, with the lzo compressor and a compression level of 1:')\n",
    "compression_options = {'complib':'lzo', 'complevel':1, 'shuffle':True}\n",
    "data.set_chunkshape_and_compression(nodename='grain_map', compression_options=compression_options)\n",
    "data.get_node_disk_size('grain_map')\n",
    "\n",
    "# No `shuffle` option:\n",
    "print('\\nUsing no shuffle option, with the lzo compressor and a compression level of 1:')\n",
    "compression_options = {'complib':'lzo', 'complevel':1, 'shuffle':False}\n",
    "data.set_chunkshape_and_compression(nodename='grain_map', compression_options=compression_options)\n",
    "data.get_node_disk_size('grain_map')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4f0e86f",
   "metadata": {},
   "source": [
    "As you may observe, is significantly affected by the choice of the compression level. The higher the compression level, the higher the compression ratio, but also the lower the I/O speed. On the other hand, you can also remark that, in the present case, using the `shuffle` filter deteriorates the compression ratio.\n",
    "\n",
    "Let us try to with another data item:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73633419",
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_node_info('Amitex_stress_1')\n",
    "\n",
    "# No `shuffle` option:\n",
    "print('\\nUsing the shuffle option, with the zlib compressor and a compression level of 1:')\n",
    "compression_options = {'complib':'zlib', 'complevel':1, 'shuffle':True}\n",
    "data.set_chunkshape_and_compression(nodename='Amitex_stress_1', compression_options=compression_options)\n",
    "data.get_node_disk_size('Amitex_stress_1')\n",
    "\n",
    "# No `shuffle` option:\n",
    "print('\\nUsing no shuffle option, with the zlib compressor and a compression level of 1:')\n",
    "compression_options = {'complib':'zlib', 'complevel':1, 'shuffle':False}\n",
    "data.set_chunkshape_and_compression(nodename='Amitex_stress_1', compression_options=compression_options)\n",
    "data.get_node_disk_size('Amitex_stress_1')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f35a417",
   "metadata": {},
   "source": [
    "On the opposite, for this second array, the shuffle filter improves significantly the compression ratio. However, in this case, you can see that the compression ratio achieved is much lower than for the `grain_map` array. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f39e182",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "**Warning 1** \n",
    "    \n",
    "The efficiency of compression algorithms in terms of compression ratio is strongly affected by the data itself (variety, value and position of the stored values in the array). Compression filters will not have the same behavior with all data arrays, as you have observed just above. Be aware of this fact, and do not hesitate to conduct tests to find the best settings for you datasets !\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1418cbe7",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "**Warning 2** \n",
    "    \n",
    "Whenever you change the compression or chunkshape settings of your datasets, the data item is re-created into the *SampleData* dataset, which may be costly in computational time. Be careful if you are dealing with very large data arrays and want to try out several settings to find the best I/O speed / compression ratio compromise, with the `set_chunkshape_and_compression` method. You may want to try on a subset of your large array to speed up the process.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de49744e",
   "metadata": {},
   "source": [
    "### Apply same compression settings for a serie of nodes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e258bb2",
   "metadata": {},
   "source": [
    "If you need to apply the same compression settings to a list of data items, you may use the `set_nodes_compression_chunkshape`. This method works exactly like `set_chunkshape_and_compression`, but take a list of nodenames as arguments instead of just one. The inputted compression settings are then applied to all the nodes in the list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "58fcec3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Print current size of disks and their compression settings\n",
    "data.get_node_disk_size('grain_map_raw')\n",
    "data.print_node_compression_info('grain_map_raw')\n",
    "data.get_node_disk_size('uncertainty_map')\n",
    "data.print_node_compression_info('uncertainty_map')\n",
    "data.get_node_disk_size('mask')\n",
    "data.print_node_compression_info('mask')\n",
    "\n",
    "# Compress datasets\n",
    "compression_options = {'complib':'zlib', 'complevel':9, 'shuffle':True}\n",
    "data.set_nodes_compression_chunkshape(node_list=['grain_map_raw', 'uncertainty_map', 'mask'], \n",
    "                                      compression_options=compression_options)\n",
    "\n",
    "# Print new size of disks and their compression settings\n",
    "data.get_node_disk_size('grain_map_raw')\n",
    "data.print_node_compression_info('grain_map_raw')\n",
    "data.get_node_disk_size('uncertainty_map')\n",
    "data.print_node_compression_info('uncertainty_map')\n",
    "data.get_node_disk_size('mask')\n",
    "data.print_node_compression_info('mask')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84c35fcf",
   "metadata": {},
   "source": [
    "### Lossy compression and data normalization"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa0decf0",
   "metadata": {},
   "source": [
    "The compression filters used above preserve exactly the original values of the stored data. However, it is also possible with specific filters **a lossy compression**, which remove non relevant part of the data. As a result, data compression ratio is usually strongly increased, at the cost that stored data is no longer exactly equal to the inputed data array."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d866cf70",
   "metadata": {},
   "source": [
    "One of the most important feature of data array that increase their compressibility, is the presence of patterns in the data. If a value or a serie of values is repeated multiple times throughout the data array, data compression can be very efficient (the pattern can be stored only once). \n",
    "\n",
    "Numerical simulation and measurement tools usually output data in a standard simple or double precision floating point numbers, yiedling data arrays with values that have a lot of digits. Typically, these values are all different, and hence these arrays cannot be efficiently compressed."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2671816f",
   "metadata": {},
   "source": [
    "The `Amitex_stress_1` and `Amitex_strain_1` data array are two tensor fields outputed by a continuum mechanics FFT-based solver, and typical fall into this category: they have almost no equal value or clear data pattern. \n",
    "\n",
    "As you can see above, the best achieved compression ratio is 60% while for the dataset `grain_map`, the compression is way more efficient, with a best ratio that climbs up to 97% (62 Kb with *zlib* compressor and compression level of 9, versus an initial data array of 1.945 Mb). This is due to the nature of the `grain_map` data array, which is a tridimensional map of grains identification number in microstructure of the Titanium sample represented by the dataset. It is hence an array containing a few integer values that are repeated many times.\n",
    "\n",
    "Let us analyze these two data arrays values to illustrate this difference:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fcedfd42",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "print(f\"Data array `grain_map` has {data['grain_map'].size} elements,\"\n",
    "      f\"and {np.unique(data['grain_map']).size} different values.\\n\")\n",
    "print(f\"Data array `Amitex_stress_1` has {data['Amitex_stress_1'].size} elements,\"\n",
    "      f\"and {np.unique(data['Amitex_stress_1']).size} different values.\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83ecdf8c",
   "metadata": {},
   "source": [
    "#### Lossy compression"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20c2e1fc",
   "metadata": {},
   "source": [
    "Usually, the relevant precision of data is only of a few digits, so that many values of the array should be considered equal. The idea of lossy compression is to truncate values up to a desired precision, which  increases the number of equal values in a dataset and hence increases its compressibility. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "457a7375",
   "metadata": {},
   "source": [
    "Lossy compression can be applied to floating point data arrays in *SampleData* datasets using the `least_significant_digit` compression setting. If you set the value of this option to $N$, the data will be truncated after the $N^{th}$ siginificant digit after the decimal point. Let us see an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a1b439b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# We will store a value of an array to verify how it evolves after compression\n",
    "original_value = data['Amitex_stress_1'][20,20,20]\n",
    "\n",
    "# Apply lossy compression\n",
    "data.get_node_disk_size('Amitex_stress_1')\n",
    "# Set up compression settings with lossy compression: truncate after third digit adter decimal point\n",
    "compression_options = {'complib':'zlib', 'complevel':9, 'shuffle':True,  'least_significant_digit':3}\n",
    "data.set_chunkshape_and_compression(nodename='Amitex_stress_1', compression_options=compression_options)\n",
    "data.get_node_disk_size('Amitex_stress_1')\n",
    "\n",
    "# Get same value after lossy compression\n",
    "new_value = data['Amitex_stress_1'][20,20,20]\n",
    "\n",
    "print(f'Original array value: {original_value} \\n'\n",
    "      f'Array value after lossy compression: {new_value}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d72112f",
   "metadata": {},
   "source": [
    "As you may observe, the compression ratio has been improved, and the retrieved values after lossy compression are effectively equal to the original array up to the third digit after the decimal point. \n",
    "\n",
    "We will now try to increase the compression ratio by reducing the number of conserved digits to 2:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ccdafcb3",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Set up compression settings with lossy compression: truncate after third digit adter decimal point\n",
    "compression_options = {'complib':'zlib', 'complevel':9, 'shuffle':True,  'least_significant_digit':2}\n",
    "data.set_chunkshape_and_compression(nodename='Amitex_stress_1', compression_options=compression_options)\n",
    "data.get_node_disk_size('Amitex_stress_1')\n",
    "\n",
    "# Get same value after lossy compression\n",
    "new_value = data['Amitex_stress_1'][20,20,20]\n",
    "\n",
    "print(f'Original array value: {original_value} \\n'\n",
    "      f'Array value after lossy compression 2 digits: {new_value}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b491c6f",
   "metadata": {},
   "source": [
    "As you can see, the compression ratio has again been improved, now close to 75%. Know, you know how to do to choose the best compromise between lost precision and compression ratio. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53859c21",
   "metadata": {},
   "source": [
    "#### Normalization to improve compression ratio"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4f20a34",
   "metadata": {},
   "source": [
    "If you look more closely to the `Amitex_stress_1` array values, you can observe that the value of this array have been outputed within a certain scale of values, which in particular impact the number of significant digits that come before the decimal point. Sometimes precision of the data would require less significant digits than its scale of representation.\n",
    "\n",
    "In that case, storing the complete data array at its original scale is not necessary, and very inefficient in terms of data size. To optimize storage of such datasets, one can normalize them to a form with very few digits before the decimal point (1 or 2), and stored separately their scale to be able to revert the normalization operation when retrieiving data.\n",
    "\n",
    "This allows to reduce the total number of significant digits of the data, and hence further improve the achievable compression ratio with lossy compression. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c6d2aac",
   "metadata": {},
   "source": [
    "The *SampleData* class allows you to aplly automatically this operation when applying compression settings to your dataset. All you have to do is add to the `compression_option` dictionary the key `normalization` with one of its possible values. \n",
    "\n",
    "To try it, we will close (and delete) our test dataset and recopy the original file, to apply normalization and lossy compression on the original raw data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4e5cf73b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# removing dataset to recreate a copy\n",
    "del data\n",
    "# creating a copy of the dataset to try out lossy compression methods\n",
    "data = SD.copy_sample(src_sample_file=dataset_file, dst_sample_file='Test_compression', autodelete=True,\n",
    "                      get_object=True, overwrite=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06deb269",
   "metadata": {},
   "source": [
    "##### Standard Normalization"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0396189f",
   "metadata": {},
   "source": [
    "The `standard` normalization setting will center and reduce the data of an array $X$ by storing a new array $Y$ that is:\n",
    "\n",
    "$Y = \\frac{X - \\bar{X}}{\\sigma(X)}$\n",
    "\n",
    "where $\\bar{X}$ and $\\sigma(X)$ are respectively the mean and the standard deviation of the data array $X$. \n",
    "\n",
    "This operation reduces the number of significant digits before the decimal point to 1 or 2 for the large majority of the data array values. After *standard normalization*, lossy compression will yield much higher compression ratios for data array that have a non normalized scale. \n",
    "\n",
    "The SampleData class ensures that when data array are retrieved, or visualized, the user gets or sees the original data, with the normalization reverted.  \n",
    "\n",
    "Let us try to apply it to our stress field `Amitex_stress_1`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca39ef58",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set up compression settings with lossy compression: truncate after third digit adter decimal point\n",
    "compression_options = {'complib':'zlib', 'complevel':9, 'shuffle':True,  'least_significant_digit':2,\n",
    "                       'normalization':'standard'}\n",
    "data.set_chunkshape_and_compression(nodename='Amitex_stress_1', compression_options=compression_options)\n",
    "data.get_node_disk_size('Amitex_stress_1')\n",
    "\n",
    "# Get same value after lossy compression\n",
    "new_value = data['Amitex_stress_1'][20,20,20,:]\n",
    "\n",
    "# Get in memory value of the node\n",
    "memory_value = data.get_node('Amitex_stress_1', as_numpy=False)[20,20,20,:]\n",
    "\n",
    "print(f'Original array value: {original_value} \\n'\n",
    "      f'Array value after normalization and lossy compression 2 digits: {new_value}',\n",
    "      f'Value in memory: {memory_value}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78ad4362",
   "metadata": {},
   "source": [
    "As you can see, the compression ratio has been strongly improved by this normalization operation, reaching 90%. \n",
    "When looking at the retrieved value after compression, you can see that depending on the field component that is observed, the relative precision loss varies. The third large component value error is less than 1%, which is consistent with the truncation to 2 significant digits. However, it is not the other components, that have smaller values by two or three orders of magnitude, and that are retrieved with larger errors.\n",
    "\n",
    "This is explained by the fact that the `standard` normalization option scales the array as a whole. As a result, if there are large differencies in the scale of different components of a vector or tensor field, the precision of the smaller components will be less preserved."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f4c89b2",
   "metadata": {},
   "source": [
    "##### Standard Normalization per components for vector/tensor fields"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbed3c8f",
   "metadata": {},
   "source": [
    "Another normalization option is available for *SampleData* field arrays, that allows to apply standard normalization individually to each component of a field in order to keep a constant relative precision for each component when applying lossy compression to the field data array.\n",
    "\n",
    "To use this option, you will need to set the `normalization` value to `standard_per_component`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e303ecf",
   "metadata": {},
   "outputs": [],
   "source": [
    "del data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "981227a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = SD.copy_sample(src_sample_file=dataset_file, dst_sample_file='Test_compression', autodelete=True,\n",
    "                      get_object=True, overwrite=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77a9ab3c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set up compression settings with lossy compression: truncate after third digit adter decimal point\n",
    "compression_options = {'complib':'zlib', 'complevel':9, 'shuffle':True,  'least_significant_digit':2,\n",
    "                       'normalization':'standard_per_component'}\n",
    "data.set_chunkshape_and_compression(nodename='Amitex_stress_1', compression_options=compression_options)\n",
    "data.get_node_disk_size('Amitex_stress_1')\n",
    "\n",
    "# Get same value after lossy compression\n",
    "new_value = data['Amitex_stress_1'][20,20,20,:]\n",
    "\n",
    "# Get in memory value of the node\n",
    "memory_value = data.get_node('Amitex_stress_1', as_numpy=False)[20,20,20,:]\n",
    "\n",
    "print(f'Original array value: {original_value} \\n'\n",
    "      f'Array value after normalization per component and lossy compression 2 digits: {new_value}\\n',\n",
    "      f'Value in memory: {memory_value}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "665242e5",
   "metadata": {},
   "source": [
    "As you can see, the error in the retrieved array is now less than 1% for each component of the field value. However, the cost was a reduced improvement of the compression ratio. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed8c9d68",
   "metadata": {},
   "source": [
    "##### Visualization of normalized data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91c60a05",
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_xdmf()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7c9ca21",
   "metadata": {},
   "source": [
    "As you can see, the `Amitex_stress_1` *Attribute* node data in the dataset XDMF file is now provided by a `Function` item type, involving three data array with the original field shape. This function computes:\n",
    "\n",
    "$X' = Y*\\sigma(X) + \\bar{X}$\n",
    "\n",
    "where $*$ and $+$ are element-wise product and addition operators for multidimensional arrays. This operation allows to revert the component-wise normalization of data. The Paraview software is able to interpret this syntax of the XDMF format and hence, when visualizing data, you will see the values with the original scaling."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28d6c9d0",
   "metadata": {},
   "source": [
    "This operation required the creation of two large arrays in the dataset, that the store the mean and standard deviation of each component of the field, repeted for each spatial dimensions of the field data array. It is mandatory to allow visualization of the data with the right scaling in Paraview. However, as these array contain a very low amount of data ($2*N_c$: two times de number of components of the field), they can be very easily compressed and hence do not significantly affect the storage size of the data item, as you may see below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39095d35",
   "metadata": {},
   "outputs": [],
   "source": [
    "data.get_node_disk_size('Amitex_stress_1')\n",
    "data.get_node_disk_size('Amitex_stress_1_norm_std')\n",
    "data.get_node_disk_size('Amitex_stress_1_norm_mean')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbde9fc7",
   "metadata": {},
   "source": [
    "### Changing the chunksize of a node"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0084866a",
   "metadata": {},
   "source": [
    "#### Compressing all fields when adding Image or Mesh Groups"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37171e3c",
   "metadata": {},
   "source": [
    "Changing the chunksize of a data array with *SampleData* is very simple. You just have to pass as a *tuple* the news shape of the chunks you want for your data array, and pass it as an argument to the `set_chunkshape_and_compression` or `set_nodes_compression_chunkshape`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c181fe99",
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_node_compression_info('Amitex_stress_1')\n",
    "data.get_node_disk_size('Amitex_stress_1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2820e9c7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Change chunkshape of the array\n",
    "compression_options = {'complib':'zlib', 'complevel':9, 'shuffle':True,  'least_significant_digit':2,\n",
    "                       'normalization':'standard_per_component'}\n",
    "data.set_chunkshape_and_compression(nodename='Amitex_stress_1', chunkshape=(10,10,10,6),\n",
    "                                   compression_options=compression_options)\n",
    "data.get_node_disk_size('Amitex_stress_1')\n",
    "data.print_node_compression_info('Amitex_stress_1')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b09292e",
   "metadata": {},
   "source": [
    "As you can see, the chunkshape has been changed, which has also affected the memory size of the compressed data array. We have indeed reduced the number of chunks in the dataset, which reduces the number of data to store. This modification can also improve or deteriorate the I/O speed of access to your data array in the dataset. The reader is once again refered to dedicated documents to know more ion this matter: [here](https://support.hdfgroup.org/HDF5/doc/TechNotes/TechNote-HDF5-ImprovingIOPerformanceCompressedDatasets.pdf) and [here](https://www.pytables.org/usersguide/optimization.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f1ef5ea",
   "metadata": {},
   "source": [
    "### Compression data and setting chunkshape upon creation of data items"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41e4a29d",
   "metadata": {},
   "source": [
    "Until here we have only modified the compression settings of already existing data items. In this process, the data items are replaced by the new compressed version of the data, which is a costly operation. For this reason, if they are known in advance, it is best to apply the compression filters and appropriate chunkshape when creating the data item.\n",
    "\n",
    "If you have read through all the tutorials of this user guide, you should know all the method that allow to create data items in your datasets, like `add_data_array`, `add_field`, `ædd_mesh`... All of these methods accept the two arguments `chunkshape` and `compression_options`, that work exaclty as for the `set_chunkshape_and_compression` or `set_nodes_compression_chunkshape` methods. You hence use them to create your data items directly with the appropriate compression settings.\n",
    "\n",
    "Let us see an example. We will get an array from our dataset, and try to recreate it with a new name and some data compression:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6fc1edb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# removing dataset to recreate a copy\n",
    "del data\n",
    "# creating a copy of the dataset to try out lossy compression methods\n",
    "data = SD.copy_sample(src_sample_file=dataset_file, dst_sample_file='Test_compression', autodelete=True,\n",
    "                      get_object=True, overwrite=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f350b10",
   "metadata": {},
   "outputs": [],
   "source": [
    "# getting the `orientation_map` array\n",
    "array = data['Amitex_stress_1']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f2a6fe7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a new field for the CellData image group with the `orientation_map` array and add compression settings\n",
    "compression_options = {'complib':'zlib', 'complevel':1, 'shuffle':True,  'least_significant_digit':2,\n",
    "                       'normalization':'standard'}\n",
    "new_cshape = (10,10,10,3)\n",
    "\n",
    "# Add data array as field of the CellData Image Group\n",
    "data.add_field(gridname='CellData', fieldname='test_compression', indexname='testC', array=array,\n",
    "              chunkshape=new_cshape, compression_options=compression_options, replace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d89a24f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check size and settings of new field\n",
    "data.print_node_info('testC')\n",
    "data.get_node_disk_size('testC')\n",
    "data.print_node_compression_info('testC')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c5628eb",
   "metadata": {},
   "source": [
    "The node has been created with the desired chunkshape and compression filters."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31f0a430",
   "metadata": {},
   "source": [
    "### Repacking files"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62f5f7db",
   "metadata": {},
   "source": [
    "We now recreate a new copy of the original dataset, and try to reduce the size of oll heavy data item, to reduce as much as possible the size of our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a995a75",
   "metadata": {},
   "outputs": [],
   "source": [
    "# removing dataset to recreate a copy\n",
    "del data\n",
    "# creating a copy of the dataset to try out lossy compression methods\n",
    "data = SD.copy_sample(src_sample_file=dataset_file, dst_sample_file='Test_compression', autodelete=True,\n",
    "                      get_object=True, overwrite=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5353403",
   "metadata": {},
   "outputs": [],
   "source": [
    "compression_options1 = {'complib':'zlib', 'complevel':9, 'shuffle':True,  'least_significant_digit':2,\n",
    "                       'normalization':'standard'}\n",
    "compression_options2 = {'complib':'zlib', 'complevel':9, 'shuffle':True}\n",
    "data.set_chunkshape_and_compression(nodename='Amitex_stress_1', compression_options=compression_options1)\n",
    "data.set_nodes_compression_chunkshape(node_list=['grain_map', 'grain_map_raw','mask'],\n",
    "                                      compression_options=compression_options2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae23362a",
   "metadata": {},
   "source": [
    "Now that we have compressed a few of the items of our dataset, the disk size of its HDF5 file should have diminished. Let us check again the size of its data items, and of the file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e92347b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "data.print_dataset_content(short=True)\n",
    "data.get_file_disk_size()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6611d8ce",
   "metadata": {},
   "source": [
    "The file size has not changed, surprisingly, even if the large `Amitex_stress_1` array has been shrinked from almost 50 Mo to roughly 5 Mo. This is due to a specific feature of HDF5 files: they do not free up the memory space that they have used in the past. The memory space remains associated to the file, and is used in priority when new data is written into the dataset. \n",
    "\n",
    "After changing the compression settings of one or several nodes in your dataset, if that induced a reduction of your actual data memory size, and that you want your file to be smaller on disk. To retrieve the fried up memory spacae, you may **repack** your file (overwrite it with a copy of itself, that has just the size require to store all actual data).\n",
    "\n",
    "To do that, you may use the *SampleData* method `repack_h5file`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9996d533",
   "metadata": {},
   "outputs": [],
   "source": [
    "data.repack_h5file()\n",
    "data.get_file_disk_size()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "243b5a37",
   "metadata": {},
   "source": [
    "You see that repacking the file has allowed to free some memory space and reduced its size. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa8e7e6f",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "**Note** \n",
    "    \n",
    "Note that the size of the file is larger than the size of data items printed by `print_dataset_content`. This extra size is the memory size occupied by the data array storing *Element Tags* for the mesh `grains_mesh`. Element tags are not printed by the printing methods as they can be very numerous and pollute the lecture of the printed information.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4189ff7",
   "metadata": {},
   "source": [
    "Once again, you should repack your file at carefully chosen times, as is it a very costly operation for large datasets. The *SampleData* class constructor has an **autorepack** option. If it is set to `True`, the file is automatically repacked when closing the dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20821759",
   "metadata": {},
   "source": [
    "We can now close our dataset, and remove the original unarchived file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f4748842",
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove SampleData instance\n",
    "del data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0bef0cd2",
   "metadata": {},
   "outputs": [],
   "source": [
    "os.remove(dataset_file+'.h5')\n",
    "os.remove(dataset_file+'.xdmf')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "env_dev",
   "language": "python",
   "name": "env_dev"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.17"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}