{
"cells": [
{
"cell_type": "markdown",
"id": "e7dbd6fe",
"metadata": {},
"source": [
"# Data Compression with Pymicro"
]
},
{
"cell_type": "markdown",
"id": "0945f582",
"metadata": {},
"source": [
"This short Notebook will introduce you to how to efficiently compress your data within SampleData datasets."
]
},
{
"cell_type": "markdown",
"id": "bc913479",
"metadata": {},
"source": [
"## I - Data compression with HDF5 and Pytables"
]
},
{
"cell_type": "markdown",
"id": "8698fcd3",
"metadata": {},
"source": [
"### HDF5 and data compression"
]
},
{
"cell_type": "markdown",
"id": "6d17c269",
"metadata": {},
"source": [
"HDF5 allows compression filters to be applied to datasets in a file to minimize the amount of space it consumes. These compression features allow to drastically improve the storage space required for you datasets, as well as the speed of I/O access to datasets, and can differ from one data item to another within the same HDF5 file. A detailed presentation of HDF5 compression possibilities is provided [here](https://support.hdfgroup.org/HDF5/faq/compression.html). \n",
"\n",
"The two main ingredients that control compression performances for HDF5 datasets are the compression filters used (which compression algorithm, with which parameters), and the using a chunked layout for the data. This two features are briefly developed hereafter."
]
},
{
"cell_type": "markdown",
"id": "2263a8d2",
"metadata": {},
"source": [
"### Pytables and data compression"
]
},
{
"cell_type": "markdown",
"id": "4ceaf1ad",
"metadata": {},
"source": [
"The application of compression filters to HDF5 files with the *SampleData* class is handled by the *Pytable* package, on which is built the *SampleData* HDF5 interface. *Pytables* implements a specific containers class, the `Filters` class, to gather the various settings of the compression filters to apply to the datasets in a HDF5 file.\n",
"\n",
"When using the *SampleData* class, you will have to specify this compression filter settings to class methods dedicated to data compression. These settings are the parameters of the *Pytables* `Filters` class. These settings and their possible values are detailed in the next subsection. "
]
},
{
"cell_type": "markdown",
"id": "5036668a",
"metadata": {},
"source": [
"#### Available filter settings"
]
},
{
"cell_type": "markdown",
"id": "098763e1",
"metadata": {},
"source": [
"The description given below of compression options available with *SampleData*/*Pytables* is exctracted from [the *Pytables* documentation of the *Filter* class](https://www.pytables.org/usersguide/libref/helper_classes.html#the-filters-class)."
]
},
{
"cell_type": "markdown",
"id": "9dc47fe0",
"metadata": {},
"source": [
"* **complevel** (int) – Specifies a compression level for data. The allowed range is 0-9. A value of 0 (the default) disables compression.\n",
"\n",
"* **complib** (str) – Specifies the compression library to be used. Right now, `zlib` (the default), `lzo`, `bzip2` and `blosc` are supported. Additional compressors for Blosc like `blosc:blosclz` (‘blosclz’ is the default in case the additional compressor is not specified), `blosc:lz4`, `blosc:lz4hc`, `blosc:snappy`, `blosc:zlib` and `blosc:zstd` are supported too. Specifying a compression library which is not available in the system issues a FiltersWarning and sets the library to the default one.\n",
"\n",
"* **shuffle** (bool) – Whether or not to use the Shuffle filter in the HDF5 library. This is normally used to improve the compression ratio. A false value disables shuffling and a true one enables it. The default value depends on whether compression is enabled or not; if compression is enabled, shuffling defaults to be enabled, else shuffling is disabled. Shuffling can only be used when compression is enabled.\n",
"\n",
"* **bitshuffle** (bool) – Whether or not to use the BitShuffle filter in the Blosc library. This is normally used to improve the compression ratio. A false value disables bitshuffling and a true one enables it. The default value is disabled.\n",
"\n",
"* **fletcher32** (bool) – Whether or not to use the Fletcher32 filter in the HDF5 library. This is used to add a checksum on each data chunk. A false value (the default) disables the checksum.\n",
"\n",
"* **least_significant_digit** (int) – If specified, data will be truncated (quantized). In conjunction with enabling compression, this produces ‘lossy’, but significantly more efficient compression. For example, if least_significant_digit=1, data will be quantized using around(scale*data)/scale, where scale = 2^bits, and bits is determined so that a precision of 0.1 is retained (in this case bits=4). Default is None, or no quantization."
]
},
{
"cell_type": "markdown",
"id": "7dc127a8",
"metadata": {},
"source": [
"### Chunked storage layout"
]
},
{
"cell_type": "markdown",
"id": "2c6edfe6",
"metadata": {},
"source": [
"Compressed data is stored in a data array of an HDF5 dataset using a chunked storage mechanism. When chunked storage is used, the data array is split into equally sized chunks each of which is stored separately in the file, as illustred on the diagram below. Compression is applied to each individual chunk. When an I/O operation is performed on a subset of the data array, only chunks that include data from the subset participate in I/O and need to be uncompressed or compressed.\n",
"\n",
"Chunking data allows to:\n",
"\n",
"* Generally improve, sometimes drastically, the I/O performance of datasets. This comes from the fact that the chunked layout removes the reading speed anisotropy for data array that depends along which dimension its elements are read (*i.e* the same number of disk access are required when reading data in rows or columns).\n",
"* Chunked storage also enables adding more data to a dataset without rewriting the whole dataset.\n",
"\n",
"
\n",
"\n",
"**By default, data arrays are stored with a chunked layout in SampleData datasets.** The size of chunks is the key parameter that controls the impact on I/O performances for chunked datasets. **The shape of chunks is computed automatically by the Pytable package, providing a value yielding generally good I/O performances.** if you need to go further in the I/O optimization, you may consult the *Pytables* [documentation page](https://www.pytables.org/usersguide/optimization.html) dedicated to compression optimization for I/O speed and storage space. In addition, it is highly recommended to read [this document](https://support.hdfgroup.org/HDF5/doc/TechNotes/TechNote-HDF5-ImprovingIOPerformanceCompressedDatasets.pdf) in order to be able to efficiently optimize I/O and storage performances for your chunked datasets. These performance issues will not be discussed in this tutorial."
]
},
{
"cell_type": "markdown",
"id": "fab0cf21",
"metadata": {},
"source": [
"## II - Compressing your datasets with SampleData"
]
},
{
"cell_type": "markdown",
"id": "c63fb275",
"metadata": {},
"source": [
"Within *Sampledata*, data compression can be applied to:\n",
"\n",
"* data arrays\n",
"* structured data arrays\n",
"* field data arrays\n",
"\n",
"There are two ways to control the compression settings of your *SampleData* data arrays:\n",
"\n",
"1. Providing compression settings to data item creation methods\n",
"2. Using the `set_chunkshape_and_compression` and `set_nodes_compression_chunkshape` methods"
]
},
{
"cell_type": "markdown",
"id": "50e22a8e",
"metadata": {},
"source": [
"### The compression options dictionary"
]
},
{
"cell_type": "markdown",
"id": "31c961fd",
"metadata": {},
"source": [
"In both cases, you will have to pass the various settings of the compression filter you want to apply to your data to the appropriate *SampleData* method. All of these methods accept for that purpose a `compression_options` argument, which must be a dictionary. Its keys and associated values can be chosen among the ones listed in the **Available filter settings** subsection above."
]
},
{
"cell_type": "markdown",
"id": "c448bda1",
"metadata": {},
"source": [
"### Compress already existing data arrays"
]
},
{
"cell_type": "markdown",
"id": "8258df17",
"metadata": {},
"source": [
"We will start by looking at how we can change compression settings of already existing data in *SampleData* datasets.\n",
"For that, we will use a material science dataset that is part of the *Pymicro* example datasets."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5bfa7d5c",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from pymicro import get_examples_data_dir # import file directory path\n",
"PYMICRO_EXAMPLES_DATA_DIR = get_examples_data_dir() # get file directory path\n",
"dataset_file = os.path.join(PYMICRO_EXAMPLES_DATA_DIR, 'example_microstructure') # test dataset file path\n",
"tar_file = os.path.join(PYMICRO_EXAMPLES_DATA_DIR, 'example_microstructure.tar.gz') # dataset archive path"
]
},
{
"cell_type": "markdown",
"id": "061eb446",
"metadata": {},
"source": [
"This file is zipped in the package to reduce its size. We will have to unzip it to use it and learn how to reduce its size with the *SampleData* methods. If you are just reading the documentation and not executing it, you may just skip this cell and the next one."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7b6aefa2",
"metadata": {},
"outputs": [],
"source": [
"# Save current directory\n",
"cwd = os.getcwd()\n",
"# move to example data directory\n",
"os.chdir(PYMICRO_EXAMPLES_DATA_DIR)\n",
"# unarchive the dataset\n",
"os.system(f'tar -xvf {tar_file}')\n",
"# get back to UserGuide directory\n",
"os.chdir(cwd)"
]
},
{
"cell_type": "markdown",
"id": "329cf696",
"metadata": {},
"source": [
"#### Dataset presentation"
]
},
{
"cell_type": "markdown",
"id": "e3ee3aa9",
"metadata": {},
"source": [
"In this tutorial, we will work on a copy of this dataset, to leave the original data unaltered. \n",
"We will start by creating an autodeleting copy of the file, and print its content to discover its content. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "573e1fb9",
"metadata": {},
"outputs": [],
"source": [
"# import SampleData class\n",
"from pymicro.core.samples import SampleData as SD\n",
"# import Numpy\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "98f24ea2",
"metadata": {},
"outputs": [],
"source": [
"# Create a copy of the existing dataset\n",
"data = SD.copy_sample(src_sample_file=dataset_file, dst_sample_file='Test_compression', autodelete=True,\n",
" get_object=True, overwrite=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9d25c8af",
"metadata": {},
"outputs": [],
"source": [
"print(data)\n",
"data.get_file_disk_size()"
]
},
{
"cell_type": "markdown",
"id": "e4b47944",
"metadata": {},
"source": [
"As you can see, this dataset already contains a rich content. It is a digital twin of a real polycristalline microstructure of a grade 2 Titanium sample, gathering both experimental and numerical data obtained through Diffraction Contrast Tomography imaging, and FFT-based mechanical simulation. \n",
"\n",
"This dataset has actually been constructed using the `Microstructure` class of the pymicro package, which is based on the *SampleData* class. The link between these classes will be discussed in the next tutorial."
]
},
{
"cell_type": "markdown",
"id": "8dc062bc",
"metadata": {},
"source": [
"This dataset contains only uncompressed data. We will try to reduce its size by using various compression methods on the large data items that it contains. You can see that most of them are stored in the *3DImage Group* `CellData`. "
]
},
{
"cell_type": "markdown",
"id": "c3c95962",
"metadata": {},
"source": [
"### Apply compression settings for a specific array"
]
},
{
"cell_type": "markdown",
"id": "3cc5883e",
"metadata": {},
"source": [
"We will start by compressing the `grain_map` *Field data array* of the `CellData` image. Let us look more closely on this data item:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "823b9ea2",
"metadata": {},
"outputs": [],
"source": [
"data.print_node_info('grain_map')"
]
},
{
"cell_type": "markdown",
"id": "9d610cd0",
"metadata": {},
"source": [
"We can see above that this data item is not compressed (`complevel=0`), and has a disk size of almost 2 Mb.\n",
"\n",
"To apply a set of compression settings to this data item, you need to:\n",
"\n",
"1. create a dictionary specifying the compression settings:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6a769a5b",
"metadata": {},
"outputs": [],
"source": [
"compression_options = {'complib':'zlib', 'complevel':1}"
]
},
{
"cell_type": "markdown",
"id": "fa8edc6c",
"metadata": {},
"source": [
"2. use the *SampleData* `set_chunkshape_and_compression` method with the dictionary and the name of the data item as arguments"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "55e7bad9",
"metadata": {},
"outputs": [],
"source": [
"data.set_chunkshape_and_compression(nodename='grain_map', compression_options=compression_options)\n",
"data.get_node_disk_size('grain_map')\n",
"data.print_node_compression_info('grain_map')"
]
},
{
"cell_type": "markdown",
"id": "e3ceebce",
"metadata": {},
"source": [
"As you can see, the storage size of the data item has been greatly reduced, by more than 10 times (126 Kb vs 1.945 Mb), using this compression settings. Let us see what will change if we use different settings :"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2f5f5597",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# No `shuffle` option:\n",
"print('\\nUsing the shuffle option, with the zlib compressor and a compression level of 1:')\n",
"compression_options = {'complib':'zlib', 'complevel':1, 'shuffle':True}\n",
"data.set_chunkshape_and_compression(nodename='grain_map', compression_options=compression_options)\n",
"data.get_node_disk_size('grain_map')\n",
"\n",
"# No `shuffle` option:\n",
"print('\\nUsing no shuffle option, with the zlib compressor and a compression level of 9:')\n",
"compression_options = {'complib':'zlib', 'complevel':9, 'shuffle':False}\n",
"data.set_chunkshape_and_compression(nodename='grain_map', compression_options=compression_options)\n",
"data.get_node_disk_size('grain_map')\n",
"\n",
"# No `shuffle` option:\n",
"print('\\nUsing the shuffle option, with the lzo compressor and a compression level of 1:')\n",
"compression_options = {'complib':'lzo', 'complevel':1, 'shuffle':True}\n",
"data.set_chunkshape_and_compression(nodename='grain_map', compression_options=compression_options)\n",
"data.get_node_disk_size('grain_map')\n",
"\n",
"# No `shuffle` option:\n",
"print('\\nUsing no shuffle option, with the lzo compressor and a compression level of 1:')\n",
"compression_options = {'complib':'lzo', 'complevel':1, 'shuffle':False}\n",
"data.set_chunkshape_and_compression(nodename='grain_map', compression_options=compression_options)\n",
"data.get_node_disk_size('grain_map')"
]
},
{
"cell_type": "markdown",
"id": "b4f0e86f",
"metadata": {},
"source": [
"As you may observe, is significantly affected by the choice of the compression level. The higher the compression level, the higher the compression ratio, but also the lower the I/O speed. On the other hand, you can also remark that, in the present case, using the `shuffle` filter deteriorates the compression ratio.\n",
"\n",
"Let us try to with another data item:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "73633419",
"metadata": {},
"outputs": [],
"source": [
"data.print_node_info('Amitex_stress_1')\n",
"\n",
"# No `shuffle` option:\n",
"print('\\nUsing the shuffle option, with the zlib compressor and a compression level of 1:')\n",
"compression_options = {'complib':'zlib', 'complevel':1, 'shuffle':True}\n",
"data.set_chunkshape_and_compression(nodename='Amitex_stress_1', compression_options=compression_options)\n",
"data.get_node_disk_size('Amitex_stress_1')\n",
"\n",
"# No `shuffle` option:\n",
"print('\\nUsing no shuffle option, with the zlib compressor and a compression level of 1:')\n",
"compression_options = {'complib':'zlib', 'complevel':1, 'shuffle':False}\n",
"data.set_chunkshape_and_compression(nodename='Amitex_stress_1', compression_options=compression_options)\n",
"data.get_node_disk_size('Amitex_stress_1')"
]
},
{
"cell_type": "markdown",
"id": "6f35a417",
"metadata": {},
"source": [
"On the opposite, for this second array, the shuffle filter improves significantly the compression ratio. However, in this case, you can see that the compression ratio achieved is much lower than for the `grain_map` array. "
]
},
{
"cell_type": "markdown",
"id": "0f39e182",
"metadata": {},
"source": [
"