More: Speeding up loading by pre-parsing#

Since bioimageloader is designed for computer vision ML/DL, it expects to have image arrays both for an image and its annotation. But datasets sometimes come with encoded annotation or in formats other than image formats. By its design, bioimageloader does not transform or modify the original source. That being said, as you may guess, decoding and parsing them to build image arrays take a while and easily become a bottle neck. The solution is to simply pre-parse them only once and save them.

Let’s see an example. We have ComputationalPathology dataset, which comes with fully annotated instance masks. It is one of the high quality datasets you can find for instance segmentation tasks. But its annotations are stored in .xml format and thus need a parsing step. Conveniently, you do not have to worry about how to parse them, because it is already implemented in bioimageloader. As mentioned, however, iterating these masks and parsing them one by one is a huge bottle neck.

[1]:
from bioimageloader.collections import ComputationalPathology
[2]:
# `mask_tif` is specific to ComputationalPathology dataset
compath = ComputationalPathology(
    '../../Data/ComputationalPathology',
    mask_tif=False  # by default
)
print(compath, len(compath))
ComPath 30
[3]:
%%timeit
for data in compath:
    ...
16.1 s ± 41.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[4]:
# You can see annotation is stored in .xml format
compath.anno_dict[0]
[4]:
PosixPath('../../Data/ComputationalPathology/Annotations/TCGA-18-5592-01Z-00-DX1.xml')

Below save_xml_to_tif() method is specific and bound to ComputationalPathology. What is does is clear. Let’s print out its documentation.

[5]:
compath.save_xml_to_tif?
Signature: compath.save_xml_to_tif()
Docstring:
Parse .xml to mask and write it as tiff file

Having masks in images is much faster than parsing .xml for each call.
This func iterates through ``anno_dict``, parse and save each in .tif
format in the same annotation directory. Re-initiate an instance with
``mask_tif`` argument to load them.
File:      ~/workspace/bioimageloader/bioimageloader/collections/_compath.py
Type:      method

Let’s execute it

[6]:
compath.save_xml_to_tif()
[0/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-18-5592-01Z-00-DX1.tif'
[1/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-21-5784-01Z-00-DX1.tif'
[2/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-21-5786-01Z-00-DX1.tif'
[3/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-38-6178-01Z-00-DX1.tif'
[4/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-49-4488-01Z-00-DX1.tif'
[5/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-50-5931-01Z-00-DX1.tif'
[6/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-A7-A13E-01Z-00-DX1.tif'
[7/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-A7-A13F-01Z-00-DX1.tif'
[8/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-AR-A1AK-01Z-00-DX1.tif'
[9/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-AR-A1AS-01Z-00-DX1.tif'
[10/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-AY-A8YK-01A-01-TS1.tif'
[11/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-B0-5698-01Z-00-DX1.tif'
[12/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-B0-5710-01Z-00-DX1.tif'
[13/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-B0-5711-01Z-00-DX1.tif'
[14/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-CH-5767-01Z-00-DX1.tif'
[15/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-DK-A2I6-01A-01-TS1.tif'
[16/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-E2-A14V-01Z-00-DX1.tif'
[17/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-E2-A1B5-01Z-00-DX1.tif'
[18/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-G2-A2EK-01A-02-TSB.tif'
[19/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-G9-6336-01Z-00-DX1.tif'
[20/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-G9-6348-01Z-00-DX1.tif'
[21/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-G9-6356-01Z-00-DX1.tif'
[22/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-G9-6362-01Z-00-DX1.tif'
[23/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-G9-6363-01Z-00-DX1.tif'
[24/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-HE-7128-01Z-00-DX1.tif'
[25/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-HE-7129-01Z-00-DX1.tif'
[26/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-HE-7130-01Z-00-DX1.tif'
[27/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-KB-A93J-01A-01-TS1.tif'
[28/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-NH-A8F7-01A-01-TS1.tif'
[29/29] Wrote '../../Data/ComputationalPathology/Annotations/TCGA-RD-A8N9-01A-01-TS1.tif'

We will re-initialize an instance with mask_tif=True to load pre-parsed masks in .tif format.

[7]:
compath_tif = ComputationalPathology(
    '../../Data/ComputationalPathology',
    mask_tif=True
)
[8]:
%%timeit
for data in compath_tif:
    ...
1.21 s ± 7.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Iteration that took 16.1 seconds now takes 1.21 seconds!