ML: Data statistics with PCA#

Once you have chosen datasets, it is time to load and inspect them. This notebook demonstrates steps for a simple machine learning (ML), in this case, principle component analysis (PCA).

We are going to use additional packages: scikit-learn for PCA, plotly for drawing graphs and pandas for tabular data.

Install additional dependencies. This notebook was written in jupyter lab. If you are in jupyter notebook or other environments, plotly may not work. If you have troubles, follow this instruction https://plotly.com/python/getting-started/#installation.

Install dependencies and Imports#

[1]:

# !pip install scikit-learn
# !pip install pandas
# !pip install plotly
# !pip install "jupyterlab>=3" "ipywidgets>=7.6"

[2]:

# # or use conda
# !conda install scikit-learn
# !conda install pandas
# !conda install -c plotly plotly
# !conda install "jupyterlab>=3" "ipywidgets>=7.6"

[3]:

import albumentations as A
import numpy as np
import pandas as pd
import plotly.express as px
from plotly import graph_objects as go
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

from bioimageloader import Config, ConcatDataset, BatchDataloader
from bioimageloader.utils import random_split_dataset
from bioimageloader.plots import cycle_colors, to_hex_color

Load collections#

Start by loading collections using bioimageloader. You have two options.

If you have a config (ex. cfg.yaml)

I am loading all available collections

[4]:

cfg = Config('../configs/all_collections_image.yml')
print(len(cfg), cfg.keys())

22 dict_keys(['DSB2018', 'TNBC', 'ComputationalPathology', 'S_BSST265', 'MurphyLab', 'FRUNet', 'BBBC006', 'BBBC007', 'BBBC008', 'BBBC018', 'BBBC020', 'BBBC039', 'DigitalPathology', 'UCSB', 'BBBC002', 'BBBC013', 'BBBC014', 'BBBC015', 'BBBC016', 'BBBC026', 'BBBC021', 'BBBC041'])

Otherwise make a dictionary with necessary arguments for each dataset

Note that we need only output='image'. Some collections can output ‘both’ or ‘mask’ and if they do, the default output is to load ‘both’. So we need to force it.

[5]:

# _cfg = {
#     'BBBC007': {
#         'root_dir': '../Data/bbbc/007',
#         'output': 'image',
#     },
#     'BBBC008': {
#         'root_dir': '../Data/bbbc/008',
#         'output': 'image',
#     },
#     'BBBC013': {
#         'root_dir': '../Data/bbbc/013',
#     },
# }
# cfg = Config.from_dict(_cfg)

Resize all to the same size in order to handle them easily

[6]:

resize = A.Resize(256, 256)
datasets = cfg.load_datasets(transforms=resize)
lengths = [len(dset) for dset in datasets]
print(sum(lengths), lengths)

4965 [670, 50, 30, 79, 100, 72, 768, 16, 12, 56, 25, 150, 141, 26, 50, 96, 96, 144, 72, 864, 240, 1208]

[7]:

for dset in datasets:
    print(f'{dset.acronym:10s}: {len(dset):10d}')

DSB2018   :        670
TNBC      :         50
ComPath   :         30
S_BSST265 :         79
MurphyLab :        100
FRUNet    :         72
BBBC006   :        768
BBBC007   :         16
BBBC008   :         12
BBBC018   :         56
BBBC020   :         25
BBBC039   :        150
DigitPath :        141
UCSB      :         26
BBBC002   :         50
BBBC013   :         96
BBBC014   :         96
BBBC015   :        144
BBBC016   :         72
BBBC026   :        864
BBBC021   :        240
BBBC041   :       1208

hist(datasets)#

Plot histogram

[8]:

# Choose your color map
# https://plotly.com/python/discrete-color/#color-sequences-in-plotly-express
colors = cycle_colors(px.colors.qualitative.Alphabet, len(datasets))

[9]:

bar_df = pd.DataFrame({
    'acronym': [dset.acronym for dset in datasets],
    'length': [len(dset) for dset in datasets],
    'color': colors,
})
bar_df.head()

[9]:

	acronym	length	color
0	DSB2018	670	#AA0DFE
1	TNBC	50	#3283FE
2	ComPath	30	#85660D
3	S_BSST265	79	#782AB6
4	MurphyLab	100	#565656

Add percentage info

[10]:

bar_df['perc'] = bar_df.length / bar_df.length.sum()
bar_df.head()

[10]:

	acronym	length	color	perc
0	DSB2018	670	#AA0DFE	0.134945
1	TNBC	50	#3283FE	0.010070
2	ComPath	30	#85660D	0.006042
3	S_BSST265	79	#782AB6	0.015911
4	MurphyLab	100	#565656	0.020141

[11]:

# Sort by length in descending order
bar_df = bar_df.sort_values('length', ascending=False)
bar_df.head()

[11]:

	acronym	length	color	perc
21	BBBC041	1208	#C075A6	0.243303
19	BBBC026	864	#2ED9FF	0.174018
6	BBBC006	768	#16FF32	0.154683
0	DSB2018	670	#AA0DFE	0.134945
20	BBBC021	240	#B10DA1	0.048338

[12]:

HOVERTEMPLATE = '%{x}, %{y}, %{customdata:.2f}%<extra></extra>'

fig = go.Figure()

fig.add_trace(
    go.Bar(x=bar_df.acronym,
           y=bar_df.length,
           marker_color=bar_df.color,
           customdata=bar_df.perc,
           hovertemplate=HOVERTEMPLATE)
)
fig.update_layout(
    title_text='Collections (#collections: {}, #images: {})'.format(
        len(bar_df), bar_df.length.sum())
)
fig.show()

PCA(datasets)#

Using BatchDataloader won’t do much in this example since it only has Resize transformation which is pretty cheap operation. But it is a good practice to benefit from multi-processing.

Sample#

Loading all images would require a lot of memory. We will sample 10% (minimum 5 images) from each dataset

[13]:

SAMPLE_RATE = 0.1
MININUM_N_IMAGES = 5

datasets_sampled = []
for i in range(len(datasets)):
    n = int(SAMPLE_RATE * len(datasets[i]))
    n = max(n, MININUM_N_IMAGES)  # minimum 5
    sampled, _ = random_split_dataset(datasets[i], [n, len(datasets[i])-n])
    print(f'{datasets[i].acronym:10s}: {len(datasets[i]):>10d} => {len(sampled):>10d}')
    datasets_sampled.append(sampled)

DSB2018   :        670 =>         67
TNBC      :         50 =>          5
ComPath   :         30 =>          5
S_BSST265 :         79 =>          7
MurphyLab :        100 =>         10
FRUNet    :         72 =>          7
BBBC006   :        768 =>         76
BBBC007   :         16 =>          5
BBBC008   :         12 =>          5
BBBC018   :         56 =>          5
BBBC020   :         25 =>          5
BBBC039   :        150 =>         15
DigitPath :        141 =>         14
UCSB      :         26 =>          5
BBBC002   :         50 =>          5
BBBC013   :         96 =>          9
BBBC014   :         96 =>          9
BBBC015   :        144 =>         14
BBBC016   :         72 =>          7
BBBC026   :        864 =>         86
BBBC021   :        240 =>         24
BBBC041   :       1208 =>        120

Concatenate them and load image in batch

[14]:

BATCH_SIZE = 32

cat_sampled = ConcatDataset(datasets_sampled)
callcat_sampled = BatchDataloader(cat_sampled, batch_size=BATCH_SIZE)

[15]:

len(cat_sampled), len(callcat_sampled), callcat_sampled.batch_size

[15]:

(505, 16, 32)

Prep for PCA#

Flatten all image arrays and stack them so that the array has shape of (n_samples, n_feature).

[16]:

enum_callcat_sampled = enumerate(callcat_sampled)

[17]:

features = []
for i, meow in enum_callcat_sampled:
    print(i, end=' ')
    for img in meow['image']:
        features.append(img.ravel())
features = np.asarray(features)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

[18]:

features.shape

[18]:

(505, 196608)

Perform PCA#

[19]:

N_COMPONENTS = 10

pca = PCA(n_components=N_COMPONENTS)
pca.fit(features)

[19]:

PCA(n_components=10)

Plot results#

Scree Plot#

[20]:

x = list(range(1, N_COMPONENTS+1))
y = pca.explained_variance_ratio_

[21]:

fig = go.Figure()
fig.add_trace(
    go.Bar(
        x=x,
        y=y
    )
)
fig.update_layout(
    title_text='PCA scree plot',
    xaxis=dict(title_text='Principle Component',
               tickmode='array', tickvals=x),
    yaxis=dict(title_text='Explained Variance Ratio'),
)

2D PCA (from sample)#

[22]:

fitted = pca.transform(features)
fitted.shape

[22]:

(505, 10)

[23]:

cat_sampled.cumulative_sizes

[23]:

array([ 67,  72,  77,  84,  94, 101, 177, 182, 187, 192, 197, 212, 226,
       231, 236, 245, 254, 268, 275, 361, 385, 505])

You can interact with below plot. For example, click each dataset in legend to disable and enable visibility.

[24]:

fig = go.Figure()
i_offset = 0
for i, size in enumerate(cat_sampled.cumulative_sizes):
    # print(i_offset, size)
    fig.add_trace(
        go.Scatter(
            x=fitted[i_offset:size, 0],
            y=fitted[i_offset:size, 1],
            mode='markers',
            opacity=0.7,
            marker_color=colors[i],
            name=cat_sampled.datasets[i].acronym,
        )
    )
    i_offset = size

fig.update_layout(
    height=600,
    title_text='PCA',
    xaxis=dict(title_text=f'PC1 ({pca.explained_variance_ratio_[0]:.2f})'),
    yaxis=dict(title_text=f'PC2 ({pca.explained_variance_ratio_[1]:.2f})'),
)

fig.show()

2D PCA (full size)#

We have a PCA model fitted with samples. Now we will load all images in batch using using bioimageloader.BatchDataloader, which is a generator and thus it requires much less memory.

This one will take a while (~10mins).

[25]:

fullcat = ConcatDataset(datasets)
len(fullcat)

[25]:

[26]:

# I want more cats for every batch
BATCH_SIZE = 512

callfullcat = BatchDataloader(fullcat, batch_size=BATCH_SIZE)

[27]:

len(fullcat), len(callfullcat), callfullcat.batch_size

[27]:

(4965, 10, 512)

[28]:

fitted_full = []
for i, meow in enumerate(callfullcat):
    imgs = meow['image']
    feat = imgs.reshape((len(imgs), -1))  # flatten image array is our feature
    feat_transformed = pca.transform(feat)  # multiply by decomposed eigen vectors
    fitted_full.append(feat_transformed)
    print(i, end=' ')
    # 1m14s for one loop

0 1 2 3 4 5 6 7 8 9

I tried cupy just for fun. The gain is not worth because our matrices are not big and numpy is still very fast. Most of all, the bottleneck is loading part, not matrix multiplication.

[30]:

# import cupy as cp

# pca_comp_cp = cp.array(pca.components_)
# fitted_full = []
# for i, meow in enumerate(callfullcat):
#     imgs = meow['image']
#     feat = imgs.reshape((len(imgs), -1))  # flatten image array is our feature
#     feat_cp = cp.asarray(feat)
#     feat_cp_transformed = feat_cp @ pca_comp_cp.T  # multiply by decomposed eigen vectors
#     fitted_full.append(feat_cp_transformed.get())
#     print(i, end=' ')
#     # 1m14s for one loop

[29]:

fitted_full = np.vstack(fitted_full)
fitted_full.shape

[29]:

(4965, 10)

[31]:

fullcat.cumulative_sizes

[31]:

array([ 670,  720,  750,  829,  929, 1001, 1769, 1785, 1797, 1853, 1878,
       2028, 2169, 2195, 2245, 2341, 2437, 2581, 2653, 3517, 3757, 4965])

You can interact with below plot. For example, click each dataset in legend to disable and enable visibility.

[34]:

fig = go.Figure()
i_offset = 0
for i, size in enumerate(fullcat.cumulative_sizes):
    # print(i_offset, size)
    fig.add_trace(
        go.Scatter(
            x=fitted_full[i_offset:size, 0],
            y=fitted_full[i_offset:size, 1],
            mode='markers',
            opacity=0.4,
            marker_color=colors[i],
            name=fullcat.datasets[i].acronym,
        )
    )
    i_offset = size

fig.update_layout(
    height=600,
    title_text='PCA',
    xaxis=dict(title_text=f'PC1 ({pca.explained_variance_ratio_[0]:.2f})'),
    yaxis=dict(title_text=f'PC2 ({pca.explained_variance_ratio_[1]:.2f})'),
)

fig.show()