ML: Data statistics with PCA#
Once you have chosen datasets, it is time to load and inspect them. This notebook demonstrates steps for a simple machine learning (ML), in this case, principle component analysis (PCA).
We are going to use additional packages: scikit-learn
for PCA, plotly
for drawing graphs and pandas
for tabular data.
Install additional dependencies. This notebook was written in jupyter lab. If you are in jupyter notebook or other environments, plotly
may not work. If you have troubles, follow this instruction https://plotly.com/python/getting-started/#installation.
Install dependencies and Imports#
[1]:
# !pip install scikit-learn
# !pip install pandas
# !pip install plotly
# !pip install "jupyterlab>=3" "ipywidgets>=7.6"
[2]:
# # or use conda
# !conda install scikit-learn
# !conda install pandas
# !conda install -c plotly plotly
# !conda install "jupyterlab>=3" "ipywidgets>=7.6"
[3]:
import albumentations as A
import numpy as np
import pandas as pd
import plotly.express as px
from plotly import graph_objects as go
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from bioimageloader import Config, ConcatDataset, BatchDataloader
from bioimageloader.utils import random_split_dataset
from bioimageloader.plots import cycle_colors, to_hex_color
Load collections#
Start by loading collections using bioimageloader
. You have two options.
If you have a config (ex.
cfg.yaml
)
I am loading all available collections
[4]:
cfg = Config('../configs/all_collections_image.yml')
print(len(cfg), cfg.keys())
22 dict_keys(['DSB2018', 'TNBC', 'ComputationalPathology', 'S_BSST265', 'MurphyLab', 'FRUNet', 'BBBC006', 'BBBC007', 'BBBC008', 'BBBC018', 'BBBC020', 'BBBC039', 'DigitalPathology', 'UCSB', 'BBBC002', 'BBBC013', 'BBBC014', 'BBBC015', 'BBBC016', 'BBBC026', 'BBBC021', 'BBBC041'])
Otherwise make a dictionary with necessary arguments for each dataset
Note that we need only output='image'
. Some collections can output ‘both’ or ‘mask’ and if they do, the default output is to load ‘both’. So we need to force it.
[5]:
# _cfg = {
# 'BBBC007': {
# 'root_dir': '../Data/bbbc/007',
# 'output': 'image',
# },
# 'BBBC008': {
# 'root_dir': '../Data/bbbc/008',
# 'output': 'image',
# },
# 'BBBC013': {
# 'root_dir': '../Data/bbbc/013',
# },
# }
# cfg = Config.from_dict(_cfg)
Resize all to the same size in order to handle them easily
[6]:
resize = A.Resize(256, 256)
datasets = cfg.load_datasets(transforms=resize)
lengths = [len(dset) for dset in datasets]
print(sum(lengths), lengths)
4965 [670, 50, 30, 79, 100, 72, 768, 16, 12, 56, 25, 150, 141, 26, 50, 96, 96, 144, 72, 864, 240, 1208]
[7]:
for dset in datasets:
print(f'{dset.acronym:10s}: {len(dset):10d}')
DSB2018 : 670
TNBC : 50
ComPath : 30
S_BSST265 : 79
MurphyLab : 100
FRUNet : 72
BBBC006 : 768
BBBC007 : 16
BBBC008 : 12
BBBC018 : 56
BBBC020 : 25
BBBC039 : 150
DigitPath : 141
UCSB : 26
BBBC002 : 50
BBBC013 : 96
BBBC014 : 96
BBBC015 : 144
BBBC016 : 72
BBBC026 : 864
BBBC021 : 240
BBBC041 : 1208
hist(datasets)#
Plot histogram
[8]:
# Choose your color map
# https://plotly.com/python/discrete-color/#color-sequences-in-plotly-express
colors = cycle_colors(px.colors.qualitative.Alphabet, len(datasets))
[9]:
bar_df = pd.DataFrame({
'acronym': [dset.acronym for dset in datasets],
'length': [len(dset) for dset in datasets],
'color': colors,
})
bar_df.head()
[9]:
acronym | length | color | |
---|---|---|---|
0 | DSB2018 | 670 | #AA0DFE |
1 | TNBC | 50 | #3283FE |
2 | ComPath | 30 | #85660D |
3 | S_BSST265 | 79 | #782AB6 |
4 | MurphyLab | 100 | #565656 |
Add percentage info
[10]:
bar_df['perc'] = bar_df.length / bar_df.length.sum()
bar_df.head()
[10]:
acronym | length | color | perc | |
---|---|---|---|---|
0 | DSB2018 | 670 | #AA0DFE | 0.134945 |
1 | TNBC | 50 | #3283FE | 0.010070 |
2 | ComPath | 30 | #85660D | 0.006042 |
3 | S_BSST265 | 79 | #782AB6 | 0.015911 |
4 | MurphyLab | 100 | #565656 | 0.020141 |
[11]:
# Sort by length in descending order
bar_df = bar_df.sort_values('length', ascending=False)
bar_df.head()
[11]:
acronym | length | color | perc | |
---|---|---|---|---|
21 | BBBC041 | 1208 | #C075A6 | 0.243303 |
19 | BBBC026 | 864 | #2ED9FF | 0.174018 |
6 | BBBC006 | 768 | #16FF32 | 0.154683 |
0 | DSB2018 | 670 | #AA0DFE | 0.134945 |
20 | BBBC021 | 240 | #B10DA1 | 0.048338 |
[12]:
HOVERTEMPLATE = '%{x}, %{y}, %{customdata:.2f}%<extra></extra>'
fig = go.Figure()
fig.add_trace(
go.Bar(x=bar_df.acronym,
y=bar_df.length,
marker_color=bar_df.color,
customdata=bar_df.perc,
hovertemplate=HOVERTEMPLATE)
)
fig.update_layout(
title_text='Collections (#collections: {}, #images: {})'.format(
len(bar_df), bar_df.length.sum())
)
fig.show()
PCA(datasets)#
Using BatchDataloader
won’t do much in this example since it only has Resize transformation which is pretty cheap operation. But it is a good practice to benefit from multi-processing.
Sample#
Loading all images would require a lot of memory. We will sample 10% (minimum 5 images) from each dataset
[13]:
SAMPLE_RATE = 0.1
MININUM_N_IMAGES = 5
datasets_sampled = []
for i in range(len(datasets)):
n = int(SAMPLE_RATE * len(datasets[i]))
n = max(n, MININUM_N_IMAGES) # minimum 5
sampled, _ = random_split_dataset(datasets[i], [n, len(datasets[i])-n])
print(f'{datasets[i].acronym:10s}: {len(datasets[i]):>10d} => {len(sampled):>10d}')
datasets_sampled.append(sampled)
DSB2018 : 670 => 67
TNBC : 50 => 5
ComPath : 30 => 5
S_BSST265 : 79 => 7
MurphyLab : 100 => 10
FRUNet : 72 => 7
BBBC006 : 768 => 76
BBBC007 : 16 => 5
BBBC008 : 12 => 5
BBBC018 : 56 => 5
BBBC020 : 25 => 5
BBBC039 : 150 => 15
DigitPath : 141 => 14
UCSB : 26 => 5
BBBC002 : 50 => 5
BBBC013 : 96 => 9
BBBC014 : 96 => 9
BBBC015 : 144 => 14
BBBC016 : 72 => 7
BBBC026 : 864 => 86
BBBC021 : 240 => 24
BBBC041 : 1208 => 120
Concatenate them and load image in batch
[14]:
BATCH_SIZE = 32
cat_sampled = ConcatDataset(datasets_sampled)
callcat_sampled = BatchDataloader(cat_sampled, batch_size=BATCH_SIZE)
[15]:
len(cat_sampled), len(callcat_sampled), callcat_sampled.batch_size
[15]:
(505, 16, 32)
Prep for PCA#
Flatten all image arrays and stack them so that the array has shape of (n_samples, n_feature)
.
[16]:
enum_callcat_sampled = enumerate(callcat_sampled)
[17]:
features = []
for i, meow in enum_callcat_sampled:
print(i, end=' ')
for img in meow['image']:
features.append(img.ravel())
features = np.asarray(features)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[18]:
features.shape
[18]:
(505, 196608)
Perform PCA#
[19]:
N_COMPONENTS = 10
pca = PCA(n_components=N_COMPONENTS)
pca.fit(features)
[19]:
PCA(n_components=10)
Plot results#
Scree Plot#
[20]:
x = list(range(1, N_COMPONENTS+1))
y = pca.explained_variance_ratio_
[21]:
fig = go.Figure()
fig.add_trace(
go.Bar(
x=x,
y=y
)
)
fig.update_layout(
title_text='PCA scree plot',
xaxis=dict(title_text='Principle Component',
tickmode='array', tickvals=x),
yaxis=dict(title_text='Explained Variance Ratio'),
)
2D PCA (from sample)#
[22]:
fitted = pca.transform(features)
fitted.shape
[22]:
(505, 10)
[23]:
cat_sampled.cumulative_sizes
[23]:
array([ 67, 72, 77, 84, 94, 101, 177, 182, 187, 192, 197, 212, 226,
231, 236, 245, 254, 268, 275, 361, 385, 505])
You can interact with below plot. For example, click each dataset in legend to disable and enable visibility.
[24]:
fig = go.Figure()
i_offset = 0
for i, size in enumerate(cat_sampled.cumulative_sizes):
# print(i_offset, size)
fig.add_trace(
go.Scatter(
x=fitted[i_offset:size, 0],
y=fitted[i_offset:size, 1],
mode='markers',
opacity=0.7,
marker_color=colors[i],
name=cat_sampled.datasets[i].acronym,
)
)
i_offset = size
fig.update_layout(
height=600,
title_text='PCA',
xaxis=dict(title_text=f'PC1 ({pca.explained_variance_ratio_[0]:.2f})'),
yaxis=dict(title_text=f'PC2 ({pca.explained_variance_ratio_[1]:.2f})'),
)
fig.show()
2D PCA (full size)#
We have a PCA model fitted with samples. Now we will load all images in batch using using bioimageloader.BatchDataloader
, which is a generator and thus it requires much less memory.
This one will take a while (~10mins).
[25]:
fullcat = ConcatDataset(datasets)
len(fullcat)
[25]:
4965
[26]:
# I want more cats for every batch
BATCH_SIZE = 512
callfullcat = BatchDataloader(fullcat, batch_size=BATCH_SIZE)
[27]:
len(fullcat), len(callfullcat), callfullcat.batch_size
[27]:
(4965, 10, 512)
[28]:
fitted_full = []
for i, meow in enumerate(callfullcat):
imgs = meow['image']
feat = imgs.reshape((len(imgs), -1)) # flatten image array is our feature
feat_transformed = pca.transform(feat) # multiply by decomposed eigen vectors
fitted_full.append(feat_transformed)
print(i, end=' ')
# 1m14s for one loop
0 1 2 3 4 5 6 7 8 9
I tried cupy
just for fun. The gain is not worth because our matrices are not big and numpy is still very fast. Most of all, the bottleneck is loading part, not matrix multiplication.
[30]:
# import cupy as cp
# pca_comp_cp = cp.array(pca.components_)
# fitted_full = []
# for i, meow in enumerate(callfullcat):
# imgs = meow['image']
# feat = imgs.reshape((len(imgs), -1)) # flatten image array is our feature
# feat_cp = cp.asarray(feat)
# feat_cp_transformed = feat_cp @ pca_comp_cp.T # multiply by decomposed eigen vectors
# fitted_full.append(feat_cp_transformed.get())
# print(i, end=' ')
# # 1m14s for one loop
[29]:
fitted_full = np.vstack(fitted_full)
fitted_full.shape
[29]:
(4965, 10)
[31]:
fullcat.cumulative_sizes
[31]:
array([ 670, 720, 750, 829, 929, 1001, 1769, 1785, 1797, 1853, 1878,
2028, 2169, 2195, 2245, 2341, 2437, 2581, 2653, 3517, 3757, 4965])
You can interact with below plot. For example, click each dataset in legend to disable and enable visibility.
[34]:
fig = go.Figure()
i_offset = 0
for i, size in enumerate(fullcat.cumulative_sizes):
# print(i_offset, size)
fig.add_trace(
go.Scatter(
x=fitted_full[i_offset:size, 0],
y=fitted_full[i_offset:size, 1],
mode='markers',
opacity=0.4,
marker_color=colors[i],
name=fullcat.datasets[i].acronym,
)
)
i_offset = size
fig.update_layout(
height=600,
title_text='PCA',
xaxis=dict(title_text=f'PC1 ({pca.explained_variance_ratio_[0]:.2f})'),
yaxis=dict(title_text=f'PC2 ({pca.explained_variance_ratio_[1]:.2f})'),
)
fig.show()