---
title: "05-b SVD et Analyse en Composantes Principales"
author: Pierre-Edouard Portier
date: mars 2022
output:
  beamer_presentation:
    incremental: false
---

# Projection sur les axes principaux

>- $\mathbf{X}\in\mathbb{R}^{n \times p} \quad ; \quad \mathbf{X} = \sum_{\alpha}\sqrt{\lambda_\alpha}\mathbf{u_\alpha}\mathbf{v_\alpha}^T \quad\equiv\quad \mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{V}^T$
>- Coordonnées (facteurs) des observations sur les axes principaux $\mathbf{F} \triangleq \mathbf{X}\mathbf{V} = \mathbf{U} \mathbf{D} \mathbf{V}^T\mathbf{V} = \mathbf{U} \mathbf{D}$
>- Covariance des facteurs $\left(\mathbf{U} \mathbf{D}\right)^T\left(\mathbf{U} \mathbf{D}\right) = \mathbf{D}^2$
  >- Les facteurs sont orthogonaux (par construction)
  >- $\lambda_\alpha = \sum_i f_{i,\alpha}^2$ est la variance expliquée par l'axe $\mathbf{v_\alpha}$
>- Contribution des observations aux axes $CTR_{i,\alpha} = \frac{f_{i,\alpha}^2}{\sum_i f_{i,\alpha}^2} = \frac{f_{i,\alpha}^2}{\lambda_\alpha}$
  >- $\sum_i CTR_{i,\alpha} = 1$ et $\mathbf{x_i}$ pèse sur $\mathbf{v_\alpha}$ quand $CTR_{i,\alpha} \geq 1/n$
>- Contribution des axes aux observations $COS2_{i,\alpha} = \frac{f_{i,\alpha}^2}{\sum_\alpha f_{i,\alpha}^2} = \frac{f_{i,\alpha}^2}{d_{i,g}^2}$
  >- $d_{i,g}^2 = \sum_j \left(x_{i,j}-g_j\right)^2$ ($\sum_j x_{i,j}^2$ si données centrées)
  >- $\sum_i d_{i,g}^2 = \mathcal{I} = \sum_\alpha \lambda_\alpha$
>- Contribution des variables aux axes $VARCTR_{j,\alpha} = \left(\frac{\lambda_\alpha}{\sqrt{n-1}}\mathbf{v_\alpha}\right)^2$

# Exemple `abalone`

## Récupération du jeu de données

```{r cache=TRUE}
abalone.cols = c("sex", "length", "diameter", "height",
                 "whole.wt", "shucked.wt", "viscera.wt",
                 "shell.wt", "rings")

url <- 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
abalone <- read.table(url, sep=",", row.names=NULL,
                      col.names=abalone.cols,
                      nrows=4177)
abalone <- subset(abalone, height!=0)
```

## Sélection des variables explicatives numériques

```{r}
X <- abalone[,c("length", "diameter", "height", "whole.wt",
                "shucked.wt", "viscera.wt", "shell.wt")]
```

# Exemple `abalone`

## Standardisation

```{r}
standard <- function(X) {
  meanX <- apply(X, 2, mean);
  sdX <- apply(X, 2, sd);
  Y <- sweep(X, 2, meanX, FUN = "-");
  Y <- sweep(Y, 2, sdX, FUN = "/");
  res <- list("X" = Y, "meanX" = meanX, "sdX" = sdX)
}
```

# Exemple `abalone`

## Clustering

```{r warning=FALSE}
set.seed(1123)
std.X <- standard(X)
A <- std.X$X
nbClust <- 100
clst <- kmeans(A, nbClust, nstart = 25)
```

# Exemple `abalone`

## SVD des centres des clusters

```{r}
std.centers <- standard(clst$centers)
C <- std.centers$X
svd.C <- svd(C)
```

## Pourcentage de variance expliquée par chaque axe principal

```{r}
round((svd.C$d^2 / sum(svd.C$d^2))*100, 2)
```

# Exemple `abalone`

## Facteurs et contributions des observations aux axes

- $\mathbf{F} \triangleq \mathbf{X}\mathbf{V} = \mathbf{U} \mathbf{D} \mathbf{V}^T\mathbf{V} = \mathbf{U} \mathbf{D}$
- $CTR_{i,\alpha} = \frac{f_{i,\alpha}^2}{\sum_i f_{i,\alpha}^2} = \frac{f_{i,\alpha}^2}{\lambda_\alpha}$

```{r}
fact <- svd.C$u %*% diag(svd.C$d)
fact2 <- fact^2
ctr <- sweep(fact2, 2, colSums(fact2), "/")
```

# Exemple `abalone`

## Affichage des observations importantes pour les deux premiers axes principaux

```{r, eval=FALSE}
n <- dim(fact)[1]
d1BestCtr <- which(ctr[,1] > 1/n)
d2BestCtr <- which(ctr[,2] > 1/n)
d1d2BestCtr <- union(d1BestCtr,d2BestCtr)
plot(fact[d1d2BestCtr,1], fact[d1d2BestCtr,2], pch="")
text(fact[d1d2BestCtr,1], fact[d1d2BestCtr,2],
     d1d2BestCtr, cex=0.8)
```

# Exemple `abalone`

```{r, echo=FALSE}
n <- dim(fact)[1]
d1BestCtr <- which(ctr[,1] > 1/n)
d2BestCtr <- which(ctr[,2] > 1/n)
d1d2BestCtr <- union(d1BestCtr,d2BestCtr)
plot(fact[d1d2BestCtr,1], fact[d1d2BestCtr,2], pch="")
text(fact[d1d2BestCtr,1], fact[d1d2BestCtr,2],
     d1d2BestCtr, cex=1)
```

# Exemple `abalone`

```{r}
f2.max <- which.max(abs(fact[,2]))
f2.max.size <- clst$size[f2.max]
f2.max.name <- names(clst$cluster[clst$cluster==f2.max])
f2.max.ctr <- round(ctr[f2.max,2]*100, 2)
```

- Le cluster `r f2.max` explique `r f2.max.ctr` % de la variance du second axe.
- Il est composé de `r f2.max.size` élément

```{r}
abalone[f2.max.name,]
```

# Exemple `abalone`

## Nouvelle analyse

```{r, warning=FALSE}
abalone <- abalone[!(row.names(abalone) %in% f2.max.name),]
X <- abalone[,c("length", "diameter", "height", "whole.wt","shucked.wt", "viscera.wt",
                "shell.wt")]
std.X <- standard(X)
A <- std.X$X
nbClust <- 100
clst <- kmeans(A, nbClust, nstart = 25)
std.centers <- standard(clst$centers)
C <- std.centers$X
svd.C <- svd(C)
round((svd.C$d^2 / sum(svd.C$d^2))*100, 2)
```

# Exemple `abalone`

## Nouvelle analyse (suite) et affichage des observations sur les deux premiers axes

```{r}
fact <- svd.C$u %*% diag(svd.C$d)
fact2 <- fact^2
ctr <- sweep(fact2, 2, colSums(fact2), "/")
n <- dim(fact)[1]
d1BestCtr <- which(ctr[,1] > 1/n)
d2BestCtr <- which(ctr[,2] > 1/n)
d1d2BestCtr <- union(d1BestCtr,d2BestCtr)
```

```{r, eval=FALSE}
plot(fact[d1d2BestCtr,1], fact[d1d2BestCtr,2], pch="")
text(fact[d1d2BestCtr,1], fact[d1d2BestCtr,2],
     d1d2BestCtr, cex=0.8)
```

# Exemple `abalone`

```{r, echo=FALSE}
plot(fact[d1d2BestCtr,1], fact[d1d2BestCtr,2], pch="")
text(fact[d1d2BestCtr,1], fact[d1d2BestCtr,2], d1d2BestCtr, cex=0.8)
```

# Exemple `abalone`

## Un cluster intrigant...

```{r}
intrig<-43
intrig.name <- names(clst$cluster[clst$cluster==intrig])
abalone[intrig.name,]
```

# Exemple `abalone`

## Un cluster intrigant...

```{r}
boxplot(abalone$height)
```

# Exemple `abalone`

## Un cluster intrigant...

- Contributions des axes aux écarts des observations au centre d'inertie $COS2_{i,\alpha} = \frac{f_{i,\alpha}^2}{\sum_\alpha f_{i,\alpha}^2} = \frac{f_{i,\alpha}^2}{d_{i,g}^2}$

```{r}
cos2 <- sweep(fact2, 1, rowSums(fact2), "/")
round(cos2[intrig,]*100,2)
```

- Observation intrigante toute expliquée par les deux premiers axes.
- Pas de risque à considérer comme significatif son écart à l'origine.

# Exemple `abalone`

## Contributions des variables aux axes principaux

- $VARCTR_{j,\alpha} = \left(\frac{\lambda_\alpha}{\sqrt{n-1}}\mathbf{v_\alpha}\right)^2$

```{r}
varctr <- ((svd.C$v %*% diag(svd.C$d))/sqrt(n-1))^2
rownames(varctr) <- colnames(X)
round(varctr,2)
```

- Variables très corrélées qui pèsente toutes fortement sur la variance expliquée par l'axe 1