diff --git a/ca_experiment/ca_experiment.Rmd b/ca_experiment/ca_experiment.Rmd
new file mode 100644
index 0000000..c600729
--- /dev/null
+++ b/ca_experiment/ca_experiment.Rmd
@@ -0,0 +1,471 @@
+---
+output:
+  pdf_document: default
+  html_document: default
+---
+```{r}
+set.seed(1123)
+source('clust.R')
+```
+# Dataset
+```{r}
+dat <- read.csv(file="../data/housing.csv", header=TRUE)
+str(dat)
+```
+# Univariate analysis and discretization
+
+## Transform String variables to Factor
+
+Variable `ocean_proximity` is of type `chr`. It is in fact a categorical variable with modalities such as "NEAR BAY", "NEAR OCEAN", etc.
+We transform it into an explicit categorical variable.
+```{r}
+dat$ocean_proximity <- as.factor(dat$ocean_proximity)
+summary(dat)
+```
+
+## Longitude
+The distribution of the longitudes appears bi-modal, with a first peak around -122°, near the West coast, and a second peak around -118°, more inland.
+```{r}
+hist(dat$longitude)
+```
+
+We could cut the longitude into, for example, 4 intervals with equal densities.
+```{r}
+cuts <- quantile(dat$longitude, probs = seq(0,1,1/4))
+hist(dat$longitude)
+abline(v=cuts, lwd=4)
+```
+
+Or, we could use a clustering-based approach to obtain cuts, of potentially different sizes, that may be more representative of the data.
+```{r}
+cuts <- kcuts(x = dat$longitude, centers = 4)
+cuts
+```
+
+```{r}
+hist(dat$longitude)
+abline(v=cuts, lwd=4)
+```
+
+We discretize the longitudes into a new categorical variable named `c_longitude` with modalities `LO-W` (West), `LO-M` (Mid-West), `LO-ME` (Mid-East), `LO-E` (East).
+```{r}
+dat$c_longitude <- cut(x = dat$longitude, unique(cuts), include.lowest = TRUE)
+levels(dat$c_longitude) <- c('LO-W', 'LO-MW', 'LO-ME', 'LO-E')
+summary(dat$c_longitude)
+```
+Analyzing data means initiating a dialogue with them and letting them guide us towards a model. It is an iterative process. In first intention, we test a cut of the longitude variable into 4 modalities based on the output of the k-means algorithm.
+
+## Latitude
+We proceed similarly with the latitudes.
+We discretize them into a new categorical variable named `c_latitude` with modalities `LA-S` (South), `LA-MS` (Mid-South), `LA-MN` (Mid-North) and `LA-N` (North).
+```{r}
+cuts <- kcuts(x = dat$latitude, centers = 4)
+cuts
+```
+
+```{r}
+hist(dat$latitude)
+abline(v=cuts, lwd=3)
+```
+
+```{r}
+dat$c_latitude <- cut(x = dat$latitude, unique(cuts), include.lowest = TRUE)
+levels(dat$c_latitude) <- c('LA-S','LA-MS','LA-MN','LA-N')
+summary(dat$c_latitude)
+```
+
+## Housing median age
+The histogram of the housing median age seems unimodal.
+```{r}
+hist(dat$housing_median_age)
+```
+
+There is a noticeably high number of housing ages greater than 50. We can tabulate some of the age values to better explore this phenomenon.
+```{r}
+table(dat$housing_median_age[dat$housing_median_age>45])
+```
+
+There is a cap on ages greater than 52. It appears clearly on an histogram with more bins. This phenomenon may have an impact on future analysis.
+```{r}
+nb_age_52 <- length(dat$housing_median_age[dat$housing_median_age == 52])
+pc_age_52 <- round(100 * (nb_age_52 / dim(dat)[1]))
+hist(dat$housing_median_age, breaks = 40)
+```
+
+`r pc_age_52`% of the observations have a median age of 52, enough to regroup them into a category.
+```{r}
+cuts <- c(quantile(dat$housing_median_age[dat$housing_median_age<52]), 52)
+cuts
+```
+
+Guided by the quantile analysis, we discretize the housing median age variable into a new categorical variable named `c_age` with easy to grasp round numbers as modalities' boundaries (viz., less than 15, between 15 and 25, etc.).
+```{r}
+cuts <- c(min(dat$housing_median_age), 15, 25, 35, 51, 52)
+hist(dat$housing_median_age, breaks = 40)
+abline(v=cuts, lwd=3)
+```
+
+```{r}
+dat$c_age <- cut(x = dat$housing_median_age, unique(cuts), include.lowest = TRUE)
+levels(dat$c_age) <- c('A<=15','A(15,25]','A(25,35]','A(35,51]', 'A=52')
+summary(dat$c_age)
+```
+
+## Rooms
+The histogram of the rooms has a long tail due to infrequent large values.
+Also, the values, whose range extends from `r min(dat$total_rooms)` to `r max(dat$total_rooms)`, are difficult to interpret since they depend on the number of households.
+```{r}
+cuts <- quantile(dat$total_rooms)
+hist(dat$total_rooms)
+abline(v=cuts, lwd=3)
+```
+
+We create a new variable `rooms` to count the relative number of rooms by households.
+```{r}
+dat$rooms <- dat$total_rooms / dat$households
+nb_rooms_gt_8 <- length(dat$rooms[dat$rooms>8])
+pc_rooms_gt_8 <- round(100 * (nb_rooms_gt_8 / dim(dat)[1]))
+quantile(dat$rooms)
+```
+
+Guided by the quantile analysis, we create the categorical variable `c_rooms` with modalities: `R<=4` less than 4 rooms, `R(4,6]` between 4 and 6 rooms, `R(6,8]` between 6 and 8 rooms, `R>8` more than 8 rooms. We add the last category `R>8` to explicitly register the long tail: only `r pc_rooms_gt_8`% of the observations belong to this category.
+To better visualize the cuts, we plot the histogram after a log transform of the number of rooms.
+```{r}
+cuts <- quantile(dat$rooms, probs = seq(0,1,1/3))
+cuts
+```
+
+```{r}
+cuts <- c(min(dat$rooms), 4, 6, 8, max(dat$rooms))
+hist(log10(dat$rooms))
+abline(v=log10(cuts), lwd=3)
+```
+
+```{r}
+dat$c_rooms <- cut(x = dat$rooms, unique(cuts), include.lowest = TRUE)
+levels(dat$c_rooms) <- c('R<=4','R(4,6]','R(6,8]', 'R>8')
+summary(dat$c_rooms)
+```
+
+## Bedrooms
+As for the rooms, we create a new variable `bedrooms` to count the relative number of bedrooms by households.
+We also note that this variable has missing values encoded with the `NA` symbol.
+We have to tell some R functions to ignore the missing values by setting the parameter `na.rm` at `TRUE`.
+```{r}
+dat$bedrooms <- dat$total_bedrooms / dat$households
+quantile(dat$bedrooms, na.rm = TRUE)
+```
+
+Guided by the quantile analysis, we create the categorical variable `c_bedrooms` with modalities: `B<=1` 0 or 1 bedroom, `B>1` more than 1 bedroom.
+```{r}
+cuts <- c(min(dat$bedrooms, na.rm = TRUE), 1.1, max(dat$bedrooms, na.rm = TRUE))
+hist(log10(dat$bedrooms))
+abline(v=log10(cuts), lwd=3)
+```
+
+```{r}
+dat$c_bedrooms <- cut(x = dat$bedrooms, unique(cuts), include.lowest = TRUE)
+levels(dat$c_bedrooms) <- c('B<=1','B>1')
+summary(dat$c_bedrooms)
+```
+
+## Population
+As for the rooms and the bedrooms, we create a new variable `pop` to count the number of persons by households.
+```{r}
+dat$pop <- dat$population / dat$households
+quantile(dat$pop, probs = seq(0,1,1/4))
+```
+
+Guided by the quantile analysis, we create the categorical variable `c_pop` with modalities: `P<=2`, `P(2,3]`, `P(3,4]` and `P>4`.
+```{r}
+cuts <- c(min(dat$pop), 2, 3, 4, max(dat$pop))
+hist(log10(dat$pop))
+abline(v=log10(cuts), lwd=3)
+```
+
+```{r}
+dat$c_pop <- cut(x = dat$pop, unique(cuts), include.lowest = TRUE)
+levels(dat$c_pop) <- c('P<=2','P(2,3]', 'P(3,4]', 'P>4')
+summary(dat$c_pop)
+```
+
+## Households
+After a quick quantile analysis, we create the categorical variable `c_households` with modalities: less than 300, between 300 and 400, between 400 and 600, more than 600.
+```{r}
+quantile(dat$households, probs = seq(0,1,1/4))
+```
+
+```{r}
+cuts <- c(min(dat$households), 300, 400, 600, max(dat$households))
+hist(log10(dat$households))
+abline(v=log10(cuts), lwd=3)
+```
+
+```{r}
+dat$c_households <- cut(x = dat$households, cuts, include.lowest = TRUE)
+levels(dat$c_households) <- c('H<=3', 'H(3,4]', 'H(4,6]', 'H>6')
+summary(dat$c_households)
+```
+
+## Median income
+```{r}
+cuts <- quantile(dat$median_income, probs = seq(0,1,1/4))
+hist(dat$median_income)
+abline(v=cuts, lwd=3)
+```
+
+There is a noticeably high number of incomes greater than 15. We can tabulate some of the income values to better explore this phenomenon.
+```{r}
+table(dat$median_income[dat$median_income>14])
+```
+
+There is a cap on incomes greater than 15. It appears more clearly on an histogram with more bins. This phenomenon may have an impact on future analysis.
+```{r}
+nb_income_gt_15 <- length(dat$median_income[dat$median_income > 15])
+pc_income_gt_15 <- round(100 * (nb_income_gt_15 / dim(dat)[1]), digits = 2)
+hist(dat$median_income, breaks = 40)
+```
+
+Only `r pc_income_gt_15`% of the observations have a median income greater than 15.
+We still build a specific category for them to be able to study them later.
+```{r}
+cuts <- c(cuts[1:length(cuts)-1], 15, max(dat$median_income))
+dat$c_income <- cut(x = dat$median_income, cuts, include.lowest = TRUE)
+levels(dat$c_income) <- c('IL', 'IML', 'IMH', 'IH', 'I>15')
+summary(dat$c_income)
+```
+
+## House value
+```{r}
+hist(dat$median_house_value, breaks = 30)
+```
+
+```{r}
+nb_house_value_gt_50k <- length(dat$median_house_value[dat$median_house_value > 500000])
+pc_house_value_gt_50k <- round(100 * (nb_house_value_gt_50k / dim(dat)[1]), digits = 2)
+table(dat$median_house_value[dat$median_house_value>499000])
+```
+
+The median house value variable has been capped after 500000. `r pc_house_value_gt_50k`% of the observations are affected by this simplification.
+```{r}
+quantile(dat$median_house_value[dat$median_house_value < 500000])
+```
+
+After observing a division of the dataset into 4 quantiles, we propose a discretization of the median house value variable while keeping a category for the values above 500000.
+```{r}
+cuts <- c(min(dat$median_house_value), 115000, 175000, 250000, 500000, max(dat$median_house_value))
+dat$c_house_value <- cut(x = dat$median_house_value, cuts, include.lowest = TRUE)
+levels(dat$c_house_value) <- c('V<=115', 'V(115,175]', 'V(175,250]', 'V(250,500]', 'V>500')
+summary(dat$c_house_value)
+```
+
+```{r}
+hist(dat$median_house_value)
+abline(v=cuts, lwd=3)
+```
+
+# Correspondence analysis
+
+## Ventilation of small modalities
+
+```{r}
+dat.all <- dat
+dat <- dat.all[c('ocean_proximity', 'c_longitude', 'c_latitude', 'c_age',
+                 'c_rooms', 'c_bedrooms', 'c_pop', 'c_households', 'c_income', 'c_house_value')]
+summary(dat)
+```
+
+Before correspondence analysis, we ventilate the small modality `R>8` of variable `c_rooms` into the other modalities.
+```{r}
+c_rooms_sup <- ventilate(dat$c_rooms, "R>8")
+dat$c_rooms[c_rooms_sup$sup_i] <- c_rooms_sup$smpl
+```
+
+We do the same for the modality `I>15` of variable `c_income`, `ISLAND` of variable `ocean_proximity` and for the missing values of variable `c_bedrooms`.
+
+```{r}
+c_income_sup <- ventilate(dat$c_income, "I>15")
+dat$c_income[c_income_sup$sup_i] <- c_income_sup$smpl
+
+ocean_proximity_sup <- ventilate(dat$ocean_proximity, "ISLAND")
+dat$ocean_proximity[ocean_proximity_sup$sup_i] <- ocean_proximity_sup$smpl
+
+c_bedrooms_sup <- ventilate(dat$c_bedrooms, "NA")
+dat$c_bedrooms[c_bedrooms_sup$sup_i] <- c_bedrooms_sup$smpl
+
+dat <- droplevels(dat)
+summary(dat)
+```
+
+## Supplementary variables
+
+We consider `c_house_value` to be a supplementary variable.
+
+```{r}
+# supplementary categories must be in last positions in dataframe
+sup_ind <- which(names(dat) == "c_house_value")
+dat_act <- dat[,-sup_ind]
+dat_sup <- dat[,sup_ind]
+I <- dim(dat_act)[1]
+Q <- dim(dat_act)[2]
+dat[c(1:5, I),]
+```
+
+## Indicator matrix
+
+```{r}
+lev_n <- unlist(lapply(dat, nlevels))
+n <- cumsum(lev_n)
+J_t <- sum(lev_n)
+Q_t <- dim(dat)[2]
+Z <- matrix(0, nrow = I, ncol = J_t)
+numdat <- lapply(dat, as.numeric)
+offset <- c(0, n[-length(n)])
+for (i in 1:Q_t)
+  Z[1:I + (I * (offset[i] + numdat[[i]] - 1))] <- 1
+cn <- rep(names(dat), lev_n)
+ln <- unlist(lapply(dat, levels))
+dimnames(Z)[[1]] <- as.character(1:I)
+dimnames(Z)[[2]] <- paste(cn, ln, sep = "")
+
+Z_sup_min <- n[sup_ind[1] - 1] + 1
+Z_sup_max <- n[sup_ind[length(sup_ind)]]
+Z_sup_ind <- Z_sup_min : Z_sup_max
+Z_act <- Z[,-Z_sup_ind]
+J <- dim(Z_act)[2]
+
+Z_act[c(1:5, I), ]
+```
+
+## Burt matrix
+
+```{r}
+B <- t(Z_act) %*% Z_act
+B[1:5, 1:5]
+```
+
+## Principal inertia
+
+```{r}
+P     <- B / sum(B)
+r     <- apply(P, 2, sum)
+rr    <- r %*% t(r)
+S     <- (P - rr) / sqrt(rr)
+dec   <- eigen(S)
+delt  <- dec$values[1 : (J-Q)]
+expl  <- 100 * (delt / sum(delt))
+lam   <- delt^2
+expl2 <- 100 * (lam / sum(lam))
+```
+
+## Standard and principal coordinates
+
+```{r}
+K <- J - Q
+a <- sweep(dec$vectors, 1, sqrt(r), FUN = "/")
+a <- a[,(1 : K)]
+f <- a %*% diag(delt)
+```
+
+## Labels
+
+```{r}
+lbl_dic <- c(
+  "O:<1H" = "ocean_proximity<1H OCEAN",
+  "O:INL" = "ocean_proximityINLAND",
+  "O:NB" = "ocean_proximityNEAR BAY",
+  "O:NO" = "ocean_proximityNEAR OCEAN",
+  "LO:W" = "c_longitudeLO-W",
+  "LO:MW" = "c_longitudeLO-MW",
+  "LO:ME" = "c_longitudeLO-ME",
+  "LO:E" = "c_longitudeLO-E",
+  "LA:S" = "c_latitudeLA-S",
+  "LA:MS" = "c_latitudeLA-MS",
+  "LA:MN" = "c_latitudeLA-MN",
+  "LA:N" = "c_latitudeLA-N",
+  "AG:15]" = "c_ageA<=15",
+  "AG:25]" = "c_ageA(15,25]",
+  "AG:35]" = "c_ageA(25,35]",
+  "AG:51]" = "c_ageA(35,51]",
+  "AG:52" = "c_ageA=52",
+  "RO:4]" = "c_roomsR<=4",
+  "RO:6]" = "c_roomsR(4,6]",
+  "RO:8]" = "c_roomsR(6,8]",
+  "BE:1]" = "c_bedroomsB<=1",
+  "BE:>1" = "c_bedroomsB>1",
+  "PO:2]" = "c_popP<=2",
+  "PO:3]" = "c_popP(2,3]",
+  "PO:4]" = "c_popP(3,4]",
+  "PO:>4" = "c_popP>4",
+  "HH:3]" = "c_householdsH<=3",
+  "HH:4]" = "c_householdsH(3,4]",
+  "HH:6]" = "c_householdsH(4,6]",
+  "HH:>6" = "c_householdsH>6",
+  "IC:L" = "c_incomeIL",
+  "IC:ML" = "c_incomeIML",
+  "IC:MH" = "c_incomeIMH",
+  "IC:H" = "c_incomeIH",
+  "HV:A" = "c_house_valueV<=115",
+  "HV:B" = "c_house_valueV(115,175]",
+  "HV:C" = "c_house_valueV(175,250]",
+  "HV:D" = "c_house_valueV(250,500]",
+  "HV:E" = "c_house_valueV>500"
+)
+
+lbl_act_dic <- lbl_dic[1:J]
+fac_names <- paste("F", paste(1 : K), sep = "")
+rownames(a) <- names(lbl_act_dic)
+colnames(a) <- fac_names
+rownames(f) <- names(lbl_act_dic)
+colnames(f) <- fac_names
+```
+
+## Contribution of a variable to a given factor
+
+Inertia of variable $i$ along principal axis (or factor) $k$ is the mass $r_i$ times the squared distance to origin $f_{i,k}^2$.
+The contribution of variable $i$ to factor $k$ is its inertia normalized so that the sum of the contributions to a factor is $1$.
+`ctr[i,k]` is contribution of variable $i$ to factor $k$.
+```{r}
+temp <- sweep(f^2, 1, r, FUN = "*")
+sum_ctr <- apply(temp, 2, sum)
+ctr <- sweep(temp, 2, sum_ctr, FUN = "/")
+```
+
+## Correlation of a given variable with a factor
+
+For a given variable $i$ the sum of its correlations with the factors is $1$.
+```{r}
+temp <- f^2
+sum_cor <- apply(temp, 1, sum)
+cor <- sweep(temp, 1, sum_cor, FUN="/")
+```
+
+
+## plot
+
+```{r}
+sel <- ctr > (1 / dim(ctr)[1])
+sel12 <- sel[,1] | sel[,2]
+par(pty = "s")
+plot(f[sel12,1], f[sel12,2], xlab = "F1", ylab = "F2", type = "n", asp = 1)
+text(f[sel12,1], f[sel12,2], names(lbl_act_dic)[sel12])
+```
+
+```{r}
+# sort(cor["O:NB",] / sum(cor["O:NB",]), decreasing = TRUE)
+```
+
+```{r}
+sel34 <- sel[,3] | sel[,4]
+par(pty = "s")
+plot(f[sel34,3], f[sel34,4], xlab = "F3", ylab = "F4", type = "n", asp = 1)
+text(f[sel34,3], f[sel34,4], names(lbl_act_dic)[sel34])
+```
+
+```{r}
+sel56 <- sel[,5] | sel[,6]
+par(pty = "s")
+plot(f[sel56,5], f[sel56,6], xlab = "F5", ylab = "F6", type = "n", asp = 1)
+text(f[sel56,5], f[sel56,6], names(lbl_act_dic)[sel56])
+```
+
diff --git a/ca_experiment/ca_experiment.Rproj b/ca_experiment/ca_experiment.Rproj
new file mode 100644
index 0000000..367ad5b
--- /dev/null
+++ b/ca_experiment/ca_experiment.Rproj
@@ -0,0 +1,15 @@
+Version: 1.0
+
+RestoreWorkspace: Default
+SaveWorkspace: Default
+AlwaysSaveHistory: Default
+
+EnableCodeIndexing: Yes
+UseSpacesForTab: Yes
+NumSpacesForTab: 2
+Encoding: UTF-8
+
+RnwWeave: Sweave
+LaTeX: pdfLaTeX
+
+SpellingDictionary: en_US
diff --git a/ca_experiment/clust.R b/ca_experiment/clust.R
new file mode 100644
index 0000000..5f58ac1
--- /dev/null
+++ b/ca_experiment/clust.R
@@ -0,0 +1,42 @@
+# `kcuts` cuts variable `x` into `centers` categories with k-means.
+kcuts <- function(x, centers)
+{
+  km <- kmeans(x = x, centers = centers)
+  cuts <- unlist(lapply(order(km$centers), function(clustId) { 
+    min(x[km$cluster == clustId]) }))
+  cuts <- c(cuts, max(x))
+}
+
+# ventilate modality `mod` of categorical var `cat` of dataframe `dat`
+# into the remaining modalities according to their relative frequencies.
+ventilate <- function(cat, mod)
+{
+  if (mod == "NA") isna <- TRUE else isna <- FALSE
+  tab <- table(cat)
+  if (isna)
+  {
+    sup_i <- which(is.na(cat))
+    sup_n <- length(sup_i)
+    act_mod <- tab
+  }
+  else
+  {
+    sup_i <- which(cat == mod)
+    sup_n <- tab[mod]
+    act_mod <- tab[! names(tab) %in% mod] 
+  }
+  prob <- act_mod / sum(act_mod)
+  smpl <- sample(names(prob), size = sup_n, replace = TRUE, prob = prob)
+  r <- list(sup_mod = mod,
+            sup_n = sup_n,
+            sup_i = sup_i,
+            smpl = smpl)
+  return(r)
+}
+
+# Plot axes
+plaxes <- function(a,b)
+{
+  segments(  min(a),0,  max(a),0  )
+  segments(  0,min(b),  0,max(b)  )
+}
\ No newline at end of file