The isoForest package is a simple replication of the Isolation Forests algorithm for outlier detection, and the ranger package is used to truly construct the forests. In addition, the visualization of outliers is also implemented to help better observe the prediction results.
Installation
# Development version
devtools::install_github("flystar233/isoForest")Feature Contribution Analysis
The feature_contribution() function helps you understand which features contribute most to a sample’s anomaly score. This is crucial for interpreting anomaly detection results and understanding why certain samples are flagged as outliers.
Methods Available
- Path-based analysis (default): Analyzes decision paths in isolation trees to determine feature importance
- Permutation importance: Measures how much each feature affects the anomaly score when its values are randomly permuted
Basic Usage
# Train isolation forest
model <- isoForest(iris[1:4])
# Analyze feature contributions for anomalous samples
contributions <- feature_contribution(model, data = iris[1:4])
print(contributions)
# Analyze specific samples
contributions <- feature_contribution(model,
sample_ids = c(42, 107, 119),
data = iris[1:4])
print(contributions)Using Different Methods
# Path-based analysis (default, faster)
path_contributions <- feature_contribution(model,
sample_ids = c(1, 50),
data = iris[1:4],
method = "path")
# Permutation importance (more accurate but slower)
perm_contributions <- feature_contribution(model,
sample_ids = c(1, 50),
data = iris[1:4],
method = "permutation",
n_permutations = 50)Anomaly Threshold Setting
The package provides multiple methods for setting anomaly detection thresholds. Instead of manually choosing a threshold, you can use statistical and geometric methods to automatically determine the optimal threshold.
Available Methods
| Method | Description | Best For |
|---|---|---|
| contamination | Set threshold based on expected outlier proportion | Known anomaly rate |
| quantile | Use a specific quantile as threshold | Percentile-based detection |
| iqr | Interquartile range (Q3 + 1.5×IQR) | Box-plot style analysis |
| zscore | Z-score based (mean + 2×sd) | Normal distributions |
| mad | Median Absolute Deviation | Robust, symmetric distributions |
| kde_weighted | KDE-weighted mean (density-weighted robust mean) | Heavy tails, extreme outliers |
| mtt | Modified Thompson Tau test | Small to medium samples |
| manual | User-specified threshold | Custom requirements |
Quick Start
# Train model
library(isoForest)
model <- isoForest(iris[1:4])
# Method 1: Contamination-based (most common)
result <- set_anomaly_threshold(model, method = "contamination", contamination = 0.05)
print(result)
# Get anomalous samples
anomalies <- iris[result$predictions$is_anomaly, ]
head(anomalies)Robust Methods
For data with extreme outliers or heavy-tailed distributions:
# KDE-weighted method (density-aware, highly robust)
result_kde <- set_anomaly_threshold(model, method = "kde_weighted", kde_multiplier = 3)
# MAD method (robust and fast)
result_mad <- set_anomaly_threshold(model, method = "mad", mad_multiplier = 3)
# Compare results
cat("KDE-weighted detected:", sum(result_kde$predictions$is_anomaly), "anomalies\n")
cat("MAD detected:", sum(result_mad$predictions$is_anomaly), "anomalies\n")Statistical Testing
For small to medium sample sizes with statistical guarantees:
# Modified Thompson Tau test
result_mtt <- set_anomaly_threshold(
model,
method = "mtt",
mtt_alpha = 0.05, # Significance level
mtt_max_iter = 30 # Maximum iterations
)
# Adjust sensitivity
result_strict <- set_anomaly_threshold(model, method = "mtt", mtt_alpha = 0.01) # More conservative
result_loose <- set_anomaly_threshold(model, method = "mtt", mtt_alpha = 0.10) # More sensitiveFeature Distribution Visualization
The package provides flexible visualization tools to understand how anomalies differ from normal data across features.
Visualizing Single Anomaly with Contribution Analysis
When you want to understand why a specific sample is anomalous:
# Calculate feature contributions
model <- isoForest(iris[1:4])
contributions <- feature_contribution(model, sample_ids = 42, data = iris[1:4])
# Single boxplot view (shows top contributing features)
plot_anomaly_boxplot(contributions, iris[1:4], sample_id = 42, top_n = 5)
# Faceted view (better for many features)
plot_anomaly_boxplot_faceted(contributions, iris[1:4], sample_id = 42, top_n = 8)Visualizing Multiple Anomalies (Without Contribution Analysis)
When you want to see where all detected anomalies fall in the feature distributions:
# Detect anomalies using threshold
data <- read.csv('test.csv')
model2 <- isoForest(data)
result <- set_anomaly_threshold(model2, method = "mtt", mtt_alpha = 0.05)
anomaly_ids <- which(result$predictions$is_anomaly)
# Visualize all anomalies at once
plot_anomaly_boxplot(
contribution_obj = NULL, # No contribution object needed
data = data,
sample_id = anomaly_ids # Can be a vector of IDs
)
# Faceted view (recommended for multiple features)
plot_anomaly_boxplot_faceted(
contribution_obj = NULL,
data = data,
sample_id = anomaly_ids,
top_n = NULL # Show all features
)High-Dimensional Data Visualization
For high-dimensional data (>4 features), visualize anomalies in 2D using dimensionality reduction:
# PCA projection (fast, interpretable)
plot_anomaly_projection(model, data, dim_reduction = "pca")
# UMAP projection (better for non-linear patterns, requires 'umap' package)
plot_anomaly_projection(model, data, dim_reduction = "umap")
# Compare both methods side-by-side (requires 'umap' and 'gridExtra' packages)
plot_anomaly_projection_all(model, data)Features: - Anomalies highlighted in red, normal points in blue - Smart sampling for large datasets (preserves all anomalies) - Adjust sampling: sample_rate = 0.05 (default, anomalies = 5% of display)
See ?plot_anomaly_projection for more details.
Basic Visualization
result <- isoForest(iris[1:2])
plot_anomaly_basic(result, iris[1:2], plot_type="heatmap")