Flag or remove outliers in movement data based on joint movement probabilities

Detects outliers in a move2 object by computing joint probabilities from the empirical distributions of step lengths, turning angles, and their consecutive changes. Locations that fall in low-probability regions of this joint space are flagged as potential outliers.

Usage

mt_flag_outliers(
  x,
  threshold = NULL,
  prob_type = "joint",
  remove = FALSE,
  plot = TRUE,
  drop_na = FALSE,
  autodiff_alpha = "acf",
  method = "histogram",
  iterations = 1,
  quality_columns = NULL,
  time_normalize = TRUE,
  threshold_type = "gap",
  step_transform = c("none", "log"),
  step_floor = 0,
  reference = NULL,
  pool_by = NULL,
  silent = FALSE
)

Arguments

x

A move2 object. Must contain at least 3 non-empty locations. Either lon/lat or projected; projected input is transformed internally to WGS84 lon/lat for the turning-angle computation (see move2::mt_azimuth) and the result is returned in the original CRS.

threshold

Numeric, or NULL for automatic defaults. Interpretation depends on threshold_type: With "gap" (default): how many times the local spacing a gap in the sorted log-probabilities must exceed to be considered a natural break. Default is 3. Higher values are more conservative. With "entropy": the maximum allowed valley-to-peak density ratio in the KDE of log-probabilities; a valley deeper than this declares an outlier regime. Default is 0.3 (unified package-wide entropy default; see mt_flag_outliers_bridge for the same value's rationale). Higher values admit shallower valleys (more sensitive); lower values demand a deeper split. With "significance": the significance level for flagging based on robust z-scores of log-probabilities. Default is 0.001. With "percentile": the bottom fraction to flag (e.g. 0.001 flags the bottom 0.1 percent). Default is 0.001. When NULL, the default for the chosen threshold_type is used.

prob_type

Character. Which probability to use for outlier detection. One of "joint" (default), "step_turn", "delta_step", "delta_turn", or "custom" (product of step_turn and delta_step only).

remove

Logical. If TRUE, return the object with outliers removed. If FALSE (default), return the original object with outlier flags and probabilities added as columns.

plot

Logical. If TRUE (default), create a two-panel diagnostic plot showing probability-coloured locations and flagged outliers.

drop_na

Logical. If TRUE, also remove locations where the probability could not be calculated (e.g. first/last locations). Default is FALSE.

autodiff_alpha

Exponent on the auto-difference terms in the joint probability (Equation 1 in the paper). Controls how much weight the persistence components (delta-speed, delta-angular velocity) carry relative to the step–turn component.

The default "acf" derives alpha from the data: the lag-1 autocorrelation of speed ($r_v$) and angular velocity ($r_\omega$) are computed, and $\alpha = \sqrt{r_v \cdot r_\omega}$. High autocorrelation means persistence is informative and auto-differences should carry weight; low autocorrelation means they are noise and should be downweighted.

Numeric values override the ACF estimate: 0 ignores auto-differences, 0.5 applies a geometric mean, 1.0 gives the simple product. The string "auto" selects alpha by maximising the trimmed mean of log-probabilities (legacy mode).

method

Character. Method for computing step-turn probabilities. One of "histogram" (default, 2D histogram with bilinear interpolation) or "copula" (parametric marginal distributions: Weibull for step lengths, von Mises for turning angles). The copula method requires the circular and MASS packages.

iterations

Integer. Number of iterative refinement passes. Default is 1 (no iteration). When greater than 1, after each pass flagged outliers are masked and movement metrics are recomputed on the cleaned track. This helps detect consecutive outliers. Iteration stops early if no new outliers are found.

quality_columns

Named list of functions, or NULL (default). Each name must be a column in x, and each function maps the raw column values to a [0,1] quality score. The product of all quality scores multiplies the joint probability before thresholding. Example:

quality_columns = list(
  "gps.satellite.count" = function(s) pnorm(s, mean=7, sd=2),
  "gps.hdop" = function(h) 1 - pnorm(h, mean=3, sd=1.5)
)

time_normalize

Logical. If TRUE (default), use speed (step_length / time_lag) and angular velocity (turning_angle / time_lag) instead of raw step lengths and turning angles. This makes the method time-aware: the same displacement over different time intervals produces different probabilities. For regular data, dividing by a constant time lag simply rescales uniformly and does not change relative probabilities. Set to FALSE only if the data has no meaningful time information. Zero time lags (duplicate timestamps) are not permitted and will raise an error – clean duplicates before running outlier detection.

threshold_type

Character. One of "gap" (default), "entropy", "significance", or "percentile". The gap method uses a two-stage approach: (1) a broken stick null model (MacArthur 1957) screens for candidate breaks by comparing observed gap sizes in the sorted log-probabilities to their expected sizes under a single continuous distribution; (2) a tail-decay inflection analysis tracks how the tail shortens as points are removed from the extreme left — genuine outliers cause steep drops, while entering the bulk distribution produces small, linear changes. The inflection point where the second derivative of the tail-length curve drops to the noise floor determines the natural boundary. No distributional assumptions are made, and clean data produces no or very few outliers. Default threshold is 3. The entropy method estimates the density of log-probabilities via KDE and searches for the deepest local minimum (valley) below the main mode; a fix is flagged when its log-probability sits below that valley. The default threshold = 0.3 is the unified package- wide maximum valley-to-peak density ratio (see mt_flag_outliers_bridge for the same value's rationale and the Raven-sweep validation). Returns no outliers on clean unimodal data. The significance method uses robust z-scores (median + MAD) on log-probabilities, assuming approximate normality. It can over-flag in left-skewed distributions. The percentile method always flags the bottom fraction regardless of data quality and does not converge during iteration.

step_transform

Character. Controls the axis on which the 2D turn/step histogram is built when method = "histogram". "none" (default) uses the raw step length and preserves the joint turn/step structure that the synthetic benchmark was validated on. "log" applies log(1 + step) to the y-axis; useful only for tracks with a pathologically heavy step-length tail (e.g. teleport-class GPS errors spanning several orders of magnitude) where raw-scale binning collapses real movement into a single row of the histogram. The log transform is a monotonic, invertible change of variable, not a distributional assumption — the density is still estimated non-parametrically from the transformed data. Note: log-transform can hide physiologically-plausible joint outliers (wrong-turn / impossible acceleration at reasonable step lengths), so it is not the default. If you pass a track whose diff(range(step))/IQR(step) is very large, the function emits a suggestion to try "log". Ignored when method = "copula", where the parametric Weibull marginal handles the tail on its own.

step_floor

Numeric, non-negative. Minimum absolute step length (metres) for a rate-flagged fix to actually be flagged as an outlier. Default 0 (disabled); the pre-2026 rate-only behaviour. Set to a positive value (commonly 5–25 m, or your device's nominal accuracy) to add a two-axis criterion: a fix is flagged only where both the joint-probability threshold is tripped AND the absolute step length exceeds this floor. Recommended for real-world data with burst-mode sampling, where a few-metre GPS jitter over a 1-second dt produces a "high speed" that is not a physical outlier. Not on by default so existing synthetic-ground-truth benchmarks (which can include sub-noise displacement outliers by construction) continue to pass. Applied after thresholding, before iterative refinement.

reference

Optional move2 object from which to build the probability surfaces. If NULL (default), each individual's own data are used. Supply a longer or cleaner track to improve distributions for short or contaminated tracks. To pool all individuals into a single reference distribution, pass reference = x. Mutually exclusive with pool_by.

pool_by

Optional character vector of length 1 or 2 naming column(s) in mt_track_data(x). Length 1: single column used as the pool source (e.g. "individual_id"). Length 2: c(outer, inner) where outer names the fit-source column – the union of its events supplies one (step, turn, delta_step, delta_turn, gaps) reference distribution per outer group, injected into each member track's per-track dispatch. Length 2 requires strict nesting: every distinct inner value must map to exactly one outer value. Length $> 2$ is rejected. Note: for the probability primitive the pool is integrated (acts through the per-track dispatcher) rather than post-hoc, so inner has no role here – it is validated for consistency with the orchestrator but does not affect the prob primitive's flagging beyond what outer drives. NULL (default) preserves per-track behaviour. NA values in the named column(s) cause those tracks to fall back to per-track processing with a warning. Mutually exclusive with reference.

silent

Logical. If FALSE (default) the function prints a brief running narration (per-iteration counts, threshold diagnostics, final summary). Set TRUE to suppress. Errors and warnings are always shown.

Value

A move2 object. If remove = FALSE, the following columns are added: The added columns are: step_turn_prob (probability from the 2D step/turn histogram), delta_step_prob (probability of the change in step length), delta_turn_prob (probability of the change in turning angle), joint_prob (product of all three), outlier_percentile (0–100, higher = more unusual), is_outlier (logical flag), and is_na_prob (logical, TRUE where probability could not be calculated). If remove = TRUE, outlier rows (and optionally NA rows) are removed.

Details

The method works by building three probability components for each location. The three components are: (1) step-turn probability, from a 2D histogram of step lengths and turning angles with circular wrapping and bilinear interpolation; (2) delta-step probability, from the kernel density of changes in step length between consecutive steps; and (3) delta-turn probability, from the kernel density of changes in turning angle between consecutive steps.

The joint probability is the product of all three. Locations whose joint probability falls below the specified percentile threshold are flagged as outliers.

When the input contains multiple individuals, each is processed separately by default. To build a pooled reference distribution from all individuals, pass reference = x. To use an external clean track as the reference, pass it via reference.

Two methods are available for computing the step-turn probability: "histogram" (default) uses a 2D histogram with circular wrapping and bilinear interpolation; "copula" fits parametric marginal distributions (Weibull for step lengths, von Mises for turning angles) and uses the product of marginal densities. The copula method is faster and can work better with small samples.

When iterations > 1, the detection runs iteratively: after each pass, flagged outliers are masked (removed from the track) and movement metrics are recomputed on the cleaned track. This allows detection of consecutive outliers, because removing the first outlier reveals the true step from the last good location to the next good location. Iteration stops after the specified number of passes or when no new outliers are found, whichever comes first. All flags are mapped back to the original object.

When quality_columns is provided, each quality function maps a raw data column to a [0,1] quality score. The product of all quality scores multiplies the movement probability before thresholding: $$P_{final} = P_{movement} \times \prod quality\_weights$$

When time_normalize = TRUE, the method uses speed (step_length / time_lag) and angular velocity (turning_angle / time_lag) instead of raw step lengths and turning angles. This makes the method time-aware, which is recommended for irregularly sampled data.

References

Safi, K. (in preparation). Self-thresholding hierarchical outlier-detection for animal movement tracks. Companion paper to the move2utils R package. Preprint: bioRxiv (DOI forthcoming).

Examples

if (FALSE) { # \dontrun{
library(move2)

## load example data
fishers <- mt_read(mt_example())
fishers <- fishers[!sf::st_is_empty(fishers), ]

## flag outliers per individual (automatic when multiple IDs present)
result <- mt_flag_outliers(fishers)

## remove outliers, single individual
leroy <- fishers[mt_track_id(fishers) == "M4", ]
cleaned <- mt_flag_outliers(leroy, remove = TRUE)

## copula method (faster, works well with small samples)
result_cop <- mt_flag_outliers(leroy, method = "copula")

## pooled reference: score against distribution from all individuals
result_pop <- mt_flag_outliers(fishers, reference = fishers)

## default: ACF-derived alpha (adapts to the data)
result_acf <- mt_flag_outliers(leroy)

## manual override: ignore auto-differences
result_a0 <- mt_flag_outliers(leroy, autodiff_alpha = 0)

## manual override: full product (no downweighting)
result_a1 <- mt_flag_outliers(leroy, autodiff_alpha = 1.0)

## iterative refinement for consecutive outliers
result_iter <- mt_flag_outliers(leroy, iterations = 3)

## quality weighting with satellite count and HDOP
result_qc <- mt_flag_outliers(leroy, quality_columns = list(
  "gps.satellite.count" = function(s) pnorm(s, mean = 7, sd = 2),
  "gps.hdop" = function(h) 1 - pnorm(h, mean = 3, sd = 1.5)
))

## raw step/turn metrics (without time normalisation)
result_raw <- mt_flag_outliers(leroy, time_normalize = FALSE)
} # }