Flag or remove outliers in movement data based on joint movement probabilities
Source:R/mt_flag_outliers.R
mt_flag_outliers.RdDetects outliers in a move2 object by computing joint probabilities
from the empirical distributions of step lengths, turning angles, and their
consecutive changes. Locations that fall in low-probability regions of this
joint space are flagged as potential outliers.
Usage
mt_flag_outliers(
x,
threshold = NULL,
prob_type = "joint",
remove = FALSE,
plot = TRUE,
drop_na = FALSE,
autodiff_alpha = "acf",
method = "histogram",
iterations = 1,
quality_columns = NULL,
time_normalize = TRUE,
threshold_type = "gap",
step_transform = c("none", "log"),
step_floor = 0,
reference = NULL,
pool_by = NULL,
silent = FALSE
)Arguments
- x
A
move2object. Must contain at least 3 non-empty locations. Either lon/lat or projected; projected input is transformed internally to WGS84 lon/lat for the turning-angle computation (seemove2::mt_azimuth) and the result is returned in the original CRS.- threshold
Numeric, or
NULLfor automatic defaults. Interpretation depends onthreshold_type: With"gap"(default): how many times the local spacing a gap in the sorted log-probabilities must exceed to be considered a natural break. Default is 3. Higher values are more conservative. With"entropy": the maximum allowed valley-to-peak density ratio in the KDE of log-probabilities; a valley deeper than this declares an outlier regime. Default is 0.3 (unified package-wide entropy default; seemt_flag_outliers_bridgefor the same value's rationale). Higher values admit shallower valleys (more sensitive); lower values demand a deeper split. With"significance": the significance level for flagging based on robust z-scores of log-probabilities. Default is 0.001. With"percentile": the bottom fraction to flag (e.g. 0.001 flags the bottom 0.1 percent). Default is 0.001. WhenNULL, the default for the chosenthreshold_typeis used.- prob_type
Character. Which probability to use for outlier detection. One of
"joint"(default),"step_turn","delta_step","delta_turn", or"custom"(product of step_turn and delta_step only).- remove
Logical. If
TRUE, return the object with outliers removed. IfFALSE(default), return the original object with outlier flags and probabilities added as columns.- plot
Logical. If
TRUE(default), create a two-panel diagnostic plot showing probability-coloured locations and flagged outliers.- drop_na
Logical. If
TRUE, also remove locations where the probability could not be calculated (e.g. first/last locations). Default isFALSE.- autodiff_alpha
Exponent on the auto-difference terms in the joint probability (Equation 1 in the paper). Controls how much weight the persistence components (delta-speed, delta-angular velocity) carry relative to the step–turn component.
The default
"acf"derives alpha from the data: the lag-1 autocorrelation of speed (\(r_v\)) and angular velocity (\(r_\omega\)) are computed, and \(\alpha = \sqrt{r_v \cdot r_\omega}\). High autocorrelation means persistence is informative and auto-differences should carry weight; low autocorrelation means they are noise and should be downweighted.Numeric values override the ACF estimate:
0ignores auto-differences,0.5applies a geometric mean,1.0gives the simple product. The string"auto"selects alpha by maximising the trimmed mean of log-probabilities (legacy mode).- method
Character. Method for computing step-turn probabilities. One of
"histogram"(default, 2D histogram with bilinear interpolation) or"copula"(parametric marginal distributions: Weibull for step lengths, von Mises for turning angles). The copula method requires thecircularandMASSpackages.- iterations
Integer. Number of iterative refinement passes. Default is
1(no iteration). When greater than 1, after each pass flagged outliers are masked and movement metrics are recomputed on the cleaned track. This helps detect consecutive outliers. Iteration stops early if no new outliers are found.- quality_columns
Named list of functions, or
NULL(default). Each name must be a column inx, and each function maps the raw column values to a [0,1] quality score. The product of all quality scores multiplies the joint probability before thresholding. Example:- time_normalize
Logical. If
TRUE(default), use speed (step_length / time_lag) and angular velocity (turning_angle / time_lag) instead of raw step lengths and turning angles. This makes the method time-aware: the same displacement over different time intervals produces different probabilities. For regular data, dividing by a constant time lag simply rescales uniformly and does not change relative probabilities. Set toFALSEonly if the data has no meaningful time information. Zero time lags (duplicate timestamps) are not permitted and will raise an error – clean duplicates before running outlier detection.- threshold_type
Character. One of
"gap"(default),"entropy","significance", or"percentile". The gap method uses a two-stage approach: (1) a broken stick null model (MacArthur 1957) screens for candidate breaks by comparing observed gap sizes in the sorted log-probabilities to their expected sizes under a single continuous distribution; (2) a tail-decay inflection analysis tracks how the tail shortens as points are removed from the extreme left — genuine outliers cause steep drops, while entering the bulk distribution produces small, linear changes. The inflection point where the second derivative of the tail-length curve drops to the noise floor determines the natural boundary. No distributional assumptions are made, and clean data produces no or very few outliers. Default threshold is 3. The entropy method estimates the density of log-probabilities via KDE and searches for the deepest local minimum (valley) below the main mode; a fix is flagged when its log-probability sits below that valley. The defaultthreshold = 0.3is the unified package- wide maximum valley-to-peak density ratio (seemt_flag_outliers_bridgefor the same value's rationale and the Raven-sweep validation). Returns no outliers on clean unimodal data. The significance method uses robust z-scores (median + MAD) on log-probabilities, assuming approximate normality. It can over-flag in left-skewed distributions. The percentile method always flags the bottom fraction regardless of data quality and does not converge during iteration.- step_transform
Character. Controls the axis on which the 2D turn/step histogram is built when
method = "histogram"."none"(default) uses the raw step length and preserves the joint turn/step structure that the synthetic benchmark was validated on."log"applieslog(1 + step)to the y-axis; useful only for tracks with a pathologically heavy step-length tail (e.g. teleport-class GPS errors spanning several orders of magnitude) where raw-scale binning collapses real movement into a single row of the histogram. The log transform is a monotonic, invertible change of variable, not a distributional assumption — the density is still estimated non-parametrically from the transformed data. Note: log-transform can hide physiologically-plausible joint outliers (wrong-turn / impossible acceleration at reasonable step lengths), so it is not the default. If you pass a track whosediff(range(step))/IQR(step)is very large, the function emits a suggestion to try"log". Ignored whenmethod = "copula", where the parametric Weibull marginal handles the tail on its own.- step_floor
Numeric, non-negative. Minimum absolute step length (metres) for a rate-flagged fix to actually be flagged as an outlier. Default
0(disabled); the pre-2026 rate-only behaviour. Set to a positive value (commonly 5–25 m, or your device's nominal accuracy) to add a two-axis criterion: a fix is flagged only where both the joint-probability threshold is tripped AND the absolute step length exceeds this floor. Recommended for real-world data with burst-mode sampling, where a few-metre GPS jitter over a 1-second dt produces a "high speed" that is not a physical outlier. Not on by default so existing synthetic-ground-truth benchmarks (which can include sub-noise displacement outliers by construction) continue to pass. Applied after thresholding, before iterative refinement.- reference
Optional
move2object from which to build the probability surfaces. IfNULL(default), each individual's own data are used. Supply a longer or cleaner track to improve distributions for short or contaminated tracks. To pool all individuals into a single reference distribution, passreference = x. Mutually exclusive withpool_by.- pool_by
Optional character vector of length 1 or 2 naming column(s) in
mt_track_data(x). Length 1: single column used as the pool source (e.g."individual_id"). Length 2:c(outer, inner)whereouternames the fit-source column – the union of its events supplies one(step, turn, delta_step, delta_turn, gaps)reference distribution per outer group, injected into each member track's per-track dispatch. Length 2 requires strict nesting: every distinctinnervalue must map to exactly oneoutervalue. Length \(> 2\) is rejected. Note: for the probability primitive the pool is integrated (acts through the per-track dispatcher) rather than post-hoc, soinnerhas no role here – it is validated for consistency with the orchestrator but does not affect the prob primitive's flagging beyond whatouterdrives.NULL(default) preserves per-track behaviour. NA values in the named column(s) cause those tracks to fall back to per-track processing with a warning. Mutually exclusive withreference.- silent
Logical. If
FALSE(default) the function prints a brief running narration (per-iteration counts, threshold diagnostics, final summary). SetTRUEto suppress. Errors and warnings are always shown.
Value
A move2 object. If remove = FALSE, the following
columns are added:
The added columns are: step_turn_prob (probability from the 2D
step/turn histogram), delta_step_prob (probability of the change
in step length), delta_turn_prob (probability of the change in
turning angle), joint_prob (product of all three),
outlier_percentile (0–100, higher = more unusual),
is_outlier (logical flag), and is_na_prob (logical, TRUE
where probability could not be calculated).
If remove = TRUE, outlier rows (and optionally NA rows) are removed.
Details
The method works by building three probability components for each location. The three components are: (1) step-turn probability, from a 2D histogram of step lengths and turning angles with circular wrapping and bilinear interpolation; (2) delta-step probability, from the kernel density of changes in step length between consecutive steps; and (3) delta-turn probability, from the kernel density of changes in turning angle between consecutive steps.
The joint probability is the product of all three. Locations whose joint probability falls below the specified percentile threshold are flagged as outliers.
When the input contains multiple individuals, each is processed separately
by default. To build a pooled reference distribution from all individuals,
pass reference = x. To use an external clean track as the reference,
pass it via reference.
Two methods are available for computing the step-turn probability:
"histogram" (default) uses a 2D histogram with circular wrapping
and bilinear interpolation; "copula" fits parametric marginal
distributions (Weibull for step lengths, von Mises for turning angles)
and uses the product of marginal densities. The copula method is faster
and can work better with small samples.
When iterations > 1, the detection runs iteratively: after each
pass, flagged outliers are masked (removed from the track) and movement
metrics are recomputed on the cleaned track. This allows detection of
consecutive outliers, because removing the first outlier reveals the
true step from the last good location to the next good location.
Iteration stops after the specified number of passes or when no new
outliers are found, whichever comes first. All flags are mapped back
to the original object.
When quality_columns is provided, each quality function maps a
raw data column to a [0,1] quality score. The product of all quality
scores multiplies the movement probability before thresholding:
$$P_{final} = P_{movement} \times \prod quality\_weights$$
When time_normalize = TRUE, the method uses speed
(step_length / time_lag) and angular velocity (turning_angle / time_lag)
instead of raw step lengths and turning angles. This makes the method
time-aware, which is recommended for irregularly sampled data.
References
Safi, K. (in preparation). Self-thresholding hierarchical outlier-detection for animal movement tracks. Companion paper to the move2utils R package. Preprint: bioRxiv (DOI forthcoming).
Examples
if (FALSE) { # \dontrun{
library(move2)
## load example data
fishers <- mt_read(mt_example())
fishers <- fishers[!sf::st_is_empty(fishers), ]
## flag outliers per individual (automatic when multiple IDs present)
result <- mt_flag_outliers(fishers)
## remove outliers, single individual
leroy <- fishers[mt_track_id(fishers) == "M4", ]
cleaned <- mt_flag_outliers(leroy, remove = TRUE)
## copula method (faster, works well with small samples)
result_cop <- mt_flag_outliers(leroy, method = "copula")
## pooled reference: score against distribution from all individuals
result_pop <- mt_flag_outliers(fishers, reference = fishers)
## default: ACF-derived alpha (adapts to the data)
result_acf <- mt_flag_outliers(leroy)
## manual override: ignore auto-differences
result_a0 <- mt_flag_outliers(leroy, autodiff_alpha = 0)
## manual override: full product (no downweighting)
result_a1 <- mt_flag_outliers(leroy, autodiff_alpha = 1.0)
## iterative refinement for consecutive outliers
result_iter <- mt_flag_outliers(leroy, iterations = 3)
## quality weighting with satellite count and HDOP
result_qc <- mt_flag_outliers(leroy, quality_columns = list(
"gps.satellite.count" = function(s) pnorm(s, mean = 7, sd = 2),
"gps.hdop" = function(h) 1 - pnorm(h, mean = 3, sd = 1.5)
))
## raw step/turn metrics (without time normalisation)
result_raw <- mt_flag_outliers(leroy, time_normalize = FALSE)
} # }