Multi-scale persistence score for outlier flags

What this vignette covers

When mt_clean_track() flags a fix as an outlier, the next question is usually: how confident should I be that this is really an error? mt_persistence_score() answers that by re-checking each flagged fix against the track at wider temporal scales — if a fix looks anomalous when compared to its immediate neighbours and also when compared to the fixes 2, 4, and 8 steps away, that’s much stronger evidence than if it only looks weird at one scale.

The result is a per-fix confidence score from 1 to 4 (with the default settings). You decide whether to drop low-confidence flags or keep them all; the function never modifies is_outlier itself.

It works on the output of any flagger in the package (mt_clean_track, mt_flag_outliers_bridge, mt_flag_outliers_detour, mt_flag_outliers, mt_flag_speed_cap, mt_sequential_outliers, mt_combined_outliers) — it doesn’t care how the candidates were flagged, only whether they still look out of place when you zoom out.

The worked examples below show that the score is most useful when you filter it conditional on the cascade’s error_class column rather than applying it across the board. We’ll show you what that looks like in practice.

How it works — intuition first

For each flagged fix, the function asks: “if I look at the track at a coarser temporal scale — comparing this fix to the one 2 steps away instead of 1 step away, and again at 4 steps and 8 steps — does it still look like an outlier?”

The “look like an outlier” check at each scale uses the same kind of joint distribution that the per-fix detectors use, just on windowed step lengths and turn angles instead of single-step ones. The reference at each scale is built from every interior fix of the track (not just the flagged ones), so each candidate is scored against the track’s own multi-scale geometry rather than against an external assumption.

A fix that only trips at the native (single-step) resolution gets score 1. A fix that also trips at the 2-step, 4-step, and 8-step windows gets score 4. The higher the score, the more “robust” the anomaly — its geometric extent survives temporal coarsening.

The formal picture (for the curious)

For each flagged fix $i$ and each validation scale $k \in \{2, 4, 8\}$ (the default), the function computes:

$\text{step}_k^\text{in}(i)$ : distance from $x_{i-k}$ to $x_i$
$\text{step}_k^\text{out}(i)$ : distance from $x_i$ to $x_{i+k}$
$\text{turn}_k(i)$ : turn angle at $x_i$ between the two long arms

The reference distribution at scale $k$ is built from the same quantities computed at every interior fix of the track (the candidate is not removed). A 2-D histogram of $(\log\text{step}_k, \text{turn}_k)$ gives the joint density; a gap-on-( $-\log$ probability) threshold determines whether each candidate is flagged at scale $k$ .

The persistence score is

$p(i) = 1 + \sum_{k \in \text{scales}} \mathbb{1}\bigl[i \text{ flagged at scale } k\bigr],$

so $p(i) \in \{1, 2, 3, 4\}$ for the default scales = c(2, 4, 8).

Why no thinning?

A natural-sounding alternative is “multi-scale voting”: thin the track at each scale and run a detector on each thinned subset, then count votes. That has a structural index-parity problem: a fix at original index 7 with scales = c(1, 2, 4, 8) is in the thinned subset only at scale 1; it can vote at most once, no matter how anomalous it looks. The persistence score evaluates every flagged fix at every validation scale, dissolving the bias.

Worked example

d <- read.csv(gzfile(system.file("extdata", "synthetic_tracks.csv.gz",
                                   package = "move2utils")),
               stringsAsFactors = FALSE)
d$timestamp <- as.POSIXct(d$timestamp, tz = "UTC")
x <- mt_as_move2(d,
  coords = c("location.long", "location.lat"),
  time_column = "timestamp",
  track_id_column = "individual.local.identifier", crs = 4326)
x <- x[!st_is_empty(x), ]
gt <- readRDS(system.file("extdata", "synthetic_ground_truth.rds",
                            package = "move2utils"))

cpf_a <- x[mt_track_id(x) == "CPF_A", ]
truth_a <- gt[["CPF_A"]]$index

Run the cascade on the spike-contaminated CPF_A track and annotate the result:

clean <- mt_clean_track(cpf_a, plot = FALSE, remove = FALSE)
#> No physiological speed cap supplied -- running with a data-driven cap chosen from your track.  This works well for most cases.  If your animal has multiple behavioural states (e.g. perched and flying) or you expect sustained-spoof errors, supplying `v_max =` (a published top speed in m/s) or `(mass = ..., mode = ...)` for the allometric estimate gives sharper results.  See `?v_phys_estimate` for the allometric helper; `?mt_clean_track` documents the failure modes of the auto-cap in detail.
#> Auto-cap landed at 60.2 m/s -- above the Hirt 2017 95% upper CI of the maximum biological speed (~52.6 m/s).  The gap finder is detecting a structural break within the outlier tail. Supply `(mass, mode)` or a hard `v_max` for a principled physiological cap.  See `?v_phys_estimate`.
#> Iter 1: bridge=20 prob=5 speed=23 detour=11 (v_max=60.2) | conjunction=19 | new=19 cumulative=19
#> Iter 2: bridge=4 prob=17 speed=6 detour=5 (v_max=25.5) | conjunction=6 | new=6 cumulative=25
#> Iter 3: bridge=0 prob=12 speed=0 detour=2 (v_max=-) | conjunction=0 | new=0 cumulative=25
#> === mt_clean_track: 25 flagged (1.430% of 1748); stopped: no_new_flags ===
#>     Returning all rows with flag columns attached. To drop flagged rows, either re-run with remove = TRUE (the default) or subset: x[!x$is_outlier, ].
ann <- mt_persistence_score(clean, silent = TRUE)

flagged <- which(ann$is_outlier)
cat("Cascade flagged:", length(flagged), "fixes\n")
#> Cascade flagged: 25 fixes
cat("Persistence score distribution:\n")
#> Persistence score distribution:
print(table(ann$persistence_count[flagged]))
#> 
#>  3  4 
#>  1 24

Most cascade flags persist at the maximum score (p = 4), as expected for spike-class outliers whose geometric anomaly survives every coarsening.

The class-conditional finding

The empirical question is: does persistence usefully discriminate true positives from false positives within the cascade’s flag set? The answer depends on which error_class the cascade assigned.

The mt_clean_track() cascade attaches one of six error classes to each flag: geometric_spike, consensus, state_anomaly, kinematic_confluence, block, physiological. Each class captures a different combination of the four primitives that fired, and each has a different empirical relationship with persistence.

Pooling across all five CPF synthetic tracks (CPF_A through CPF_F) and asking, for each error class, what fraction of true positives vs. false positives persist at $\geq 3$ scales:

`error_class`	$n$ TPs	$n$ FPs	% TPs at $p \geq 3$	% FPs at $p \geq 3$	TP $-$ FP gap
`geometric_spike`	76	0	51%	(no FPs)	(class-pure on synthetic)
`state_anomaly`	24	23	96%	57%	+39.3 pp
`consensus`	34	9	82%	44%	+37.9 pp
`kinematic_confluence`	7	7	71%	86%	$-$14.3 pp (small $n$ )

Two findings:

The geometric_spike class is empirically pure on the synthetic – the cascade does not produce false positives in this class. Persistence has nothing to filter.
The state_anomaly and consensus classes show a clean +37–39 pp discrimination gap: TPs persist substantially more often than FPs. A persistence-based filter on these classes substantively improves precision.
The kinematic_confluence class is on a small sample (7 TPs / 7 FPs concentrated on the multi-state CPF_F track) and the signal direction is uncertain.

Recommended usage pattern

Apply the helper as annotation only by default; never modify is_outlier automatically. Where filtering is desired, gate the filter on error_class:

clean <- mt_clean_track(cpf_a, plot = FALSE, remove = FALSE)
#> No physiological speed cap supplied -- running with a data-driven cap chosen from your track.  This works well for most cases.  If your animal has multiple behavioural states (e.g. perched and flying) or you expect sustained-spoof errors, supplying `v_max =` (a published top speed in m/s) or `(mass = ..., mode = ...)` for the allometric estimate gives sharper results.  See `?v_phys_estimate` for the allometric helper; `?mt_clean_track` documents the failure modes of the auto-cap in detail.
#> Auto-cap landed at 60.2 m/s -- above the Hirt 2017 95% upper CI of the maximum biological speed (~52.6 m/s).  The gap finder is detecting a structural break within the outlier tail. Supply `(mass, mode)` or a hard `v_max` for a principled physiological cap.  See `?v_phys_estimate`.
#> Iter 1: bridge=20 prob=5 speed=23 detour=11 (v_max=60.2) | conjunction=19 | new=19 cumulative=19
#> Iter 2: bridge=4 prob=17 speed=6 detour=5 (v_max=25.5) | conjunction=6 | new=6 cumulative=25
#> Iter 3: bridge=0 prob=12 speed=0 detour=2 (v_max=-) | conjunction=0 | new=0 cumulative=25
#> === mt_clean_track: 25 flagged (1.430% of 1748); stopped: no_new_flags ===
#>     Returning all rows with flag columns attached. To drop flagged rows, either re-run with remove = TRUE (the default) or subset: x[!x$is_outlier, ].
ann <- mt_persistence_score(clean, silent = TRUE)

## Class-aware filter: drop low-persistence flags only in the
## classes where persistence is empirically discriminative.
demote <- ann$is_outlier &
            ann$error_class %in% c("state_anomaly", "consensus") &
            ann$persistence_count < 3

cat("Cascade flags:           ", sum(ann$is_outlier), "\n")
#> Cascade flags:            25
cat("Demoted by class filter: ", sum(demote), "\n")
#> Demoted by class filter:  0
cat("Final flagged after filter:",
    sum(ann$is_outlier & !demote), "\n")
#> Final flagged after filter: 25

When NOT to apply the persistence filter

The score’s discriminative power decays in the following cases:

Halo-style outliers. Repeated wandering returns to a fixed location have anomalies that average out over wider windows. Their persistence score is structurally low even when the fix is genuinely an outlier. The cascade typically classifies these as geometric_spike (where the class is already pure and no filtering is needed), but on detector outputs that don’t carry an error_class column, applying a persistence filter to halo data will preferentially remove true positives.
Bridge or detour standalone output. The structural review showed that on raw bridge or detour flags, false positives persist more than true positives – a persistence filter would hurt precision rather than help. This is because both primitives flag based on local-window geometry, so their TPs include scale-dependent cases (halo) while their FPs are at structurally complex track regions where any window-based score fires consistently.

The safe pattern is therefore: annotate, then filter by error_class. Without an error-class column, treat the persistence score as an inspectable confidence column rather than an automatic filter.

Custom scales

The default scales = c(2, 4, 8) is appropriate for typical animal-tracking datasets. Adjust if your sampling rate or contamination geometry warrants:

## Denser scale ladder for high-frequency GPS:
ann <- mt_persistence_score(clean, scales = c(2L, 3L, 5L, 8L, 13L))

## Coarser-only validation for sparse Argos data:
ann <- mt_persistence_score(clean, scales = c(4L, 8L, 16L))

The function requires each $k \geq 2$ ; scale 1 is the original flagger’s resolution and is implicit in the score’s $+1$ constant.

Relationship to the retired `mt_flag_outliers_multiscale`

mt_persistence_score() replaces the previous mt_flag_outliers_multiscale() function (retired in v0.3). The predecessor thinned the track at multiple scales and voted; the new helper does not thin and instead computes a windowed view at every fix, which removes the structural index-parity bias of the old design and makes the score detector-agnostic. See the package NEWS entry for v0.3 for the empirical motivation behind the change.