Abstract
Cambridge-Open-Engage-version-of-the-Abstract-We-conduct-an-exhaustive-and-rigorous-mathematical-study-of-a-temporal-difference-(TD)-learning-scheme-characterized-by-sign-gated,-dual-eligibility-traces,-denoted-as-a-pair-of-vectors-(e\_t^+,-e\_t^-).-The-analysis-is-situated-within-a-canonical-and-idealized-experimental-environment-featuring-a-single,-isolated-feature-spike,-which-is-subsequently-followed-by-a-delayed-and-stochastic-outcome.-The-agent's-value-function-is-assumed-to-be-linear-in-its-features,-represented-as-V\_θ(s\_t)=⟨θ,φ\_t⟩,-and-the-learning-process-is-driven-by-the-temporal-difference-error,-defined-as-δ\_t=r\_{t+1}+γ·V\_θ(s\_{t+1})-V\_θ(s\_t),-where-the-discount-factor-γ-is-a-value-within-the-interval-\[0,1).-The-parameter-update-rule,-which-forms-the-core-of-the-learning-mechanism,-is-given-by-Δθ\_t=α·δ\_t·(e\_t^+·I(δ\_t>0)+e\_t^-·I(δ\_t<0)).-The-dual-eligibility-traces-evolve-according-to-the-linear-recursive-equations-e\_{t+1}^±=n\_±·e\_t^±+φ\_t,-with-distinct-decay-rates-n\_±-in-(0,1)-and-initial-conditions-e\_0^±=0.-Our-analytical-setting-involves-a-single,-non-zero-feature-spike-φ\_τ=x-at-a-specific-time-τ,-and-a-consequential-outcome-r\_{τ+L}-from-the-set-{+R,-S}-arriving-at-a-later-time-τ+L,-with-respective-probabilities-p-and-1-p.-A-crucial-finding-of-this-paper-is-that-under-this-single-spike-assumption,-the-expected-parameter-update-is-exactly-independent-of-the-discount-factor-γ-for-all-values-in-\[0,1).-We-provide-a-complete-and-exact-characterization-of-the-conditions-under-which-the-expected-parameter-update-positively-aligns-with-the-feature-spike-x,-a-phenomenon-we-term-"undesirable-reinforcement,"-even-when-the-expected-reward-is-non-positive.-Furthermore,-we-extend-our-analysis-to-more-realistic-scenarios-by-deriving-a-robust,-conservative-lower-bound-for-the-expected-update-under-general-perturbations-and-"leaky-features,"-where-small,-non-zero-features-may-be-present-at-other-times.-This-bound-rigorously-isolates-the-distinct-contributions-of-direct-reward,-bootstrapping-discrepancy,-and-non-reward-driven-updates.-All-mathematical-derivations,-from-the-primary-theorems-to-the-edge-case-analyses-(e.g.,-L=0,-1; n\_+=n\_-; pR=(1-p)S; γ-approaching-1),-are-presented-in-a-complete,-stepwise-fashion-with-exhaustive-detail-to-ensure-full-transparency,-reproducibility,-and-verifiability.