Incorrect Behavior Order: Validate Before Canonicalize
This weakness occurs when software validates user input before converting it to its canonical standard form. An attacker can craft input using alternate…
This weakness occurs when software validates user input *before* converting it to its canonical (standard) form. An attacker can craft input using alternate encodings, escape sequences, or representations that pass validation but transform into malicious content once normalized. A classic example is validating a file path before decoding URL-encoded characters — the validator might miss ..%2F..%2F (encoded path traversal), which becomes ../../ after decoding.
02How It Happens
The vulnerability arises from a logic-order mistake: validation happens on the "raw" input, while the application later processes a canonicalized version. Between validation and use, the input is decoded, unescaped, or normalized in some way. If the canonical form differs from what was validated, an attacker can bypass security checks by using alternate representations. Common scenarios include double-encoding, Unicode normalization, case folding, or filesystem-specific path resolution (e.g., Windows short names like PROGRA~1).
The root cause is a false assumption that validation on the raw form is sufficient. In reality, the application should validate *after* canonicalization, ensuring that the final form used by the application is the one that was checked.
03Real-World Impact
Attackers can exploit this to access restricted files, traverse directory structures, bypass allowlists, or inject malicious content. For example, a file upload filter might reject .exe files but accept .exe%00.jpg (null-byte encoded), which resolves to .exe after decoding. Similarly, path validation might block /etc/passwd but allow /%2e%2e/etc/passwd (encoded dot-dot-slash), which becomes /../etc/passwd after URL decoding. The consequences range from information disclosure to remote code execution, depending on what the canonicalized input is used for.
04Vulnerable & Fixed Patterns
Vulnerable pattern
import urllib.parse
def validate_filename(user_input):
# Validate BEFORE canonicalizing
if ".." in user_input or user_input.startswith("/"):
raise ValueError("Invalid path")
return user_input
def read_file(filename):
# Canonicalize AFTER validation
decoded = urllib.parse.unquote(filename)
with open(f"/var/data/{decoded}", "r") as f:
return f.read()
# Attacker passes "..%2F..%2Fetc%2Fpasswd"
# Validation sees ".." is not in the raw string (it's encoded as %2F)
# After unquote(), it becomes "../../etc/passwd" — path traversal succeeds
user_file = validate_filename("..%2F..%2Fetc%2Fpasswd")
content = read_file(user_file)
Why it's vulnerable: The validation checks the encoded form, which doesn't contain the literal .. sequence. After URL decoding, the path traversal becomes active and bypasses the check.
Fixed pattern
import urllib.parse
from pathlib import Path
def read_file(filename):
# Canonicalize FIRST
decoded = urllib.parse.unquote(filename)
# Validate AFTER canonicalization
safe_base = Path("/var/data").resolve()
requested = (safe_base / decoded).resolve()
if not str(requested).startswith(str(safe_base)):
raise ValueError("Path traversal detected")
with open(requested, "r") as f:
return f.read()
Vulnerable pattern
<?php
function validate_filename($user_input) {
// Validate BEFORE canonicalizing
if (strpos($user_input, "..") !== false || strpos($user_input, "/etc") !== false) {
throw new Exception("Invalid path");
}
return $user_input;
}
function read_file($filename) {
// Canonicalize AFTER validation
$decoded = urldecode($filename);
$content = file_get_contents("/var/data/" . $decoded);
return $content;
}
// Attacker passes "..%2F..%2Fetc%2Fpasswd"
// Validation doesn't see "/etc" (it's encoded)
// After urldecode(), it becomes "../../etc/passwd"
$file = validate_filename("..%2F..%2Fetc%2Fpasswd");
echo read_file($file);
?>
Why it's vulnerable: The validation runs on the URL-encoded string, which doesn't match the dangerous patterns. Once decoded, the path traversal is active.
Fixed pattern
<?php
function read_file($filename) {
// Canonicalize FIRST
$decoded = urldecode($filename);
$realpath = realpath("/var/data/" . $decoded);
// Validate AFTER canonicalization
if ($realpath === false || strpos($realpath, "/var/data") !== 0) {
throw new Exception("Path traversal detected");
}
$content = file_get_contents($realpath);
return $content;
}
?>
05Prevention Checklist
Canonicalize first, validate second. Always decode, unescape, and normalize input before applying security checks.
Use allowlists on canonical form. Define what characters, patterns, or values are acceptable *after* normalization, not before.
Leverage platform APIs for path validation. Use realpath() (PHP), Path.resolve() (Python), or equivalent to resolve paths to their true form before checking boundaries.
Be aware of multiple encoding layers. Watch for double-encoding (e.g., %252F), Unicode normalization, case folding, and filesystem-specific tricks (Windows short names, alternate data streams).
Test with encoded payloads. Include URL-encoded, HTML-encoded, and Unicode variants in your test cases to catch validation-order bugs.
Document canonicalization rules. Make it explicit in code comments and design docs which transformations happen and in what order.
06Signs You May Already Be Affected
Look for file-access logs showing requests with unusual encoding patterns (e.g., %2F, %2e, %00) that successfully accessed files outside intended directories. Check for unexpected files in sensitive locations or evidence of directory traversal in web server logs. Review validation code to see if checks happen before URL decoding, HTML unescaping, or path resolution.