CWE-172: Encoding Error — Weakness Reference

01Summary

An encoding error occurs when software fails to properly convert data between different character sets, formats, or representations. This mismatch can cause security checks to be bypassed because the application validates one form of the data while processing another. For example, a filter might block a dangerous string in UTF-8, but if the application later reinterprets the same bytes as a different encoding, the filter's decision becomes meaningless.

02How It Happens

Encoding errors arise when an application assumes data is in one character encoding (or format) but processes it as another, or when it fails to normalize data before validation. This commonly occurs at boundaries where data moves between systems with different encoding assumptions—for instance, between a web form (typically UTF-8), a database (which may use a different collation), and backend processing logic. If validation happens before encoding conversion, or if the conversion is incomplete or inconsistent, an attacker can craft input that passes the filter in one form but becomes dangerous after re-encoding. The root cause is usually a lack of explicit encoding declaration, inconsistent use of encoding functions, or failure to canonicalize data before security checks.

03Real-World Impact

Encoding errors can lead to authentication bypass, SQL injection, cross-site scripting (XSS), and file upload vulnerabilities. For example, a filename filter might reject files with dangerous extensions by checking the string in one encoding, but if the filesystem interprets the filename in a different encoding, a blocked filename could become executable. Similarly, encoding confusion in URL or parameter parsing can allow an attacker to inject SQL or script code that passes input validation but executes after decoding. The severity depends on what security boundary the encoding error crosses.

04Vulnerable & Fixed Patterns

Python PHP

Vulnerable pattern

import sqlite3

user_input = request.args.get('username')
# Assume input is UTF-8, but don't explicitly handle encoding
if len(user_input) > 50:
    return "Username too long"

# Later, the database connection uses a different encoding assumption
conn = sqlite3.connect(':memory:')
query = f"SELECT * FROM users WHERE name = '{user_input}'"
cursor = conn.execute(query)

Why it's vulnerable:
The code validates the input length and structure without normalizing encoding, then constructs a query assuming a specific encoding. If the input is re-encoded or interpreted differently by the database, the validation becomes meaningless and SQL injection becomes possible.

Fixed pattern

import sqlite3

user_input = request.args.get('username', '')
# Explicitly decode and normalize to UTF-8
user_input = user_input.encode('utf-8', errors='replace').decode('utf-8')

if len(user_input) > 50:
    return "Username too long"

# Use parameterized queries to separate data from code
conn = sqlite3.connect(':memory:')
query = "SELECT * FROM users WHERE name = ?"
cursor = conn.execute(query, (user_input,))

Vulnerable pattern

<?php
$filename = $_FILES['upload']['name'];
// Check for dangerous extensions
if (preg_match('/\.(exe|bat|sh)$/i', $filename)) {
    die("Blocked extension");
}
// But don't normalize encoding before the check
move_uploaded_file($_FILES['upload']['tmp_name'], "/uploads/$filename");
?>

Why it's vulnerable:
The extension check assumes a specific encoding and character interpretation, but the filesystem may interpret the filename differently if encoding is inconsistent. An attacker could craft a filename that passes the regex in one encoding but becomes dangerous after the filesystem processes it.

Fixed pattern

<?php
$filename = $_FILES['upload']['name'];
// Normalize to UTF-8 and remove non-ASCII characters
$filename = iconv('UTF-8', 'ASCII//TRANSLIT', $filename);
$filename = preg_replace('/[^a-zA-Z0-9._-]/', '', $filename);

// Check for dangerous extensions
if (preg_match('/\.(exe|bat|sh)$/i', $filename)) {
    die("Blocked extension");
}
move_uploaded_file($_FILES['upload']['tmp_name'], "/uploads/$filename");
?>

05Prevention Checklist

Declare encoding explicitly
at every boundary: HTTP headers (Content-Type: text/html; charset=utf-8), database connections, and file operations.

Normalize input early
by converting all user input to a canonical encoding (typically UTF-8) before any validation or processing.

Use parameterized queries and prepared statements
to separate data from code, eliminating encoding-based injection risks.

Validate after encoding
, not before: ensure security checks operate on the final, normalized form of data.

Test with non-ASCII characters
during development—include accented characters, emoji, and multi-byte sequences in test cases.

Use framework-provided encoding functions
(e.g., htmlspecialchars() in PHP, html.escape() in Python) rather than manual string manipulation.

06Signs You May Already Be Affected

Look for unexpected characters or mojibake (garbled text) in logs or user-submitted data. Check for files or database records with names that appear corrupted or contain unusual byte sequences. Review access logs for requests with unusual percent-encoded characters or multi-byte sequences that bypass your filters but appear to execute successfully.

07Related Recent Vulnerabilities

CVE-2026-42926 NGINX ngx_http_proxy_v2_module vulnerability CVSS 5.8/10 MEDIUM CVE-2024-48909 SpiceDB calls to LookupResources using LookupResources2 with caveats may return context is missing when it is not CVSS 2.0/10 LOW CVE-2021-33604 Reflected cross-site scripting in development mode handler in Vaadin 14, 15-19 CVSS 2.5/10 LOW CVE-2019-12677 Cisco Adaptive Security Appliance Software SSL VPN Denial of Service Vulnerability CVSS 7.7/10 HIGH