Quite frankly, I cannot count how many times I’ve seen these 3 terms mixed up by folks at all levels in both development and security fields. Hearing those terms misused or even given the implication that they are some sort of magical cure for all vulnerabilities (without context on proper and timely implementation of each control) makes me nervous. Oftentimes, the terms are mixed and matched without concern for their actual definition. Let’s dive into the terms of Sanitizing vs Encoding vs Escaping.
Input sanitization is the process of ensuring that all user input is clean and safe by entirely removing characters.
There are many ways to sanitize user input but the 2 most common ones are the use of whitelists and the use of blacklists.
Whitelist: With a whitelist, you specify a set of characters that are allowed, and all other characters are rejected. This is an effective way to sanitize input, but it can be difficult to implement if you have a lot of user input that needs to be accepted.
Blacklist: With a blacklist, you specify a set of characters that are not allowed, and all other characters are accepted. This is an easier way to sanitize input, but it can be less effective because it is possible to miss characters that should be rejected.
The best way to sanitize user input is to use a combination of both whitelists and blacklists. However, input sanitization as a whole is considered to be a complicated and error-prone process. Sanitization will result in the irreversible removal of data (as it is literally removed from the input) often resulting in mangled data. Additionally, it is hard to implement correctly and often results in easy bypass by attackers. For example, here is a classic (and, sadly, common) case of easily bypassable input sanitization:
Input escaping is the process of transforming or “escaping” input that would be interpreted in some mode into a different mode where it can be safely used. A famous example of “escaping” are the ANSI “escape codes”. A more common example most folks who have worked with JSON have run into is needing to add a double-quote character into a JSON parameter (which already encloses strings in double-quotes). If the double quote is added without escape, it will break the syntax:
Therefore, the backslash escape character is prepended to the double-quote character in order to tell the parser to treat it just as a character and not the start or end of a string literal:
Input escaping can be accomplished in a number of ways, depending on the type of data being input and the context in which it will be used. For example, if user input is going to be used in an SQL query, it might be escaped by adding slashes so that any quotes in the input are treated as literal quotes, rather than as part of the SQL code. In HTML, input might be escaped by replacing special characters with their HTML entity equivalents.
Input escaping is used in the prevention of multiple types of vulnerabilities. One well-known example of an injection attack is the SQL injection attack. In this type of attack, the attacker enters malicious SQL code into an input field in order to compromise the database. The SQL code is executed by the database, and the attacker can use it to do things like delete data, change passwords, or even take over the entire system. To protect against SQL injection, user input must be escaped before it is used in an SQL query. This can be done by adding slashes to all single and double quotes in the input. The database will then treat these quotes as literal characters, and the SQL code will be executed as intended.
In the security field, the term “escaping” is often used as a synonym of the term encoding. As you will see shortly, they are very similar in meaning.
In computing, input encoding is the process of transforming data so that it can be properly (and safely) read by an application. This transformation is necessary because computers only understand numbers, not letters or other characters.
In security, input encoding is used to ensure that data is properly formatted and free of any potentially harmful characters or code. This is especially important when accepting user input, as it can help to prevent security vulnerabilities.
There are a number of different input encoding schemes in use, each with its own advantages and disadvantages. The most common schemes are ASCII, Unicode, and UTF-8.
ASCII is the oldest and most basic form of input encoding. It uses a single byte to represent each character, which limits it to a total of 256 characters. This is enough to cover the English alphabet, numerals, and a few common punctuation marks, but not much else.
Unicode is a more modern input encoding scheme that supports a much larger range of characters. It uses two or four bytes per character, depending on the version in use. This allows for a total of over one million different characters to be represented.
UTF-8 is a newer encoding scheme that is backward-compatible with ASCII. It uses a variable-length encoding that can use one to four bytes per character. This makes it very efficient for storing English text, but also allows for other languages to be represented as well.
Input encoding is used in the prevention of multiple types of vulnerabilities. One well-known example of a vulnerability it is used for is cross-site scripting (XSS). In this type of attack, the attacker injects malicious code into a web page. When someone visits the page, the code is executed by the browser and can be used to do things like stealing cookies or redirecting the user to a malicious site. To protect against XSS attacks, user input must is encoded before it is used in an HTML page. This can be done by replacing special characters with their HTML entity equivalents. For example, the < character would be replaced with < and the > character would be replaced with >.
Which is best? Encoding, Escaping, or Sanitizing?
Contextually encoding untrusted values at the time of usage is generally considered the preferred method from a security standpoint. However, understanding the output context of the data and which encoding to apply is challenging but important when considering which method to use.