What is 'taint analysis' and why do I care?

by g. ann campbell|

He covered a wet, hacking cough with his hand, then pushed through the door off the ward. I reached the same door, and hesitated. The Cougher had just tainted the door with his germs. If I touched it, I'd be tainted too.


These days we all know what germs are and how they're passed from person to person, and from hand to door to hand. The fact is that particularly in cold and flu season you have to regard every doorknob, and every elevator button as suspicious. You always wash your hands afterward, because you never know which doorknob is tainted with germs. You have to assume they all are.

And the same is true for the data you get from your users. Not every user is a bad actor. In fact, most aren't. But some are. Some want to infect your systems - to get access to your users, their passwords, their mothers' maiden names, and anything else they can sell - and they'll do anything to accomplish that. So you have to treat every user's data as if contained The Plague, and sanitize accordingly. 

Unfortunately, in large systems that's easier said than done. First you have to find all the places you accept data from users, and then you have to sanitize the data before you use it. The hard part is making sure you've found all the sources of user data and intervened before any kind of use. That's where taint analysis comes in. 

Taint analysis identifies every source of user data - form inputs, headers, you name it - and follows each piece of data all the way through your system to make sure it gets sanitized before you do anything with it. And by "all the way through" I mean all the way through. Here's a simple example from the OWASP Benchmark project, an intentionally insecure application built to test analyzers:

Here, SonarQube shows us that

  • At line 47, data provided by the user is retrieved and assigned to the variable 'param'. 'param' is now tainted by user input.
  • Line 51, 'param' gets manipulated - but not sanitized! It's still tainted.
  • Line 54, 'param' is incorporated into the value of 'sql'. 'sql' is now tainted too!
  • Lines 58-59, 'sql', which is tainted with raw user input, is sent to the database :-(

Of course, in that example, everything is contained in a single method. The problem is easy to spot... if you know what to look for… and where to look… and that you should look.

So let's look at something slightly more complicated. This one's from Securibench micro, another test-the-analyzers project:

Here, in the 'doGet' method, user-supplied data is stored in a collection. Then in another method in a different file, it's retrieved from the collection and sent to the database. Again, without being sanitized. In the SonarQube UI this example is easy to understand because all the relevant files are shown together, with each propagation of the taint highlighted, but it would be much harder than the first example to find manually. Because if you start from the 'doGet' method, you have to find every place the method is called from and then follow the data it returns until it's no longer "live" to make sure it's not misused. On the other hand, you could start from the other end and go backward to the source of every value sent to this "sink" (place where the data is stored/used). That might be a little cleaner, but it's no less painful.

And that's why you want taint analysis. Because it traces user-tainted data from its source to your sinks, and raises the alarm when you use that data without sanitizing it. It helps you protect your data, your users, and your reputation from hackers and accidents.

Taint analysis of Java, C#, PHP, and Python is free on SonarCloud for open source projects, and available in SonarQube commercial editions as part of SonarSource's larger SAST (Static Application Security Testing) offering. Later in 2020, SonarSource's SAST offering will expand to include JavaScript, TypeScript, C and C++.