Stale-State Serverless Security Flaws

Stale-State Serverless Security Flaws

I'd like to introduce a possibly new class of serverless security flaws with a story.

For our training business at we45, we orchestrate servers. A lot of servers. These servers are used by our many students to practice and learn concepts around Application Security, Cloud Security, Kubernetes Security, Cloud Security among other things.

Last week, we rewrote a large chunk of this orchestrator code, as we moved to a different cloud vendor. To ensure that server-names don't collide and cause errors, we use a PRNG to generate a reasonably random string value that we append to the server name, when its being spun up over the cloud provider's API.

Our orchestrator runs on AWS Lambda (Functions as a Service) and we've been running a different version of this orchestrator for a really long time now. The orchestrator service was undergoing changes in this iteration.

The orchestrator framework had been subjected to a battery of unit tests and integration tests in CI/CD, all of which had passed with flying colors. However, this was the issue. See if you can find the issue:

server_id = generate(size=21) # global var that generates a random str

def launch_server(event, context):
	server_name = "bti-{}".format(server_id)

During our tests, this code performed as intended. The global variable would generate a new random value every single time when invoked. It was deployed successfully to Lambda and it was working.

As we were running a final set of tests, we realized that there was HUGE issue. We were testing the orchestrator in quick succession. Lot of servers being spun up and destroyed. And every one of them had the same ID!!

My first thought went to a seed that I had not initialized correctly or something like that. I had made the same mistake recently with some Go code that I wrote, and suspected that it might be the case here as well.

But that was NOT it. The library worked just fine. No insecure or undefined security parameters. I figured that using a global variable might possibly be the reason for this and changed the code to this:

def gen_random_id():
	return generate(size=21)

def launch_server(event, context):
	server_name = "bti-{}".format(gen_random_id())

And it worked just fine after. But I was a little concerned. I started reading more about Lambda's handling of Global Variables and came across a couple of blogs (footnote) where they've discussed the same problem. The issue apparently has to do with Lambda's caching of Global variables when the Lambda function is warm.

What are Lambda Warm and Cold Functions?

Lambda (and nearly every other Cloud Functions as a Service provider) has the concept of Warm and Cold Starts.

Lambda is basically a "per-execution" billing model. Which means that even if you deploy your application, it doesnt bill you or perform any operations until that function is invoked.

When a function is invoked for the first time or infrequently, the function "cold starts". Meaning that the dedicated execution environment is prepared for the function, the code and its dependencies are downloaded to the environment (probably an EC2 running the Firecracker MicroVM) and the code executes with some output (as you've defined it). This cold start process approximately takes 0.8 seconds to happen, since its "thawed" from "cold storage"

However, if its a frequently invoked function, the function remains "warm". Essentially the container that runs the lambda is available in the execution environment. And in this case, instead of downloading the code and its dependencies, loading the function into memory, etc. The function "warm starts" essentially meaning that the code runs. Its naturally, much faster when a function is kept warm.

Cold vs Warm Start

So, when a function is in its "warm" state. Global variables are cached and treated as constants. When you're dealing with a real constant value, for example - a value of some static variable that you're fetching from Amazon's SSM, etc., its fine. But once you are using a module-level global variable for some dynamic use-case, specially for security specific use-cases like:

  • Generating random IDs
  • Generating salts for passwords
  • Generating Primary Key values, among others

you're likely to see the quality of that integrity you rely on, despite your best efforts, is completely compromised.

I realized that this could create a pretty nasty set of security vulnerabilities and exceptions for several developers, and in many cases, completely unknown to them, because they expect their global-vars to behave as global vars. Look at this ascii video for the problem with global vars.

I am making an HTTP GET request to an API Gateway Path, that triggers a lambda function. This lambda function generates a unique ID (probably a object reference for something the user does). Note how "unique" the unique ID actually is:

Serverless State Bug

Now, Let's look at another story.

The Dynamic, Static Salt

Imagine you're building a sign up functionality for your application.

You have users signing up with their email and setting a password for your application.

You do the right thing. You decide to use BCrypt. You decide to setup a reasonably strong work factor and generate a dynamic salt for every single user password.

Your code looks something like this

import json
import bcrypt

uppu = bcrypt.gensalt()

def gen_pass(event, context):
    data = json.loads(event.get('body'))
    userpw = str(data.get('password')).encode() #user entered password
    bpass = bcrypt.hashpw(password=userpw, salt=uppu).decode()
    body_dict = {"password": bpass} #this would go into the DB
    return {
        "statusCode": 200,
        "body": json.dumps(body_dict) #contrived example
    }

Your service is much-awaited and you expected a lot of user signups to happen against your signup lambda function. Your function remains "warm".

Naturally, users being users, will not all sign up with strong passwords. Lots of them are going to use crappy passwords like:

  • "password"
  • "letmein"
  • etc

But you've thought about that. And while you'd like to enforce strong passwords, you want to make sure that its more convenient for your users. Besides, you use BCrypt, which is supposed to be pretty secure, even in the face of poor passwords. And with per-password salts, you're quite happy that things will be much more secure, especially in comparison with a hashing function

However, because of Lambda's stale state, now your supposedly dynamic salt (line 4 from above) is actually a static salt.

So if 100 users have the password "password", you now have:

http POST https://XXXXXXX.execute-api.us-east-1.amazonaws.com/dev/gen-pass password=password

{
    "password": "$2b$12$TBJUCJ7i2LXUPxfdFiR7.uljlxNpXXx4AJW2mls0T6TBeFB8CZ3sC"
}

http POST https://XXXXXXX.execute-api.us-east-1.amazonaws.com/dev/gen-pass password=password

{
    "password": "$2b$12$TBJUCJ7i2LXUPxfdFiR7.uljlxNpXXx4AJW2mls0T6TBeFB8CZ3sC"
}

http POST https://XXXXXXX.execute-api.us-east-1.amazonaws.com/dev/gen-pass password=password

{
    "password": "$2b$12$TBJUCJ7i2LXUPxfdFiR7.uljlxNpXXx4AJW2mls0T6TBeFB8CZ3sC"
}

Additional Experiments

We have simulated this vulnerability in the following additional ways:

  • To rule out possibilities of some kind of user/IP specific caching, 4 of my colleagues and I ran the same example from different locations and different networks and we had the same results. The same bcrypt hash (and salt) was generated till the time the function was re-initialized from a cold-start.
  • To rule out possibilities of different results from the urandom function of the Operating System, we replaced the bcrypt salt function with the raw os urandom value, with the same results.
  • To rule out possibilities of it being a language/platform specific bug, we simulated this with NodeJS and Python, with the same results.

Attack Vector Possibilities

I definitely see that this could be a vulnerability that can be used to leveraged to perform:

  • Insecure Direct Object Reference and
  • Password-based attacks, possibly easier offline compromises
  • Any other attack that relies on integrity bypasses/compromises

Info from AWS

I have not come across a specific mention of this in any AWS document, let alone a security document. Not saying its not there. Saying I have not seen it. I have looked in their document on Lambda Best Practices, AWS Lambda Security Model. There's definitely no mention of this, including the security implications this has.

I have come across some blogs that talk about the global variable caching issue, but they've not (at least ostensibly) connected it to a security vulnerability.

PoC Code

we45/serverless-state-state
Contribute to we45/serverless-state-state development by creating an account on GitHub.

References