I was recently working on a project that contained secrets in source control. The team was aware of this fact but had never been able to allocate time to get rid of them. The circumstances changed and I was tasked with cleansing the repository. I was still unfamiliar with the code-base so I started to look around for config files. I realised quickly this approach would not work out:
- Some secrets were hard-coded directly in the code
- Some secrets had previously been committed to source control but had since then been removed
I needed a tool that would not only attempt to identify secrets but would also do so over the complete
Identify secrets in code
While git-secrets found
AWS access keys, it missed out pretty much anything else (private keys,
API keys for other services…). My next pick was truffleHog.
Based on the name only I had a clear winner,
truffleHog uses both entropy and known patterns to attempt to find secrets. This approach results in a high number of false positive, but it is also the only one that discovered credentials I was unaware of.
For entropy checks, truffleHog will evaluate the shannon entropy for both the base64 char set and hexadecimal char set for every blob of text greater than 20 characters comprised of those character sets in each diff. If at any point a high entropy string >20 characters is detected, it will print to the screen.
Warning: no automated approach will uncover all the secrets. There is no way to prevent developers from creating short secrets with a low entropy and use them in production. Your best hope in this case is that those secrets were committed together with stronger secrets and that they will appear in the output of
truffleHog on a repository is an iterative process. The first run will yield an enormous amount of results which will be impossible to thoroughly review manually. The goal of the initial phase should be to discard files which are unlikely to contain secrets. Package managers lock files,
SVGs are amongst those files.
A good starting point to reduce the volume of the haystack is to use this exclude file:
truffleHog is written in Python and distributed using pip. If you’re like me and have no idea what those words mean, the quickest way to get started is to use
Docker. Browse to the directory where the
Git repository is located and run the following container:
<output-and-settings-directory> should contain the exclude file we created previously.
Running the previous command will give you a
bash session within a container with
Python installed. You’ll then need to install
truggleHog and run it:
truffleHog to look for known patterns (ranging from private keys, passing by
AWS access keys to
API keys). The switch
--exclude_paths points to the exclude file we created previously (in this instance I named it
truffleHog expects to be looking at a remote
Git repository but you can direct it to your file system by using
truffleHog takes some time (13 minutes on a repository with many thousands of commits) but by beeing cheeky we’ll be able to reduce the number of runs required.
truffleHog outputs its results to the terminal. The potential secrets are coloured in bright yellow using ANSI escape codes:
truffleHog gets over-enthusiastic and surrounds a potential secret with many
ANSI escape codes:
False positives litter the output. In the screenshot above, the secret is actually a portion of the path of an
S3 object. I decided to post-process
truffleHog’s output using
C#, but you could use any language to do so. In the LINQPad script below I:
- Replace duplicate
ANSIescape codes by a single one
ANSIescape codes surrounding false positives and known secrets
This script runs in a few seconds and you’ll be able to iterate quickly.
Visual Studio Code
- Search for
- Add the secret to the values to discard list (if it is an actual secret, write it down)
- Run the
- Return to step
After some cycles you’ll reach a much cleaner output. You might discover files you want to exclude from
truffleHog (which would require you to run
truffleHog again) or you could decide to discard those via scripting.
By now you should have a list of secrets and entire files that are secrets (private keys, license files…).
BFG Repo-Cleaner removes big files and secrets from your
Git history. It requires
Java 8, I already had it installed on my machine, but you could run it in
Docker if you needed to.
The first step is to clone the repository as a bare repository:
A bare repository […] does not have a locally checked-out copy of any of the files under revision control. That is, all of the Git administrative and control files that would normally be present in the hidden
.gitsub-directory are directly present in the […] directory instead, and no other files are present and checked out.
You can clone a repository as a bare repository using the following command:
By convention the directory containing a bare repository should end with the suffix
Copy the content of the
project-backup.git directory into a directory called
project-secrets.git (this is so that we don’t have to clone the repository again at every successive try).
You can then run
BFG Repo-Cleaner with the following command:
truffleHog I identified that two directories (
useless-directory) have not been in use for quite some time. They contain many files I want to purge from the
Git history. The
--delete-folders is used to remove a directory and its content from history.
BFG Repo-Cleaner does not support full path for directories and files, so you’ll either only be able to delete objects with a unique name or delete all objects sharing the same name.
--delete-files I’m deleting the backup from our production database amongst other files and all the
PSD files. Another approach is to use the
--strip-blobs-bigger-than switch to delete files bigger than a certain size.
--replace-text points to the secrets we found when running
BFG Repo-Cleaner will replace them with the string
***REMOVED***. Each secret should be on its own line:
BFG Repo-Cleaner runs super quickly (a handful of seconds on the repository I was working on) but you need to scrutinise the output with care. If you get a warning about dirty files, this means some secrets are still present in the
HEAD and by default
BFG Repo-Cleaner doesn’t modify the contents of the latest commit on your
HEAD. Here is an example of such a warning:
You need to remove those secrets through a commit and then run
BFG Repo-Cleaner again. Do not move on to the next step until this warning is gone. This is the output you should expect when the
HEAD is clean:
Finally let’s ensure that
Git itself doesn’t store anything any more about the objects we’ve just removed from history:
The last step involves pushing back the changes to the
remote. Quite often the
master branch will be protected, you will need to lift this restriction before pushing:
You also need to keep in mind that you will lose the links between your
Pull Requests, work items, builds… and commits as the
Ids identifying the commits will change. You should merge as many
Pull Requests as you can before starting this process and warn your teammates that they will need to clone the repository again after you’re done.