I wrote git-format-staged to apply an automatic code formatter or linter to staged files. It ignores unstaged changes, and leaves those changes unstaged. When run in a Git pre-commit hook git-format-staged guarantees that committed files are formatted properly, and does not clobber unstaged changes if formatting cannot be applied to working tree files cleanly.
How I learned to love automatic formatting
I used to pay a lot of attention to code formatting. I would split or join lines, indent to just the right column, and so on. Then I learned about Prettier, and tried it out in my Javascript projects. After setting up an editor plugin I could tap a key combination, and Prettier would do the same things that I was doing by hand in an instant without any thought on my part. The code did not wind up in exactly the style that I was used to - but it was formatted according to a set of consistent, sensible rules that I can live with. I realized that if I can trust a program to format my code nicely I can free up a not-insubstantial chunk of my attention. That leaves me with more mental capacity to devote to matters that actually require human input. I have not looked back!
Automatic formatting is not specific to Javascript. The idea has been around for a while - but I credit the Golang community with boosting its popularity. Early on the community made gofmt a core part of Go programming conventions. The spotlight on gofmt encouraged wider adoption of code formatting in other languages. These days there is a tool available to prettily-format just about every language out there.
Automatic formatting in a team
Automatic formatting can help a team to be more focused too. The formatter makes style expectations explicit; and people joining the team do not have to expend effort to learn the style rules for a project. But if some contributors run a formatter and some do not then you get unnecessary code churn when two contributors work on the same file, and one contributor's formatting rewrites the other's recent changes. That can happen in a solo project too if you forget to run the formatter consistently. (When I forget I feel guilty about not following my own rules!) The most reliable way to ensure that all code committed to your project is formatted consistently is to run the formatter in a Git pre-commit hook.
The problem with content changes on pre-commit
The naïve way to format files in a pre-commit hook is to:
- get a list of files with changes that will be committed (staged files)
- run the formatter on those files
- run
git add
to stage any changes from formatting
The last step is important: if you do not add changes to the staging area after
formatting then changes from formatting will not make it into the commit. But
running git add
non-interactively will stage the entire file, which will
irritate contributors who like to use git add --patch
or a similar editor
feature to selectively stage changes to each file. A contributor might
deliberately leave some changes to a file unstaged if they want to save those
changes for a later commit, or if those changes contain debugging code. In the
worst cases an unstaged change might include a password that the contributor
has temporarily pasted into a source file for use in development. If your
pre-commit hook runs git add
all unstaged changes will be committed, and the
contributer might push that commit before they realize what happened. Even if
that does not happen you risk receiving a sternly-worded issue report for
interfering with another developer's workflow!
Enter git-format-staged
I wrote git-format-staged to reconcile automatic formatting with partially staged files. Git-format-staged runs your formatter on the staged version of each file. It ignores unstaged changes, and leaves those changes unstaged. Or it can run a linter, which can be helpful if you do not want your linter to report problems in unstaged hunks.
You can use git-format-staged in any project that uses Git for version control. It is a standalone script, and its only dependencies are Git and Python. There are detailed instruction in the project's readme; but in case you want to get started right away here is the two-step process to set up automatic formatting in your Javascript project:
Install git-format-staged from NPM. I recommend installing it as a development dependency of your project. While you are at it install a code formatter, and Husky which will hook up your pre-commit script.
$ yarn add --dev git-format-staged prettier husky
Git does not provide a way to pre-install event hooks in repo clones. Husky
fills that gap. Did you know that by default npm packages can run arbitrary
code when the package is installed? Husky uses that power to copy a bunch of
hooks into .git/hooks/
. Each hook checks for an npm script with
a corresponding name. Once Husky is installed all you need to do is to add
a script to your package.json
file called "precommit"
that formats your
code:
"scripts": {
"precommit": "git-format-staged --formatter 'prettier --stdin' '*.js'"
}
You can provide any command that you want for formatting, but it must be
"pipeable": it must read file content from stdin
and write formatted code to
stdout
.
Note that quotes are required around the both the formatter command that you
provide, and around patterns for files that you want to format. The file
patterns are similar to those in .gitignore
, except that *
matches files in
nested subdirectories. You can supply multiple patterns, and exclude files from
formatting with !
. See the readme for details.
And that's it! Happy coding!
Git objects and the index
Explaining exactly how git-format-staged works requires getting into some details of Git's internal operation. If you want a high-level view of how Git works behind the scenes I recommend reading The Git Parable. Git is complicated - but it is more approachable than you might think.
Put briefly, every version of every file that you commit is stored as a distinct
file in the Git object database in .git/objects/
. Every object has
a unique name, which is the hash of its content. You can see the hashes of files
in your project with this command:
$ git ls-files --stage
100644 1fed445333e85fb9996542978fa56866de90a2fb 0 .flowconfig
100644 d95266a6abbfb88067c449565b3ed01ab08fc639 0 .gitignore
100644 0e81c64902c1e6d5455addac38a9c6a3f01c2190 0 .travis.yml
100644 792ca2246057929ed88cd5ecc02eda6f1472cea9 0 LICENSE
100644 ee9a2bc0c0226cff24154937e70ef8bb4599e25d 0 README.md
100644 721bafb9848dcc0e5bd5166e5a227adfd8ccfe92 0 commitlint.config.js
100755 3f1ffbb770142bd0b37eb1e855a742fb38c2cb8b 0 git-format-staged
100644 be28f28111a221450bb8b8f11e9a7f6fe397947d 0 no-main.js
100644 75e8c8c29ae642f52462a5286bcd342b089d4783 0 package-lock.json
100644 294e47fcffe47264e1a3d53678890cadac0b26ba 0 package.json
100644 236c1ff26fcd39f5efdf8b31c0c68f7b3839762f 0 test/git-format-staged_test.js
100644 408f26e1aa5450a8a42987f2f46df5041e3fdd75 0 test/helpers/git.js
From left to right those columns show each object's mode bits, object name
/ hash, stage number, and file path. The output of git ls-files --stage
shows
the state of the Git index. The index (sometimes called the cache) can be
approximately described as the state of your repository content right now. It
initially matches the state of the most recent commit in your working branch.
Whenever you stage changes, the index changes. When you create a commit the
state in the index becomes the state of content in the new commit.
Try creating a new file, and stage it:
$ echo "const foo = ()=>'foo'" > new_file.js && git add new_file.js
What happens when you run git add
is that Git creates a new object in the
object database using the content of new_file.js
, and adds an entry to the
index that points to the new object:
$ git ls-files --stage | grep new_file.js
100644 9a622ce1db369d03a7eaca94c4306b9b0f00429c 0 new_file.js
When you stage changes to a previously-committed file the process is similar: Git creates a new object with the latest content of the file, and changes the index entry to point to the new object.
Now make another change to the same file, but do not stage the change:
$ echo "const bar = ()=>'bar'" >> new_file.js
Unstaged changes are not represented in the index, or in Git's object database.
If you check the index again you can see that it is still holding onto the first
version of new_file.js
:
$ git ls-files --stage | grep new_file.js
100644 9a622ce1db369d03a7eaca94c4306b9b0f00429c 0 new_file.js
The object name in the index has not changed; and Git objects are immutable,
which means that means that the content of new_file
in the index is the
same as before. You can verify this by dumping the content of the object:
$ git cat-file -p 9a622ce1db369d03a7eaca94c4306b9b0f00429c
const foo = ()=>'foo'
What this shows us is that the staged version of a file is a file on disk, distinct from the working tree version of the file. It just happens that the staged version exists in the Git object database. When you view "staged changes" what you see is actually a diff between the latest commit and the index.
Git-format-staged works by bypassing the working tree
With the right commands you can read and write directly to the index without
touching the working tree. For example you can emulate the process of staging
changes to a file by running the low-level steps yourself. First create an
object with the content that you want. Let's format new_file.js
with Prettier:
$ git cat-file -p 9a622ce1db369d03a7eaca94c4306b9b0f00429c \
| prettier --stdin \
| git hash-object -w --stdin
d562cee83a7d2a4108c9e37a4372e509d49e59ee
We pulled the staged version of new_file.js
from the object database, fed the
content to Prettier via a pipe, and piped formatted code to git hash-object
,
which creates a new Git object. Because we pulled file content from the object
database we got the staged version of new_file.js
, which does not include the
definition for bar
.
Next update the index entry for new_file.js
to point to the formatted version:
$ git update-index --cacheinfo 100644,d562cee83a7d2a4108c9e37a4372e509d49e59ee,new_file.js
The --cacheinfo
argument is of the form MODE_BITS,OBJECT_NAME,FILE_PATH
. We
kept the same mode bits and file path from before, and supplied a new object
name / hash.
If you look at staged changes you will see that the staged version of
new_file.js
has now been prettily formatted.
$ git diff --cached
diff --git a/new_file.js b/new_file.js
new file mode 100644
index 0000000..d562cee
--- /dev/null
+++ b/new_file.js
@@ -0,0 +1 @@
+const foo = () => "foo";
When you run git-format-staged it runs the same steps, using the same commands.
Keeping the working tree in sync
The staged version of the file is now nicely formatted. But the working tree
file has not been changed. And the unstaged definition for bar
is still there:
$ cat new_file.js
const foo = ()=>'foo'
const bar = ()=>'bar'
Directly manipulating the Git object database and index means that we did not read or write the working tree file. This leaves us with a problem: when the staged version of the file is changed by automatic formatting we want the same changes to be made to the working tree file. Otherwise any discrepancies between the working tree file and the staged file will be presented as unstaged changes that did not exist before the pre-commit hook ran.
To get the working tree back in sync with the index git-format-staged gets a diff between the original staged file and the formatted staged file to compute a patch of changes introduced by formatting. Then it applies that patch to the working tree:
$ STAGED_OBJECT=9a622ce1db369d03a7eaca94c4306b9b0f00429c
$ FORMATTED_OBJECT=d562cee83a7d2a4108c9e37a4372e509d49e59ee
$ git diff $STAGED_OBJECT $FORMATTED_OBJECT | git apply -
The patch actually needs to be massaged a bit to fix up working tree paths
before it can be given to git apply
. So the command above does not work
literally. But it gives an idea of what git-format-staged does.
In most cases merging formatting changes with unstaged changes works transparently from the user's perspective. Unstaged portions of the file did not get run through the formatter, and they end up as unformatted islands in an otherwise-formatted file. Sometimes there is a conflict applying the patch. In that case git-format-staged aborts the merge, leaving the working tree file entirely unformatted. This is the least-lossy outcome possible: changes that are committed are properly formatted; unstaged changes are preserved; and formatting changes to the working tree file can be recomputed by running the formatter again.