Estimating the Dollar Cost of LLVM

tl;dr 1.2k developers have produced a 6.9M line code base with an estimated price tag of $530M.


“Compilers are expensive to build” is a recurring theme in my research. I hope this is an uncontroversial sentiment, but wouldn’t it be nice if we could attach a metric to this claim? For example, how much has LLVM development cost? I haven’t been able to find a credible estimate for this, but the good news is that the source code and full development history is freely available, so we can use that to make our own.

We’ll start by grabbing the source code for the LLVM umbrella project and its test suite:

$ git clone https://github.com/llvm/llvm-project.git llvm
$ git clone https://github.com/llvm/llvm-test-suite.git llvm/test-suite

We’ll use the most recent release at the time of writing, 8.0.1:

$ git -C llvm checkout llvmorg-8.0.1
$ git -C llvm/test-suite checkout llvmorg-8.0.1

To start with, let’s count commits:

$ (git -C llvm shortlog -s -n; git -C llvm/test-suite shortlog -s -n) \
    | q 'SELECT SUM(c1) FROM -'

Which informs us there have been 312,036 commits across the two projects. The same git commands can tell us that there have been 1,210 unique contributors:

$ (git -C llvm shortlog -s -n; git -C llvm/test-suite shortlog -s -n) \
    | awk '{$1=""}1' | sort -u | wc -l

How many lines of addition and deletion is that?

$ (git -C llvm log --format=tformat: --numstat; \
   git -C llvm/test-suite log --format=tformat: --numstat \
   ) | q -t 'SELECT SUM(c1), SUM(c2) FROM -'

LLVM 8.0.1 is the culmination of 81,033,801 line additions and 22,048,147 line deletions. That is nice, but it (a) doesn’t tell us how much code there is, and (b) doesn’t assign a dollar cost.

Let’s turn to David A. Wheeler’s sloccount to get both logical source line counts, and a development cost estimate based on a COCOMO model. We’ll leave the parameters of the model as default, but we’ll update the 19-year old salary estimate using a contemporary US average from glassdoor:

$ sloccount --personcost 103035 llvm

Which tells us that LLVM 8.0.1 comprises 6,887,489 lines of code, with an estimated development cost of $529,895,190. I suppose by most definitions that should qualify as expensive. At least, it is certainly more expensive than funding research to make compiler construction cheaper :-)

Do you have any references for the dollar costs of building a compiler? Please get in touch!


P.S: I used the same methodology to compute estimated costs for 10 popular compilers, which made its way into the introduction of my thesis:

AI researcher specializing in compilers and code optimization.