Build Optimizations

Problem Description

Engineers voiced their concerns that smake was taking too long to build. After performing some test regarding local vs. remote building with smake, Hong Bin wrote:

Since two weeks ago when the NP3400 subsystem (host side s/w) migrated to the neopon vob and smake environment, we've got the feeling that the build takes away a lot of our precious time. While Andrew is monitoring and improving the smake build environment, Roy and myself have done some simple measurement on the s/make on our local machine and clear-case server.

The result shows that using local machine to build is much quicker. If your subsystem is relatively big, we suggest you setting up your local view and local tool chain to facilitate the total local build.

Taking our NP3400 subsystem + BSP for ONU as an example, it took 2m49.082s to complete a clean build on local view + local machine while it dragged to 10m4.902s to complete the same build on remote view + remote machine. This is a big save of our time.

I was concerned about this. Using a remote view and smake to build on sons-clearcase should mean that everything was being done locally on sons-clearcase. Why then did Hong Bin's tests show a local build working quicker than essentially a local build on sons-clearcase, a more powerful machine. Monitoring sons-clearcase did not show any tell tale signs of saturation of the server so then what was the problem?

Test Environment

In order to determine the problem I decided to first create a test environment and do some measurements. Then I would analyze the problem, implement some optimizations and measure the results. If things went well I should see markedly better results after optimization.

The first step was to create a local environment with local views and build locally as well as remotely timing the results. To do this I created a local view as well as copied the tools to the local system. I then created a script to easily perform multiple builds and capture the timing statistics. This script would take two parameters, the first being "local" or "remote" to signify whether to perform the build locally or remotely and the second parameter was the number of iterations of the build to perform. The script then either used my local view with local tools or my remote view on sons-clearcase to perform the build. First a make clean was done followed by a full make of ONU, which was timed. The local machine was my desktop, a P6 1 gHz machine with 512 Meg of memory running Windows XP and Cygwin 1.3.6. The remote machine was sons-clearcase, with dual P6 1 gHz, 1 Gig of memory running Windows 2000 and Cygwin 1.3.6.

Test Results Verifying the Problem

After performing 40 builds both locally and remotely I gathered my test results I obtained the following graph:

Local vs. Remote Build

As can be seen, local builds were faster than remote builds by well over 1 minute and a half (approximately 100 seconds). There are spikes in both the local and remote builds however the running average still indicates that local builds are consistently faster. This does not make sense since smake essentially builds in a remote view which is essentially local on sons-clearcase, with tools that are also essentially local on sons-clearcase and sons-clearcase is a more powerful machine with 2 CPUs and more memory than my desktop. Why are local builds running faster than essential local smakes on sons-clearcase? Performance monitoring of sons-clearcase did not show that the server was even close to being saturated.

Short Circuit Assumptions

Next I got to thinking that there must be some networking reads or writes that were slowing down the build process without evidencing itself as a CPU, Memory or Disk drain on the server. Was there anything else in the smake scenario that might be causing things to be happening over the network instead of locally.

As you know we build by using tools mounted on the T drive. All systems need to have the T drive mapped to //sons-clearcase/Tools. This includes sons-clearcase for consistency's sake. Also, a remote view essentially means that the view storage is contained on the share //sons-clearcase/Views which is mounted as /view under Cygwin. An assumption that I made was that Windows would be smart enough to recognize that the T drive and the /view Cygwin mount where indeed local when on sons-clearcase and therefore would not perform any network reads or writes when dealing with these file system if a process was running on sons-clearcase. Turns out that this is not the case, that Windows is not smart enough to short circuit network connections that resolve to the current host and therefore smake was indeed still using network reads and writes.

Optimizations Implemented

I implemented two optimizations to the smake system. First I optimized smake to set the TOOLS_ROOT environment variable to E:/Tools IFF the build was running on sons-clearcase because that is the local path to where the Tools reside on sons-clearcase. Smake silently inserts "export TOOLS_ROOT=E:/Tools" before executing make on sons-clearcase as well as adds the -e option to make so that make will pick up this environment variable. Normally one would worry about running make -e because one would not want to have the users environment influencing the build, however since we are going through rsh to sons-clearcase and because rsh does not pass along the users environment we are safe. The new rsh command line in smake is:

$ rsh $build_server cd $(pwd -P) && export TOOLS_ROOT=E:/Tools && nice make -e $@"

The second optimization that I implemented was to mount the Views share using a local path. In the past all machines also had //sons-clearcase/Tools Cygwin mounted to /view by the cygwin_setup script. However to do this on sons-clearcase means that all access to /view will likewise be network access which will only slow down things. I changed the Cygwin mount of /view to C:/ClearCaseStorage/Views for sons-clearcase and sons-clearcase alone. I also modified the cygwin_setup script to be cognizant of when it is running on sons-clearcase so as to always establish a not networked mount when run on sons-clearcase.

Test Results Verifying the Optimization

Again I ran my tests by performing 40 remote builds and graphing the results:

Local vs Remote Builds (optimized)

Here we can see that the local build graph lines remain the same as before as I did not rerun the local builds, however we can see dramatic improvement on remote build speeds. Additionally we can see that remote building remained pretty consistently at ~200 seconds or 3 minutes and 20 seconds for a build of ONU.

Comparing the average build time for the three methods tested shows us:

Local vs. Remote build (optimized - average)

You can also view the Excel spreadsheet containing the test result data .