add advice on how to deal with test failures in the Go project

2025-05-05 15:43:04 +00:00 · 2022-05-27 11:08:36 -04:00 · 2022-05-27 11:08:36 -04:00 · 16d53bf158
commit 16d53bf158
parent d3a7656d24
1 changed files with 190 additions and 0 deletions
--- a/TestFailures.md
+++ b/TestFailures.md
@ -0,0 +1,190 @@
+# TestFailures
+
+If you notice a failure in a test in the Go project, what should you do?
+
+## Goals of testing
+
+The goal of writing (and running) tests for a Go package is to learn about the
+behavior of that package and its dependencies.
+
+A test failure for a Go package may give us information about:
+* implementation defects in the package or its dependencies,
+* mistaken assumptions about the package API,
+* bugs in the test itself (such as invalid assumptions about timing),
+* unexpectedly high resource requirements (such as RAM or CPU time), or
+* bugs in the underlying platform or defects in the test infrastructure (which
+  may need to be escalated or worked around).
+
+In some cases, the cause of a test failure might not be clear: it might be
+caused by _more than one_ of the above conditions. Much like repeating a
+scientific experiment, allowing a test to fail multiple times can sometimes
+provide more information from the specific pattern of failures.
+
+However, if a test fails without telling us anything new, then the test is not
+fulfilling its purpose.
+
+## Finding test failures
+
+A test failure is typically noticed from:
+* the [Go build dashboard](https://build.golang.org), especially during [builder
+  triage](https://go.dev/issue/52653);
+* TryBot or [SlowBot](https://go.dev/wiki/SlowBots) failures on a pending
+  change;
+* running `go test` on a specific package or packages, either working within a
+  Go project repository or as part of (say) `go test all` in a user's own
+  module;
+* or running `all.bash` or `all.bat` when [installing from
+  source](https://go.dev/doc/install/source#install) or [testing a contributed
+  change](https://go.dev/doc/contribute#testing).
+
+## Triaging a test failure
+
+Once we have noticed a failure, we need to _triage_ it.
+The goal of triage is to identify:
+
+1. Is the information from the failure new?
+2. Who is best equipped to analyze the new information from the failure?
+
+### Identifying new information
+
+Start by searching the [open issues](https://go.dev/issue) for key details from
+the failure, such as the name of the failing test and/or other distinctive
+fragments of the error text (such as error codes).
+
+If you find an existing issue, first check the issue discussion to see whether
+the failure has already been fixed, and whether the information from your failure
+contributes relevant new information. If so, _comment on it_ with details:
+* describe what Go version you were testing (`go version`)
+* how and where you were running the test, such as:
+   * `go env` output
+   * your machine and OS configuration
+   * your network configuration and status
+* whether or how often you are able to reproduce the failure.
+
+Ideally, attach or link to the full test logs (possibly in a `<details>` block).
+
+If you don't find an existing issue, file a new one.
+
+### Filing an issue
+
+Paste in enough of the test log for future reporters to be able to search for
+the issue, including the name of the test and any distinctive parts of its log
+output. (If the test log is long — such as if it contains a large goroutine
+dump — consider posting a shorter excerpt and/or enclosing the complete failure
+message in a `<details>` block.)
+
+You can use the
+[`fetchlogs`](https://pkg.go.dev/golang.org/x/build/cmd/fetchlogs) and
+[`greplogs`](https://pkg.go.dev/golang.org/x/build/cmd/greplogs) tools to
+search for similar failures in the build dashboard:
+
+```
+# download recent logs
+fetchlogs -n 1024 -repo all
+```
+```
+# search logs for some regexp describing the failure message
+greplogs -l -e $FAILURE_REGEXP
+```
+
+If the failure appears to be specific to a package, consult
+https://dev.golang.org/owners to find the maintainers of that package and mention
+them on the issue. (If no owners are listed for the package, check the recent
+history of the package at https://cs.opensource.google/go and/or escalate to
+someone who can help to identify a relevant owner — and consider updating the
+[owners
+table](https://cs.opensource.google/go/x/build/+/master:devapp/owners/table.go)
+with what you learn!)
+
+If the failure appears to be specific to a `GOOS` or `GOARCH` label the issue
+with the corresponding GOOS and/or GOARCH labels, and mention the relevant
+subteam(s) of
+[@golang/port-maintainers](https://github.com/orgs/golang/teams/port-maintainers)
+on the issue.
+
+If the failure appears to affect at least one
+[first class port](https://go.dev/wiki/PortingPolicy#first-class-ports),
+add the issue to the current release milestone and label it
+[`release-blocker`](https://github.com/golang/go/labels/release-blocker).
+Otherwise, add the issue to the `Backlog` milestone.
+
+If the failure appears to be specific to a builder (such as a network
+connectivity issue, or a platform bug requiring a system update), consult
+[`x/build/dashboard/builders.go`](https://cs.opensource.google/go/x/build/+/master:dashboard/builders.go)
+to find the maintainer for that builder and mention them on the issue.
+(For builders without an explicit maintainer listed, instead mention
+the [@golang/release](https://github.com/orgs/golang/teams/release) team.)
+
+## Addressing a test failure
+
+Once an issue has been filed for a test failure, the relevant package, port,
+and/or builder maintainers should examine the information gleaned from the test
+failure and decide how to _address_ it, by doing one or more of:
+
+* _revert_ a change to the code or the test infrastructure believed to have
+  introduced the problem,
+* _fix_ (or apply a workaround for) the root cause of the failure,
+* _collect_ more information, by subscribing to issue updates and/or running
+  more tests,
+* _report_ an underlying defect in a dependency, platform, or test
+  infrastructure, and/or
+* _deprioritize_ the failure, by skipping the failure on affected platforms (or
+  marking those platforms as broken) and then moving the issue to a future or
+  `Backlog` milestone and/or removing the `release-blocker` label.
+
+When a maintainer decides to deprioritize a test failure, they determine that
+_additional failures of the test will not provide useful new information_. At
+that point, the test no longer fulfills its purpose, and the maintainer should
+suppress the failure — typically by adding a call to
+[`testenv.SkipFlaky`](https://pkg.go.dev/internal/testenv#SkipFlaky) or
+[`t.Skipf`](https://pkg.go.dev/testing#F.Skipf).
+
+### Skipping a test failure
+
+When we add a call to `testenv.SkipFlaky`, our goal is to eliminate failure
+modes that do not provide new information while still preserving as much of the
+value of the test as is feasible.
+
+* If the observed failure is only one of several possible failure modes for
+  the test, skip the test only for that failure mode.
+
+   * For example, if the error is always something specific like
+     `syscall.ECONNRESET`, use `errors.Is` to check for that specific error.
+
+* If the failure is believed to affect all versions of a particular `GOOS`
+  and/or `GOARCH`, or the affected versions cannot be identified, check against
+  `runtime.GOOS` and/or `runtime.GOARCH` and skip only the affected platform.
+
+* If the failure is due to a bug on a _specific version_ of a platform, skip the
+  test based on `testenv.Builder` or the `GO_BUILDER_NAME` environment variable.
+  (If the test fails for external Go users, they have the option to upgrade to
+  an unaffected version of the platform — and they probably ought to see the
+  test failure to find out that the bug exists!)
+
+   * Also consider adding another environment variable that users and contributors
+     can set to acknowledge the bug and suppress the failure.
+
+### Marking a builder or port as broken
+
+A platform bug or bug in a core package (such as `os`, `net`, or `runtime`) may
+impact so many tests that the failures are not feasible to skip, or may manifest
+as a failure at compile or link time. If such a bug occurs, the options are more
+limited: if we cannot revert a change or fix or work around the root cause, and
+don't need to collect more information, we can only deprioritize the failure by
+marking the _entire builder or platform_ as broken.
+
+To mark a builder as broken, edit its configuration in
+[`x/build/dashboard/builders.go`](https://cs.opensource.google/go/x/build/+/master:dashboard/builders.go)
+to add an issue in the `KnownIssue` field; note that builders with known issues
+will generally be skipped during dashboard triage.
+
+A broken builder for a
+[first class port](https://go.dev/wiki/PortingPolicy#first-class-ports)
+should have its known issue(s) labeled `release-blocker`, pending a decision
+to either fix the builder or drop support for the affected version of the
+platform.
+
+If all of the builders for a secondary port are broken, the port itself may be
+considered broken. [Discussion
+#53060](https://github.com/golang/go/discussions/53060)
+aims to resolve the question of how broken secondary ports should be handled.