Quality control of extracted spectra

The number of spectra in the pre-release of May 2009 (1235) was small enough so that each spectrum could be inspected visually. In this visual quality control process ~40% of the spectra were discarded primarily because of high contamination from nearby sources, but also for the presence of deviant pixels either saturated or lying near the edges of combined chips. When selecting the spectra to be published, we adopted the general principle that although contaminated to some level many spectra can still provide useful information on the nature and redshift of the source. In many cases where the extraction region is not optimal, a simple re-extraction of the 1D spectrum from the 2D slitless cutout can produce a better result. For this reason we include those spectra in the release and provide a script for a customized extraction.
Since visual inspection of the full sample (73581) of this release was no longer an option, we have explored ways of automatic classification using machine-learning techniques. The idea is to first train an algorithm with a relatively small number of hand-classified spectra using measured values such as the estimated contamination fraction, the signal-to-noise ratio, the magnitude, the position of the maximum of light on the 2D spectrum or the exposure time. In a second step then, ideally the remaining spectra can be classified automatically with the algorithm returning the classification of the spectrum at hand by finding the classification of the most similar spectrum in the training set.

In practice, it turns out that there is a surprisingly large percentage of spectra that are borderline cases (roughly 25%) between the spectra that are unambiguously publishable and those that are not. Instead of a final fully-automatic classification we therefore use the machine to separate the spectra in two groups, "good" and "bad", and then visually inspect the group classified as "good" only throwing out catastrophic failures. Such spectra are typically cases where the failure is obvious and also can be seen in the parameters but where just the number of occurrences in the training set is too small for the algorithm to build up expertise. A typical example are spectra with gaps in the values due to saturation or cosmic ray impact. The "bad" spectra, primarily highly contaminated spectra, due to the overlap of the dispersed light of near-by sources aligned along the dispersion direction are not looked at or only very quickly.

To illustrate the problem of classification, we show some example of spectra classified as "bad" by the algorithm but considered "good" or "acceptable" by the visual inspection or spectra which were wrongly classified as "good" by the automated process

The classification was carried out using the weka (Hall et al. 2009) software package. The machine learning algorithms are trained with 2/3 of the consolidated training set of 2020 spectra from the visually inspected sample (out of a total of 73581 spectra) and the quality of the classification is measured by applying the trained algorithm on the independent remaining 1/3 of the inspected set. Several algorithms showed similar performance for the task at hand. The final algorithm chosen was ClassificationViaRegression using an M5P tree which had a very high total classification rate (90%) and a very low rate of false negatives ("good" spectra classified as "bad"). This classification quality is roughly that a single scientist can achieve by visual inspection.