# GCalignR: How does the Algorithms work?

## Introduction

The intention of this vignette is to illustrate how the algorithms in GCalignR handles the data during each of the processing steps. Simultaneously we deploy some simple datasets that can be generated within R to visualise the outcome of the processing steps in an easy way. For further descriptions of the algorithm we refer to our manuscript (Ottensmann et al. 2018). Here, we give a detailed introduction into the concept behind the package and illustrate how it works by simulating simple datasets with arbitrary peaks. To enhance readability, not all code lines are show within this vignette, but they can be easily accessed by typing browseVignettes("GCalignR") and clicking on R code. The datasets are pseudo-randomly created and can therefore differ from run to run. For consistency, we supply the dataset that is used to demonstrate the alignment progress with the package. An accompanying vignette “GCalignR: Step by step” focuses on the workflow with this package and the integration into a broader analysis pipeline.

## What´s a Peak List and how to create one ?

GCalignR performs all steps on a so called peak list. Such a list can be generated for every chromatogram, defined as the output of a Chromatograph (e.g. Gas Chromatography Flame Ionization Detector, GC-FID) that plots the measured electric current over the time course of a separation run. Figure 1 shows a chromatogram containing five peaks that represent five chemicals detected in this simulated sample. Here, all the distinct peaks approximate perfect Gaussian distributions and are clearly separated. Essentially every vendor of a Chromatograph offers software that allows to detect peaks (i.e. substances) in the generated signal based on procedures that take into account the shape of a peak, the noise level of the signal and many things more that are out of the scope of GCalignR and we refer to specialised resources (“Qualitative and Quantitative Analysis by Gas Chromatography” 2004). Nevertheless, the quality of the chromatograms as well as a sophisticated way to detect and quantify peaks is a crucial step before one should start to think about aligning peaks for downstream analysis!

In this example it would be adequate to define each peak by two values, the retention time and the peak height as a measure of the concentration, simply by calculating the local maxima. Figure 2 shows the peaks annotated by the intensity. The dashed vertical lines indicate the retention time of the peaks maximum that is written on top of each of the five peaks.

Now these and further information (e.g. the peak area) can be summarised in the form of a peak list that contains information for every detected peak. One row of such a peak list refers to a single peak. In general, peaks are ordered with increasing retention time, thereby starting with the most volatile substances.

time height
Peak 1 5.01 1.17
Peak 2 10.02 1.05
Peak 3 13.10 1.21
Peak 4 20.22 1.81
Peak 5 24.57 1.38

Figure 3 shows chromatograms of four samples “A1” to “A4” that were analysed on the same hypothetical GC-FID run. These peaks can be individually characterised by their retention times (see labels on each peak). In this small set of samples, one can easily see that several peaks appear in consistent temporal sequence with increasing retention times in sample order approximating intervals of 0.7 minutes. Here, it would be possible to account for this variation manually, but consider a scenario where there are many more samples and peaks, perhaps in noisier chromatograms. For this reason we developed GCalignR and implemented simple algorithms that are explained below. Before these chromatograms can be analysed, we need to obtain peak retention times and peak heights again. Additionally we need to format it for using GCalignR to align the peaks. GCalignR is distributed with sample files that can be used as templates.

The standard input format is a tab-delimited text file that contains all the required information.(1) The first row lists the sample identifiers of all the samples included for aligning, (2) column names are written in the second row and specify the content of individual peak lists that are (3) incorporated for each individual below. Peak lists have to be concatenated column-wise in the order specified in the first row. The input file belonging to the chromatograms shown in Figure 3 is depicted below. Several sample files are distributed with GCalignR. Type system.file("extdata", package = "GCalignR") to obtain the path on your computer. Any row of a given sample that only contains zeros or NAs is treated as empty an will be deleted prior to running the algorithms.

>
> rt    height

## Aligning peak lists

Over the course of the analytic pipeline, retention times of the same substance can vary for a number of reasons that include column ageing, perturbations of the carrier gas flow or temperature fluctuations, all of which can be avoided with varying success. GCalignR comes into place, when a question regarding the similarity of a number of samples is addressed by analysing their chemical composition. For this purpose it is crucial to cluster peaks that belong to the putatively homologous substance across samples. All alignment steps take only the retention time of a peak into account and are embedded within a single function align_chromatograms that conducts the alignment sequentially. Any other variable included in the dataset remains always associated with the retention time of the peak and is not treated any further. The single steps undertaken by the function to align the data can be traced back from the output that is returned after execution of align_chromatograms.

### Full alignment of peaks lists

The first step in the alignment acts on linear drifts across samples. Therefore, a reference sample is required that will be selected automatically, by picking the sample with the highest average similarity to all others samples based on the original retention times. Otherwise, a reference can be selected manually. Linear shifts between each sample and the reference are evaluated in a pairwise comparison similar to a cross-correlation based on the peak retention time alone. Thereby, the sample is slided in a user-defined window in discrete time steps. Figure 4 illustrates this for two samples (Sample “B” & “C”), where the linear shift was evaluated by in a window ranging from -2 to +2 minutes. Here, the “true” linear shift between reference (Sample A) and each sample is approximated by searching for the minimum total deviation in retention times between all reference peaks and the closest peak in a sample. When more than one shift size (including 0!) yield to the same similarity score, the smallest value is taken to avoid overshooting. In this example, Sample B is shifted by - 2 and Sample C by + 1, whereas all other considered steps would result in greater deviations in retention times to the reference peaks. However, it is important to note that putatively three peaks are shared between the reference Sample A and Sample B and Sample C respectively, but due to non-linear perturbations not all peaks align well after the full alignment took place.

### Partial alignment of peaks

As shown in Figure 4, non-linear errors in retention times need to be accounted for in order to cluster similar retention times that belong to putatively homologous substances. This step of the algorithm takes the linearly adjusted peak retention times as input, but does not manipulate them any further. All computation is from now onwards based on the concept of a retention matrix, in which columns represent samples and rows are the operational unit of a substance. Because of the more or less subtle variations in the retention times, the core alignment is focusing one substance (i.e. row of the matrix) at a time and works itself through all the substances it will get to see, starting with the first row in the matrix. Figure 5 shows the matrix operations for an already linearly corrected example. All the operations are based on the comparison of a focal retention time with the mean retention time of all previous samples within the same row, always starting with the first column. If a focal retention time deviates significantly from the mean a matrix manipulation is conducted to solve the conflict. Whenever the focal peak has a larger retention time, a gap is introduced by inserting a zero and moving the peak to the next row, whereas a gap is introduced in all previous samples is introduced if the focal peak has a smaller retention time value. Here, a user-defined threshold, set by the parameter max_diff_peak2mean = 0.5 in this example, defines when manipulations will be conducted. In order to follow the steps more easily, the order of samples is kept constant in this illustration, whereas the implemented algorithm uses random starts for each row to prevent any systematic errors that remained until now, from inflating results.

## Merging rows

Looking at final retention matrix (bottom left) in Figure 5, we immediately see that rows 1 and 2 have very similar mean retention times. Here, the difference is even smaller than specified by max_diff_peak2mean and the separation was triggered accidentally by comparing the two extreme values prior to the first manipulation (upper left matrix). For this and similar situations, a third step in the alignment is available that tries to merge rows that differ by (i) less than min_diff_peak2peak in mean retention times and (ii) show a clear pattern, by which not a single sample shows peaks in all of the focal rows. Setting min_diff_peak2peak to a value of two-times max_diff_peak2mean will solve this conflict by merging both of the rows (Figure 6).

## Aligning the a hypothetical dataset

Now we are going to apply the alignment to a dataset that is highly similar to the example depicted in Figure 3. Here, all peaks are shown with the same shape, which is appropriate as we are only interested in the retention times. Based on the inspection of the graph shown in Figure 7, we pick “A2” as a reference and take into account that peaks of “A3” and “A4” are seemingly postponed by approx. 0.7 and 1.4 minutes, whereas “A1” shows peaks 0.7 minutes earlier. Therefore, a good estimate for the required window size to correct for linear shifts is given by max_linear_shift = 1.6 yielding to a window of 1.6 Minutes around the retention times of the reference sample, including a safety margin. We can check if the search window is of appropriate size after executing the algorithm. For aligning individual peaks were stick to the default values max_diff_peak2mean = 0.02 and min_diff_peak2peak = 0.08 for now.

## path to the data
path <- system.file("extdata", "simulated_peak_data.txt", package = "GCalignR")
## draw chromatograms
x <- draw_chromatogram(data = path, rt_col_name = "rt", show_rt = T, show_num = F, plot = F)
x[["ggplot"]] + geom_line(size = 1.2) + theme(axis.ticks.x = element_blank()) + ggplot2::scale_color_brewer(palette = "Dark2")
path <- system.file("extdata", "simulated_peak_data.txt", package = "GCalignR")
aligned <- align_chromatograms(data = path,
rt_col_name = "rt",
max_linear_shift = 1.6,
max_diff_peak2mean = 0.5,
min_diff_peak2peak = 1,
reference = "A2")

GCalignR creates a Logfile while processing a dataset that allows to trace back the Linear shifts that have been applied to the samples. We can see that the linear shifts of “A3” and “A4” are of the size that we expected, whereas the drift in “A1” was putatively not fully compensated with 0.5 instead of 0.7. The maximum value is -1.4 and therefore the setting max_linear_shift = 1.6 was enough.

print(aligned[["Logfile"]][["LinearShift"]])
>   shift sample
> 1   0.7     A1
> 2   0.0     A2
> 3  -0.7     A3
> 4  -1.4     A4

For a bigger dataset it is more convenient to invoke a histogram that shows the distribution of applied shifts (Figure 8). Here, the horizontal axis shows the range that was considered. As a rule of thumb, a skew towards one or the other end of the axis would indicate a potential underestimation of the drift amplitude.

plot(aligned, which_plot = "shifts")

Using draw_chromatograms again, we can inspect how the linear corrections have changed the peak list.

x <- draw_chromatogram(data = aligned, rt_col_name = "rt", step = "shifted", show_rt = F, show_num = F, plot = F)
x[["ggplot"]] + ggplot2::scale_color_brewer(palette = "Dark2")

We immediately see that the peaks were shifted accordingly and start to cluster as expected (Figure 9). However, there is variation in retention times among the samples within the visually separated clusters. Therefore, we utilise another algorithm that evaluates the observed variance and decides which peaks belong to the same substance. These steps were already executed with the call to align_chromatograms and we can inspect the results by simply defining step = "fully_aligned". This time we also set show_num = T in order to print the number of samples behind each peak. This is helpful, because peaks of the same substance will overlap indicating that the retention time is exactly matched (Figure 10).

x <- draw_chromatogram(data = aligned, rt_col_name = "rt", step = "aligned", show_num = T, plot = F)
x[["ggplot"]] + ggplot2::scale_color_brewer(palette = "Dark2")

We can test that not only “A4” contributes to the peaks by moving each sample to its own panel on the plot (Figure 11) by making use of the facet_wrap function in ggplot2. The data frame that is used to create the plot is accessible in the list that is returned by a call to draw_chromatogram.

## for using ggplot2::facet_wrap we need to get rid of the annotations
x <- draw_chromatogram(data = aligned, rt_col_name = "rt", step = "aligned", show_num = F, plot = F)
> Computing chromatograms ...
x[["ggplot"]] + ggplot2::facet_wrap(~sample, ncol = 1) + ggplot2::scale_color_brewer(palette = "Dark2")

## Remarks

Running align_chromatograms with default settings will be a good starting point for aligning a dataset and we were able to show that parameter settings are generally robust (Ottensmann et al. 2018). However, every dataset has unique features that will require to change one or more of the two parameters max_diff_peak2mean and min_diff_peak2peak. For the example above, we created a dataset by picking peaks pseudo-randomly and adding arbitrary perturbations. Especially the amplitude of linear drift in the range of minutes is not expected in real life applications of chromatography and was used to illustrate the principles. When one works with experimental data we suggest to use the original chromatograms in combination with the draw_chromatogram tool to explore the data carefully, for example by looking at subsets of samples and different time scales. All of the optional parameters that enable options of filtering and preprocessing need to be applied with caution. For example, excluding peaks that are unique for a single sample is adequate for similarity analyses but not helpful for characterising the composition of a sample and is also possible afterwards. In another vignette GCalignR step by step we focus on a empirical dataset and illustrate how GCalignR can be used within a broader workflow.

## References

Ottensmann, Meinolf, Martin A Stoffel, Hazel J Nichols, and Joseph I Hoffman. 2018. “GCalignR: An R Package for Aligning Gas-Chromatography Data for Ecological and Evolutionary Studies.” PloS One 13 (6): e0198311.

“Qualitative and Quantitative Analysis by Gas Chromatography.” 2004. In Modern Practice of Gas Chromatography, 403–60. John Wiley & Sons, Inc. https://doi.org/10.1002/0471651141.ch8.