Getting started

Simulation studies are often used in forensic genetics to study the behaviour of probabilistic genotyping software. In many cases, researchers use simplified models that ignore complexities such as peak height variability and stutters. The goal of this package is to make it straightforward to use probabilistic genotyping models in simulation studies.

For a general introduction to forensic genetics and probabilistic genotyping, see (Buckleton, Bright, and Taylor 2018). See also (Gill et al. 2021) for a recent review of probabilistic genotyping systems.

Basic log normal example

Below, we load allele frequencies provided with the package and demonstrate how mixtures can be sampled using a log normal model for peak heights (refer to (Taylor, Bright, and Buckleton 2013) for details on the model). The template parameters for each contributor are picked uniformly between 50 and 10,000. The degradation parameter is picked from a gamma distribution with shape 2.5 and scale 0.001.

freqs <- read_allele_freqs(system.file("extdata","FBI_extended_Cauc_022024.csv",
                                       package = "simDNAmixtures"))
gf <- gf_configuration()

sampling_parameters <- list(min_template = 50., max_template = 10000.,
                            degradation_shape = 2.5, degradation_scale = 1e-3)

After the setup, a single function call is sufficient to generate a set of samples. We set n=2 to generate two samples. The contributors to the sample are named U1 and U2 which means that each mixed sample has two unrelated contributors.

set.seed(1)
mixtures <- sample_mixtures(n = 2, contributors = c("U1", "U2"), freqs = freqs,
                            sampling_parameters = sampling_parameters,
                            model_settings = gf$log_normal_bwfw_settings,
                            sample_model = sample_log_normal_model)

The mixtures object contains the simulation results. The sample_mixtures function has an optional results_directory argument. If the path to a directory is provided, then all simulation results will be persisted on disk in file formats that are readily imported in probabilistic genotyping software. We inspect the parameter_summary property of the result of sample_mixtures.

knitr::kable(mixtures$parameter_summary[1:5])
SampleName contributors1 contributors2 model template1
sample_0001_N2_7753_6977_ca U1 U2 log_normal_model 7752.690
sample_0002_N2_5356_7818_bb U1 U2 log_normal_model 5356.177

The first sample is in approximately a 2:1 ratio. We print the first 10 peaks of the profile below.

knitr::kable(head(mixtures$samples[[1]]$mixture, 10))
Locus Allele Height Size
D3S1358 14 1481 117.33
D3S1358 15 22260 121.40
D3S1358 16 6660 125.48
vWA 13 174 164.80
vWA 14 4054 168.84
vWA 17 588 180.95
vWA 18 7710 184.99
vWA 19 4561 189.02
D16S539 8 200 239.58
D16S539 9 5827 243.61

Basic gamma example

We repeat the example above with a gamma distribution for peak heights instead of a log normal one. The mean peak height parameter mu is chosen uniformly between 50 and 5,000. The variance parameter cv parameter (σ in (Bleka, Storvik, and Gill 2016)) is chosen uniformly between 0.05 and 0.35. The degradation parameter is sampled from a Beta distribution with parameters 10 and 1, which means most profile will be only mildly degraded.

gamma_sampling_parameters <- list(min_mu = 50., max_mu = 5e3,
                            min_cv = 0.05, max_cv = 0.35,
                            degradation_shape1 = 10, degradation_shape2 = 1)

We now invoke the sample_mixtures function again to sample two mixtures; both consisting of two unrelated contributors. We do not sample stutters.

set.seed(2)
mixtures <- sample_mixtures(n = 2, contributors = c("U1", "U2"), freqs = freqs,
                            sampling_parameters = gamma_sampling_parameters,
                            model_settings = gf$gamma_settings_no_stutter,
                            sample_model = sample_gamma_model)

We see that the first sample has μ around 1,236 and is in a 65:35 ratio.

knitr::kable(mixtures$parameter_summary[1:4]) 
SampleName contributors1 contributors2 model
sample_0001_N2_4297_48_52_bb U1 U2 gamma_model
sample_0002_N2_4513_34_66_aa U1 U2 gamma_model

The coefficient of variation for a full heterozygote is about 9% for the first sample.

knitr::kable(mixtures$parameter_summary[c(1,5:7)])
SampleName mixture_proportions1 mixture_proportions2 mu
sample_0001_N2_4297_48_52_bb 0.4822929 0.5177071 4297.329
sample_0002_N2_4513_34_66_aa 0.3395792 0.6604208 4513.154

We print the first 10 peaks of the profile below.

knitr::kable(head(mixtures$samples[[1]]$mixture, 10))
Locus Allele Height Size
D3S1358 15 3604 121.40
D3S1358 17 2812 129.56
D3S1358 18 1978 133.64
vWA 14 2359 168.84
vWA 16 4001 176.91
vWA 18 2151 184.99
D16S539 11 3291 251.67
D16S539 12 2749 255.70
CSF1PO 10 1303 298.34
CSF1PO 11 2973 302.30

References

Bleka, Øyvind, Geir Storvik, and Peter Gill. 2016. “EuroForMix: An Open Source Software Based on a Continuous Model to Evaluate STR DNA Profiles from a Mixture of Contributors with Artefacts.” Forensic Science International: Genetics 21: 35–44. https://doi.org/10.1016/j.fsigen.2015.11.008.
Buckleton, John S, Jo-Anne Bright, and Duncan Taylor. 2018. Forensic DNA Evidence Interpretation. CRC press.
Gill, Peter, Corina Benschop, John Buckleton, Øyvind Bleka, and Duncan Taylor. 2021. “A Review of Probabilistic Genotyping Systems: EuroForMix, DNAStatistX and STRmix™.” Genes 12 (10): 1559. https://doi.org/10.3390/genes12101559.
Taylor, Duncan, Jo-Anne Bright, and John Buckleton. 2013. “The Interpretation of Single Source and Mixed DNA Profiles.” Forensic Science International: Genetics 7 (5): 516–28. https://doi.org/10.1016/j.fsigen.2013.05.011.