Getting started

Simulation studies are often used in forensic genetics to study the behaviour of probabilistic genotyping software. In many cases, researchers use simplified models that ignore complexities such as peak height variability and stutters. The goal of this package is to make it straightforward to use probabilistic genotyping models in simulation studies.

For a general introduction to forensic genetics and probabilistic genotyping, see (Buckleton, Bright, and Taylor 2018). See also (Gill et al. 2021) for a recent review of probabilistic genotyping systems.

Basic log normal example

Below, we load allele frequencies provided with the package and demonstrate how mixtures can be sampled using a log normal model for peak heights (refer to (Taylor, Bright, and Buckleton 2013) for details on the model). The template parameters for each contributor are picked uniformly between 50 and 10,000. The degradation parameter is picked from a gamma distribution with shape 2.5 and scale 0.001.

freqs <- read_allele_freqs(system.file("extdata","FBI_extended_Cauc_022024.csv",
                                       package = "simDNAmixtures"))
gf <- gf_configuration()

sampling_parameters <- list(min_template = 50., max_template = 10000.,
                            degradation_shape = 2.5, degradation_scale = 1e-3)

After the setup, a single function call is sufficient to generate a set of samples. We set n=2 to generate two samples. The contributors to the sample are named U1 and U2 which means that each mixed sample has two unrelated contributors.

set.seed(1)
mixtures <- sample_mixtures(n = 2, contributors = c("U1", "U2"), freqs = freqs,
                            sampling_parameters = sampling_parameters,
                            model_settings = gf$log_normal_bwfw_settings,
                            sample_model = sample_log_normal_model)

The mixtures object contains the simulation results. The sample_mixtures function has an optional results_directory argument. If the path to a directory is provided, then all simulation results will be persisted on disk in file formats that are readily imported in probabilistic genotyping software. We inspect the parameter_summary property of the result of sample_mixtures.

knitr::kable(mixtures$parameter_summary[1:5])

SampleName	contributors1	contributors2	model	template1
sample_0001_N2_7753_6977_ca	U1	U2	log_normal_model	7752.690
sample_0002_N2_5356_7818_bb	U1	U2	log_normal_model	5356.177

The first sample is in approximately a 2:1 ratio. We print the first 10 peaks of the profile below.

knitr::kable(head(mixtures$samples[[1]]$mixture, 10))

Locus	Allele	Height	Size
D3S1358	14	1481	117.33
D3S1358	15	22260	121.40
D3S1358	16	6660	125.48
vWA	13	174	164.80
vWA	14	4054	168.84
vWA	17	588	180.95
vWA	18	7710	184.99
vWA	19	4561	189.02
D16S539	8	200	239.58
D16S539	9	5827	243.61

Basic gamma example

We repeat the example above with a gamma distribution for peak heights instead of a log normal one. The mean peak height parameter mu is chosen uniformly between 50 and 5,000. The variance parameter cv parameter (σ in (Bleka, Storvik, and Gill 2016)) is chosen uniformly between 0.05 and 0.35. The degradation parameter is sampled from a Beta distribution with parameters 10 and 1, which means most profile will be only mildly degraded.

gamma_sampling_parameters <- list(min_mu = 50., max_mu = 5e3,
                            min_cv = 0.05, max_cv = 0.35,
                            degradation_shape1 = 10, degradation_shape2 = 1)

We now invoke the sample_mixtures function again to sample two mixtures; both consisting of two unrelated contributors. We do not sample stutters.

set.seed(2)
mixtures <- sample_mixtures(n = 2, contributors = c("U1", "U2"), freqs = freqs,
                            sampling_parameters = gamma_sampling_parameters,
                            model_settings = gf$gamma_settings_no_stutter,
                            sample_model = sample_gamma_model)

We see that the first sample has μ around 1,236 and is in a 65:35 ratio.

knitr::kable(mixtures$parameter_summary[1:4])

SampleName	contributors1	contributors2	model
sample_0001_N2_4297_48_52_bb	U1	U2	gamma_model
sample_0002_N2_4513_34_66_aa	U1	U2	gamma_model

The coefficient of variation for a full heterozygote is about 9% for the first sample.

knitr::kable(mixtures$parameter_summary[c(1,5:7)])

SampleName	mixture_proportions1	mixture_proportions2	mu
sample_0001_N2_4297_48_52_bb	0.4822929	0.5177071	4297.329
sample_0002_N2_4513_34_66_aa	0.3395792	0.6604208	4513.154

We print the first 10 peaks of the profile below.

knitr::kable(head(mixtures$samples[[1]]$mixture, 10))

Locus	Allele	Height	Size
D3S1358	15	3604	121.40
D3S1358	17	2812	129.56
D3S1358	18	1978	133.64
vWA	14	2359	168.84
vWA	16	4001	176.91
vWA	18	2151	184.99
D16S539	11	3291	251.67
D16S539	12	2749	255.70
CSF1PO	10	1303	298.34
CSF1PO	11	2973	302.30

References

Bleka, Øyvind, Geir Storvik, and Peter Gill. 2016. “EuroForMix: An Open Source Software Based on a Continuous Model to Evaluate STR DNA Profiles from a Mixture of Contributors with Artefacts.” Forensic Science International: Genetics 21: 35–44. https://doi.org/10.1016/j.fsigen.2015.11.008.

Buckleton, John S, Jo-Anne Bright, and Duncan Taylor. 2018. Forensic DNA Evidence Interpretation. CRC press.

Gill, Peter, Corina Benschop, John Buckleton, Øyvind Bleka, and Duncan Taylor. 2021. “A Review of Probabilistic Genotyping Systems: EuroForMix, DNAStatistX and STRmix™.” Genes 12 (10): 1559. https://doi.org/10.3390/genes12101559.

Taylor, Duncan, Jo-Anne Bright, and John Buckleton. 2013. “The Interpretation of Single Source and Mixed DNA Profiles.” Forensic Science International: Genetics 7 (5): 516–28. https://doi.org/10.1016/j.fsigen.2013.05.011.