Simulation studies are often used in forensic genetics to study the behaviour of probabilistic genotyping software. In many cases, researchers use simplified models that ignore complexities such as peak height variability and stutters. The goal of this package is to make it straightforward to use probabilistic genotyping models in simulation studies.
For a general introduction to forensic genetics and probabilistic genotyping, see (Buckleton, Bright, and Taylor 2018). See also (Gill et al. 2021) for a recent review of probabilistic genotyping systems.
Below, we load allele frequencies provided with the package and demonstrate how mixtures can be sampled using a log normal model for peak heights (refer to (Taylor, Bright, and Buckleton 2013) for details on the model). The template parameters for each contributor are picked uniformly between 50 and 10,000. The degradation parameter is picked from a gamma distribution with shape 2.5 and scale 0.001.
freqs <- read_allele_freqs(system.file("extdata","FBI_extended_Cauc_022024.csv",
package = "simDNAmixtures"))
gf <- gf_configuration()
sampling_parameters <- list(min_template = 50., max_template = 10000.,
degradation_shape = 2.5, degradation_scale = 1e-3)
After the setup, a single function call is sufficient to generate a
set of samples. We set n=2
to generate two samples. The
contributors to the sample are named U1
and U2
which means that each mixed sample has two unrelated contributors.
set.seed(1)
mixtures <- sample_mixtures(n = 2, contributors = c("U1", "U2"), freqs = freqs,
sampling_parameters = sampling_parameters,
model_settings = gf$log_normal_bwfw_settings,
sample_model = sample_log_normal_model)
The mixtures
object contains the simulation results. The
sample_mixtures
function has an optional
results_directory
argument. If the path to a directory is
provided, then all simulation results will be persisted on disk in file
formats that are readily imported in probabilistic genotyping software.
We inspect the parameter_summary
property of the result of
sample_mixtures
.
SampleName | contributors1 | contributors2 | model | template1 |
---|---|---|---|---|
sample_0001_N2_7753_6977_ca | U1 | U2 | log_normal_model | 7752.690 |
sample_0002_N2_5356_7818_bb | U1 | U2 | log_normal_model | 5356.177 |
The first sample is in approximately a 2:1 ratio. We print the first 10 peaks of the profile below.
Locus | Allele | Height | Size |
---|---|---|---|
D3S1358 | 14 | 1481 | 117.33 |
D3S1358 | 15 | 22260 | 121.40 |
D3S1358 | 16 | 6660 | 125.48 |
vWA | 13 | 174 | 164.80 |
vWA | 14 | 4054 | 168.84 |
vWA | 17 | 588 | 180.95 |
vWA | 18 | 7710 | 184.99 |
vWA | 19 | 4561 | 189.02 |
D16S539 | 8 | 200 | 239.58 |
D16S539 | 9 | 5827 | 243.61 |
We repeat the example above with a gamma distribution for peak
heights instead of a log normal one. The mean peak height parameter
mu
is chosen uniformly between 50 and 5,000. The variance
parameter cv
parameter (σ in (Bleka,
Storvik, and Gill 2016)) is chosen uniformly between 0.05 and
0.35. The degradation parameter is sampled from a Beta distribution with
parameters 10 and 1, which means most profile will be only mildly
degraded.
gamma_sampling_parameters <- list(min_mu = 50., max_mu = 5e3,
min_cv = 0.05, max_cv = 0.35,
degradation_shape1 = 10, degradation_shape2 = 1)
We now invoke the sample_mixtures
function again to
sample two mixtures; both consisting of two unrelated contributors. We
do not sample stutters.
set.seed(2)
mixtures <- sample_mixtures(n = 2, contributors = c("U1", "U2"), freqs = freqs,
sampling_parameters = gamma_sampling_parameters,
model_settings = gf$gamma_settings_no_stutter,
sample_model = sample_gamma_model)
We see that the first sample has μ around 1,236 and is in a 65:35 ratio.
SampleName | contributors1 | contributors2 | model |
---|---|---|---|
sample_0001_N2_4297_48_52_bb | U1 | U2 | gamma_model |
sample_0002_N2_4513_34_66_aa | U1 | U2 | gamma_model |
The coefficient of variation for a full heterozygote is about 9% for the first sample.
SampleName | mixture_proportions1 | mixture_proportions2 | mu |
---|---|---|---|
sample_0001_N2_4297_48_52_bb | 0.4822929 | 0.5177071 | 4297.329 |
sample_0002_N2_4513_34_66_aa | 0.3395792 | 0.6604208 | 4513.154 |
We print the first 10 peaks of the profile below.
Locus | Allele | Height | Size |
---|---|---|---|
D3S1358 | 15 | 3604 | 121.40 |
D3S1358 | 17 | 2812 | 129.56 |
D3S1358 | 18 | 1978 | 133.64 |
vWA | 14 | 2359 | 168.84 |
vWA | 16 | 4001 | 176.91 |
vWA | 18 | 2151 | 184.99 |
D16S539 | 11 | 3291 | 251.67 |
D16S539 | 12 | 2749 | 255.70 |
CSF1PO | 10 | 1303 | 298.34 |
CSF1PO | 11 | 2973 | 302.30 |