For simplicity, we set up the computational genome as a perfect beads-on-a-string model with $3\cdot 10^7$ nucleosomes$^1$ in a perfect linear array. The $10^6$ cells comprising the full cellular genome in the calculations can be arranged into a latice as shown in Fig. 1.
Fig. 1 shows a representative example of a browser window 41 nucleosomes wide with contribution from 20 cells. The peaks in the browser are obtained as the ratio of IP'd nucleosomes to input nucleosomes. The lattice shows all nucleosomes and therefore corresponds to the input. Summation down a column in the nucleosome lattice yields the number of input nucleosome fragments that map to the column's position in the genome.
Note that each comlumn contains different amounts of each of the four types of nucleosomes. At a position $x$ in the genome we can write the number of one type, say $i=$orange, as $n^t_i(x)$. When $i=$orange and $x=4$, Fig. 1 corresponds to $n^t_{\text{orange}}(4)=6$. The total number of fragments in the input is then the sum over nucleosome types at $x=4$. Generally we have $n^t(x)=\sum_i n^t_i(x)$ for the total number of input fragments at $x$. Notice also that we could count all fragments of a particular type by writing $n^t_i=\sum_x n^t_i(x)$. Even though we cannot measure it exactly, this would let us write the true distribution of type-$i$ nucleosomes as $o_i(x) = n^t_i(x)/n^t_i$. This function is important later, but is not directly observable.
If we write $n^b_i(x)$ for the average number of type-$i$ fragments that were bound in the IP reaction, then we can write down an expectation for the peaks in the genome browser. The browser track is given by $$t(x)=\frac{\sum_i n^b_i(x)}{\sum_in^t_i(x)} \tag{1}$$ where we write $t(x)$ for the browser track at $x$. The nucleosomal lattice contains a number of nucleosomes that are of an inert type. These nucleosomes do not interact with antibody and do not get captured in IP. Thus, these nucleosomes contribute only to the denominator of the track. Because of these nucleosomes it is likely that 100% efficiency can never be obtained, even with a perfect antibody, unless the cells are completely homogeneous at some values of $x$. ---generally the inert nucleosomes can also be used to model loss of nucleosomal occupancy but we do not pursue that interpretation here.
To be able to simulate how peaks change in response to experimental conditions or perturbations, we need a model for the total number of bound fragments $n^b_i=\sum_x n^b_i(x)$ for all the types of nucleosomes. The number of total particles of any "species" or type of nucleosome can be converted to a concentration as $$S^t_i = \frac{n^t_i}{V\,N_A} \tag{2}$$ where $V$ is the volume in which the particles are solvated and $N_A$ is Avogadro's number. We use the symbol $S^t_i$ to indicate the total concentration of type-$i$ (or species $i$) particles (or nucleosomes). Converting to concentration models the act of solvating the chromatin for the IP reaction. In order to understand how the distribution of bound fragments might change for different IP conditions, we can use the standard conservation of mass arguments that are used universally to determine binding constants.
Once we have converted to total concentrations we can write down the usual Mass Balance for the IP reaction. For fixed total antibody and nucleosome concentrations, the IP reaction obeys the mass balance $$A^t = A^f+\sum_{i=0}^3S_i^fK_{B,i}A^f \,\,\text{subject to} \\ S_i^t = S^f_i+S^f_iK_{B,i}A^f, \,\,\forall i \tag{3}$$ where $A^t$ is the total concentration of antibody, $S^t_i$ is the total concentration of nuclesomes of species $i$, while $A^f$, and $S^f_i$, are free, or unbound, concentrations. For the inert nucleosomes we set the binding constant as $K_{B,3}=0$ to reflect that these do not interact with antibody. The other binding constants are finite, but unknown, and we assume that the "antibody target" is the "target" nucleosome class. We enforce this by making the binding constant for target larger than the binding constants for other species.
There are a number of ways to solve this system of equations. Our solution is to first rewrite these equations as $$A^f = \frac{A^t}{1+\sum_{i=0}^3S_i^fK_{B,i}} \,\,\text{subject to} \\ S_i^f = \frac{S^t_i}{1+K_{B,i}A^f}, \,\,\forall i \tag{4}$$ From here, one can make an initial guess of the values $A^f$ and $S^f_i$ and iterate to convergence. To be certain the initial guess is always physically possible we initialize to $A^f=0$ and $S^f_i = \max[0,S^t_i-A^t/3]$ for all $i<3$, since one of the four species is non-interacting. Plugging these values in to the equation for $A^f$ updates the free antibody concentration and that update is plugged into the equations for each $S^f_i$. After a few iterations, the values converge to a fixed point and further iterations produce no change in values.
Fig. 2 shows the solution to the mass balance for an IP with three species that interact with antibody. Solving the mass balance yields the set of free concentrations $A^f$ and $S^f_i\, \forall i$, allowing the concentration of bound nucleosomes $S^b_i=S^t_i-S^f_i$ to be computed. Fig. 2 plots the outcome of simulating the IP binding reaction in two ways: $$\text{Species capture efficiency}=\frac{S^b_i}{S^t_i}=\frac{n^b_i}{n^t_i} \tag{5}$$ and the more useful but seldom considered $$\text{IP product composition}=\frac{S^b_i}{\sum_i S^b_i}=\frac{n^b_i}{\sum_i n^b_i} \tag{6}$$ The IP product composition is the fraction of the total IP product due to capture of species $i$. This quantity expresses the compositions of fragments that will be sequenced. The figures show that after ten iterations the solution has converged.
The Case 1 and Case 2 buttons below Fig. 2 demonstrate two critical results about IP reactions. First, Case 1 shows that for a 500-fold selective antibody the species capture efficiencies fall very much in-line with expectation: Near 100% capture efficiency for target and less than 30% capture efficiency for the off-target species. However, the IP product distribution is 50% off-target particles. The species capture efficiency does not report on the composition of the IP product. In this example the Off-target-1 concentration is 0.05$\mu$M and the target concentration is 0.012$\mu$M. The antibody concentration is 0.33$\mu$M, which is more than enough to saturate the target and bind to a significant amount of off-target species. The Case 2 button demonstrates that lowering the antibody preferentially sacrifices off-target binding events while minimally impacting the number of target binding events.
The simulation shown in Fig. 2 corresponds to a chromatin that is much larger than that depicted in Fig. 1. The simulation corresponds to a chromatin collected from $10^6$ cells with each cell yielding $3\cdot 10^7$ nucleosomes solvated in 200$\mu$L of buffer. The total pool of nucleosomes was split into four fractions where $\sum_i^3 f_i =1$ and $3\cdot 10^{13} \times f_i$ is the number of nucleosomes of type $i$. $f_0$ is variable, controlled by the Target slider above Fig. 2, and $f_1=0.2$, $f_2=0.05$. Thus, the inert fraction is determined dynamically with $f_3=1-f_0-f_1-f_2$. The maximal inert fraction is $f_3=0.75$ when $f_0=0$, and the minimal fraction $f_0=0.55$ when $f_3=0.2$. The total concentration of chromatin (nucleosomes in this model) for $\mathcal{C}$ cells is $$\frac{\text{Moles Chromatin}}{\text{Liter}} = \Big [ \mathcal{C}\frac{\epsilon \cdot 3\cdot 10^7 \cdot \sum_{i=0}^3f_i}{6.022\cdot 10^{23} \cdot 200\cdot 10^{-6}}\Big] \tag{7}$$ We assume the efficiency of harvesting chromatin from cells is $\epsilon = 1$, but this efficiency will be much lower in practice, but unbiased nonetheless. Making this model more complex does not preclude any of the present calculations, nor do these calculations suggest the model complexity is limited, so we confine ourselves to perfect mononucleosomal chromatin fragments for simplicity here.
We introduced the function $o_i(x)=\frac{n^t_i(x)}{n^t}$ in the context of Fig. 1. Note that this allows us to write $n^t_i(x) =V\,N_A\,o_i(x)S^t_i$. Formally the solutions to the mass balance obey $$S^b_i = S^t_i\frac{A^fK_{B,i}}{1+A^fK_{B,i}} \tag{8}$$ and in combination with $o_i(x)$ this produces $$n^b_i(x)= V\,N_A\,o_i(x)\,S^b_i \tag{9}$$ This result for $n^b_i(x)$ is a central result for this work as it defines an expected outcome for fragments at $x$ generated by IP.
We suppose that fragmented chromatin was suspened in a buffer of volume $V$ and prior to reacting antibody a small sample of volume $v$ was withdrawn from the solvated chromatin. This small sample is the "input" sample. With this modification to the volumes we have $$n^b_i(x)= (V-v)\,N_A\,o_i(x)\,S^b_i \\ n^t(x) = v\,N_A\,o_i(x)\,S^t_i \tag{10}$$ Equation (1) defined the expected sequencing track for the model chromatin in Fig. 1. This expression can now be recast in terms of the outcome of the mass balance where the reaction volume and input sample volume are correclty modelled $$t(x) = \frac{v}{V-v}\frac{\sum_{i=0}^2 n^b_i(x)}{\sum_{i=0}^3 n^t_i(x)} = \frac{\sum_{i=0}^2 o_i(x)S^b_i}{\sum_{i=0}^3 o_i(x)S^t_i} \tag{11} $$ This expression gives us a sequencing track in units of capture efficiency as a function of genomic position. The act of sequencing the input and IP materials gives us direct observations on $\sum_{i=0}^2 n^b_i(x)$ and $\sum_{i=0}^3 n^t_i(x)$ without resolving the individual terms in these sums and without any explicit observation of $o_i(x)$. The outcome of sequencing only yields a convolution of nucleosome types at each position $x$. Equation (11) is accurate when all of the input and IP samples are sequenced. This is not always done in practice, see the article (link article) accompanying this web page for specialization to sequencing fractions of the samples.
In order to plot the genomic peaks as functions of the mass balance outcome, we need to choose representations for the functions $o_i(x)$. In reality these functions are dictated by the distribution of epitopes in a population of cells. For simplicity, we represent $o_i(x)$ with the Normal distribution.
Again for simplicity, let the mock genome be comprised of 75000 identical windows of 400 nucleosomes. The genomic interval visualized here is one of these identical windows. As illustrated in Fig. 1, each position in the interval is evaluated as the superposition of $10^6$ nucleosomes, one from each cell. The number of inert nucleosomes at each position is determined as a constraint that there must be $10^6$ nucleosomes at each position. Each of the antibody-interacting species is distributed in the window according to the Normal distribution $o_i(x)=N(\mu_i,\sigma^2)$, with $\mu_i$ the position of species $i$. The width of the distributions is determined as follows such that no location has more than $10^6$ nucleosomes.
The visualized window (Fig. 3) represents $400\cdot 10^6$ nucleosomes, where $400\cdot 10^6(f_0+f_1+f_2)$ of those nucleosomes interact with antibody. The number of nucleosomes for each species in the window is $400\cdot 10^6\cdot f_i$, in analogy with Fig. 1, and the maximal number for any species $\mathcal{N}^*$ is given when $f_i=0.2$. Thus, we set $$\sigma^2=\frac{\mathcal{N}^*}{\frac{2}{3}\sqrt{2\pi}10^6}$$ where $\frac{2}{3}$ determines the maximal level of homogeneity. At the peak of a distribution $N(\mu_i,\sigma^2)$ no less than one third of the nucleosomes can be inert. When target concentration is reduced, the fraction of inert nucleosomes increases in order to maintain a total of $400\cdot 10^6$ nucleosomes in the wondow, according to the constraing on $f_3$ given previously. This constraint on the number of nucleosomes at each position in the interval means that removal of target epitope does not cause reduction of nucleosomal density.
At this point we can solve the binding model and plot the outcome in a browser track via $t(x)$. Before showing those results, however, it is worth explicitly showing how to include spike-ins in this model.
The above framework already admits treatment of spike-in chromatin. Currently the coordinate $x$ has been a position in the host genome $\mathcal{X}$. If we have exogenous spike-ins that map to a different genome, $\mathcal{Z}$, then we can build the full sample genome as $\mathcal{X}+\mathcal{Z}$ and we can choose any position $x$ in that sample genome. When $x$ maps explicitly to the $\mathcal{Z}$ genome, all of the above results hold.
For semi-synthetic spike-ins$^2$ we are guaranteed certain simplifications. Because these nucleosomes are prepaired synthetically and each nucleosomal modification corresponds to a specific and unique DNA barcode, any choice $x \in \mathcal{Z}$ is free from all heterogeneity. Because of this, each choice of $x\in\mathcal{Z}$ corresponds to a specific species. The values of $x\in\mathcal{Z}$ are therefore in exact correspondence with species specific labels, or barcodes.
For the genome position $x \in \mathcal{Z}$ which corresponds to species $j$, we have the following simplification $$t(x) = \frac{v}{V-v}\frac{\sum_{i=0}^2 n^b_i(x)}{\sum_{i=0}^3 n^t_i(x)} =\frac{v}{V-v}\frac{o_j(l)n^b_j}{o_j(l)n^t_j}=\frac{S^b_j(l)}{S^t_j(l)}\sim \frac{\text{bound species j}}{\text{total species j}}\tag{12}$$ We write $S^t_j(l)$ for the total species $j$ with spike-in label (or barcode) $l$. In the simulations considered here, nucleosomes bear only one PTM at a time. In this case, the spike-in capture efficiency for a species is actually equal to the capture efficiency of that species in bulk. This capture efficiency is not the capture efficiency at a particular genomic location $x\neq l$ because of heterogenity of nucleosomes at $x$. In reality, cellular chromatin may only rarely present a single PTM on a nucleosome. --- real chromatin isn't single PTM and can be more heterogeneous than in this model, so it is a big stretch to assume spike-in is accurate in bulk.---
The chromatin is expected to be significantly more complex than the synthetic spike-ins, and because of heterogeneity, spike-in efficiency will not be realized at any particular genomic coordinate unless that coordinate is 100% homogeneous for the target. Spike-ins are in fact genomic intervals, in our $\mathcal{X}+\mathcal{Z}$ genome, that are fully homogeneious. Nearly no interval in the cellular genome, $\mathcal{X}$, will be homogeneous as this would require every cell to have a nucleosome at the exact same position with the exact same modifications. The complexity and heterogeneity of cellular chromatin undermine application of the spike-in capture efficiency as a proxy for chromatin capture efficiency. Thus, when considering any specific genomic interval in $\mathcal{X}$, the right-hand side of equation (12) is a claim presented in current literature that should be, but likely cannot be, justified.
Equation (12) shows that if $n^t_j$ fragments of spike-in species $j$ were sequenced from volume $v$ of the input sample and $n^b_j$ fragments of spike-in species $j$ were sequenced from volume $V$ of IP sample, the volume weighted ratio of those sequenced fragments is an estimate of the binding efficiency of that species in bulk. Keep in mind that in this numerical experiment all of the sample volumes $V$ and $v$ are sequenced so only volume weighting is required. When that volume weighting is used, the spike-ins can be used to quantify capture efficiency of total species.
In Fig. 2 we demonstrated that capture efficiency and IP composition are not equivilant quantities. That is, an IP can have 100% capture of one species and 10% capture of another yet the bulk of the captured material may be due to the species with 10% efficiency. This was demonstrated with Case 1 in Fig. 2 above. What we have just shown with Equation (12) is that synthetic spike-ins quantify the capture efficiency of the corresponding chromatin species, just as they were designed to do. However, we also have that the spike-in efficiency is an upper bound for specific PTM capture efficiency on any interval of $\mathcal{X}$ because of heterogeneity.
As Fig. 2 Case 1 demonstrates, the target species can be captured with very high efficiency while the total captured IP mass is comprised mostly of off target material. In Case 1, the target concentration is lower than the concentration of off-target-1 and antibody concentration exceeds target concentration. Case 2 demonstrates that reduced antibody concentration causes the IP material to be mostly comprised of target but the target capture efficiency does not respond to altered antibody load. The off-target capture efficiencies suffer with reduced antibody, but target efficiency is not impacted. First, because the capture efficiency does not reflect the composition of IP material, and second, because the capture efficiency is not responsive to massive perturbation (like altered antibody load), the spike-ins cannot be thought of as producing a meaningful normalization for ChIP-seq. Case 1 and Case 2 would produce very different sequencing tracks due to their respectively different compositions in IP'd material, yet the spike-in normalization (capture efficiency) is the same in both cases. Normalizing to spike-ins amounts to doing nothing because the normalization factor is the same in Case 1 and Case 2.
Fig. 3 combines all the model aspects described above into one figure and provides the Case 1 and Case 2 presets from Fig. 2. Notice that when switching from Case 1 to Case 2 the height of the target peak does not change while the height of the off target peaks do change. The spike-in normalization, which is proportional to capture efficiency when volume weights are ignored (as they are ignored in practice), also does not change. Thus, sequencing the Case 1 and Case 2 IP products will produce tracks with some identical and some very different sized peaks but these changes will not be reflected in the spike-in normalization. The spike-in efficiency is both unnecessary and uninformative as a normalization. Notice also that when the spike-in target efficiency is lower, the ChIP-seq outcomes are less contaminated by off-target. Thus, the spike-in efficiency could be used as an indication of potential signal contamination but this use is not discussed in current literature.
Fig. 3 also has presets to mimic the epitope depletion experiment that we performed (Link to paper when paper is published). The model settings are not tuned in anyway to match the actual data, and in order to match the experiment we would actually need to correclty model the chromatin heterogeneity. We have made no attempt to do this. The key model predictions to note are how the spike-in capture efficiencies change when switching from DMSO to EPZ6438. Target spike-in capture efficiency increases. The second point to note is that the percent composition of the IP product changes drastically. Switching from DMSO to EPZ6438 results in an IP product that is mostly off-target chromatin. The simulated sequencing outcome shows that the peak height associated with off-target capture should increase slightly, and we did indeed observe this in our experiment. (Link to the paper again once published)