For the past few weeks, I have been trying to come up with a fast and purely agebraic function to convert DNA bases to their respective complements.
Problem statement
Our goal is rather straightforward. We aim to create a mapping of the characters a
, t
, g
, c
and n
to their respective complements.
We choose to keep our solution one step behind the classical reverse complement which reverses the string after the mapping.
Lastly, we will stick to lowercase characters and not deal with RNA bases for the sake of simplicity.
Here’s what the mapping would look like:
Input | a | t | c | g | n |
Output | t | a | g | c | n |
Kicking the tires
Let us first note that we can’t just use the value a
, t
, c
, etc. as inputs to a function. We must convert them
to a numerical representation.
Although we might be tempted to use a given character’s position in the alphabet, we should prefer using the ASCII character representations since a computer will be used to churn this magic function eventually.
We can easily get the ASCII representations in python as the following:
print(tuple(b'atcgn'))
which gives us (97, 116, 99, 103, 110)
Now, let’s examine the different ways we can approach the problem.
Matrices and polynomials
The crux of this method relies on the existence of a vector such that for each input character and the complement ,
Consider the input character ‘a’ and the corresponding output ’t’. We substitute p and q in the previous expression with the ASCII values 97 and 116 respectively to get the following polynomial.
We can setup a system of polynomials for the remaining mappings in the same way. This gives us 5 equations, which is precisely why we chose 5 coefficients in .
If we had more equations than unknowns, we would have been redundant. Whereas, more unknowns than equations yields infinitely many solutions.
Here’s one more example for c
at 99, mapping to g
at 103.
In general, we have a matrix representing a linear map from to .
Also observe that a linear map would allow us to shift the inputs by some offset and guarantee that the output is also shifted by the same offset.
from sympy import Matrix
mappings = {
'a': 't',
't': 'a',
'g': 'c',
'c': 'g',
'n': 'n',
}
bases = tuple(mappings.keys())
for n in range(-1000, 1000):
mat = []
for dna_base in bases:
dna_base_n = ord(dna_base) - n
dna_compl = mappings[dna_base]
dna_compl_n = ord(dna_compl) - n
row = [dna_base_n ** i for i in range(len(bases))]
row.append(dna_compl_n)
mat.append(row)
coeffs = Matrix(mat).rref()[0][:, -1]
if 0 in coeffs:
print(f'{n} -> {coeffs}')
We prioritize coefficients that contain zeros since they allow us to ignore a term in the resulting polynomial.
Running the experiment yields a single polynomial where one of the coefficients is 0.
110 -> Matrix([[0], [16075/51051], [-72265/204204], [-1721/102102], [235/204204]])
def compl(x: str) -> str:
x = ord(x) - 110
c = (64300 * x -72265 * x**2 -3442 * x**3 + 235 * x**4)//204204 + 110
return chr(c)
Modular arithmetic
In this approach, a given base maps to its complement as
where is some known constant.
We are essentially trying to mend our problem into a Chinese remainder theorem problem. However, all moduli in a CRT problem must be coprime. The numerical representations of our bases aren’t quite coprime.
To deal with this, we double all the entries and add an odd number. More abstractly, for every , must be even. Adding an odd number , then, must make odd.
Varying this can potentially yield some whose elements are coprime.
from itertools import starmap, permutations
import numpy as np
def egcd(a, b):
old_r, r = a, b
old_s, s = 1, 0
while r != 0:
quo = old_r // r
old_r, r = r, old_r % r
old_s, s = s, old_s - quo * s
return (old_r, old_s)
def gcd(a, b):
return egcd(a, b)[0]
def are_coprime(a, b):
return gcd(a, b) == 1
def modinv(a, b):
old_r, old_s = egcd(a, b)
if old_r != 1:
raise ValueError("modular multiplicative inverse is not possible")
return old_s % b
def are_pairwise_coprime(vs):
return all(starmap(are_coprime, permutations(vs, 2)))
u = np.array(tuple(b"atgcn"))
compl = np.array(tuple(b"tacgn"))
# smallest target, i.e., a
min_of_seq = np.min(compl)
compl -= min_of_seq
# the largest value of the complement must be a principal value in (mod N)
ring_enclose = np.max(compl)
ring_enclose += 1 - ring_enclose & 1
print(f"{u} maps to {compl}")
min_answer = float('Infinity')
min_ih = None
for i in range(1, 32):
# -2 * i * min_of_seq so that `vs` is as small as possible
# + field_should_enclose so that `vs` is atleast positive and
# greater than the largest element of `seq`
for h in range(-2 * i * min_of_seq + ring_enclose, 0, 2):
# The linear transform 2iu + h
vs = 2 * i * u + h
if not are_pairwise_coprime(vs):
continue
# apply chinese remainder theorem
pi_n = np.prod(vs)
pi_all_but_ni = pi_n / vs
try:
modinv_ni = np.array(tuple(starmap(modinv, zip(pi_all_but_ni, vs))))
except ValueError:
continue
pieces = pi_all_but_ni * compl * modinv_ni
answer = np.sum(pieces) % pi_n
if answer < min_answer:
min_answer = answer
min_ih = (i, h)
print(f"progress: {min_ih} -> {int(min_answer)}")
min_answer = int(min_answer)
print(f"{min_ih} -> {min_answer}")
We set the upper limit of our experiment as 32 to avoid waiting too long for a reasonable answer.
[ 97 116 103 99 110] maps to [19 0 2 6 13]
progress: (1, -159) -> 47922894
progress: (3, -559) -> 33253051
(3, -559) -> 33253051
With the final solution, we can cook up the following function. Note how we add 97 since we had shifted such that its smallest element was 0.
for base in 'atgcn':
compl = chr(97 + 33253051 % (6*ord(base)-559))
print(f'{base} -> {compl}')
Wrapping up
Those were two ways I could think of mapping DNA nucleobases to their complements. Although this is a contrived example, it was a fun exercise and the modular arithmetic approach is my personal favorite. Let me know how you would have solved this differently by shooting me an email!
Bye now.