For the past few weeks, I have been trying to come up with a fast and purely agebraic function to convert DNA bases to their respective complements.

Problem statement

Our goal is rather straightforward. We aim to create a mapping of the characters a, t, g, c and n to their respective complements. We choose to keep our solution one step behind the classical reverse complement which reverses the string after the mapping.

Lastly, we will stick to lowercase characters and not deal with RNA bases for the sake of simplicity.

Here’s what the mapping would look like:

Inputatcgn
Outputtagcn

Kicking the tires

Let us first note that we can’t just use the value a, t, c, etc. as inputs to a function. We must convert them to a numerical representation.

Although we might be tempted to use a given character’s position in the alphabet, we should prefer using the ASCII character representations since a computer will be used to churn this magic function eventually.

We can easily get the ASCII representations in python as the following:

print(tuple(b'atcgn'))

which gives us (97, 116, 99, 103, 110)

Now, let’s examine the different ways we can approach the problem.

Matrices and polynomials

from sympy import Matrix

mappings = {
    'a': 't',
    't': 'a',
    'g': 'c',
    'c': 'g',
    'n': 'n',
}

bases = tuple(mappings.keys())

for n in range(-1000, 1000):
    mat = []
    for dna_base in bases:

        dna_base_n = ord(dna_base) - n
        dna_compl = mappings[dna_base]
        dna_compl_n = ord(dna_compl) - n

        row = [dna_base_n ** i for i in range(len(bases))]

        row.append(dna_compl_n)
        mat.append(row)

    coeffs = Matrix(mat).rref()[0][:, -1]
    # prioritize zeros
    # if a coefficient is 0, we can skip raising x to some power
    if 0 in coeffs:
        print(f'{n} -> {coeffs}')

Running the above experiment yields us just one polynomial where one of the coefficients is 0.

110 -> Matrix([[0], [16075/51051], [-72265/204204], [-1721/102102], [235/204204]])
def compl(x: str) -> str:
    x = ord(x) - 110
    c = (64300 * x -72265 * x**2 -3442 * x**3 + 235 * x**4)//204204 + 110
    return chr(c)

Modular arithmetic

from itertools import starmap, permutations
import numpy as np


def egcd(a, b):
    old_r, r = a, b
    old_s, s = 1, 0

    while r != 0:
        quo = old_r // r
        old_r, r = r, old_r % r
        old_s, s = s, old_s - quo * s

    # if b != 0:
    #     t = (old_r - old_s * a) // b
    # else:
    #     t = 0

    # return (old_r, old_s, t)
    return (old_r, old_s)


def gcd(a, b):
    return egcd(a, b)[0]


def are_coprime(a, b):
    return gcd(a, b) == 1


def modinv(a, b):
    old_r, old_s = egcd(a, b)
    if old_r != 1:
        raise ValueError("modular multiplicative inverse is not possible")
    return old_s % b


def are_pairwise_coprime(vs):
    return all(starmap(are_coprime, permutations(vs, 2)))


seq = np.array([ord(x) for x in "atgcn"])
compl = np.array([ord(x) for x in "tacgn"])

# smaller numbers
min_of_seq = np.min(compl)
compl -= min_of_seq

# the largest value of the complement must be a principal value in (mod N)
field_should_enclose = np.max(compl)
# but it must also be odd because even number + odd number = odd number
# inside the loop body below ensures that we have some chance of getting
# a set of pairwise coprime numbers after the linear transform on `vs`
field_should_enclose += 1 - field_should_enclose & 1
print(f"{seq} maps to {compl}")

min_answer = float('Infinity')
min_ij = None

for i in range(1, 32):
    # -2 * i * min_of_seq so that `vs` is as small as possible
    # + field_should_enclose so that `vs` is atleast positive and
    # greater than the largest element of `seq`
    for j in range(-2 * i * min_of_seq + field_should_enclose, 0, 2):

        # The linear transform
        vs = 2 * i * seq + j

        if not are_pairwise_coprime(vs):
            continue

        # apply chinese remainder theorem
        pi_n = np.prod(vs)
        pi_all_but_ni = pi_n / vs
        try:
            modinv_ni = np.array(tuple(starmap(modinv, zip(pi_all_but_ni, vs))))
        except ValueError:
            continue

        pieces = pi_all_but_ni * compl * modinv_ni
        answer = np.sum(pieces) % pi_n

        if answer < min_answer:
            min_answer = answer
            min_ij = (i, j)
            print(f"progress: {min_ij} -> {int(min_answer)}")

        # print(f"{i}, {j} -> {vs} -> {answer}")

min_answer = int(min_answer)
print(f"{min_ij} -> {min_answer}")

We set the upper limit of our experiment as 32 to avoid waiting too long for a reasonably decent answer.

[ 97 116 103  99 110] maps to [19  0  2  6 13]
progress: (1, -159) -> 47922894
progress: (3, -559) -> 33253051
(3, -559) -> 33253051

With the last solution, we can cook up the following function. Note how we add 97 (ASCII a) because it was the minimum of the complements

(97, 116, 99, 103, 110)

and we intentionally subtracted it, shifting all the values closes to 0. Thus, the complements become [19 0 2 6 13] instead of b"tacgn".

for base in 'atgcn':
    compl = chr(97 + 33253051 % (6*ord(base)-559))
    print(f'{base} -> {compl}')