Byte Pair Encoding of Protein Amino Acid Sequence

Byte pair encoding (BPE) is an algorithm originally desgined for word compression. This is an example of adopting BPE for tokenziation of protein amino acids sequences to feed it into LLMs, adapted from Asgari et al. This example will use 2500 completely random sequence of uniform size 50 aa. In practice, the sequence length will be vastly different.

Prerequisite

random (optional)

Installation

None

Usage

Import random and set seed for preproducibility

 import random
 SEED = 42
 random.seed(SEED)

Define the amino acid list and the initial vocabulary. Also initilize and define a maximum vocabulary size. (e.g. 1024)

 VOCAB = {
 'A': 0, 'C': 1, 'D': 2, 'E': 3, 'F': 4, 'G': 5,
 'H': 6, 'I': 7, 'K': 8, 'L': 9, 'M': 10, 'N': 11,
 'P': 12, 'Q': 13, 'R': 14, 'S': 15, 'T': 16, 'V': 17,
 'W': 18, 'Y': 19, 'X': 20} 
 MAX_VOCABSIZE = 1024
 AMINOLIST = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'X']

Define housekeeping function to generate a random corpus of sequence and generate random sequences. For this example, a corpus of 2500 random sequence is used to generate the vocabulary.

 def gen_random_seq():
     seq = []
     for i in range(50):
         seq.append(random.choice(AMINOLIST))
     return seq

 def gen_corpus(): 
     corpus = []
     for i in range(2500): #generate 2500 random sequences for the corpus
         corpus_seq = gen_random_seq()
         corpus.append(corpus_seq)
     return corpus

Define function for counting the occurences of each consecutive tokens, and return all the counts.

 def count_dipeptide(corpus):
     pass_counts = {}
     for seq in corpus:
         for i in range(len(seq) - 2 + 1):
             try:
                 dipeptide = seq[i] + seq[i+1]
             except:
                 break
                
             if dipeptide not in pass_counts:
                 pass_counts[dipeptide] = 1
             else:
                 pass_counts[dipeptide] += 1
     return pass_counts

Define the function to update each sequence with the most occuring tokens-pairs and another function to update the corpus with the newly merged sequence.

 def update_sequence(seq, max_dipeptide):
     new_seq = []
     i = 0
     while i < len(seq):
         if i < len(seq) - 1 and seq[i] + seq[i+1] == max_dipeptide:
             new_seq.append(max_dipeptide)
             i += 2
         else:
             new_seq.append(seq[i])
             i += 1
                    
     return new_seq

 def update_corpus(old_corpus, max_dipeptide):
     new_corpus = []
     for seq in old_corpus:
         new_seq = update_sequence(seq, max_dipeptide)
         new_corpus.append(new_seq)
     return new_corpus

Define main BPE logic.

The logic flow is as follow:

In every loop, the the all occurences of combinations of tokens is reported.
Using the the most ocurring token-pair, update all corpus sequences to merge this combination, if any.
Append this token-pair to the vocabulary dictionary.

 corpus = gen_corpus()
 testfa = gen_random_seq()
    
 org_len = len(testfa)

 while len(VOCAB) < MAX_VOCABSIZE:
     counts = count_dipeptide(corpus)
     max_dipeptide, max_count = max(counts.items(), key=lambda x: x[1])
        
     #add to vocab
     VOCAB[max_dipeptide] = len(VOCAB)
        
     #update corpus
     corpus = update_corpus(corpus, max_dipeptide)

Using the vocabulary, compress the test sequence.

 for key, value in VOCAB.items():
     if value >= 21: #skips the 20 base amino acids and X
         testfa = update_sequence(testfa, key)

Compare the original sequence length with the BPE sequence.
```
 print(org_len, len(testfa))
```
The original sequence is 50 aa. The sequence compressed with BPE is 28 tokens long. The compression factor is about 1.8.

Reference

https://www.nature.com/articles/s41598-019-38746-w

back