Create a Word Auto-Completer in C using the Trie Data Structure

Pages: 1 2

So first of all a few words about our basic structure, the trie…

In computer science, a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.

For example, in our case, our project involves alphabets with every inner node of the trie data structure representing a letter which means that each node has a unique path from the root (can have up to 26 children as the number of the Latin alphabet) that symbolizes a word.

 

In our project, in order to create our fancy auto-completer we will need three structures. Firstly we need the structure of the

inner nodes that store a character that represents the letter of this node, an array of 26 letters that will navigate through the next letter. Moreover to make our program more accurate we will use an array that stores the top leafs (top words) under this inner node and a leaf pointer that is going to show us that the path from the root until this inner node reflects a word. The leaf structure contains the frequency of this word (which is the parent of the leaf) and a pointer to the last letter of the word.

Trie with English alphabet format

Trie with English alphabet format

Important Remarks: The frequency is used to fill the top-leaf array in each inner node so as to recommend the user the most frequent word for auto-completion. Also keep in mind that for saving memory purposes we extract each word of the dictionary that we load by accessing each node/letter of the path (using a simple loop that navigates strlen(word) letters – the length of the word). We only save the whole word and it’s frequency in the top-leaf array because we need a O(1) presentation of the most frequent words to the user.

struct leaf
{
 int frequency;
 struct node *parent;
};

struct topleaf
{
 int frequency;
 char word[32];
};

struct node 
{
 int num_leafs;
 char letter;
 struct node *array[26];
 struct node *parent;
 struct leaf *fyllo;
 struct topleaf topleafs[N];
};

Now that we have created the basic structures of our program we need some functions to make it work! To be more comprehensive I will write the basic steps of each algorithm and at the end of the post you will find the whole code written in c.

The insert function:

struct node* insert(struct node *h, char *c, int value)
{
 struct leaf *l;
 
 int i, j;
 
 l=NULL;
 
 if (strlen(c) == 0)
 return h;
 if (h == NULL)
 h = new_node(h);
 struct node *p = h;

 for (i = 0; i < strlen(c); ++i)
    {
 
 if (p−>array[c[i] − 'a'] == NULL) 
 {
 p−>array[c[i] − 'a'] = malloc(sizeof (struct node));
 p−>array[c[i] −'a']−>letter = c[i];
 
 p−>array[c[i] − 'a']−>parent = p; //assigning parents to internal nodes
 
 for(j=0; j<N; j++)
 p−>array[c[i] − 'a']−>topleafs[j].frequency = −1;
 }
 p = p−>array[c[i] − 'a'];
 }
 
 p−>num_leafs = 0;
 
 l = malloc(sizeof( struct leaf));//allocate space for leaf
 l−>frequency = −1;//reset frequency
 l−>parent = p;//connect leaf with parent
 p−>fyllo = l;//connect parent with leaf
 
 
 update_top_leafs(p, c);
 
 for(i=0; i<strlen(c); i++)//incrementing how many leafs has every internal node across  the path of the new word
 {
 p−>num_leafs++;
 p = p−>parent;
 }
 
 return h;
}

Important Remark: Basically the insert function initializes each letter if it does not exists and at the end of each word it assigns in this last inner node, a leaf which represents that we have a word. Also at the end the update_top_leafs function that I will demonstrate below gets two parameters. The word that we want to get updated and a pointer to the last letter of this word. Then we climb up each node from this last inner node until the root which is also strlen(c) inner nodes and we update the top leafs array of each inner node. By that I mean that if the word that we examine has a higher frequency compared with one of the existing in the top leafs array then we change them to keep the recommendation words updated.

Here is the code for the function update_top_leafs:



Giannis Kanellopoulos

Giannis Kanellopoulos

Biography to be completed

More Posts