SHILKA (keywords description translator): Introduction

1. Introduction

SHILKA is translator of keywords description into code for fast recognition of keywords and standard identifiers in compilers. SHILKA is implemented with the aid of other components of COCOM toolset.

SHILKA is analogous to GNU package `gperf' but not based on perfect hash tables. SHILKA rather uses minimal pruned O-trie for for keyword recognition. As consequence SHILKA can take the presumable frequency of keyword occurences in the program into accout. Gperf can not make it. Therefore as rule keyword recognition code generated by SHILKA is faster than one generated by Gperf up to 50%.

SHILKA is suitable for fast recognition from few keywords to huge dictionary of words (strings).

What is minimal pruned O-trie? Let us consider what is trie. If we have four keywords: case, char, else, enum. We can recognize the keywords with the following structure called trie.


                             |
                    -----------------
                 c |               e |
                -------          --------- 
             a |     h |       l|        n|
             s |     a |       s|        u|
             e |     r |       e|        m|

The corresponding code for the keywords recognition based on this structure could be


              if (kw[0] == 'c')
                {
                  if (kw[1] == 'a')
                    {
                      if (kw[2] == 's')
                        {
                          if (kw[3] == 'e')
                            {
                              /* we recognize keyword case */
                            }
                          else
                            /* this is not a keyword */
                        }
                      else
                       /* this is not a keyword */
                    }
                  else if (kw[1] == 'h')
                    {
                      if (kw[2] == 'a')
                        {
                          if (kw[3] == 'r')
                            {
                              /* we recognize keyword char */
                            }
                          else
                            /* this is not a keyword */
                        }
                      else
                       /* this is not a keyword */
                    }
                  else
                    /* this is not a keyword */
                }
              else if (kw[0] = 'e')
                {
                  if (kw[1] == 'l')
                    {
                      if (kw[2] == 's')
                        {
                          if (kw[3] == 'e')
                            {
                              /* we recognize keyword else */
                            }
                          else
                            /* this is not a keyword */
                        }
                      else
                       /* this is not a keyword */
                    }
                  else if (kw[1] == 'n')
                    {
                      if (kw[2] == 'u')
                        {
                          if (kw[3] == 'm')
                            {
                              /* we recognize keyword enum */
                            }
                          else
                            /* this is not a keyword */
                        }
                      else
                       /* this is not a keyword */
                    }
                  else
                    /* this is not a keyword */
                }

You can see in the example above that it is not necessary to test all characters of the keywords. Instead of this, we can test only several characters of the keywords and test all kewyord at the end of final decision that given string is a keyword. Such technique results in another structure called pruned trie:


                             |
                    -----------------
                 c |               e |
                -------          --------- 
             a |     h |      l |       n |
               |       |        |         |
             case     char     else      enum

The corresponding code for the keywords recognition based on this structure could be


              if (kw[0] == 'c')
                {
                  if (kw[1] == 'a')
                    {
                      if (strcmp (kw, "case") == 0)
                        /* we recognize keyword case */
                      else
                        /* this is not a keyword */
                    }
                  else if (kw[1] == 'h')
                    {
                      if (strcmp (kw, "char") == 0)
                        /* we recognize keyword char */
                      else
                        /* this is not a keyword */
                    }
                  else
                    /* this is not a keyword */
                }
              else if (kw[0] = 'e')
                {
                  if (kw[1] == 'l')
                    {
                      if (strcmp (kw, "else") == 0)
                        /* we recognize keyword else */
                      else
                        /* this is not a keyword */
                    }
                  else if (kw[1] == 'n')
                    {
                      if (strcmp (kw, "enum") == 0)
                        /* we recognize keyword enum */
                      else
                        /* this is not a keyword */
                    }
                  else
                    /* this is not a keyword */
                }

Probably you found that if we test keywords characters in another order (not in sequential order), we could recognize keywords faster. Using such approach results in another structure called pruned O-trie:


                               |
                   ------------2-------------
                a |      h |       l |     n |
                  |        |         |       |
                case      char     else     enum

Here number on the intersection means what keyword character (1st, 2nd, ...) is tested. The corresponding code for the keywords recognition based on this structure could be


              if (kw[1] == 'a')
                {
                  if (strcmp (kw, "case") == 0)
                    /* we recognize keyword case */
                  else
                    /* this is not a keyword */
                }
              else if (kw[1] == 'h')
                {
                  if (strcmp (kw, "char") == 0)
                    /* we recognize keyword char */
                  else
                    /* this is not a keyword */
                }
              else if (kw[1] == 'l')
                {
                  if (strcmp (kw, "else") == 0)
                    /* we recognize keyword else */
                  else
                    /* this is not a keyword */
                }
              else if (kw[1] == 'n')
                {
                  if (strcmp (kw, "enum") == 0)
                    /* we recognize keyword enum */
                  else
                    /* this is not a keyword */
                }
              else
                /* this is not a keyword */

And finally, minimal in the phrase "minimal pruned O-trie" means that we found pruned O-trie with minimal number of testing the keyword characters. Generally speaking we can introduce notion cost for pruned O-trie and search for prunned O-trie with minimal cost. Shilka takes probability of keyword occurences in program into account for evaluation of the cost of prunned O-trie.

Next Previous Contents