LiPG Documentation

The Lithium Parser Generator is (obviously) a parser generator and is designed to be simple and easy to use. Like lithium, LiPG (pronounced "lie-pee-jee") can be easily molded into the shape (or parser) you need.

Compared to parser generators like Bison that use BNF grammars, the format of LiPG files can more closely resemble real-world language descriptions. In addition, in LiPG it is impossible to produce ambiguous grammars. It combines aspects of both a lexer and a parser so that concepts of your target domain-specific language (DSL) are not separated. To this end, LiPG takes only one file as input and produces one readable c++ file as output.

Like most parser generators, LiPG is for creating languages including domain specific languages. In general, LiPG can take pieces of a string input and do actions based on which pieces it finds.

Download just the Windows binary here: LiPGv1.0 (windows).exe
Download just the Mac binary here: LiPGv1.0 (mac)
Download just the Linux binary here: LiPGv1.0 (linux)
Download the source code, documentation, and tutorials here: LiPG v1.0.zip

Contents: Using LiPG
                 What is a parser generator?
                 Known bugs/quirks

                 Tutorial 0
                 Phone-number Tutorial 1
                 Phone-number Tutorial 2
                 Phone-number Tutorial 3
                 Tutorial 4 - Simple Integer Calculator
                 Tutorial 5 - Test Paser
                 LiPG file format - Full specification
                     Calling a parse function as a C++ function

What is a parser generator?

Simply put, a parser generator is a program that makes it easy to create a parser. What this means is that, in essence, the parser generator is a compiler for a domain specific programming language geared toward creating parsers. One writes a parser in this special code, then "compiles" it using the parser generator. The output of the parser generator is generally code in a more general programming language. The main function of the parser generator's input language is to make it very easy for you to create complex parsers.

So what types of things are parser generators used for? One main category is creating compilers, interpreters, or translators for programming languages or domain specific languages. Another category might be extracting information from a complex file format. If you are involved with complex languages or formats, you stand the most to gain from using a parser generator. The code will be short and look relatively nice.

However if the parsing you do is mostly on static formats (for example, file formats that don't depend on the data inside the file), then a parser generator is probably overkill. It may look prettier, but it may not reduce the amount of code you have to write.

Using LiPG

LiPG takes a filename either as a single command-line argument, or (if you don't input a command-line argument) it will take a filename from standard keyboard input. On windows, you can simply drag a file formatted for LiPG onto the executable.

To compile LiPG, all you need is LiPGv1.0.cpp and the three included dot-h files (basicDynamicTypesv080206.h, neccessaryFunctions080106.h, theStrings v080218.h). There is no makefile because compilation requires only one .cpp file, no linker flags or libraries other than C standard and the three included dot-h files.

You can also generate LiPGv1.0.cpp from the LiPGv1.0.txt file by running it through LiPGv1.0.exe . LiPG does indeed generate itself. This is the easiest way to modify it if  you so choose.

Known bugs/quirks

  • LiPG may not compile correctly with certain optimizations - some optimizations may cause the program to crash at runtime.

  • anychar[ ] variables have a maximum of 20000 characters. Going over that can cause problems.

  • The 'now' variable is transformed into either (&(input[*NOW])) or (&(input[NOWtemp0])) in the C code.

  • LiPG parse functions use the following variables internally: 'wordL', 'tempLabel', 'zeroOne', 'longest', 'countparens', 'diffTemp', and 'NOWtemp0', 'NOWtemp1', 'NOWtemp2', etc. If you redeclare these variables inside a parse function, it could cause the function to work improperly.

  • LiPG parse functions translate anychar and anychar[ ] variables into "limas_x" where x is the name of the anychar or anychar[ ] variable. Creating an anychar variable of the same name as an input parameter, or global variable won't give an error, but will almost certainly mess up your code.

  • Since LiPG evaluates every wordform and chooses the best one, variables passed in to parse functions that fail may still be modified.

  • LiPG anychar variables should not be expected to retain their value between wordforms.

  • LiPG supports recursion - but not like most BNF parser generators do. Recursion in LiPG is exactly like recursion in C or most programming languages. Recursive parse functions in LiPG need base-cases just like recursive functions do in C. Creating a badly formed recursive parse function will cause a stack overflow. Sadly, there is no portable way to detect and handle stack overflow - so LiPG will simply quit immediately without telling you what happened.

  • Feel like fixing these problems? Feel free. LiPG is free under version 3 of the GNU Public License.

Tutorial 0

The code for each tutorial should be in the LiPG v1.0.zip file. If you don't want to download that, a link to the tutorial source code is at the bottom of each tutorial.

One of the great things about LiPG is that there is nothing to set up. Either download a binary from above, or compile LiPG from the c++ source code provided in LiPG v1.0.zip.

To get started, this will just be a very quick demonstration that LiPG works for you.

glob
[  

The glob block ('glob' for 'global') is for global C++ constructs like global variables, #include statements, and global functions - including main. This main declares a string variable that has the string "hello" in it. We will test it in this tutorial.

    #include <stdio.h>
    #include <string.h>
    main()
    {   int diff;
        char exampleInput[] = "hello";

The function 'tutorial0' is a parse function that is defined later in the code. Parse functions take at least three arguments. The first argument is an int* variable that will hold  the number of characters parsed. The second argument is the char* string to be parsed. The third argument is the length (int) of the string. The function will return false if it fails.

        tutorial0(&diff, exampleInput, strlen(exampleInput));
        getchar();
    }
]

Here is the parse function named 'tutorial0'. It prints a message if it is given a valid phone number to parse.

parse tutorial0
[  

The parse function consists of choice blocks. These always begin with an arrow. The choice block in turn consists of choices. This choice block only has two choices - "hello" and the else choice.

Here, the "hello" is the wordform. A wordform is a potential match to the input string. Here the input string must begin with "hello" to match (it does not have to be exactly "hello", e.g. the string can be "hello asdfasdfasf" and still match).

The wordform can be much more powerful than this, involving variables and functions, but that will be discussed later.

->  "hello"

This next part is the action bracket . This is C++ code that will be run if the above wordform is the longest (or only) match. Here, the code simply prints hello in another form - "H E L L O". It then returns true to indicate success.

    [  printf("H E L L O");
       return true;  // indicate success
    ]

The else choice is analogous to 'else' in an if-else statement. It should appear at the bottom of the set of choices. Its code is executed if no other choice matches. Here, it just prints "goodbye." and returns false for failure.

    else
    [  printf("goodbye.");
       return false; // indicates failure
    ]
]

The output of this code should simply be "H E L L O", and then will wait for you to press enter. You can mess with the value of exampleInput to see what matches and what fails.

Download the LiPG code for this tutorial here: tutorial0.txt

Phone-number Tutorial 1

This tutorial will show you how to use LiPG to parse a simple format. The format will simply be a telephone number - i.e. ### ### - #### . This tutorial assumes you have done or at least read through tutorial 0.

This main declares three strings that each have a potential phone number in them. The three calls to '
phoneNumber' test each of them.

glob
[   #include <stdio.h>
    #include <string.h>
    main()
    {   int diff;
        char pn1 = "650 858-4217";
        char pn2 = "802 944-1255";
        char pn3 = "415 92323-5478"; // fails
        phoneNumber(&diff, pn1, strlen(pn1));
        phoneNumber(&diff, pn2, strlen(pn2));
        phoneNumber(&diff, pn3, strlen(pn3)); // doesn't print
    }
]

Here is the parse function named 'phoneNumber'. It prints a message if it is given a valid phone number to parse. An anychar variable (like 'a', 'b', 'c', and 'd') can take on a string of length one or zero. They will be used to describe the wordform.

parse phoneNumber
[   anychar a, b, c, d

This next block is the top block. It allows you to declare C variables for use in the parse function, and allows you to execute code before anything is parsed by the function. Here, we are declaring two temporary string variables that will each store three of the first six digits of the phone number for later use.

    top
    [   // stores parts of the phone number while it
        //  is parsed through
        char tempString1[4], tempString2[4];
    ]

The parse function consists of choice blocks. These always begin with an arrow. The constructs on the same line after the arrow are a wordform and a conditional. The wordform can either be a string literal, a parse function, or a string variable (including anychar and anychar[ ] variables. Wordforms that have more than one piece (string literal, parse function, or variable) must be surrounded by parentheses. Here the wordform matches three characters.

After the wordform is the conditional. This conditional has three build blocks. A build block defines how an anychar (or anychar[ ]) variable is built. The build block must return true for every character in the anychar or anychar[ ] variable it applies to. Each build block starts with the name of  the anychar variable (a, b, or c here) and has a condition inside brackets.

Here, each anychar variable must be a character between '0' and '9'

->  ( a b c )  a[ '0' <= a&&a <='9' ]
               b[ '0' <= b&&b <='9' ]
               c[ '0' <= c&&c <='9' ]

After the wordform and conditional, the action bracket is written. This is C++ code that will be run if the above wordform is the longest match. Here, the code concatenates the value of each anychar variable into the tempString1 variable - for later access.

    [  tempString1[0]=a;
       tempString1[1]=b;
       tempString1[2]=c;
       tempString1[3]=0;
    ]

After the action bracket, more chioces (wordform, conditional, and action bracket) can be written. Here there are no other chioces, so we move on to the next choice block. This choice block has just one choice that matches to a space and three digits. The choice block after that has a single choice that matches to a dash and four digits. When the last set of digits is matched, it prints that a valid phone number was found.

->  ( " " a b c )      a[ '0' <= a&&a <='9' ]
                       b[ '0' <= b&&b <='9' ]
                       c[ '0' <= c&&c <='9' ]
    [  tempString1[0]=a;
       tempString1[1]=b;
       tempString1[2]=c;
       tempString1[3]=0;
    ]
->  ( "-" a b c d )    a[ '0' <= a&&a <='9' ]
                       b[ '0' <= b&&b <='9' ]
                       c[ '0' <= c&&c <='9' ]
                       d[ '0' <= d&&d <='9' ]
    [  printf("Got valid phone number: (%s) %s-%c%c%c%c\n",
                tempString1, tempString2, a, b, c, d);
       return true;
    ]
]

That wasn't that bad, was it? Well if you thought it was, the good news is that it can be much easier than that. The way this tutorial's code was written does not fully take advantage of the symmetry in the code. Instead of using anychar variables, we could use anychar[ ] variables to make the code shorter. That's in the next tutorial.

Download the LiPG code for this tutorial here: tutorial1.txt

Phone-number Tutorial 2

This tutorial will show you how to use LiPG to parse a telephone number ( i.e. ### ### - #### ) using anychar[ ] variables.

glob
[  
    #include <stdio.h>
    #include <string.h>
    main()
    {   int diff;
        char pn1 = "650 858-4217";
        char pn2 = "802 944-1255";
        char pn3 = "415 92323-5478"; // fails

        phoneNumber(&diff, pn1, strlen(pn1));
        phoneNumber(&diff, pn2, strlen(pn2));
        phoneNumber(&diff, pn3, strlen(pn3)); // doesn't print
    }
]

All of the above is the same as last tutorial.

Here is the parse function 'phoneNumber' again. This time, anychar[ ] variables X, Y, and Z will describe the wordform. Anychar[ ] variables can conceptually hold a string of any length - including a length of zero. In addition to the anychar[ ] variables, we must define an anyindex variable that will be used in the build blocks for X, Y, and Z.

parse phoneNumber
[   anychar[] X, Y, Z
    anyindex n

This time, we can use one single wordform to describe the whole telephone number format. This means we don't need those temporary string variables we used last time (in the top block).

In the build blocks for each anychar[ ] variable, the anyindex variable 'n' conceptually accesses every index in the anychar[ ] string. In this way, it can very easily be used to describe what each anychar[ ] variable can be made up of.

At the end of each build block, the condition that n is less than a number will limit the size of the anychar[ ] variable. In this case, X and Y can be up to size 3 and Z can be up to size 4. Note that if you used the condition "n==3" or "n==4" then the parse function would fail - because clearly every index in a string cannot be a single number.

->  ( X " " Y "-" Z )  X[ '0' <= X[n]&&X[n] <='9' && n<3 ]
                       Y[ '0' <= Y[n]&&Y[n] <='9' && n<3 ]
                       Z[ '0' <= Z[n]&&Z[n] <='9' && n<4 ]

The cond block is a general condition for a wordform that needs not relate to a specific variable. Here, it is needed to make sure that the lengths of each anychar[ ] variable are exact (e.g. rather than up to 3, the length of X must be exactly 3).

Because of the cond block, the conditions on 'n' in the build blocks are not strictly necessary - but they do make the parser run a bit faster (because when a build block fails it simply stops building the variable, but when a cond block fails, it backtracks and forces the variables to be built again). In general, anything that can be should be done in a build block, rather than a cond block.

                       cond[ strlen(X)==3 && strlen(Y)==3
                             && strlen(Z)==4 ]
    [  printf("Got valid phone number: (%s) %s-%s\n",X,Y,Z);
       return true;
    ]
]

That was significantly shorter than the first time. However, it still doesn't fully take advantage of the symmetry in the code. If other parse functions were used, the code could be made even shorter. That's in the next and final phone-number tutorial.

Download the LiPG code for this tutorial here: tutorial2.txt

Phone-number Tutorial 3

This tutorial will show you how to use LiPG to parse a telephone number ( i.e. ### ### - #### ) using lower-level parse functions.

glob
[  
    #include <stdio.h>
    #include <string.h>
    main()
    {   int diff;
        char pn1 = "650 858-4217";
        char pn2 = "802 944-1255";
        char pn3 = "415 92323-5478"; // fails

        phoneNumber(&diff, pn1, strlen(pn1));
        phoneNumber(&diff, pn2, strlen(pn2));
        phoneNumber(&diff, pn3, strlen(pn3)); // doesn't print
    }
]

All of the above is the same as last two tutorials.

Here is the parse function 'phoneNumber' yet again. This time, it is minimally short. The parse function 'num' describes the numbers, and puts each number into X, Y, and Z respectively. 'num' takes two inputs, a char* that will hold the number it grabs and an integer representing the length of the number to grab (the number of digits). 'num' is defined below 'phoneNumber' and needs no declaration above it.

parse phoneNumber
[   anychar[] X,Y,Z
->  ( num[X,3] " " num[Y,3] "-" num[Z,4] ) 
    [  printf("Got valid phone number: (%s) %s-%s\n",X,Y,Z);
       return true;
    ]
]

The parse function 'num' has a parameter block (the 'in') that simply allows it to have user-defined parameters like a normal C function. Arguments were passed into these parameters in the 'phoneNumber' function above.

Other than that, this 'num' function is similar to what other tutorials have shown. Before returning true to the 'phoneNumber' function, it copies the contents of W into output so that 'phoneNumber' will have the number.

parse num
[in[char* output, int length]
    anychar[ ] W
    anyindex n
->  W   W[ '0' <= W[n]&&W[n] <='9' && n<length ]
    [   strcpy(output, W); // output = W
        return true;
    ]
]

That code was slightly longer than the last tutorial but in anything but trivial cases like this, this technique will result in much shorter code. Reusable code can be made by creating parse functions that are called by other parse functions. In this way, different parts of a DSL can be written in the form of a separate parse function that may be called by other parse functions in the LiPG code.

Download the LiPG code for this tutorial here: tutorial3.txt

Tutorial 4 - Simple Integer Calculator

This is a little calculator program that can compute addition, subtraction, multiplication, and division on integers - with the correct order of operations. This calculator demonstrates both left-to-right evaluation and right-to-left evaluation.

In main, a loop gets a line then parses it, and repeats. It quits when the calculator returns false (which is when the line is equal to "exit").

glob
[   #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>

    // gets a line from standard input
    // ('gets' and 'fgets' kept crashing for me...)
    void getline(char* line)  
    {   int n=0;
        while((line[n]=getchar()) != '\n')
        { n++; }
        line[n]=0;
    }

    main()
    {   int diff;
        char line[100];
        while(1) 
        {   getline(line);
            if(false==SICalc(&diff, line, strlen(line)))
            { break;  // when SICalc returns false
            }
        }
    }
]

The main parse function for this calculator simply takes either an expression, or the command "exit". If it receives an expression, it will print the result of the expression.

parse SICalc
[   top
    [    int result;
    ]
->  expression[&result]
    [    printf(":: %d\n", result);
         return true;
    ]
    "exit"
    [    printf("Wasn't that simple?\n");
         return false;
    ]
]

The 'expression' parse function represents a sum of products connected by addition signs or subtraction signs. It consists of a product, and then optionally more products connected by either the plus operator or minus operator.

parse expression
[in[int* result]
    top
    [   int temp;
    ]

The 'every' construct is a set of choices that parse the input before every choice block (except the first one). The 'tween' construct describes just a wordform that parses the input between every piece of every wordform in the parse function. The tween cannot take any conditional or actions.

The bracket after the tween wordform is meant to look like an action bracket (for consistency), but cannot be used like one. The reason for this is that it would have been difficult to program. If you want to make the '
tween' construct work just like the 'every' construct, feel free to fix that yourself.

Here, the '
tween' and the 'every' constructs both just get whitespace - in order to make the calculator ignore whitespace.

every  ws[]    [   // do nothing   ]
tween  ws[]    [     ]
->  product[result]
    [
    ]

The "more" before the arrow is a label that can be jumped back to (analogous to goto labels).

more->

This wordform demonstrates left-to-right evaluation. The addition and subtraction operators are evaluated from left to right. Whenever another operator and operand are found, it immediately executes the operator on the operands.

    ("+" product[&temp])
    [   *result += temp;

The "more" here is a jump command - analogous to a goto statement. If the wordform is the longest match, its action bracket is executed and then jumps to the jump label specified after the action bracket. Here, it is jumping back to "more" where it can get another part of the sum.

    ]>more
    ("-" product[&temp])
    [   *result -= temp;
    ]>more

Whenever you have a loop like "more" creates, you need a way to exit the loop. This else exits the loop (and the parse function) when no more "+" or "-" operators are found.

    else
    [   return true; // end of expression
    ]
]

The 'product' parse function represents a product of integers connected by multiplication signs or division signs. A single integer is also considered a product. It consists of an integer, an integer multiplied by another product, or an integer divided by another product. 

parse product
[in[int* result]
    top
    [   int temp1, temp2;
    ]
 every ws[]    [     ]
 tween ws[]    [     ]
->  integer[result]
    [   return true;
    ]

This wordform demonstrates right-to-left evaluation. The multiplication and division operators are evaluated from right to left. To do this, recursion is used. Since the action bracket is only executed once the wordform is matched, it will execute the action bracket of the last product first.

While this doesn't make a difference in this situation, it can be important in more complex languages.

    (integer[&temp1] "*" product[&temp2])
    [   *result = temp1 * temp2;
        return true;
    ]
    (integer[&temp1] "/" product[&temp2])
    [  *result = temp1 / temp2;
        return true;
    ]
   
]

The 'integer' parse function gets an integer. It doesn't use any techniques not discussed in previous tutorials

parse integer
[in[int* result]    anychar[] W
                    anychar a
                    anyindex n
->  W    W[  (n==0 && '0' <= W[n]&&W[n] <= '9')
           || (n>0 && '1' <= W[n]&&W[n] <= '9') ]
         cond[a!=0]
    [   *result = atoi(W);
        return true;
    ]
]

The 'ws' parse function gets whitespace. It doesn't use any techniques not discussed in previous tutorials either, but is a very common function if you want to make a parser for a language that ignores whitespace.

parse ws
[   anychar[] Wo
    anyindex an
->  Wo   Wo[ Wo[an]==' ' || Wo[an]=='\t' || Wo[an]=='\n' ]
    [ return true; ]
]

This tutorial demonstrated most of the techniques needed to parse a real language.

Download the LiPG code for this tutorial here: Simple Integer Calculator.txt

Tutorial 5 - Test Parser

This tutorial will show you optional items that were not shown in the previous tutorials, and when run will give you a sense of how the parts of LiPG operate.

This will run the "test" parser just once on a simple test string. '
testParser'  takes two non-implicit arguments: a string pointer that will receive a result, and an input integer.

glob
[  
    #include <stdio.h>
    #include <string.h>
    main()
    {   int diff;
        char aString[] = "a b c \t a bollocks";
        char result[100];
        testParser(&diff, aString, strlen(aString), result, 55);
        printf("Got a result of: '%s'\n", result);
        getchar(); // wait for user to press enter
    }

    int globalVar = 4;
]

parse testParser
[in[char* in1, int in2]
    anyindex n
    anychar a,b,c
    anychar[ ] X,Y,Z

    top
    [   printf("Executing the top-block\n");
    ]

The onfail block is a block that executes code if the function fails to find a match in any choice block. Note that the block is not executed if you simply return false.

    onfail
    [   printf("Executing the onfail-block\n");
    ]

The onmismatch block is a block that executes code every time a choice fails to match.

Inside the second 'printf' statement, the variable '
now' is used. This is a special variable that is used to indicate the part of the string that has not been parsed yet. It is a char pointer that points to the next character to be parsed. 'now' can be used in any piece of C code in a parse function, including conditionals, onfail blocks, onmismatch blocks, and top blocks.

    onmismatch
    [   printf("\t-\tExecuting the onmismatch-block\n");
        printf("\t-\tThe next input to parse is '%c%c%c%c'\n",
                           now[0], now[1], now[2], now[3]);
    ]
    every ws[]
    [   printf("Executed the every-block\n");
    ]

This is just a reminder that the tween's action bracket may not contain anything but whitespace. It can't even contain a comment.

    tween " "
    [
    ]

An 'else' must be at the end of a list of choices in a choice block, but if there are no other wordforms, it can be the only choice.

-> else
   [   printf("The function got %d as the input 'in2'\n", in2);
   ]
doItAgain->

A cond block can be used even for wordforms without anychar variables.

   "a" cond[ globalVar==4 ]
   [   printf("Found the letter 'a'\n");
   ]
   a a[ a=='b' ]
   [   printf("Found the letter '%c'\n", a);
   ]
   X X[ X[n]=='c' ]
   [   printf("Found the letters '%s'\n", X);
   ]>doItAgain

A 'noevery' indicates that the every block should not precede the following choice block.  

-> noevery
   (ws[] "d")
   [   printf("Found the letter 'a'\n");
       strcpy(in1, "d");
       return true;
   ]

A 'notween' indicates that the tween block should not be used for the following choice.  

   notween
   (ws[] "b" "o" "l" "l" "o" "c" "k" "s" )
   [   printf("Found 'bollocks'\n");
       strcpy(in1, "bollocks");
       return true;
   ]

By using an empty string as a wordform, you can emulate an else block that allows conditionals.

   "" cond[ globalVar==4 ]
   [   printf("The globalVar is 4\n");
   ]>doItAgain
]

glob
[ // a glob block can appear anywhere a parse function can
]

parse ws
[  anychar[] Wo
   anyindex an
-> Wo Wo[ Wo[an]==' ' || Wo[an]=='\t' || Wo[an]=='\n' ]
   [   printf("\t'ws' found whitespace: '%s'\n", Wo);
       return true;
   ]
]

This tutorial uses all constructs possible in LiPG. Running the program will produce a slightly verbose output that can indicate the exact order each part of the parse functions were executed in.

Download the LiPG code for this tutorial here: tutorial5 - test parser.txt

LiPG file format - Full specification

The following fully specifies the format of files LiPG converts into a c++ parser. Each block has  a name in italics (in the upper left corner if possible).

  • Block names with an asterisk (*) after them are optional.

  • Deep purple bars () connecting blocks indicate that only one of those blocks may be chosen to fill that spot.

  • The code and other blocks necessary for a block are shown inside it.

  • All blocks can be separated by white-space (tabs, spaces, or newlines) because white-space is ignored.

 

The LiPG file block represents the whole LiPG file. LiPG files consist of a set of parse blocks and glob blocks ("glob" stands for "global"). glob blocks just wrap C++ code (including 'main') and drop the code in the output file wherever they are written in the input file relative to other glob blocks and parse blocks.

parse blocks will be discussed right below.

 







Parse blocks are the important parts of an LiPG file. They represent a function that can be called in C++ code or other parse functions.

All a parse function really needs is a function name and a choice block. The function name is just a normal alphanumeric word (starts with a letter or underscore and then has any number of letters, underscores, or numbers).

The parameter block
(labeled "in" for "input") is a way to pass parameters into a parse function. Calling a parse function will be discussed near the wordform block at the bottom of this section.

anychar, anychar[ ], and anyindex variables are special variables, and will be discussed below.

The top block declares C++ variables and executes C++ code before the function begins parsing.

The onfail block executes C++ code if the function can't find a match at any point.

The onmismatch block executes C++ code if a choice (discussed later) doesn't match.

The every block parses the input with the choices after every normal choice block. Similarly, the tween block parses the input with a single wordform between every piece of a multiform.

After those optional constructs, the rest of the parse function consists of choice blocks. Choice blocks start with an arrow (->) that may have a label in front of it, and consist of a set of choices. The label can be used to jump back to that choice block later - like a goto. If "noevery" is written after the arrow, the parse function will not run the every block for that choice block.

A choice must have a wordform and brackets that may contain C++ code. If "notween" is written before the wordform, the parse function will not run the tween block for that choice. If a choice matches, it can either go directly to the next choice block, or you can have it go to an arbitrary labeled choice block by writing a destination after the end-bracket of the choice. Note that the C++ is executed only if its containing choice is the longest match of all the choices in the choice set. Also note that the C++ code is run *after* code that is run in parse functions that are called in the wordform.

At the end of all the choices, an else choice can be written. The else is executed if no other choice
matches. An else cannot be labeled notween and cannot have a conditional. However, it can have a destination.

The wordform and conditional are discussed below.

One thing not mentioned is that C++ code inside a parse function can use a variable 'now' that points to the first unmatched character in the input. This can be used to access the part of the string that the parser will parse next. Note that in the top block and the onfail block, 'now' will point to the string the parse function received to begin with. In the onmismatch block, it will point to the string that will be parsed by the next choice.

To return successfully from a parse function, return true. To return unsuccessfully from a parse function, return false. Note that if you return false, the onfail block will *not* be run.
 

 

anychar, anychar[ ], and anyindex declarations look exactly like normal C-declarations, except they do not end in a semi-colon.

anychar variables are variables that can hold zero or one character. anychar[] variables can hold any number of characters. An anyindex variable is used in cond blocks (explained later). There can only be one anyindex variable.

 

A parse function call consists of the name of the function and regular C++ arguments inside brackets. In addition to normal C++ variables, anychar and anychar[ ] variables can be passed as well (as their underlying data-type is a simple char* variable).

Note that this is for calling a parse function inside a wordform. Calling a parse function in C is a little different.

 

The wordform is essentially the pattern to match for a given choice. That pattern can either be a single quote, word, or parse function or can be a multiform which is a list of any of those enclosed in parentheses and separated by white-space. A quote is any C-style double quoted string. A variable can be a normal C-string variable, or can be an anychar or anychar[ ] variable. A parse function call is a powerful way of reusing parse functions..

 

The conditional puts constraints on anychar and anychar[ ] variables.

Build block
s are conditions that must be met for every character in the given anychar or anychar[ ] variable. Build blocks are meant to be specific to a single anychar or anychar[ ] variable. The anychar or anychar[ ] variable who's build conditions are given by the build statement is given by name. Inside the build statement, you can use an anyindex variable to access each character in an anychar[ ] variable. For example, if W was an anychar[ ] variable and N was an anyindex variable, "W[ W[N]=='t' || W[N]=='d' ]" would be a build block that specifies that all characters in W must be either 't' or 'd'. Note that an empty W is valid for this.

The cond block ("cond" for "condition") are general conditions that must be met for the wordform to be accepted. The cond block is more flexible than build blocks, because it can include comparisons between anychar and anychar[ ] variables, as well as any other arbitrary condition. There may only be one cond block per conditional.

Calling a parse function as a C++ function

 name(diff, input, length, parse_arguments)

To call a parse function from C, it requires some arguments that are implicit parameters in the definition of the function. Here, the 'name' is the name of the parse function, 'diff' is an int* variable that will contain the number of characters parsed, 'input' is a char* variable and is the string to be parsed, 'length' is the length of 'input'.

The function will return true on success and false on failure.

 

Copyright 2008, Billy Tetrud
BillyAtLima@