fsyacc.exe is a LALR parser generator. It follows essentially the same specification as the OCamlYacc parser generator, especially when used with the ''ml compatibility' switch.
Parser generators typically produce Abstract Syntax Trees represented by values in an F# Algebraic Data Type. For example, the following is taken (with permission) from the F# sample samples\fsharp\Parsing:
type expr = | Val of string | Int of int | Float of float | Decr of expr type stmt = | Assign of string * expr | While of expr * stmt | Seq of stmt list | IfThen of expr * stmt | IfThenElse of expr * stmt * stmt | Print of expr type prog = Prog of stmt list
Given that, a typical parser specification is as follows:
%{
open Ast
%}
%start start
%token <string> ID
%token <System.Int32> INT
%token <System.Double> FLOAT
%token DECR LPAREN RPAREN WHILE DO END BEGIN IF THEN ELSE PRINT SEMI ASSIGN EOF
%type < Ast.prog > start
%%
start: Prog { $1 }
Prog: StmtList { Prog(List.rev($1)) }
Expr: ID { Val($1) }
| INT { Int($1) }
| FLOAT { Float($1) }
| DECR LPAREN Expr RPAREN { Decr($3) }
Stmt: ID ASSIGN Expr { Assign($1,$3) }
| WHILE Expr DO Stmt { While($2,$4) }
| BEGIN StmtList END { Seq(List.rev($2)) }
| IF Expr THEN Stmt { IfThen($2,$4) }
| IF Expr THEN Stmt ELSE Stmt { IfThenElse($2,$4,$6) }
| PRINT Expr { Print($2) }
StmtList: Stmt { [$1] }
| StmtList SEMI Stmt { $3 :: $1 }
The above generates a datatype for tokens and a function for each 'start' production. Parsers are typically combined with a lexer generated using fslex. For example, a program AST can be generated from a file using:
let parse() =
let stream = new StreamReader("myfile.txt") in
let myProg =
let lexbuf = Lexing.from_stream_reader stream in
Pars.start Lex.token lexbuf in
myProg
fsyacc <filename> -o Name the output file. -v Produce a listing file. --module Define the F# module name to host the generated parser. --open Add the given module to the list of those to open in both the generated signature and implementation. --ml-compatibility Support the use of the global state from the 'Parsing' module in MLLib. --tokens Simply tokenize the specification file itself. --help Display this list of options
Each action in an fsyacc parser has access to a parseState value through which you can access position information.
type IParseState<'pos> =
interface
abstract StartOfRHS: int -> 'pos
abstract EndOfRHS : int -> 'pos
abstract StartOfLHS: 'pos
abstract EndOfLHS : 'pos
abstract GetData : int -> obj
abstract RaiseError<'b> : unit -> 'b
end
These are fairly self explanatory - RHS relate to the indexes of the items on the right hand side of the current production, the LHS relates to the entire range covered by your production. You shouldn't use "GetData" directly - these is called automatically by $1, $2 etc. You can call RaiseError if you like.
The 'pos position values carried by the lexer and parser can in theory be any type you wish. However in practice people always use the position values defined in the "Lexing" module, i.e. see
http://research.microsoft.com/fsharp/manual/mllib/Microsoft.FSharp.Compatibility.OCaml.Lexing.type__position.html
These have the following somewhat obscure definition (blame OCaml compatibility for how cryptic this is!)
type position =
{
/// Absolute offset of position. Automatically calculated
/// by lexer
pos_cnum : int;
/// Filename, based name given in initial position
pos_fname : string;
/// Line number - must be updated by user lexer actions
/// Must be updated by user lexer actions.
pos_lnum : int;
/// Absolute offset of beginning of line.
/// Must be updated by user lexer actions.
pos_bol : int;
}
You must set the initial position when you create the lexbuf:
let setInitialPos (lexbuf:lexbuf) filename =
lexbuf.EndPos <-{ pos_bol = 0;
pos_fname=filename;
pos_cnum=0;
pos_lnum=1 }
You must also update the position recorded in the lex buffer each time you process what you consider to be a new line:
let newline (lexbuf:lexbuf) =
lexbuf.EndPos <- lexbuf.EndPos.AsNewLinePos()
Likewise if your language includes the ability to mark source code locations (e.g. the #line directive in OCaml and F#) then you must similarly adjust the lexbuf.EndPos according to the information you grok from your input.
Xuhui Li pointed out that OCaml Yacc accepts the following:
%type < context -> context > toplevel
For FsYacc you just add parentheses:
%type < (context -> context)) > toplevel