c# - How do I resolve these issues with my ANTLR grammar for AutoHotkey code? - Stack Overflow

admin2025-04-20  5

I transpile AutoHotkey v2 code to C# using ANTLR4 and Roslyn. An example using only a few grammar elements, described by these rules:

  1. Statements start at the start of a line, preceding white spaces are ignored.
  2. Assignment is defined as singleExpression := singleExpression, for example a := 1 or a := b := 1. White spaces are optional and newlines are allowed on both sides of the assignment operator.
    a:=
    1
    
    is valid. (a := 1) := 2 causes 2 to be assigned to a.
  3. Function call statement starts with an identifier or parenthesized expression (or member access, out of scope for this post), is followed by white space(s), and followed by a comma-separated argument list.
  4. Any two single expressions may be implicitly concatenated to a string with at least one white space between them. For example a := 1 2 concatenates the digits to 12 and assigns to a. End of line is allowed only if the concatenation is inside parenthesis or brackets.
    a := 1
    hello
    
    would be considered two statements: assignment of 1 to a and function call of hello function.
    a := (1
    2)
    
    is considered one and a is assigned 12. Explicit concatenation is also possible with the . operator, in which case there must be a white space/newline on both sides of it. If there aren't white spaces on both sides then it's an object member access.
  5. A hotkey is defined by a key identifier followed by ::, followed by either another hotkey (on a separate line) or a statement. For example a::b := 1 means "create a hotkey for the key a which then assigns 1 to variable b". a::MsgBox triggers a function call for MsgBox.
  6. A remap is a key identifier followed by ::, followed by another key. For example a::b creates functionality where pressing a sends b instead. A remap takes priority over hotkey, so if the second key identifier matches a key name it's considered a remap, otherwise a hotkey. a::MsgBox is a hotkey only because a key named MsgBox doesn't exist.

I'm trying to write the grammar performant. The expression statement a := 1 repeated 300,000 times is parsed and executed by AutoHotkey in < 2 seconds, whereas the following simplified grammar takes about 5 seconds in C# only to parse. I'd consider acceptable parsing performance < 10 seconds.

Simple.g4:

grammar Simple;

options {
    caseInsensitive = true;
}

program: sourceElements EOF;

sourceElements: sourceElement+;

sourceElement
    : statement EOL
    | hotkey EOL
    | remap EOL
    | EOL+
    ;

hotkey
    : HotkeyTrigger WS? statement
    ;

remap
    : RemapKey
    ;

statement
    : expressionStatement
    | functionStatement
    ;

expressionStatement
    : singleExpression (s? ',' s? singleExpression)*
    ;

singleExpression
    : singleExpression WS singleExpression
    | singleExpression s '.' s singleExpression
    | <assoc = right> singleExpression WS? ':=' WS? singleExpression
    | primaryExpression
    ;

primaryExpression
    : Identifier
    | primaryExpression ('.' primaryExpression)+ // Member access
    | DecimalLiteral
    | '(' singleExpression ')'
    ;

functionStatement
    : primaryExpression
    | primaryExpression WS (singleExpression (WS? ',' WS? singleExpression?)*)
    ;

s: (WS | EOL)+;

RemapKey            : HotkeyCharacter '::' HotkeyCharacter;
HotkeyTrigger       : HotkeyCharacter '::';
OpenParen           : '(';
CloseParen          : ')';
Comma               : ',';
Dot                 : '.';
Assign              : ':=';
DecimalLiteral      : '0' | [1-9] [0-9_]*;
Identifier          : IdentifierStart IdentifierPart*;
WS                  : [\t ]+;
EOL                 : [\r\n]+;
UnexpectedCharacter : . ;

fragment IdentifierPart : IdentifierStart | [\p{Mn}] | [\p{Nd}] | [\p{Pc}] | '\u200C' | '\u200D';
fragment IdentifierStart: [\p{L}] | [$_];
fragment HotkeyCharacter
    : 'F1'
    | 'Enter'
    | ~[`\r\n ]
    ;

Example C#:

using System.Text;
using Antlr4.Runtime;
using Antlr4.Runtime.Atn;
using System.Diagnostics;

namespace AntlrCSharp
{
    class Program
    {
        private static void Main(string[] args)
        {
            try
            {
                string input = "";
                StringBuilder text = new StringBuilder();

                string filePath = @"test.txt";

                try
                {
                    string fileContent = File.ReadAllText(filePath);
                    text.Append(fileContent);
                }
                catch (FileNotFoundException)
                {
                    Console.WriteLine($"The file at {filePath} was not found.");
                }
                catch (Exception ex)
                {
                    Console.WriteLine($"An error occurred: {ex.Message}");
                }
                StartSimpleParser(text);
            }
            catch (Exception ex)
            {
                Console.WriteLine("Error: " + ex);
            }
        }
        public static void StartSimpleParser(StringBuilder text)
        {
            Console.WriteLine("Start");
            AntlrInputStream inputStream = new AntlrInputStream(text.ToString());
            SimpleLexer simpleLexer = new SimpleLexer(inputStream);
            CommonTokenStream commonTokenStream = new CommonTokenStream(simpleLexer);
            SimpleParser simpleParser = new SimpleParser(commonTokenStream);

            /*
            foreach (var token in SimpleLexer.GetAllTokens())
            {
                Console.WriteLine($"Token: {SimpleLexer.Vocabulary.GetSymbolicName(token.Type)}, Text: '{token.Text}'" + (token.Channel == MainLexer.Hidden ? " (hidden)" : ""));
            }
            */

            simpleParser.ErrorHandler = new BailErrorStrategy();
            simpleParser.AddErrorListener(new DiagnosticErrorListener());
            simpleParser.Interpreter.PredictionMode = PredictionMode.LL_EXACT_AMBIG_DETECTION;

            SimpleParser.ProgramContext programContext = simpleParser.program();

            Console.WriteLine("Parsed");

            MainVisitor visitor = new MainVisitor();
            visitor.Visit(programContext);
            Console.WriteLine("End");
        }
    }
}

This grammar has a few problems:

  1. I'm unable to include both optional white spaces and end-of-lines in assignment expressions: singleExpression s? ':=' s? singleExpression causes reportAttemptingFullContext error with LL_EXACT_AMBIG_DETECTION.
  2. I can't figure out how to tokenize the remap syntax into two keys. Current RemapKey definition HotkeyCharacter '::' HotkeyCharacter means I have to separately parse it later in the visitor.
  3. The grammar will be littered with EOL and white spaces.

How do I resolve these issues?

转载请注明原文地址:http://anycun.com/QandA/1745127158a90387.html