Log in

No account? Create an account

Using formal parsing methods on natural languages. - Silnith’s Lair

Apr. 18th, 2018

06:10 am - Using formal parsing methods on natural languages.

Previous Entry Share

I've been thinking about this for years. I really love compilers, so much that I took the class on it multiple times in school. Possibly my favorite aspect of it is the formal theory built around parsing. The concept of a right-recursive descent parser is beautiful, both elegant and effective. For computer languages, it is unambiguously the best way to parse them for compilation.

But for years I've been thinking about how the concept should be useful for parsing natural languages, too. The basic idea is that you build a grammar for the language, then take a sentence of the language and parse it according to that grammar. The parser first tokenizes the sentence into a string of tokens, then tokens are matched to grammar rules to identify what grammatical construct they represent. Based on that you build a syntax tree representing what the sentence (for lack of a better word) means. The beauty here is that the parser does not try to guess what syntactical construct is coming next. It simply looks at what is there, and matches that against all possible legal grammatical constructs and picks the one that matches.

The core problem, of course, is ambiguity. In a computer language, ambiguity is unacceptable and hence disallowed. A grammar must be unambiguous for a parser to be built that can recognize it. For natural languages, ambiguity is not just allowed, it is prevalent. A lot of effective communication relies on ambiguity, and it is required for a lot of artwork and humor. Therefore in order to parse a natural language, the parser must be able to handle ambiguity gracefully.

In a recursive-descent parser, ambiguity manifests in the form of shift-reduce conflicts and reduce-reduce conflicts. Basically when applying a grammar, the parser finds more than one possible match in the grammar, and it does not know what action to take. In more precise terms, it finds multiple matching grammar rules and must decide (somehow) which one to apply. Again, in computer languages, ambiguity is not acceptable and so no grammar is allowed where such a conflict can happen. In theory. In practice it is a lot of painful work to come up with completely unambiguous grammars so most parser generators allow some ambiguous constructs and rules provided that all resulting conflicts are resolved somehow. A simple example would be, For all shift-reduce conflicts, shift.

What if, instead of statically choosing one grammar rule to use for parsing, we non-deterministically check all possible parses? In the same fashion as a deterministic finite state automata versus a non-deterministic finite state automata, we could create a non-deterministic right-recursive descent parser. In the case of conflict, instead of only applying one rule, we apply all rules. If no grammatical rules match, then of course nothing is applied.

In this way we can do something much more analogous to what people do in their heads when reading; we keep track of multiple possible interpretations of what they are reading and resolve down to one when they reach the end of a sentence. (Oftentimes a human will need to go back and re-read a sentence that is ambiguous to try a different parsing of it. A computer with vastly more short-term memory could do it concurrently.)

There is a piece missing from this. A key part of writing a parser is writing the tokenizer. Grammars are written in terms of streams of tokens. In a computer language, a sequence of characters can be identified and mapped to a type of token before the grammar rules are applied. That is why most languages have reserved words and rules about how numbers must be formatted and the like. In a natural language, different words representing different parts of speech can have the same spelling. For example, the word tastes could be a plural noun or a present-tense third-person singular verb. Leading could be an adjective or a noun. Lead could be a present-tense first-person singular verb, a past-tense first-person singular verb, either of those in third-person form, a noun, or an adjective. Since natural language grammar rules depend on these parts of speech, there is ambiguity in the part of the parser normally separated out as the lexer. One natural solution to this would be to move the identification of part of speech out of the lexer and into the grammar itself. This would solve the ambiguity problem by reducing it to a known problem (an ambiguous grammar) and applying the solution we already have. A major drawback is that it would drastically increase the size of the grammar itself. (Insert mathematician joke here.) Another solution would be to extend the non-determinism into the lexer. Instead of the lexer passing a single definitive token type to the parser, it could pass an ambiguous token that contains a set of all possible interpretations. The grammar parser would then need to try each of these possibilities. This would keep the grammar smaller (and theoretically more pure) but would complicate the implementation of both the parser and lexer, and may require storing more state at runtime.

Non-deterministic algorithms are well-known for being problematic to implement in reality. That is why we have a formal method for converting an NFA into a DFA, so we can actually run it in a reasonable time using a reasonable amount of memory. Since at the time of writing I know of no formal method for converting a non-deterministic LR(n) parser into a deterministic LR(n) parser, we would need to fall back on holding multiple parser states in memory simultaneously. I believe this will not be a problem in practice. When it comes to compiling a computer program, most languages have grammatical rules akin to this: program := statement (semicolon statement)*. Natural languages have very few of these constructs and they are used infrequently. Most natural language sentences are relatively quite short. Even if applied to entire paragraphs at a time, the size of the input is minuscule compared to the size of input most compilers receive on a daily basis. Even exponential algorithms can be executed if the input size is small enough. The parser can prune non-matching states as it executes, so in most cases where the amount of ambiguity is small the actual runtime states kept in memory will also be small.

For the lexer-based approach to assigning parts of speech to each word, even though the size of the problem may at first seem daunting, I believe it is not an issue. We can imagine a database of every word in a natural language that contains mappings of every verb tense conjugation or noun declension. In practice this would be slightly larger than a normal dictionary. In practical terms this is small enough to be considered a trivial problem for any standard database. For efficiently looking up a word by one of its spelling variants, imagine this database as a table where each row is a word and each column is its conjugation or declension. Now imagine creating a database index for each column. At this point I can wave my hands and invoke standard database theory and claim that the issue is solved. In practice this will involve tries or B-trees or some such, since it is a known problem with known solutions I do not need to go into detail.

I'll write more later. Right now Whiny Cat is whining at me to go back to bed, so I'll go nap for a few more hours.

Current Mood: sleepysleepy