MLRegTest: A benchmark for the machine learning of regular languages

van der Poel, Sam1 ; Lambert, Dakotah2 ; Kostyszyn, Kalina3 ; Gao, Tiantian3 ; Verma, Rahul3 ; Andersen, Derek3 ; Chau, Joanne3 ; Peterson, Emily3 ; St. Clair, Cody3 ; Fodor, Paul3 ; Shibata, Chihiro4 ; Heinz, Jeffrey3

Published Nov 17, 2023; Updated Jul 13, 2024 on Dryad. https://doi.org/10.5061/dryad.dncjsxm4h

Data files

Nov 17, 2023 version files 69.14 GB

data.tar.gz

22.08 GB
languages.tar.gz

3.79 MB
models.tar.gz

47.05 GB
README.md

11.24 KB

Jul 13, 2024 version files 106.41 GB

data.tar.gz

22.05 GB
languages.tar.gz

4.73 MB
models.tar.gz

84.35 GB
README.md

12.89 KB

Abstract

MLRegTest is a benchmark for machine learning systems on sequence classification, which contains training, development, and test sets from 1,800 regular languages. MLRegTest organizes its languages according to their logical complexity (monadic second order, first order, propositional, or monomial expressions) and the kind of logical literals (string, tier-string, subsequence, or combinations thereof). The logical complexity and choice of literal provides a systematic way to understand different kinds of long-distance dependencies in regular languages, and therefore to understand the capacities of different ML systems to learn such long-distance dependencies.

README: MLRegTest: A benchmark for the machine learning of regular languages

https://doi.org/10.5061/dryad.dncjsxm4h

MLRegTest provides training and testing data for 1800 regular languages.

This repository contains three gzipped tar archives.

> data.tar.gz (21GB)
> languages.tar.gz (4.5MB)
> models.tar.gz (76GB)

When uncompressed, these yield three directories, described in detail below.

> data (43GB)
> languages (38MB)
> models (87GB)

Languages

Languages are named according to the scheme Sigma.Tau.class.k.t.i.plebby, where Sigma is a two-digit alphabet size, Tau a two-digit number of salient symbols (the 'tier'), class the named subregular class, k the width of factors used (if applicable), t the threshold counted to (if applicable), and i a unique identifier. The table below unabbreviates the class names, and shows how many languages of each class there are.

class	name	amount
SL	Strictly Local	90
coSL	co-Strictly Local	90
TSL	Tier-based Strictly Local	150
TcoSL	Tier-based co-Strictly Local	150
SP	Strictly Piecewise	90
coSP	co-Strictly Piecewise	90
LT	Locally Testable	90
TLT	Tier-based Locally Testable	150
LPT	Locally Piecewise Testable	90
TLPT	Tier-based Locally Piecewise Testable	150
PT	Piecewise Testable	90
LTT	Locally Threshold Testable	180
TLTT	Tier-based Locally Threshold Testable	300
SF	Star-Free	30
Zp	Purely periodic languages with a prime cycle	30
Reg	Regular	30

What does it mean to say MLRegTest contains a language L belonging to class X? This is an important question because some classes contain other classes. For example, SL is properly contained in LT. Also, some incomparable classes overlap. For example LT and PT have a non-zero intersection. When we say MLRegTest contains a language L belonging to class X it means L satisfies the follow criteria:

L belongs to class X, and
L does not belong to any class Y which is a subset of, or incomparable with, class X.

Determining (1) that a language L belongs to class X was straightforward since we used constructive methods which guarantee membership in X. What was more difficult was determining (2), which requires determining that a given language does not belong to some class Y. For example, we wanted to verify that none of the LT languages we defined belonged to SL, TSL, or PT.

For this purpose, we used The Language Toolkit and amalgam by Dakotah Lambert, which contains decision algorithms for all the subregular classes listed above. In this way, we were able to verify that both properties (1) and (2) hold for all 1800 languages in MLRegTest.

With the exception of the complement languages, each language is provided in three file formats: .plebby, .att, and .fst. These formats are described below. We refer to these as 'non-co' languages.

Automata for the 330 complement languages are not provided. They can be obtained by complementing the languages in the corresponding class. For example, there are 90 SL languages. The coSL languages are the complements of these 90 languages.

plebby

This directory is populated by the .plebby files for use with plebby which is included in The Language Toolkit. These files were generated by the Python script generate-subreglib.py in the subregular-learning repo.

There is one .plebby file per non-co language, plus three defining the alphabets. There are 1473 total .plebby files.

Each of these files can be used with plebby to generate the .att files for each language (in /languages/att_format).

`.att` files

These files are text files for describing finite-state automata in use by OpenFst. According to openfst's QuickTour, these files follow the following format.

> arc format: src dest ilabel olabel [weight]
> final state format: state [weight]
> lines may occur in any order except initial state must be first line
> unspecified weights default to 0.0 (for the library-default Weight type)

Since MLRegTest only contains unweighted acceptors (and not weighted transducers), no weights are given and olabel always equals ilabel.

For each language L, L.att is the minimal deterministic finite-state acceptor (DFA) recognizing L.

There are 1470 .att files, one for each non-co language.

The att directory also contains three .txt files for the alphabets.

`.fst` files

These are 1470 binary files which can be processed by OpenFst software.

Data

The data directory contains the training, development (validation), and test data sets for 1800 regular languages. These are organized intro three sub-directories: Small, Mid, and Large. Each subdirectory contains 10800 files because for each language there are six data files, named as follows.

> languagename_Dev.txt
> languagename_TestLA.txt
> languagename_TestLR.txt
> languagename_TestSA.txt
> languagename_TestSR.txt
> languagename_Train.txt

Using OpenFst and its python interface Pynini, the training and testing data for the non-co languages was prepared as follows.

Training Data

Training data was randomly generated with duplicates. We generated 50k positive and 50k negative strings subject to the following constraints:

There were equally many strings of each length between 20 and 29.
Positive strings were sampled according to a uniform distribution over the paths in the minimal acyclic DFA for words of length n belonging to the language, and negative strings sampled similarly for the complement of the language.

Development Data

Development (validation) data was randomly generated with duplicates. We generated 50k positive and 50k negative strings subject to the following constraints.

There were equally many strings of each length between 20 and 29.
Positive strings were sampled according to a uniform distribution over the paths in the minimal acyclic DFA for words of length n belonging to the language, and negative strings sampled similarly for the complement of the language.
The development set and training sets were disjoint.

Testing Data

For each language, four Test sets were generated as follows.

In the Short Random Test Set, 50k positive and 50k negative strings were randomly generated without duplicates subject to these constraints:

There are equally many strings of each length between 20 and 29.
Positive strings were sampled according to a uniform distribution over the paths in the minimal acyclic DFA for words of length n belonging to the language, and negative strings sampled similarly for the complement of the language.
No duplicate strings are allowed, and these strings are disjoint from the training and validation data.

In the Long Random Test Set, 50k positive and 50k negative strings were randomly generated without duplicates subject to these constraints:

There are equally many strings of each length between 31 and 50.
Positive strings were sampled according to a uniform distribution over the paths in the minimal acyclic DFA for words of length n belonging to the language, and negative strings sampled similarly for the complement of the language.
No duplicate strings are allowed.

In the Short Adversarial Test Set, 50k positive and 50k negative strings were randomly created subject to these constraints:

There are equally many positive strings of each length between 20 and 29.
For each string length n, an acyclic finite-state transducer which mapped every positive string of length n to all negative strings y such that string_edit_distance(x,y)==1.
Pairs of positive and negative strings were sampled according to a uniform distribution over the paths in this transducer.
No duplicate strings are allowed, and these strings are disjoint from the training and validation data.

In the Long Adversarial Test Set, 50k positive and 50k negative strings were randomly created subject to these constraints:

There are equally many positive strings of each length between 31 and 50.
For each string length n, an acyclic finite-state transducer which mapped every positive string of length n to all negative strings y such that string_edit_distance(x,y)==1.
Pairs of positive and negative strings were sampled according to a uniform distribution over the paths in this transducer.
No duplicate strings are allowed.

Data for the complement languages

Data for the complement languages was obtained by changing the labels on the data generated from their corresponding non-complement languages. For example, the positive strings in data/Small/64.64.coSL.2.1.4_Train.txt are exactly the negative strings in data/Small/64.64.SL.2.1.4_Train.txt. Similarly, the negative strings in data/Small/64.64.coSL.2.1.4_Train.txt are exactly the positive strings in data/Small/64.64.SL.2.1.4_Train.txt.

Directory Structure

The files with 100,000 strings are in the Large directory. The Small and Mid directories contain files with fewer strings. Those files were obtained by down-sampling from the Large files.

In the Small directory, each file contains 500 positive strings and 500 negative strings, for a total of 1,000 strings.

In the Mid directory, each file contains 5,000 positive strings and 5,000 negative strings, for a total of 10,000 strings.

In the Large directory, each file contains 50,000 positive strings and 50,000 negative strings, for a total of 100,000 strings.

Models

There are 21600 trained neural network models in this directory. Tensorflow and Keras APIs were used to implement and train all neural networks.

21600 equals 1800 lgs * 4 neural network types * 3 training sets * 1 development set.

These models are provided so that researchers interested in extracting simpler models (formal grammars, simpler networks) from trained networks have many models to work with.

How they were trained

The neural network types used were simple_rnn, gru, lstm, and transformer. Hyperparameters for these networks were set using a limited gridsearch on a subset of MLRegTest. One network of each type was selected on the outcome of the hyperparameter gridsearch. These and other details are provided in the paper.

These four networks were trained on data from the 1800 languages in MLRegTest described above.

The training sets were Small, Mid, and Large.

The development set matched the size of the training set. During training, the development sets were not used to set any parameters or to trigger early stopping. They are inputs only to training the models in the sense that they allow the loss function to be monitored over the course of training.

Naming conventions

Each network is in a folder named according to the scheme nn.lg.train, where nn is the network type, lg is the language name (see above), and train is the size of the training and dev set used for that language.

Loading the models

See model_summary.py in the src/neural_net/tensorflow/ in the subregular-learning repo for an example of how to load one of the trained networks.

Change Log

The (T)(P)LT languages were redefined to ensure uniqueness and to correct accounting for k and j where appropriate.
The att and fst files for the languages .Reg.0.0.9 and *.SF.0.0.4. were reconstructed after a bug was discovered and fixed in the Language Toolkit that involved non-determinism in constructions. The data for the *LT. languages and *.Reg.0.0.9 and *.SF.0.0.4 was regenerated accordingly.
The models were changed to four (simple RNN, LSTM, GRU, transformer) whose hyperparameters were tuned in a limited search on a subset of MLRegTest. Each of these four models were then trained on the 5400 datasets (1800 languages x 3 training regimes).