MLRegTest: A benchmark for the machine learning of regular languages
Data files
Nov 17, 2023 version files 69.14 GB
-
data.tar.gz
-
languages.tar.gz
-
models.tar.gz
-
README.md
Jul 13, 2024 version files 106.41 GB
-
data.tar.gz
-
languages.tar.gz
-
models.tar.gz
-
README.md
Abstract
MLRegTest is a benchmark for machine learning systems on sequence classification, which contains training, development, and test sets from 1,800 regular languages. MLRegTest organizes its languages according to their logical complexity (monadic second order, first order, propositional, or monomial expressions) and the kind of logical literals (string, tier-string, subsequence, or combinations thereof). The logical complexity and choice of literal provides a systematic way to understand different kinds of long-distance dependencies in regular languages, and therefore to understand the capacities of different ML systems to learn such long-distance dependencies.
README: MLRegTest: A benchmark for the machine learning of regular languages
https://doi.org/10.5061/dryad.dncjsxm4h
MLRegTest provides training and testing data for 1800 regular languages.
This repository contains three gzipped tar archives.
> data.tar.gz (21GB)
> languages.tar.gz (4.5MB)
> models.tar.gz (76GB)
When uncompressed, these yield three directories, described in detail below.
> data (43GB)
> languages (38MB)
> models (87GB)
Languages
Languages are named according to the scheme Sigma.Tau.class.k.t.i.plebby
, where Sigma
is a two-digit alphabet size, Tau
a two-digit number of salient symbols (the 'tier'), class
the named subregular class, k
the width of factors used (if applicable), t
the threshold counted to (if applicable), and i
a unique identifier. The table below unabbreviates the class names, and shows how many languages of each class there are.
class | name | amount |
---|---|---|
SL | Strictly Local | 90 |
coSL | co-Strictly Local | 90 |
TSL | Tier-based Strictly Local | 150 |
TcoSL | Tier-based co-Strictly Local | 150 |
SP | Strictly Piecewise | 90 |
coSP | co-Strictly Piecewise | 90 |
LT | Locally Testable | 90 |
TLT | Tier-based Locally Testable | 150 |
LPT | Locally Piecewise Testable | 90 |
TLPT | Tier-based Locally Piecewise Testable | 150 |
PT | Piecewise Testable | 90 |
LTT | Locally Threshold Testable | 180 |
TLTT | Tier-based Locally Threshold Testable | 300 |
SF | Star-Free | 30 |
Zp | Purely periodic languages with a prime cycle | 30 |
Reg | Regular | 30 |
What does it mean to say MLRegTest contains a language L belonging to class X? This is an important question because some classes contain other classes. For example, SL is properly contained in LT. Also, some incomparable classes overlap. For example LT and PT have a non-zero intersection. When we say MLRegTest contains a language L belonging to class X it means L satisfies the follow criteria:
- L belongs to class X, and
- L does not belong to any class Y which is a subset of, or incomparable with, class X.
Determining (1) that a language L belongs to class X was straightforward since we used constructive methods which guarantee membership in X. What was more difficult was determining (2), which requires determining that a given language does not belong to some class Y. For example, we wanted to verify that none of the LT languages we defined belonged to SL, TSL, or PT.
For this purpose, we used The Language Toolkit and amalgam by Dakotah Lambert, which contains decision algorithms for all the subregular classes listed above. In this way, we were able to verify that both properties (1) and (2) hold for all 1800 languages in MLRegTest.
With the exception of the complement languages, each language is provided in three file formats: .plebby
, .att
, and .fst
. These formats are described below. We refer to these as 'non-co' languages.
Automata for the 330 complement languages are not provided. They can be obtained by complementing the languages in the corresponding class. For example, there are 90 SL languages. The coSL languages are the complements of these 90 languages.
plebby
This directory is populated by the .plebby
files for use with plebby
which is included in The Language Toolkit. These files were generated by the Python script generate-subreglib.py
in the subregular-learning repo.
There is one .plebby
file per non-co language, plus three defining the alphabets. There are 1473 total .plebby
files.
Each of these files can be used with plebby
to generate the .att
files for each language (in /languages/att_format
).
.att
files
These files are text files for describing finite-state automata in use by OpenFst. According to openfst's QuickTour, these files follow the following format.
> arc format: src dest ilabel olabel [weight]
> final state format: state [weight]
> lines may occur in any order except initial state must be first line
> unspecified weights default to 0.0 (for the library-default Weight type)
Since MLRegTest only contains unweighted acceptors (and not weighted transducers), no weights are given and olabel
always equals ilabel
.
For each language L, L.att
is the minimal deterministic finite-state acceptor (DFA) recognizing L.
There are 1470 .att
files, one for each non-co language.
The att directory also contains three .txt
files for the alphabets.
.fst
files
These are 1470 binary files which can be processed by OpenFst software.
Data
The data directory contains the training, development (validation), and test data sets for 1800 regular languages. These are organized intro three sub-directories: Small, Mid, and Large. Each subdirectory contains 10800 files because for each language there are six data files, named as follows.
> languagename_Dev.txt
> languagename_TestLA.txt
> languagename_TestLR.txt
> languagename_TestSA.txt
> languagename_TestSR.txt
> languagename_Train.txt
Using OpenFst and its python interface Pynini, the training and testing data for the non-co languages was prepared as follows.
Training Data
Training data was randomly generated with duplicates. We generated 50k positive and 50k negative strings subject to the following constraints:
- There were equally many strings of each length between 20 and 29.
- Positive strings were sampled according to a uniform distribution over the paths in the minimal acyclic DFA for words of length
n
belonging to the language, and negative strings sampled similarly for the complement of the language.
Development Data
Development (validation) data was randomly generated with duplicates. We generated 50k positive and 50k negative strings subject to the following constraints.
- There were equally many strings of each length between 20 and 29.
- Positive strings were sampled according to a uniform distribution over the paths in the minimal acyclic DFA for words of length
n
belonging to the language, and negative strings sampled similarly for the complement of the language. - The development set and training sets were disjoint.
Testing Data
For each language, four Test sets were generated as follows.
In the Short Random Test Set, 50k positive and 50k negative strings were randomly generated without duplicates subject to these constraints:
- There are equally many strings of each length between 20 and 29.
- Positive strings were sampled according to a uniform distribution over the paths in the minimal acyclic DFA for words of length
n
belonging to the language, and negative strings sampled similarly for the complement of the language. - No duplicate strings are allowed, and these strings are disjoint from the training and validation data.
In the Long Random Test Set, 50k positive and 50k negative strings were randomly generated without duplicates subject to these constraints:
- There are equally many strings of each length between 31 and 50.
- Positive strings were sampled according to a uniform distribution over the paths in the minimal acyclic DFA for words of length
n
belonging to the language, and negative strings sampled similarly for the complement of the language. - No duplicate strings are allowed.
In the Short Adversarial Test Set, 50k positive and 50k negative strings were randomly created subject to these constraints:
- There are equally many positive strings of each length between 20 and 29.
- For each string length
n
, an acyclic finite-state transducer which mapped every positive string of lengthn
to all negative stringsy
such thatstring_edit_distance(x,y)==1
. - Pairs of positive and negative strings were sampled according to a uniform distribution over the paths in this transducer.
- No duplicate strings are allowed, and these strings are disjoint from the training and validation data.
In the Long Adversarial Test Set, 50k positive and 50k negative strings were randomly created subject to these constraints:
- There are equally many positive strings of each length between 31 and 50.
- For each string length
n
, an acyclic finite-state transducer which mapped every positive string of lengthn
to all negative stringsy
such thatstring_edit_distance(x,y)==1
. - Pairs of positive and negative strings were sampled according to a uniform distribution over the paths in this transducer.
- No duplicate strings are allowed.
Data for the complement languages
Data for the complement languages was obtained by changing the labels on the data generated from their corresponding non-complement languages. For example, the positive strings in data/Small/64.64.coSL.2.1.4_Train.txt
are exactly the negative strings in data/Small/64.64.SL.2.1.4_Train.txt
. Similarly, the negative strings in data/Small/64.64.coSL.2.1.4_Train.txt
are exactly the positive strings in data/Small/64.64.SL.2.1.4_Train.txt
.
Directory Structure
The files with 100,000 strings are in the Large directory. The Small and Mid directories contain files with fewer strings. Those files were obtained by down-sampling from the Large files.
In the Small directory, each file contains 500 positive strings and 500 negative strings, for a total of 1,000 strings.
In the Mid directory, each file contains 5,000 positive strings and 5,000 negative strings, for a total of 10,000 strings.
In the Large directory, each file contains 50,000 positive strings and 50,000 negative strings, for a total of 100,000 strings.
Models
There are 21600 trained neural network models in this directory. Tensorflow and Keras APIs were used to implement and train all neural networks.
21600 equals 1800 lgs * 4 neural network types * 3 training sets * 1 development set.
These models are provided so that researchers interested in extracting simpler models (formal grammars, simpler networks) from trained networks have many models to work with.
How they were trained
The neural network types used were simple_rnn
, gru
, lstm
, and transformer
. Hyperparameters for these networks were set using a limited gridsearch on a subset of MLRegTest. One network of each type was selected on the outcome of the hyperparameter gridsearch. These and other details are provided in the paper.
These four networks were trained on data from the 1800 languages in MLRegTest described above.
The training sets were Small
, Mid
, and Large
.
The development set matched the size of the training set. During training, the development sets were not used to set any parameters or to trigger early stopping. They are inputs only to training the models in the sense that they allow the loss function to be monitored over the course of training.
Naming conventions
Each network is in a folder named according to the scheme nn.lg.train
, where nn
is the network type, lg
is the language name (see above), and train
is the size of the training and dev set used for that language.
Loading the models
See model_summary.py
in the src/neural_net/tensorflow/
in the subregular-learning repo for an example of how to load one of the trained networks.
Change Log
- The (T)(P)LT languages were redefined to ensure uniqueness and to correct accounting for k and j where appropriate.
- The att and fst files for the languages .Reg.0.0.9 and *.SF.0.0.4. were reconstructed after a bug was discovered and fixed in the Language Toolkit that involved non-determinism in constructions. The data for the *LT. languages and *.Reg.0.0.9 and *.SF.0.0.4 was regenerated accordingly.
- The models were changed to four (simple RNN, LSTM, GRU, transformer) whose hyperparameters were tuned in a limited search on a subset of MLRegTest. Each of these four models were then trained on the 5400 datasets (1800 languages x 3 training regimes).
Methods
The languages were generated by creating finite-state acceptors and the datasets were generated by sampling from these finite-state acceptors. The scripts and software used for these processes are open source and available. For details, see https://github.com/heinz-jeffrey/subregular-learning. Details are described in the arxiv preprint "MLRegTest: A Benchmark for the Machine Learning of Regular Languages".