--[===[
MODULE "MSPLITTER"
"eo.wiktionary.org/wiki/Modulo:msplitter" <!--2024-Oct-09-->
"id.wiktionary.org/wiki/Modul:msplitter"
Purpose: submodule for "mlawc"
Utilo: submodulo por "mlawc"
Manfaat: submodul untuk "mlawc"
Syfte: submodul foer "mlawc"
Used by templates / Uzata far sxablonoj /
Digunakan oleh templat / Anvaent av mallar:
* none (this module cannot be called from a template)
Required submodules / Bezonataj submoduloj /
Submodul yang diperlukan / Behoevda submoduler:
* none
Incoming: * single table with following content (everything must be
prevalidated by the caller):
* 0 (boo) -- desirability of compound cat:s
* 1 (str) -- pagename AKA input lemma (may NOT be empty)
* 2 (num) -- split strategy (0...5 or 7)
* 3 (tab) -- fragments from "%"-syntax assi
* 4 (tab) -- fragments from "#"-syntax assi
* 5 (tab) -- fragments for manual split
* 6 (tab) -- fragments from extra parameter
* 7 (boo) -- true if extra parameter was used
* 8 (tab) -- lng stuff with double-letter indexes
* 9 (boo) -- NR word class
* 10 (boo) -- KA word class
* 15 (boo) -- detrc
- desirability of compound cat:s -- index 0 (we split even if
false but no cat:s then)
- lemma (may NOT be empty) -- index 1
- split control parameter -- index 2 3 4 5
- extra parameter -- index 6 7
- language stuff (code and some variants of language name) -- index 8
- word class (reduced to 2 questions) -- index 9 10
Returned: * single table with following content:
* 0...17 (str) category names
* 20...37 (nil or boo) main page flags
* 50 (str) output lemma wikitext or "//" on error
* 51 (str) debug "qstrtrace"
* 52 (num) status code (ZERO is OK)
The split strategies available are:
- #S0 automatic multiword split
- #S1 assisted split
- #S2 manual split
- #S3 simple root split
- #S4 simple bare root
- #S5 large letter split
- #S6 reserved
- #S7 no split (splitter still may be called and extra parameter is processed)
List of 6+1+1+1 selectable morpheme types:
C circumfix cirkumfikso
I infix infikso (EO: -o- -et- -il- ...)
M standalone root memstara radiko (EO: tri dek post ...)
N nonstandalone root nememstara radiko (EO: fer voj ...)
P prefix prefikso
U suffix sufikso (postfikso, finajxo, EO: -a -j -n)
-------
W word vorto
-------
L same as "N" but changes linking behavior (only in F210)
-------
X only after "&" in the extra parameter (caller converts it for us)
These mortyp:s can be used in the split control parameter before colon ":"
with manual split, and in the extra parameter, but then "L" is prohibited
(thus C I M N P U W are left plus maybe X), either after "&", or in fragments
before ":" or "!" (see "spec-splitter-en.txt" for syntax details).
We put only the letter symbol into the category name (except for the type
word) as it otherwise would become unreasonably long. It must contain
3 pieces of information:
- language (consider "-an" in SV and ID)
- mortyp (consider "-an" and "an-" and "an" in SV)
- the morpheme / affix / word itself
It is possible to deactivate (semi-hardcoded configuration in the source
code of "mlawc") only compound categories, or the splitter resulting in the
raw lemma showed without link, or deactivate showing the lemma altogether,
in both latter cases the splitter is inactive and this module is not called
at all.
The automatic splitter ("numsplyt" = 0 and "lfhsplitaa") is fully
automatic and the 2 tables at index 3 and 4 must be empty then.
No error can occur here, but there is risk for a failure that no split
boundaries can be applied, and the output is identical to the input.
The assisted splitter ("numsplyt" = 1 and "lfhsplitaa") is
controlled by 2 prevalidated tables.
* Table contains up to 16 values indexed by integers 0 to 15,
value type string "1" means do block, type "nil" means do not
block (the default). Other values should not occur and evaluate to
do not block like "nil" does.
* Table contains up to 16 values indexed by integers 0 to 15, value:
* type string:
* "N" or "I" or "A" (as described in "spec-splitter-en.txt")
* colon ":" followed by the link target (length 1...40 octet:s NOT
checked anymore here)
Beginning char other than "N" or "I" or "A" or ":" should not
occur and evaluates to do nothing unusual like "nil" does.
* type "nil" means do nothing unusual (the default)
No error can occur in the assisted splitter, but there is risk
for a failure that no split boundaries can be applied, and the output is
identical to the input.
The manual splitter ("numsplyt" = 2 and "lfhsplitmn") is controlled by one
prevalidated table, the pagename does not even enter the split process,
but a bool revealing whether it contains at least one space does.
* Table contains 1 to 16 strings indexed by integers 0 to 15,
one string for every fragment. The 5 legal types are:
* F000 : no brackets, no colon, no slash (visible text no link)
* F200 : 2 brackets, no colon, no slash (combo target visible text)
* F201 : 2 brackets, no colon, 1 slash (target / visible text)
* F210 : 2 brackets, 1 colon, no slash (mortyp : combo target visible text)
* F211 : 2 brackets, 1 colon, 1 slash (mortyp : target / visible text)
No error can occur in the manual splitter and no failure due to
lack of boundaries either, the "sum check" is part of the prevalidation.
Note that we use slashes and single rectangular brackets "+[I:bug/BUG]"
instead of wikisyntax "[[bug|BUG]]", beware that "[bug|BUG]" would NOT work.
]===]
local exporttable = {}
------------------------------------------------------------------------
---- CONSTANTS [O] ----
------------------------------------------------------------------------
-- uncommentable EO vs ID constant table (categories)
-- syntax of insertion and discarding magic string:
-- "@" followed by 2 uppercase letters and 2 hex numbers
-- otherwise the hit is not processed, but copied as-is instead
-- 2 letters select the insertable item from table supplied by the caller
-- 2 hex numbers control discarding left and right (0...15 char:s)
-- empty item is legal and results in discarding if some number is non-ZERO
-- if uppercasing or other adjustment is needed then the caller must take
-- care of it in the form of 2 or more separate items provided in the table
-- insertable items defined:
-- constant:
-- * LK lng code (unknown "??" legal but take care elsewhere)
-- * LN lng name (unknown legal, for example "dana" or "Ido")
-- * LU lng name uppercased (unknown legal, for example "Dana" or "Ido")
-- * LO lng name not own (empty or nil if own)
-- * LV lng name uppercased not own (empty or nil if own)
-- * LY lng name long (for example "bahasa Swedia")
-- * LZ lng name long not own (empty or nil if own)
-- * SC script code (for example "T", "S", "P" for ZH, "C" "L" for SH)
-- variable (we can have 2 word classes):
-- * WC word class name (for example "substantivo")
-- * WU word class name uppercased (for example "Substantivo")
-- * MT mortyp code (for example "C")
-- * FR fragment (for example "peN-...-an" or "abelujo")
-- see "lfiultiminsert" and "tablngdbl" use space here and avoid "_"
-- note the malicious false friendship between EO:frazo kaj ID:frasa
local contabktaoj = {}
contabktaoj[3] = 'Vortgrupo -@LK00- enhavanta (@FR00) @SC10' -- EO only if ("boocatdesir" is true) can be many
-- contabktaoj[3] = 'Frasa @LZ10 mengandung kata @FR00 @SC10' -- ID only if ("boocatdesir" is true) can be many
contabktaoj[4] = 'Frazo -@LK00- enhavanta vorton (@FR00) @SC10' -- EO only if ("boocatdesir" is true) can be many
-- contabktaoj[4] = 'Kalimat @LK00 mengandung kata (@FR00) @SC10' -- ID only if ("boocatdesir" is true) can be many
contabktaoj[5] = 'Vorto -@LK00- enhavanta morfemon @MT00 (@FR00) @SC10' -- EO only if ("boocatdesir" is true) can be many
-- contabktaoj[5] = 'Kata @LK00 mengandung morfem @MT00 (@FR00) @SC10' -- ID only if ("boocatdesir" is true) can be many
------------------------------------------------------------------------
---- SPECIAL STUFF OUTSIDE MAIN [B] ----
------------------------------------------------------------------------
-- SPECIAL VAR:S
local qboodetrc = true
local qstrtrace = '<br>' -- for main & sub:s, debug report sent to caller
local qtabktaoj = {} -- global for compound categories [0]...[52] and ret
------------------------------------------------------------------------
---- DEBUG FUNCTIONS [D] ----
------------------------------------------------------------------------
-- Local function LFDTRACEMSG
-- Enhance upvalue "qstrtrace" with fixed text.
-- for variables the other sub "lfdshowvar" is preferable but in exceptional
-- cases it can be justified to send text with values of variables to this sub
-- no size limit
-- upvalue "qstrtrace" must NOT be type "nil" on entry (is inited to "<br>")
-- uses upvalue "qboodetrc"
local function lfdtracemsg (strshortline)
if (qboodetrc and (type(strshortline)=='string')) then
qstrtrace = qstrtrace .. strshortline .. '.<br>' -- dot added !!!
end--if
end--function lfdtracemsg
------------------------------------------------------------------------
---- MATH FUNCTIONS [E] ----
------------------------------------------------------------------------
local function mathdiv (xdividens, xdivisero)
local resultdiv = 0 -- DIV operator lacks in LUA :-(
resultdiv = math.floor (xdividens / xdivisero)
return resultdiv
end--function mathdiv
local function mathmod (xdividendo, xdivisoro)
local resultmod = 0 -- MOD operator is "%" and bitwise AND operator lack too
resultmod = xdividendo % xdivisoro
return resultmod
end--function mathmod
------------------------------------------------------------------------
-- Local function MATHBITTEST
-- Find out whether single bit selected by ZERO-based index is "1" / "true".
-- Result has type "boolean".
-- Depends on functions :
-- [E] mathdiv mathmod
local function mathbittest (numincoming, numbitindex)
local boores = false
while true do
if ((numbitindex==0) or (numincoming==0)) then
break -- we have either reached our bit or run out of bits
end--if
numincoming = mathdiv(numincoming,2) -- shift right
numbitindex = numbitindex - 1 -- count down to ZERO
end--while
boores = (mathmod(numincoming,2)==1) -- pick bit
return boores
end--function mathbittest
------------------------------------------------------------------------
---- LOW LEVEL STRING FUNCTIONS [G] ----
------------------------------------------------------------------------
-- Local function LFGPOKESTRING
-- Replace single octet in a string.
-- Input : * strinpokeout -- empty legal
-- * numpokepoz -- ZERO-based, out of range legal
-- * numpokeval -- new value
-- This is inefficient by design of LUA. The caller is responsible to
-- minimize the number of invocations of this, in particular, not to
-- call if the new value is equal the existing one.
local function lfgpokestring (strinpokeout, numpokepoz, numpokeval)
local numpokelen = 0
numpokelen = string.len(strinpokeout)
if ((numpokelen==1) and (numpokepoz==0)) then
strinpokeout = string.char(numpokeval) -- totally replace
end--if
if (numpokelen>=2) then
if (numpokepoz==0) then
strinpokeout = string.char(numpokeval) .. string.sub (strinpokeout,2,numpokelen)
end--if
if ((numpokepoz>0) and (numpokepoz<(numpokelen-1))) then
strinpokeout = string.sub (strinpokeout,1,numpokepoz) .. string.char(numpokeval) .. string.sub (strinpokeout,(numpokepoz+2),numpokelen)
end--if
if (numpokepoz==(numpokelen-1)) then
strinpokeout = string.sub (strinpokeout,1,(numpokelen-1)) .. string.char(numpokeval)
end--if
end--if (numpokelen>=2) then
return strinpokeout
end--function lfgpokestring
------------------------------------------------------------------------
local function lfgtestuc (numkode)
local booupperc = false
booupperc = ((numkode>=65) and (numkode<=90))
return booupperc
end--function lfgtestuc
local function lfgtestlc (numcode)
local boolowerc = false
boolowerc = ((numcode>=97) and (numcode<=122))
return boolowerc
end--function lfgtestlc
------------------------------------------------------------------------
-- Local function LFGTESTPUNCTURE
-- Test whether char is an ASCII punctuation sign, return type "boolean".
-- punctuation (5 char:s: ! , . ; ?) 21 33 | 2C 44 | 2E 46 | 3B 59 | 3F 63
-- dash "-" and apo "'" do NOT count as punctuation
-- here we do NOT include SPACE in the list
local function lfgtestpuncture (numcorde)
local boopunk = false
boopunk = ((numcorde==33) or (numcorde==44) or (numcorde==46) or (numcorde==59) or (numcorde==63))
return boopunk
end--function lfgtestpuncture
------------------------------------------------------------------------
-- Local function LFIADDTHEDASH
local function lfiaddthedash (strafikso, booaddleft, booaddright)
local numdashlength = 0
local numbuggar = 0
numdashlength = string.len (strafikso)
if (numdashlength~=0) then
numbuggar = string.byte (strafikso,1,1)
if (numbuggar==45) then
booaddleft = false -- avoid "--"...
end--if
numbuggar = string.byte (strafikso,numdashlength,numdashlength)
if (numbuggar==45) then
booaddright = false -- avoid ..."--"
end--if
if (booaddleft) then
strafikso = "-" .. strafikso
end--if
if (booaddright) then
strafikso = strafikso .. "-"
end--if
end--if
return strafikso
end--function lfiaddthedash
------------------------------------------------------------------------
-- Local function LFIDEBRACKET
-- Separate bracketed part of a string and return the inner or outer
-- part. On failure the string is returned complete and unchanged.
-- There must be exactly ONE "(" and exactly ONE ")" in correct order.
-- Input : * strde31br, boooutside
-- * numxminlencz -- minimal length of inner part, must be >= 1 !!!
-- Note that for length of hit ZERO ie "()" we have "begg" + 1 = "endd"
-- and for length of hit ONE ie "(x)" we have "begg" + 2 = "endd".
-- Example: "crap (NO)" -> len = 9
-- 123456789
-- "begg" = 6 and "endd" = 9
-- Expected result: "NO" or "crap " (note the trailing space)
-- Example: "(XX) YES" -> len = 8
-- 12345678
-- "begg" = 1 and "endd" = 4
-- Expected result: "XX" or " YES" (note the leading space)
local function lfidebracket (strde31br, boooutside, numxminlencz)
local numindoux = 1 -- ONE-based
local numdlong = 0
local num31wesel = 0
local numbegg = 0 -- ONE-based, ZERO invalid
local numendd = 0 -- ONE-based, ZERO invalid
numdlong = string.len (strde31br)
while true do
if (numindoux>numdlong) then
break -- ONE-based -- if both "numbegg" "numendd" non-ZERO then maybe
end--if
num31wesel = string.byte(strde31br,numindoux,numindoux)
if (num31wesel==40) then -- "("
if (numbegg==0) then
numbegg = numindoux -- pos of "("
else
numbegg = 0
break -- damn: more than 1 "(" present
end--if
end--if
if (num31wesel==41) then -- ")"
if ((numendd==0) and (numbegg~=0) and ((numbegg+numxminlencz)<numindoux)) then
numendd = numindoux -- pos of ")"
else
numendd = 0
break -- damn: more than 1 ")" present or ")" precedes "("
end--if
end--if
numindoux = numindoux + 1
end--while
if ((numbegg~=0) and (numendd~=0)) then
if (boooutside) then
strde31br = string.sub(strde31br,1,(numbegg-1)) .. string.sub(strde31br,(numendd+1),numdlong)
else
strde31br = string.sub(strde31br,(numbegg+1),(numendd-1)) -- separate substring
end--if
end--if
return strde31br -- same string variable
end--function lfidebracket
------------------------------------------------------------------------
-- Local function LFIREMOVE2BRA
local function lfiremove2bra (strinmedparenteser)
local stroututanparenteser = ''
local numindozux = 1 -- ONE-based
local numparepanjang = 0
local numparechar = 0
numparepanjang = string.len (strinmedparenteser)
while true do
if (numindozux>numparepanjang) then
break
end--if
numparechar = string.byte(strinmedparenteser,numindozux,numindozux)
if ((numparechar~=40) and (numparechar~=41)) then
stroututanparenteser = stroututanparenteser .. string.char(numparechar)
end--if
numindozux = numindozux + 1
end--while
return stroututanparenteser
end--function lfiremove2bra
------------------------------------------------------------------------
---- NUMBER CONVERSION FUNCTIONS [N] ----
------------------------------------------------------------------------
-- Local function LFNONEHEXTOINT
-- Convert single quasi-digit (ASCII HEX "0"..."9" "A"..."F")
-- to integer (0...15, 255 invalid).
-- Only uppercase accepted.
local function lfnonehextoint (numdigit)
local numresult = 255
if ((numdigit>=48) and (numdigit<=57)) then -- "0"..."9"
numresult = numdigit-48
end--if
if ((numdigit>=65) and (numdigit<=70)) then -- "A"..."F"
numresult = numdigit-55
end--if
return numresult
end--function lfnonehextoint
------------------------------------------------------------------------
---- UTF8 FUNCTIONS [U] ----
------------------------------------------------------------------------
-- Local function LFULNUTF8CHAR
-- Evaluate length of a single UTF8 char in octet:s.
-- Input : * numbgoctet -- beginning octet of a UTF8 char
-- Output : * numlen1234x -- unit octet, number 1...4, or ZERO if invalid
-- Does NOT thoroughly check the validity, looks at ONE octet only.
local function lfulnutf8char (numbgoctet)
local numlen1234x = 0
if (numbgoctet<128) then
numlen1234x = 1 -- $00...$7F -- ANSI/ASCII
end--if
if ((numbgoctet>=194) and (numbgoctet<=223)) then
numlen1234x = 2 -- $C2 to $DF
end--if
if ((numbgoctet>=224) and (numbgoctet<=239)) then
numlen1234x = 3 -- $E0 to $EF
end--if
if ((numbgoctet>=240) and (numbgoctet<=244)) then
numlen1234x = 4 -- $F0 to $F4
end--if
return numlen1234x
end--function lfulnutf8char
------------------------------------------------------------------------
-- Local function LFUCASEGENE
-- Adjust (generous) case of a single letter (from ASCII + limited extra
-- set from UTF8 with some common ranges) or longer string. (this is GENE)
-- Input : * strinco7cs : single unicode letter (1 or 2 octet:s) or
-- longer string
-- * booup7cas : for desired output uppercase "true" and for
-- lowercase "false"
-- * boodo7all : "true" to adjust all letters, "false"
-- only beginning letter
-- Output : * strinco7cs
-- Depends on functions : (this is GENE)
-- [U] lfulnutf8char
-- [G] lfgpokestring lfgtestuc lfgtestlc
-- [E] mathdiv mathmod mathbittest
-- This process never changes the length of a string in octet:s. Empty string
-- on input is legal and results in an empty string returned. When case is
-- adjusted, a 1-octet or 2-octet letter is replaced by another letter of same
-- length. Unknown valid char:s (1-octet ... 4-octet) are copied. Broken UTF8
-- stream results in remaining part of the output string (from 1 char to
-- complete length of the incoming string) filled by "Z".
-- * lowercase is usually above uppercase, but not always, letters can be
-- only misaligned (UC even vs UC odd), and rarely even swapped (French "Y")
-- * case delta can be 1 or $20 or $50 other
-- * case pair distance can span $40-boundary or even $0100-boundary
-- * in the ASCII range lowercase is $20 above uppercase, b5 reveals
-- the case (1 is lower)
-- * the same is valid in $C3-block
-- * this is NOT valid in $C4-$C5-block, lowercase is usually 1 above
-- uppercase, but nothing reveals the case reliably
-- ## $C2-block $0080 $C2,$80 ... $00BF $C2,$BF no letters (OTOH NBSP mm)
-- ## $C3-block $00C0 $C3,$80 ... $00FF $C3,$BF (SV mm) delta $20 UC-LC-UC-LC
-- upper $00C0 $C3,$80 ... $00DF $C3,$9F
-- lower $00E0 $C3,$A0 ... $00FF $C3,$BF
-- AA AE EE NN OE UE mm
-- $D7 $DF $F7 excluded (not letters)
-- $FF excluded (here LC, UC is $0178)
-- ## $C4-$C5-block $0100 $C4,$80 ... $017F $C5,$BF (EO mm)
-- delta 1 and UC even, but messy with many exceptions
-- EO $0108 ... $016D case delta 1
-- for example SX upper $015C $C5,$9C -- lower $015D $C5,$9D
-- $0138 $0149 $017F excluded (not letters)
-- $0178 excluded (here UC, LC is $FF)
-- $0100 ... $0137 UC even
-- $0139 ... $0148 misaligned (UC odd) note that case delta is NOT reversed
-- $014A ... $0177 UC even again
-- $0179 ... $017E misaligned (UC odd) note that case delta is NOT reversed
-- ## $CC-$CF-block $0300 $CC,$80 ... $03FF $CF,$BF (EL mm) delta $20
-- EL $0370 ... $03FF (officially)
-- strict EL base range $0391 ... $03C9 case delta $20
-- $0391 $CE,$91 ... $03AB $CE,$AB upper
-- $03B1 $CE,$B1 ... $03CB $CD,$8B lower
-- for example "omega" upper $03A9 $CE,$A9 -- lower $03C9 $CF,$89
-- ## $D0-$D3-block $0400 $D0,$80 ... $04FF $D3,$BF (RU mm)
-- * delta $20 $50 1
-- * strict RU base range $0410 ... $044F case delta $20 but there
-- is 1 extra char outside !!!
-- * $0410 $D0,$90 ... $042F $D0,$AF upper
-- * $0430 $D0,$B0 ... $044F $D1,$8F lower
-- * for example "CCCP-gamma" upper $0413 $D0,$93 -- lower $0433 $D0,$B3
-- * extra base char and exception is special "E" with horizontal doubledot
-- case delta $50 (upper $0401 $D0,$81 -- lower $0451 $D1,$91)
-- * same applies for ranges $0400 $D0,$80 ... $040F $D0,$8F upper
-- and $0450 $D1,$90 ... $045F $D1,$9F lower
-- * range $0460 $D1,$A0 ... $04FF $D3,$BF (ancient RU, UK, RUE, ...) case
-- delta 1 and UC usually even, but messy with many exceptions $048x
-- $04Cx (combining decorations and misaligned)
-- Variables "numdel7abs" and "numdel7ta" must be at least 16-bit to avoid
-- misevaluation or wrong wrapping when fitting into the range 128...191,
-- even if no deltas exceeding +-127 are supported (there are very few pairs
-- of char:s exceeding this). Also both can be declared unsigned since only
-- addition and subtraction are performed on them.
-- We peek max 2 values per iteration, and change the string in-place, doing
-- so strictly only if there indeed is a change. This is important for LUA
-- where the in-place write access must be emulated by means of a less
-- efficient function.
local function lfucasegene (strinco7cs, booup7cas, boodo7all)
local numlong7den = 0 -- actual length of input string
local numokt7index = 0
local numlong7bor = 0 -- expected length of single char
local numdel7abs = 0 -- at least 16-bit, absolute posi delta
local numdel7ta = 0 -- quasi-signed at least 16-bit, can be negative
local numdel7car = 0 -- quasi-signed 8-bit, can be negative
local numcha7r = 0 -- UINT8 beginning char
local numcha7s = 0 -- UINT8 later char (BIG ENDIAN, lower value here above)
local numcxa7rel = 0 -- UINT8 code relative to beginning of block $00...$FF
local boowan7tlowr = false
local boois7uppr = false
local boois7lowr = false
local boomy7bit0x = false -- single relevant bits picked -- b0
local boomy7bit5x = false -- single relevant bits picked -- b5
local boopen7din = false -- only fake loop
local boodo7adj = true -- preASSume innocence -- continue changing
local boobotch7d = false -- preASSume innocence -- NOT yet botched
local booc3block = false -- $C3 only $00C0...$00FF SV mm delta 32
local booc4c5blk = false -- $C4 $C5 $0100...$017F EO mm delta 1
local boocccfblk = false -- $CC $CF $0300...$03FF EL mm delta 32
local bood0d3blk = false -- $D0 $D3 $0400...$04FF RU mm delta 32 80
booup7cas = not (not booup7cas)
boowan7tlowr = (not booup7cas)
numlong7den = string.len (strinco7cs)
while true do -- genuine loop over incoming string (this is GENE)
if (numokt7index>=numlong7den) then
break -- done complete string
end--if
if ((not boodo7all) and (numokt7index~=0)) then -- loop can skip index ONE
boodo7adj = false
end--if
boois7uppr = false -- preASSume on every iteration
boois7lowr = false -- preASSume on every iteration
numdel7ta = 0 -- preASSume on every iteration
numlong7bor = 1 -- preASSume on every iteration
while true do -- fake loop (this is GENE)
numcha7r = string.byte (strinco7cs,(numokt7index+1),(numokt7index+1))
if (boobotch7d) then
numdel7ta = 90 - numcha7r -- "Z" -- delta must be non-ZERO to write
break -- fill with "Z" char:s
end--if
if (not boodo7adj) then
break -- copy octet after octet
end--if
numlong7bor = lfulnutf8char(numcha7r)
if ((numlong7bor==0) or ((numokt7index+numlong7bor)>numlong7den)) then
numlong7bor = 1 -- reassign to ONE !!!
numdel7ta = 90 - numcha7r -- "Z" -- delta must be non-ZERO to write
boobotch7d = true
break -- truncated char or broken stream
end--if
if (numlong7bor>=3) then
break -- copy UTF8 char, no chance for adjustment
end--if
if (numlong7bor==1) then
boois7uppr = lfgtestuc(numcha7r)
boois7lowr = lfgtestlc(numcha7r)
if (boois7uppr and boowan7tlowr) then
numdel7ta = 32 -- ASCII UPPER->lower
end--if
if (boois7lowr and booup7cas) then
numdel7ta = -32 -- ASCII lower->UPPER
end--if
break -- success with ASCII and one char almost done
end--if
booc3block = (numcha7r==195) -- case delta is 32
booc4c5blk = ((numcha7r==196) or (numcha7r==197)) -- case delta is 1
boocccfblk = ((numcha7r>=204) and (numcha7r<=207)) -- case delta is 32
bood0d3blk = ((numcha7r>=208) and (numcha7r<=211)) -- case delta is 32 80 1
numcha7s = string.byte (strinco7cs,(numokt7index+2),(numokt7index+2)) -- only $80 to $BF
numcxa7rel = (mathmod(numcha7r,4)*64) + (numcha7s-128) -- 4 times 64
boomy7bit0x = ((mathmod(numcxa7rel,2))==1)
boomy7bit5x = mathbittest(numcxa7rel,5)
if (booc3block) then
boopen7din = true -- pending flag
if ((numcxa7rel==215) or (numcxa7rel==223) or (numcxa7rel==247)) then
boopen7din = false -- not a letter, we are done
end--if
if (numcxa7rel==255) then
boopen7din = false -- special LC silly "Y" with horizontal doubledot
if (booup7cas) then
numdel7ta = 121 -- lower->UPPER (distant and reversed order)
end--if
end--if
if (boopen7din) then
boois7lowr = boomy7bit5x -- mostly regular block, look at b5
boois7uppr = not boois7lowr
if (boois7uppr and boowan7tlowr) then
numdel7ta = 32 -- UPPER->lower
end--if
if (boois7lowr and booup7cas) then
numdel7ta = -32 -- lower->UPPER
end--if
end--if (boopen7din) then
break -- to join mark
end--if (booc3block) then
if (booc4c5blk) then
boopen7din = true -- pending flag
if ((numcxa7rel==56) or (numcxa7rel==73) or (numcxa7rel==127)) then
boopen7din = false -- not a letter, we are done
end--if
if (numcxa7rel==120) then
boopen7din = false -- special UC silly "Y" with horizontal doubledot
if (boowan7tlowr) then
numdel7ta = -121 -- UPPER->lower (distant and reversed order)
end--if
end--if
if (boopen7din) then
if (((numcxa7rel>=57) and (numcxa7rel<=73)) or (numcxa7rel>=121)) then
boois7lowr = not boomy7bit0x -- UC odd (misaligned)
else
boois7lowr = boomy7bit0x -- UC even (ordinary align)
end--if
boois7uppr = not boois7lowr
if (boois7uppr and boowan7tlowr) then
numdel7ta = 1 -- UPPER->lower
end--if
if (boois7lowr and booup7cas) then
numdel7ta = -1 -- lower->UPPER
end--if
end--if (boopen7din) then
break -- to join mark
end--if (booc4c5blk) then
if (boocccfblk) then
boois7uppr = ((numcxa7rel>=145) and (numcxa7rel<=171))
boois7lowr = ((numcxa7rel>=177) and (numcxa7rel<=203))
if (boois7uppr and boowan7tlowr) then
numdel7ta = 32 -- UPPER->lower
end--if
if (boois7lowr and booup7cas) then
numdel7ta = -32 -- lower->UPPER
end--if
break -- to join mark
end--if (boocccfblk) then
if (bood0d3blk) then
if (numcxa7rel<=95) then -- messy layout but no exceptions
boois7lowr = (numcxa7rel>=48) -- delta $20 or $50
boois7uppr = not boois7lowr
numdel7abs = 32 -- $20
if ((numcxa7rel<=15) or (numcxa7rel>=80)) then
numdel7abs = 80 -- $50
end--if
end--if
if ((numcxa7rel>=96) and (numcxa7rel<=129)) then -- no exceptions here
boois7lowr = boomy7bit0x -- UC even (ordinary align)
boois7uppr = not boois7lowr
numdel7abs = 1
end--if
if (numcxa7rel>=138) then -- some misaligns here !!!FIXME!!!
boois7lowr = boomy7bit0x -- UC even (ordinary align)
boois7uppr = not boois7lowr
numdel7abs = 1
end--if
if (boois7uppr and boowan7tlowr) then
numdel7ta = numdel7abs -- UPPER->lower
end--if
if (boois7lowr and booup7cas) then
numdel7ta = -numdel7abs -- lower->UPPER
end--if
break -- to join mark
end--if (bood0d3blk) then
break -- finally to join mark -- unknown non-ASCII char is a fact :-(
end--while -- fake loop -- join mark (this is GENE)
if ((numlong7bor==1) and (numdel7ta~=0)) then -- no risk of carry here
strinco7cs = lfgpokestring (strinco7cs,numokt7index,(numcha7r+numdel7ta))
end--if
if ((numlong7bor==2) and (numdel7ta~=0)) then -- no risk of carry here
numdel7car = 0
while true do -- inner genuine loop
if ((numcha7s+numdel7ta)<192) then
break
end--if
numdel7ta = numdel7ta - 64 -- get it down into range 128...191
numdel7car = numdel7car + 1 -- BIG ENDIAN 6 bits with carry
end--while
while true do -- inner genuine loop
if ((numcha7s+numdel7ta)>127) then
break
end--if
numdel7ta = numdel7ta + 64 -- get it up into range 128...191
numdel7car = numdel7car - 1 -- BIG ENDIAN 6 bits with carry
end--while
if (numdel7car~=0) then -- in-place change only if needed
strinco7cs = lfgpokestring (strinco7cs,numokt7index,(numcha7r+numdel7car))
end--if
if (numdel7ta~=0) then -- in-place change only if needed
strinco7cs = lfgpokestring (strinco7cs,(numokt7index+1),(numcha7s+numdel7ta))
end--if
end--if
numokt7index = numokt7index + numlong7bor -- advance in incoming string
end--while -- genuine loop over incoming string (this is GENE)
return strinco7cs
end--function lfucasegene
------------------------------------------------------------------------
---- HIGH LEVEL STRING FUNCTIONS [I] ----
------------------------------------------------------------------------
-- Local function LFIULTIMINSERT
-- Insert selected substitute strings into request string at given positions
-- with optional discarding if the substitute string is empty. Discarding
-- is protected from access out of range by clamping the distances.
-- Input : * strrekvest -- request string containing placeholders
-- (syntax see below)
-- * tabsubstut -- list with substitute strings using two-letter
-- codes as keys, non-string in the table is safe and
-- has same effect as empty string, still type "nil"
-- or empty string "" are preferred
-- Output : * strhazil
-- Syntax of the placeholder:
-- * "@" followed by 2 uppercase letters and 2 hex numbers, otherwise
-- the hit is not processed, but copied as-is instead
-- * 2 letters select the substitute from table supplied by the caller
-- * 2 hex numbers control discarding left and right (0...15 char:s)
-- Empty item in "tabsubstut" is legal and results in discarding if some of
-- the control numbers is non-ZERO. Left discarding is practically performed
-- on "strhazil" whereas right discarding on "strrekvest" and "numdatainx".
-- If uppercasing or other adjustment is needed, then the caller must
-- take care of it by providing several separate substitute strings with
-- separate names in the table.
-- Depends on functions :
-- [G] lfgtestnum lfgtestuc
-- [N] lfnonehextoint
local function lfiultiminsert (strrekvest,tabsubstut)
local varduahuruf = 0
local strhazil = ''
local numdatalen = 0
local numdatainx = 0 -- src index
local numdataoct = 0 -- maybe @
local numdataodt = 0 -- UC
local numdataoet = 0 -- UC
local numammlef = 0 -- hex and discard left
local numammrig = 0 -- hex and discard right
local boogotplejs = false
numdatalen = string.len(strrekvest)
numdatainx = 1 -- ONE-based
while true do -- genuine loop, "numdatainx" is the counter
if (numdatainx>numdatalen) then -- beware of risk of overflow below
break -- done (ZERO iterations possible)
end--if
boogotplejs = false
numdataoct = string.byte(strrekvest,numdatainx,numdatainx)
numdatainx = numdatainx + 1
while true do -- fake loop
if ((numdataoct~=64) or ((numdatainx+3)>numdatalen)) then
break -- no hit here
end--if
numdataodt = string.byte(strrekvest, numdatainx , numdatainx )
numdataoet = string.byte(strrekvest,(numdatainx+1),(numdatainx+1))
if ((not lfgtestuc(numdataodt)) or (not lfgtestuc(numdataoet))) then
break -- no hit here
end--if
numammlef = string.byte(strrekvest,(numdatainx+2),(numdatainx+2))
numammrig = string.byte(strrekvest,(numdatainx+3),(numdatainx+3))
numammlef = lfnonehextoint (numammlef)
numammrig = lfnonehextoint (numammrig)
boogotplejs = ((numammlef~=255) and (numammrig~=255))
break
end--while -- fake loop -- join mark
if (boogotplejs) then
numdatainx = numdatainx + 4 -- consumed 5 char:s, cannot overflow here
varduahuruf = string.char (numdataodt,numdataoet)
varduahuruf = tabsubstut[varduahuruf] -- risk of type "nil"
if (type(varduahuruf)~='string') then
varduahuruf = '' -- type "nil" or invalid type gives empty string
end--if
if (varduahuruf=='') then
numdataoct = string.len(strhazil) - numammlef -- this can underflow
if (numdataoct<=0) then
strhazil = ''
else
strhazil = string.sub(strhazil,1,numdataoct) -- discard left
end--if
numdatainx = numdatainx + numammrig -- discard right this can overflow
else
strhazil = strhazil .. varduahuruf -- insert / expand
end--if
else
strhazil = strhazil .. string.char(numdataoct) -- copy char as-is
end--if (boogotplejs) else
end--while
return strhazil
end--function lfiultiminsert
------------------------------------------------------------------------
-- Local function LFIFINDITEMS
-- Search in string primarily intended for LFIULTIMINSERT.
-- Input : * long string where to search (for example "Kapvorto (@LK00)")
-- * even number of char:s what to search (for example "WCWU")
-- Output : * boolean ("true" in any found, "false" for our example)
local function lfifinditems (strwhere, strandevenwhat)
local strcxztvaa = ''
local numcxzlen = 0
local numcxzind = 1 -- ONE-based step TWO
local boofoundthecrap = false
numcxzlen = string.len(strandevenwhat)
while true do
if (numcxzind>=numcxzlen) then
break -- not found
end--if
strcxztvaa = '@' .. string.sub(strandevenwhat,numcxzind,(numcxzind+1))
boofoundthecrap = (string.find(strwhere,strcxztvaa,1,true)~=nil)
if (boofoundthecrap) then
break -- found any of them, done
end--if
numcxzind = numcxzind + 2
end--while
return boofoundthecrap
end--function lfifinditems
------------------------------------------------------------------------
---- HIGH LEVEL FUNCTIONS [H0] ----
------------------------------------------------------------------------
-- Local function LFILEFTRIGHT
-- Brew wikilink from 2 elements.
local function lfileftright (strbigleft, strbigright)
local strwikilink = ''
if ((strbigright=='') or (strbigleft==strbigright)) then
strwikilink = strbigleft -- save bloat
else
strwikilink = strbigleft .. '|' .. strbigright -- here genuine wall needed
end--if
strwikilink = '[[' .. strwikilink .. ']]' -- always link
return strwikilink
end--function lfileftright
------------------------------------------------------------------------
-- Local function LFFILLKATON
-- Add one string and maybe one bool to global "qtabktaoj" provided the
-- string is nonempty and not yet in and there is some space left.
-- This function has exclusive write access to "qtabktaoj". Do NOT write
-- to it in any other way except during declaration + initialization.
-- We allow max 16 cat:s from auto split or split control parameter and
-- max 4 cat:s from extra parameter but there is a sum limit of 18.
local function lffillkaton (stritem, boomain)
local numsrchindex = 0
local varpeek = 0
while true do
if (numsrchindex==18) then
break -- no free slot left
end--if
varpeek = qtabktaoj[numsrchindex]
if (varpeek==stritem) then
numsrchindex = 18
break -- already in
end--if
if (varpeek==nil) then
break -- found free slot
end--if
numsrchindex = numsrchindex + 1
end--while
if (numsrchindex~=18) then
qtabktaoj[numsrchindex] = stritem
if (boomain) then
qtabktaoj[numsrchindex+20] = true
end--if
end--if
end--function lffillkaton
------------------------------------------------------------------------
-- Local function LFHGET345NONIL
-- we read from global "contabktaoj" index 3...5
-- "nummortyyp" mortyp "W" has code 87 and gives index 3 or 4
-- "nummortyyp" mortyp other has code < 87 (ZERO is safe) and gives index 5
-- "boofraazo" can be assigned to "false" if not needed (index 5)
local function lfhget345nonil (nummortyyp, boofraazo)
local strctlstring = ''
local numpiinx = 0 -- temp 3...5
if (nummortyyp==87) then
numpiinx = 3 -- vortgrupo contains "W"
if (boofraazo) then
numpiinx = 4 -- kalimat contains "W"
end--if
else
numpiinx = 5 -- word can contain C I M N P U but obviously not "W"
end--if
strctlstring = contabktaoj[numpiinx] -- pick main data string risk for "nil"
if (type(strctlstring)~='string') then
strctlstring = '' -- fool-proof
end--if
return strctlstring -- can be empty but NOT type "nil"
end--function lfhget345nonil
------------------------------------------------------------------------
---- HIGH LEVEL FUNCTIONS [H5] ----
------------------------------------------------------------------------
-- Local function LFHSPLITAA
-- Perform the automatic multiword split or assisted split controlled
-- by 2 prevalidated tables.
-- Note that the split can sort of fail and return same string, most notably
-- if no split boundaries exist, or some do exist but all are blocked.
-- Counting of the boundaries is tricky. We DO count the suppressed ones but
-- do NOT count multiple consecutive non-letters more than once. Thus the
-- boundaries are between words only and at begin and end, there CANNOT
-- be empty content between 2 boundaries. We usually have 2 faked empty
-- boundaries at begin and end, but they can also be real and count then.
-- For example "AND YES, we !,definit-ely,! can." contains 5 words (that can
-- become 5 output fragments numbered 0...4) words and 5 input boundaries
-- (numbered 0...4). In the text "?va?" there are 2 boundaries at begin
-- and end.
-- We need sub "lfiultiminsert" (2 para) and table "contabktaoj"
-- controlling the structure of the cat name. "boomorfium" must be
-- false unless lng in "tabkoudo" is valid and known.
-- Names of the categories are built from "contabktaoj" index 3 (vortgrupo)
-- or 4 (frazo) but here not 5 (vorto, useful for manual split). Categories
-- are brewed only if "boomorfium" is true, the split does not fail, and the
-- individual fragment is not blocked. For example "va" will neither link nor
-- categorize but "va?" will do both. The "#"..."N"-syntax blocks both linking
-- and morpheme categorization (if the latter is enabled otherwise). Even if
-- linking is blocked for other reason (most notably only 1 fragment generated
-- after split attempt) then categorization is suppressed as well.
-- Input : * "strlemmain" -- input text (pagename)
-- * "tabblokr" -- index 0...15 holes permitted, from "%"
-- * "tablinker" -- index 0...15 holes permitted, from "#"
-- * "boomorfium" -- "true" if compound cat:s are desired
-- * "bookalimat" -- "true" is word class "KA" was specified
-- * "tabkoudo" -- lng stuff ("??" legal but needs "boomorfium")
-- Output : * "stromong" -- wikitext to be sent to screen
-- This function fills global "qtabktaoj" index [0]...[15] with names of
-- morpheme cat:s (index [20]...[35] main page status not used here).
-- Depends on functions :
-- [I] lffillkaton lfiultiminsert
-- [U] lfulnutf8char lfucasegene (generous)
-- [G] lfgpokestring lfgtestuc lfgtestlc
-- [E] mathdiv mathmod mathbittest
-- This sub depends on "HIGH LEVEL FUNCTIONS"\"lfhget345nonil" and
-- "HIGH LEVEL FUNCTIONS"\"lfileftright".
local function lfhsplitaa (strlemmain, tabblokr, tablinker, boomorfium, bookalimat, tabkoudo)
local varrisktabl = 0 -- can be type "nil"
local strfragment = ''
local strfragdext = '' -- right part with visible text (wall not included)
local stromong = '' -- final result
local strkattcty = ''
local strkatoon = '' -- for "lffillkaton"
local numloonginp = 0 -- length of input
local numinxed = 0 -- ZERO-based index of input char:s
local numboundrinp = 0 -- counter of detected boundaries include suppressed
local numoutfrag = 0 -- counter of produced fragments
local numotcot = 0
local numotcet = 0
local numotcuu = 0 -- control code from "tablinker" (ZERO is "nil" ie none)
local boohavechar = false
local booqboueof = false -- combo status: boundary char or end of string
local booprevqbe = false -- previous combo status
local boosuppress = false -- suppress split but still do count the boundary
local boodolnkkat = false -- do link and maybe categorize the fragment
numloonginp = string.len(strlemmain)
while true do
if (numinxed==numloonginp) then
boohavechar = false
booqboueof = true -- copied whole string and end of fragment
boosuppress = false -- last chance, we must output accumulated fragment
else
boohavechar = true -- can be part of word or boundary !!!
numotcot = string.byte (strlemmain,(numinxed+1),(numinxed+1))
numinxed = numinxed + 1 -- ZERO-based
booqboueof = ((numotcot==32) or lfgtestpuncture(numotcot))
boosuppress = (tabblokr[numboundrinp]=="1")
end--if
if (booprevqbe and (booqboueof==false)) then
numboundrinp = numboundrinp + 1 -- count even suppressed boundaries
end--if
booprevqbe = booqboueof -- assign previous status for next round
if (booqboueof and (not boosuppress) and (strfragment~='')) then
strfragdext = strfragment -- visible text right of the wall "|"
boodolnkkat = false -- preassume no link no cat
if ((stromong~='') or boohavechar) then -- avoid selflink to page
varrisktabl = tablinker[numoutfrag] -- can be type "nil"
numotcuu = 0
if (type(varrisktabl)=='string') then
numotcuu = string.byte (varrisktabl,1,1)
end--if
if (numotcuu==73) then -- "I" lowercase
strfragment = lfucasegene (strfragment,false,false)
end--if
if (numotcuu==65) then -- "A" uppercase
strfragment = lfucasegene (strfragment,true,false)
end--if
if (numotcuu==58) then -- ":" explicit replace
strfragment = string.sub (varrisktabl,2,string.len(varrisktabl))
end--if
boodolnkkat = (numotcuu~=78) -- "boodolnkkat" needed below 2 times
end--if ((stromong~='') or boohavechar) then
if (boodolnkkat) then
stromong = stromong .. lfileftright (strfragment,strfragdext) -- wlink
else
stromong = stromong .. strfragment -- add raw fragment no link
end--if
if (boomorfium and boodolnkkat) then
strkattcty = lfhget345nonil (87,bookalimat) -- always "W" thus 5 imposs
numotcet = string.len(strkattcty) -- this is automatic or assisted
if (numotcet>=2) then
tabkoudo["WC"] = nil -- no stupid word class here
tabkoudo["WU"] = nil -- no stupid word class here
tabkoudo["MT"] = nil -- a word does not have any morpheme type
tabkoudo["FR"] = strfragment
strkatoon = lfiultiminsert (strkattcty,tabkoudo)
lffillkaton (strkatoon,false) -- NOT main page -- "qtabktaoj"
end--if (numotcet>=2) then
end--if (boomorfium and boodolnkkat) then
strfragment = ''
numoutfrag = numoutfrag + 1 -- count fragments "lffillkaton" separately
end--if (booqboueof and (not boosuppress) and (strfragment~='')) then
if (boohavechar) then
if (booqboueof and (not boosuppress)) then
stromong = stromong .. string.char(numotcot) -- add non-linkable char
else
strfragment = strfragment .. string.char(numotcot) -- add chr to fragm
end--if
else
break -- done all
end--if
end--while
return stromong
end--function lfhsplitaa
------------------------------------------------------------------------
-- Local function LFHSPLITMN
-- Perform the manual split controlled by one prevalidated table. Actually
-- the table contains the presplit complete lemma and the pagename is not
-- needed at all. Max 16 fragments can come in, type "F000" does count. We
-- rely on all details being prevalidated (number of fragments, plusses and
-- rectangular brackets, colons and slashes, only valid uppercase letters
-- before colon, legal use of "L:", ...).
-- We need sub "lfiultiminsert" (2 para) and table "contabktaoj"
-- controlling the structure of the cat name. "boomorkat" must be
-- false unless lng in "tabkuodo" is valid and known.
-- Names of the categories are built from "contabktaoj" index 3 (vortgrupo)
-- or 4 (frazo) or 5 (vorto).
-- The source string uses slashes "/" as field separator but the destination
-- string uses walls "|".
-- Omitting deleted characters and dash adding are performed only for
-- fragment type "F210" ie only one field after ":" and no slash "/".
-- Also "L" is permitted for fragment type "F210" only but this is
-- prevalidated. Note that in the early prevalidation step the debracketing
-- for the "sum check" is NOT limited to fragment type "F210".
-- We have to maintain 2 separate fragment counters. For example valid syntax
-- "[M:kung]+a+[M:doeme]" gives 3 input fragments in "tabmnfragoj", but only
-- 2 output fragments in "qtabktaoj", and we want them to have indexes 0
-- and 1, not 0 and 2. The out counter is not explicit, it is the content
-- of "qtabktaoj" processed in "lffillkaton".
-- There is a problem with the wikisyntax, for example "[[no]]pe" will act as
-- "[[no|nope]]" ie the visible link text will continue beyond the bracket
-- and cover the "pe", whereas "[[no]]??" does not trigger such behavior. To
-- prevent this from happening we must add something invisible, and we use
-- "<i></i>".
-- here we DO introduce wikilinks with double brackets and walls
-- here we DO expand "+" to " + " (between fragments)
-- here we DO add dashes to some affixes (fragment type "F210")
-- here we do NOT carry out the "sum check" (done in the prevalidation)
-- Input : * "tabmnfragoj" -- prevalidated presplit table "+[I:bug/BUG]"
-- * "boomorkat" -- "true" if compound cat:s are desired
-- * "bookalymat" -- "true" is word class "KA" was specified
-- * "tabkuodo" -- lng stuff ("??" legal but needs "boomorkat")
-- Output : * "strumung" -- wikitext to be sent to screen
-- This function fills global "qtabktaoj" index [0]...[15] with names of
-- morpheme cat:s (index [20]...[35] main page status not used here).
-- Depends on functions :
-- [H] lfhget345nonil
-- [I] lffillkaton lfiultiminsert
-- [G] lfgtestuc
-- This sub depends on "STRING FUNCTIONS"\"lfidebracket" and
-- "STRING FUNCTIONS"\"lfiremove2bra" and
-- "STRING FUNCTIONS"\"lfiaddthedash" and
-- "HIGH LEVEL FUNCTIONS"\"lfifinditems" and
-- "HIGH LEVEL FUNCTIONS"\"lfileftright".
local function lfhsplitmn (tabmnfragoj, boomorkat, bookalymat, tabkuodo)
local varrysktabl = 0 -- from in table can be type "nil"
local strumung = '' -- final result with links
local strwalzleft = ''
local strwallrght = ''
local strwallcatg = '' -- same as "strwalzleft" unless "L"-trick is used
local strkattctx = ''
local strkatton = '' -- for "lffillkaton"
local numinnfrog = 0 -- counter in "tabmnfragoj" type "F000" does count
local numlenfrago = 0 -- ONE-based last valid index
local numivnxed = 0 -- ONE-based index of char:s inside fragment
local numcuaar = 0
local numcuabr = 0 -- +1
local numcuacr = 0 -- +2
local numcom1of79z = 0 -- 0 | 67 C 73 I 76 L 77 M 78 N 80 P 85 U | 87 W
local booeldtrick = false -- true for the "L"-trick giving type "N"
local booright = false -- false left | true right
local boohavecolon = false
local boo210magic = false -- enhance and strip then
local booneedmor = false
while true do -- outer loop counts fragments in table
booeldtrick = false -- separate verdict for every fragment
boohavecolon = false -- separate verdict for every fragment
boo210magic = false -- separate verdict for every fragment
numcom1of79z = 0 -- default none, separate verdict for every fragment
varrysktabl = tabmnfragoj [numinnfrog] -- can be type "nil" !!!
numinnfrog = numinnfrog + 1
if (type(varrysktabl)~='string') then
break -- give up on "nil"
end--if
numlenfrago = string.len (varrysktabl) -- cannot be empty
numivnxed = 1 -- ONE-based
numcuaar = string.byte (varrysktabl,1,1)
if (numcuaar==43) then
numivnxed = 2 -- ONE-based skip the "+" even for type "F000" far below
strumung = strumung .. ' + ' -- add the spaces here
numcuaar = string.byte (varrysktabl,2,2) -- pick new char cannot be "+"
end--if
if (numcuaar==91) then -- bracketed []-fragment processed char-by-char
numivnxed = numivnxed + 1 -- now at least 2
strwalzleft = ''
strwallrght = ''
booright = false
numcuabr = 0
numcuacr = 0 -- minimal fe "[M:x]" 5 char:s 1...5 or 2...6
if ((numlenfrago-numivnxed)>=3) then
numcuabr = string.byte (varrysktabl,numivnxed,numivnxed)
numcuacr = string.byte (varrysktabl,(numivnxed+1),(numivnxed+1))
end--if
if ((numcuacr==58) and lfgtestuc(numcuabr)) then
numcom1of79z = numcuabr -- "numcuabr" is prevalidated ;-)
numivnxed = numivnxed + 2 -- eat it away too
boohavecolon = true -- fragment type "F210" or "F211"
if (numcom1of79z==76) then
booeldtrick = true -- fe "fer(o)" -> link "fero" and categ "fer"
numcom1of79z = 78 -- "L" -> "N"
end--if
end--if
while true do -- inner loop counts char:s in a bracketed []-fragment
if (numivnxed==numlenfrago) then
break -- skip trailing ']' guaranteed to exist
end--if
numcuaar = string.byte (varrysktabl,numivnxed,numivnxed)
if (booright) then
strwallrght = strwallrght .. string.char (numcuaar) -- wall NOT po
else
if (numcuaar==47) then
booright = true -- source separating slash "/"
else
strwalzleft = strwalzleft .. string.char (numcuaar)
end--if
end--if
numivnxed = numivnxed + 1
end--while
if (strwallrght=='') then
strwallrght = strwalzleft -- type "F200" or "F210"
boo210magic = boohavecolon -- magic qualifies only if type is F210
end--if
if (boo210magic) then -- try enhance left fe "il" -> "-il-"
if (numcom1of79z==80) then
strwalzleft = lfiaddthedash (strwalzleft,false,true) -- P
end--if
if (numcom1of79z==85) then
strwalzleft = lfiaddthedash (strwalzleft,true,false) -- U
end--if
if (numcom1of79z==73) then
strwalzleft = lfiaddthedash (strwalzleft,true,true) -- I
end--if
end--if
strwallcatg = strwalzleft -- seize after enhancing before stripping
if (boo210magic) then -- always strip but in various ways
strwalzleft = lfiremove2bra (strwalzleft) -- link "kac(o)" -> "kaco"
if (booeldtrick) then -- "L" -> "N"
strwallcatg = lfidebracket (strwallcatg,true,1) -- for the category
else
strwallcatg = lfiremove2bra (strwallcatg) -- for the category
end--if
end--if
strumung = strumung .. lfileftright (strwalzleft,strwallrght) .. '<i></i>' -- always link
if (boomorkat and (numcom1of79z~=0)) then
strkattctx = lfhget345nonil (numcom1of79z,bookalymat) -- 3 or 4 or 5
numcuaar = string.len(strkattctx) -- this is the manual split
if (numcuaar>=2) then
booneedmor = lfifinditems(strkattctx,"MT") -- need it ??
tabkuodo["WC"] = nil -- no stupid word class here
tabkuodo["WU"] = nil -- no stupid word class here
if (booneedmor) then
tabkuodo["MT"] = string.char(numcom1of79z) -- morpheme type
else
tabkuodo["MT"] = nil -- no morpheme type here
end--if
tabkuodo["FR"] = strwallcatg -- fragment or word
strkatton = lfiultiminsert (strkattctx,tabkuodo)
lffillkaton (strkatton,false) -- NOT main page -- "qtabktaoj"
end--if (numcuaar>=2) then
end--if (boomorkat and (numcom1of79z~=0)) then
else
strumung = strumung .. string.sub (varrysktabl,numivnxed,numlenfrago) -- copy type F000 as-is
end--if (numcuaar==91) else
end--while
return strumung
end--function lfhsplitmn
------------------------------------------------------------------------
-- Local function LFHSPLITSI
-- Perform the simple root split (3, "$S") or simple bare
-- root (4, "$B") strategy. Pagename is needed.
-- $S simple root split suno -> sun + [-o/o] kat "N!sun" + "U:-o"
-- Suno -> [suno/Sun] + [-o/o] kat "N!sun" + "U:-o"
-- $B simple bare root sun -> sun kat "M!sun"
-- Sun -> [sun/Sun] kat "M!sun"
-- $B simple bare root NR # -> # kat "N!#"
-- ("#" represents a Chinese letter)
-- Note that for $S the mortyp is always "N" (nonstandalone) whereas
-- for $B it can be either "M" (standalone) or "N".
-- We need sub "lfiultiminsert" (2 para) and table "contabktaoj"
-- controlling the structure of the cat name. "bookomdez" must be
-- false unless lng in "tablngbah" is valid and known.
-- Names of the categories are built from "contabktaoj" index 5 (vorto).
-- Input : * "strhalaman" -- input lemma ie pagename
-- * "numkodsplit" -- 3 or 4 for $S or $B
-- * "bookomdez" -- "true" if compound cat:s are desired at all
-- * "boonitro" -- "true" if word class is NR
-- * "tablngbah" -- lng stuff ("??" legal but needs "bookomdez")
-- Output : * "strymyng" -- wikitext to be sent to screen
-- This function fills global "qtabktaoj" index [0]...[15] with names of
-- morpheme cat:s and maybe index [20]...[35] with main page status.
-- In fact only one index (probably [20]) can receive the "true" here.
-- Depends on functions :
-- [H] lffillkaton lfiultiminsert lfhget345nonil
-- [U] lfulnutf8char lfucasegene (generous)
-- [G] lfgpokestring lfgtestuc lfgtestlc
-- [E] mathdiv mathmod mathbittest
local function lfhsplitsi (strhalaman, numkodsplit, bookomdez, boonitro, tablngbah)
local strtakkctx = '' -- contabktaoj[5] index 5 is hardcoded
local strymyng = '' -- screen
local strlover = '' -- brewed from "strhalaman" : "Suno" -> "suno"
local strnolast = '' -- brewed from "strhalaman" : "Suno" -> "Sun"
local strnolaslow = '' -- brewed from "strlover" : "Suno" -> "sun"
local strkatroot = ''
local strcatoton = ''
local nummortyp = 0 -- 77 "M" or 78 "N" only
local numdewsx = 0
local numlasst = 0 -- last char of lemma or ZERO if not separated
local numcauar = 0
local numcaubr = 0
local booindeedlow = false
numdewsx = string.len (strhalaman)
strlover = lfucasegene(strhalaman,false,false)
booindeedlow = (strlover==strhalaman)
numlasst = 0 -- needed far below
nummortyp = 77 -- "M"
if (boonitro or (numkodsplit==3)) then
nummortyp = 78 -- "N"
end--if
if (numkodsplit==3) then
strnolast = string.sub (strhalaman,1,(numdewsx-1)) -- cut off last char
strnolaslow = string.sub (strlover,1,(numdewsx-1)) -- cut off & lowercase
numlasst = string.byte (strhalaman,numdewsx,numdewsx) -- needed far below
if (booindeedlow) then
strymyng = strnolast -- as-is lowercase
else
strymyng = '[[' .. strlover .. '|' .. strnolast .. ']]' -- link
end--if
strymyng = strymyng .. ' + [[-' .. string.char(numlasst) .. '|' .. string.char(numlasst) .. ']]'
strkatroot = strnolaslow -- $S
end--if
if (numkodsplit==4) then
if (booindeedlow) then
strymyng = strhalaman -- as-is lowercase
else
strymyng = '[[' .. strlover .. '|' .. strhalaman .. ']]' -- link
end--if
strkatroot = strlover -- $B
end--if
if (bookomdez) then
strtakkctx = lfhget345nonil (0,false) -- pick main data string 5 hardcoded
numcauar = string.len(strtakkctx) -- simple "strtakkctx" can be used twice
if (numcauar>=2) then
tablngbah["WC"] = nil -- no stupid word class here
tablngbah["WU"] = nil -- no stupid word class here
tablngbah["MT"] = string.char(nummortyp)
tablngbah["FR"] = strkatroot
strcatoton = lfiultiminsert (strtakkctx,tablngbah)
lffillkaton (strcatoton,true) -- YES main page -- "qtabktaoj"
if (numlasst~=0) then
tablngbah["MT"] = 'U' -- last letter is suffix "U"
tablngbah["FR"] = '-' .. string.char(numlasst)
strcatoton = lfiultiminsert (strtakkctx,tablngbah)
lffillkaton (strcatoton,false) -- NOT main page -- "qtabktaoj"
end--if
end--if (numcauar>=2) then
end--if (bookomdez) then
return strymyng
end--function lfhsplitsi
------------------------------------------------------------------------
-- Local function LFSPLITLALE
-- Perform the large letter split (5, "$H").
-- The lemma is split into single letters. This is most useful for but
-- not restricted to Chinese ones. Note that for this split the mortyp is
-- always "M" (standalone). Use manual split for other cases.
-- We need sub "lfiultiminsert" (2 para) and table "contabktaoj"
-- controlling the structure of the cat name. "bookomdoz" must be
-- false unless lng in "tablngbaih" is valid and known.
-- Names of the categories are built from "contabktaoj" index 5 (vorto).
-- Input : * "strhilaman" -- input lemma ie pagename
-- * "bookomdoz" -- "true" if compound cat:s are desired at all
-- * "tablngbaih" -- lng stuff ("??" legal but needs "bookomdez")
-- Output : * "strygyng" -- wikitext to be sent to screen
-- This function fills global "qtabktaoj" index [0]...[15] with names of
-- morpheme cat:s (index [20]...[35] main page status not used here).
-- Depends on functions :
-- [H] lfhget345nonil
-- [I] lfiultiminsert
-- [U] lfulnutf8char
-- This sub depends on "HIGH LEVEL FUNCTIONS"\"lffillkaton".
local function lfsplitlale (strhilaman, bookomdoz, tablngbaih)
local strtookctj = '' -- contabktaoj[5] index 5 is hardcoded
local strygyng = '' -- screen
local strbeexess = ''
local strcatatan = ''
local numinwwlen = 0
local numwwindex = 1 -- ONE-based
local numwwchar = 0
local numwwlen = 0
numinwwlen = string.len(strhilaman)
while true do -- genuine loop, counter is "numwwindex" step 1...4
if (numwwindex>numinwwlen) then
break -- done (risk of overflow)
end--if
numwwchar = string.byte (strhilaman,numwwindex,numwwindex)
numwwlen = lfulnutf8char (numwwchar)
if (numwwlen==0) then
strygyng = strhilaman -- this is criminal
break -- some compound cat:s may be left behind :-(
end--if
strbeexess = string.sub (strhilaman,numwwindex,(numwwindex+numwwlen-1))
if (strygyng~='') then
strygyng = strygyng .. ' + '
end--if
strygyng = strygyng .. '[[' .. strbeexess .. ']]'
if (bookomdoz) then
strtookctj = lfhget345nonil (0,false) -- pick main data string 5 hardco
numwwchar = string.len(strtookctj) -- this is large letter split
if (numwwchar>=2) then
tablngbaih['WC'] = nil -- no stupid word class here
tablngbaih['WU'] = nil -- no stupid word class here
tablngbaih['MT'] = 'M'
tablngbaih['FR'] = strbeexess
strcatatan = lfiultiminsert (strtookctj,tablngbaih)
lffillkaton (strcatatan,false) -- NOT main page -- "qtabktaoj"
end--if (numwwchar>=2) then
end--if (bookomdoz) then
numwwindex = numwwindex + numwwlen -- step 1...4 risk of overflow
end--while
return strygyng
end--function lfsplitlale
------------------------------------------------------------------------
---- VARIABLES [R] ----
------------------------------------------------------------------------
function exporttable.ek (arxframent)
-- general unknown type
local vartamp = 0 -- variable without type
-- special type "args" AKA "arx"
local arxspecial = 0 -- from module
-- general tab in from caller ("qtabktaoj" is elsewhere)
local tabbluck = {} -- from "%"-syntax assi
local tablynx = {} -- from "#"-syntax assi
local tabmnfrags = {} -- for manual split
local tabextfriig = {} -- from extra parameter
local tablngdbl = {} -- double-letter indexes
-- general str ("qstrtrace" is elsewhere)
local strkaatctl = '' -- picked from "contabktaoj" via "lfhget345nonil"
local strlemmain = '' -- lemma in
local strlemmaut = '' -- bold lemma (maybe split) out
local strutmp = '' -- temp
-- general num
local numsplyt = 0 -- split strategy (0 auto 1 assisted 2 manu 7 none)
local numerr = 0
local numtamp = 0
local numoct = 0
local numodt = 0
local numlindex = 0
-- general boo from caller
local boocatdesir = false
local booexteval = false -- true if we got the extra parameter
local boohavnyrr = false -- true if we got "NR"
local boohavkall = false -- true if we got "KA"
------------------------------------------------------------------------
---- MAIN [Z] ----
------------------------------------------------------------------------
---- ASSIGN AND BOAST ----
lfdtracemsg ('This is "msplitter" submodule') -- unconditional
---- GET THE ARX ----
arxspecial = arxframent.args
while true do -- fake loop
if (type(arxspecial)~='table') then
lfdtracemsg ('Overall bad data type') -- "qstrtrace"
numerr = 2 -- #E02
break
end--if
boocatdesir = arxspecial[ 0]
strlemmain = arxspecial[ 1]
numsplyt = arxspecial[ 2]
if ((type(boocatdesir)~='boolean') or (type(strlemmain)~='string') or (type(numsplyt)~='number')) then
lfdtracemsg ('Index 0...2 bad data type') -- "qstrtrace"
numerr = 3 -- #E03
break
end--if
tabbluck = arxspecial[ 3]
tablynx = arxspecial[ 4]
tabmnfrags = arxspecial[ 5]
tabextfriig = arxspecial[ 6]
booexteval = arxspecial[ 7] -- boolean between tables !!!
tablngdbl = arxspecial[ 8]
boohavnyrr = arxspecial[ 9] -- NR
boohavkall = arxspecial[10] -- KA
boodetrc = (arxspecial[15]==true)
if ((type(booexteval)~='boolean') or (type(tablngdbl)~='table') or (type(boohavnyrr)~='boolean') or (type(boohavkall)~='boolean')) then
lfdtracemsg ('Index 7...10 bad data type') -- "qstrtrace"
numerr = 4 -- #E04
end--if
break
end--while -- fake loop
if (numerr==0) then
lfdtracemsg ('Incoming table OK')
end--if
---- SPLIT THE LEMMA IF NEEDED ----
-- process from "strlemmain" (sudah guaranteed to be
-- non-empty) to "strlemmaut" (actually NOT for manual split)
-- "numsplyt" : 0 auto 1 assisted 2 manu 3 srs 4 sbr 5 lale 7 none
-- we skip the split and copy only if:
-- * "numsplyt" is 7 (#S7 no split)
-- punctuation (5 char:s: ! , . ; ?) 21 33 | 2C 44 | 2E 46 | 3B 59 | 3F 63
-- dash "-" and apo "'" do NOT count as punctuation (for auto and assisted)
-- we depend on "boocatdesir" (they can switch off some cat:s)
-- we depend on "boohavkall" (switches between "vortgrupo" and "frazo")
-- "qtabktaoj" is very global
-- 0...17 cat names without "Category:" prefix, unused "nil"
-- 20...37 "true" if main page, otherwise "nil"
-- "lfhsplitaa" and "lfhsplitmn" and "lfhsplitsi" and "lfsplitlale" will fill
-- it (via "lffillkaton") and below more content comes from extra parameter
if (numerr==0) then
if (numsplyt<2) then -- ZERO or ONE -> auto or assisted #S0 #S1
strlemmaut = lfhsplitaa (strlemmain, tabbluck, tablynx, boocatdesir, boohavkall, tablngdbl)
end--if
if (numsplyt==2) then -- 2 -> manu #S2
strlemmaut = lfhsplitmn (tabmnfrags, boocatdesir, boohavkall, tablngdbl)
end--if
if ((numsplyt==3) or (numsplyt==4)) then -- 3 4 -> simple #S3 #S4
strlemmaut = lfhsplitsi (strlemmain, numsplyt, boocatdesir, boohavnyrr, tablngdbl)
end--if
if (numsplyt==5) then -- 5 -> lale #S5
strlemmaut = lfsplitlale (strlemmain, boocatdesir, tablngdbl)
end--if
if (numsplyt==7) then -- 7 -> no split #S7
strlemmaut = strlemmain -- no split, "strlemmaut" needed for visible part
end--if
end--if
---- BREW UP TO 4 EXTRA CATEGORIES ----
-- from extra parameter sent to us in "tabextfriig" and "booexteval"
-- with "booexteval" true prevalidated morphemes are be in
-- "tabextfriig" incl prefix fe "C:" or "M!", the caller
-- converts possible "&"-syntax to 1 or 2 fragments
-- with "booexteval" false the extra parameter was empty and
-- we do nothing here
if ((numerr==0) and boocatdesir and booexteval) then
numlindex = 0
while true do
vartamp = tabextfriig[numlindex] -- risk of type "nil"
if (type(vartamp)=='string') then
numoct = string.byte(vartamp,1,1) -- C I M N P U W
numodt = string.byte(vartamp,2,2) -- ":" 58 or "!" 33
numtamp = string.len (vartamp)
strutmp = string.sub (vartamp,3,numtamp) -- prevalidated morpheme string
strkaatctl = lfhget345nonil (numoct,boohavkall) -- pick main data str
numtamp = string.len(strkaatctl) -- this is main brewing 4 extra cat:s
if (numtamp>=2) then
bootimp = lfifinditems(strkaatctl,"MT") -- need it ??
tablngdbl["WC"] = nil -- no stupid word class here
tablngdbl["WU"] = nil -- no stupid word class here
if (bootimp) then
tablngdbl["MT"] = string.char(numoct) -- morpheme type
else
tablngdbl["MT"] = nil -- no morpheme type here
end--if
tablngdbl["FR"] = strutmp
strutmp = lfiultiminsert (strkaatctl,tablngdbl)
lffillkaton (strutmp,(numodt==33)) -- MAYBE main page -- "qtabktaoj"
end--if (numtamp>=2) then
else
break -- abort at type "nil"
end--if (type(vartamp)=='string') else
numlindex = numlindex + 1
end--while
end--if
---- PREPARE RETURN ----
if (numerr~=0) then
strlemmaut = "//" -- still use qtabktaoj [52] to check status
end--if
qtabktaoj [50] = strlemmaut -- unconditionally
qtabktaoj [51] = qstrtrace -- unconditionally, cannot be empty
qtabktaoj [52] = numerr -- unconditionally
---- RETURN THE RESULT TABLE ----
return qtabktaoj
end--function
---- RETURN THE JUNK LUA TABLE ----
return exporttable