Discussion:
[tex-hyphen] Hyphenation patterns for Belarusian
Maksim Salau
2016-08-28 00:29:36 UTC
Permalink
Hello,

Recently I've found hyphenation patterns in the LibreOffice extension [1]. According to discussions in this list, it is possible to use patterns from (Libre|Open)Office since hyphenation engine is the same (or almost the same).
Actions seems be pretty straightforward:
1. Put patterns to tex/generic/hyph-utf8/patterns/tex/hyph-<lang>.tex
2. Describe the language in source/generic/hyph-utf8/languages.rb
3. Adjust source/generic/hyph-utf8/generate-pattern-loaders.rb if patterns are UTF-8 only
4. Generate loader & etc.
5. Install and use.

My steps 1-4 can be found on GitHub [2] (last 2 commits).
I have troubles with step 5 :(

I tried to create a deb file similar to texlive-lang-cyrillic_2014.20141024-1_all.deb:
* /etc/texmf/hyphen.d/10texlive-lang-belarusian.cnf with specification of the language
name=belarusian file=loadhyph-be.tex patterns=hyph-be.pat.txt lefthyphenmin=2 righthyphenmin=2 exceptions=
* empty /etc/texmf/fmt.d/10texlive-lang-belarusian.cnf
* patterns in /usr/share/texlive/texmf-dist/tex/generic/hyph-utf8
* '10texlive-lang-belarusian' in /var/lib/tex-common/hyphen-cnf/texlive-lang-belarusian.list
and /var/lib/tex-common/fmtutil-cnf/texlive-lang-belarusian.list

language.dat is regenerated during installation, but fmtutil-sys is not happy. Its complains:

$ sudo fmtutil-sys --all
fmtutil: running `luatex -ini -jobname=luatex -progname=luatex luatex.ini' ...
This is LuaTeX, Version beta-0.79.1 (TeX Live 2015/dev/Debian) (rev 4971) (INITEX)
restricted \write18 enabled.
(/usr/share/texlive/texmf-dist/tex/plain/config/luatex.ini
(/usr/share/texlive/texmf-dist/tex/generic/config/luatexiniconfig.tex)
(/usr/share/texlive/texmf-dist/tex/generic/config/luatex-unicode-letters.tex
loading Unicode properties)
(/usr/share/texlive/texmf-dist/tex/plain/config/pdfetex.ini
(/usr/share/texlive/texmf-dist/tex/generic/config/pdftexconfig.tex
(/var/lib/texmf/tex/generic/config/pdftexconfig-paper.tex))
(/usr/share/texlive/texmf-dist/tex/luatex/hyph-utf8/etex.src
(/usr/share/texlive/texmf-dist/tex/plain/base/plain.tex
Preloading the plain format: codes, registers, parameters, fonts, more fonts,
macros, math definitions, output routines, hyphenation
(/usr/share/texlive/texmf-dist/tex/generic/hyphen/hyphen.tex
[skipping from \patterns to end-of-file...]))
(/usr/share/texlive/texmf-dist/tex/plain/etex/etexdefs.lib
Skipping module "grouptypes"; Loading module "interactionmodes";
Skipping module "nodetypes"; Skipping module "iftypes";)
(/var/lib/texmf/tex/generic/config/language.def
(/usr/share/texlive/texmf-dist/tex/generic/hyphen/hyphen.tex)
(/usr/share/texlive/texmf-dist/tex/generic/hyph-utf8/loadhyph/loadhyph-be.tex
UTF-8 Belarusian hyphenation patterns
(/usr/share/texlive/texmf-dist/tex/generic/hyph-utf8/patterns/tex/hyph-be.tex
! Conflicting pattern ignored.
l.6024 }

?
! Emergency stop.
l.6024 }

! ==> Fatal error occurred, no output PDF file produced!
Transcript written on luatex.log.

Is there any way to make it more verbose? Or debug the issue somehow?

Also, please, clarify for me usage of quotes. There are 3 symbols used in hyph-be.tex: ' ` ’
I suspect this can confuse the engine, since generate-plain-patterns.rb checks only the first one and convert it to the third one to populate hyph-quote-<lang>.tex
What is the official position on quotes? Should one use only ' and *TeX will do the rest, or other symbols are allowed too?

And the third moment with these patterns is T2A encoding. The U+2019 symbol (the third quote from the list above) make conversion impossible, since the symbol is not mapped in converter. I tried to enable it in t2a.dat and regenerate converter, but it fails with message: The encoding t2a uses more than two bytes to encode characters.

Thanks in advance,
Maksim Salau.

[1] http://extensions.libreoffice.org/extension-center/belarusian-dictionary-spelling-hyphenation-official-orthography-2008
[2] https://github.com/msalau/hyph-utf8-belarusian/tree/belarusian
Arthur Reutenauer
2016-08-28 14:12:48 UTC
Permalink
Hi Maksim,

First of all thank you for your efforts, although I would say you’re
trying to do a little too much at this stage, I’ll explain why at the
end.
Post by Maksim Salau
! Conflicting pattern ignored.
l.6024 }
?
! Emergency stop.
l.6024 }
! ==> Fatal error occurred, no output PDF file produced!
Transcript written on luatex.log.
Is there any way to make it more verbose? Or debug the issue somehow?
You can’t really make it more verbose with LuaTeX, but debugging the
issue is easy: conflicting patterns (called “duplicate patterns” by
XeTeX and other engines) are patterns where the underlying character
strings are the same, for example a1b and a2b. If you generate formats
for XeTeX instead of LuaTeX, it gives you the exact line number where
the offending pattern is found -- i. e., the second occurrence, which
should help you find the first one.

Using that technique I found a number of conflicts such as б1ь and
б8ь, в1ь and в8ь, as well as а1й and а8й, а1ў and а8ў, and the more
intriguing pairs 1’2а and ’3а, 1’2е and ’3е, etc. This makes me suspect
that the patterns haven’t been developed with great care.
Post by Maksim Salau
Also, please, clarify for me usage of quotes. There are 3 symbols used in hyph-be.tex: ' ` ’
I suspect this can confuse the engine, since generate-plain-patterns.rb checks only the first one and convert it to the third one to populate hyph-quote-<lang>.tex
What is the official position on quotes? Should one use only ' and *TeX will do the rest, or other symbols are allowed too?
Any symbol is allowed in a hyphenation pattern for TeX as long as you
set its \lccode correctly, which is done in a file called
unicode-letters.def, or later within hyph-utf8. If the characters don’t
have a correct \lccode, you get an error from TeX saying “Non-letter”,
and since you’re not reporting anything like that, your system seems to
be set up correctly from that point of view.

However, TeX won’t treat the different types of apostrophes in any
special way, there are no equivalence tables or anything like that. To
the engine, the different Unicode characters for the apostrophe are
simply that, different characters. We enforce equivalences such as the
one between ' and ’ by duplicating every pattern containing an
apostrophe and putting it in the hyph-quote-* files as you’ve seen, so
in your case we could do that by putting all patterns with ` and ’ in
hyph-quote-be.tex, and the patterns with ' in the main file. We can
update the Ruby scripts to do that.

The reason for having only one type of apostrophe in the main file
(hyph-be.tex) is so that other programs that have a notion of
equivalence won’t get confused; this is not about TeX (at least not
about UTF-8 TeX, see below).
Post by Maksim Salau
And the third moment with these patterns is T2A encoding. The U+2019 symbol (the third quote from the list above) make conversion impossible, since the symbol is not mapped in converter. I tried to enable it in t2a.dat and regenerate converter, but it fails with message: The encoding t2a uses more than two bytes to encode characters.
Yes, of course, in T2A there is only one character slot for the
apostrophe, so you shouldn’t try and map all the different characters
one-to-one. This is precisely where the strategy explained in the
paragraph above helps: if you extract all the different types of
apostrophes to an auxiliary file and keep only one in the main file, you
can work around that problem. That said, do you really need to use the
patterns in an 8-bit encoding?

In conclusion, I think you should try and test the patterns first; you
don’t need any of the machinery that hyph-utf8 provides, but for example
just

---- BEGIN test-hyph-be.tex
\catcode`\{=1
\catcode`\}=2
\input unicode-letters.def
\lccode`\'=`\'
\lccode`\`=`\`
\lccode`\’=`\’
\input hyph-be
% Your text here
---- END test-hyph-be-tex

to be compiled with “xetex -ini -etex test-hyph-be.tex”. We’ll do the
packaging later.

Best,

Arthur
Maksim Salau
2016-08-29 03:50:53 UTC
Permalink
Hi Arthur,

Thank you for detailed explanation.

But unfortunately the test script doesn't work for me.
I tried it with TeXLive 2014.20141024-2 without success (unicode-letters.def is not shipped with it) and with the most recent vanilla version:

/usr/local/texlive/2016/bin/x86_64-linux/xetex -ini -etex test-hyph-be.tex
This is XeTeX, Version 3.14159265-2.6-0.99996 (TeX Live 2016) (INITEX)
restricted \write18 enabled.
entering extended mode
(./test-hyph-be.tex
(/usr/local/texlive/2016/texmf-dist/tex/plain/config/unicode-letters.def
! Use of \XeTeXcheck doesn't match its definition.
<inserted text> .9
9996
l.64 ...ifnum\expandafter\XeTeXcheck\XeTeXrevision
.-\relax>996 %
?
! Emergency stop.
<inserted text> .9
9996
l.64 ...ifnum\expandafter\XeTeXcheck\XeTeXrevision
.-\relax>996 %
No pages of output.
Transcript written on test-hyph-be.log.

My level of understanding of TeX is not enough to track down the cause from the message and sources.
Here is code XeTeX complains about (staring from the line 63):

\def\XeTeXcheck.#1.#2-#3\relax{#1}
\ifnum\expandafter\XeTeXcheck\XeTeXrevision.-\relax>996 %
\def\XeTeXcheck#1{}
\else
\def\XeTeXcheck#1{%
\ifnum"#1>"FFFF %
\long\def\XeTeXcheck##1\endgroup{\endgroup}
\expandafter\XeTeXcheck
\fi
}
\fi

Best regards,
Maksim.

On Sun, 28 Aug 2016 15:12:48 +0100
Post by Arthur Reutenauer
Hi Maksim,
First of all thank you for your efforts, although I would say you’re
trying to do a little too much at this stage, I’ll explain why at the
end.
Post by Maksim Salau
! Conflicting pattern ignored.
l.6024 }
?
! Emergency stop.
l.6024 }
! ==> Fatal error occurred, no output PDF file produced!
Transcript written on luatex.log.
Is there any way to make it more verbose? Or debug the issue somehow?
You can’t really make it more verbose with LuaTeX, but debugging the
issue is easy: conflicting patterns (called “duplicate patterns” by
XeTeX and other engines) are patterns where the underlying character
strings are the same, for example a1b and a2b. If you generate formats
for XeTeX instead of LuaTeX, it gives you the exact line number where
the offending pattern is found -- i. e., the second occurrence, which
should help you find the first one.
Using that technique I found a number of conflicts such as б1ь and
б8ь, в1ь and в8ь, as well as а1й and а8й, а1ў and а8ў, and the more
intriguing pairs 1’2а and ’3а, 1’2е and ’3е, etc. This makes me suspect
that the patterns haven’t been developed with great care.
Post by Maksim Salau
Also, please, clarify for me usage of quotes. There are 3 symbols used in hyph-be.tex: ' ` ’
I suspect this can confuse the engine, since generate-plain-patterns.rb checks only the first one and convert it to the third one to populate hyph-quote-<lang>.tex
What is the official position on quotes? Should one use only ' and *TeX will do the rest, or other symbols are allowed too?
Any symbol is allowed in a hyphenation pattern for TeX as long as you
set its \lccode correctly, which is done in a file called
unicode-letters.def, or later within hyph-utf8. If the characters don’t
have a correct \lccode, you get an error from TeX saying “Non-letter”,
and since you’re not reporting anything like that, your system seems to
be set up correctly from that point of view.
However, TeX won’t treat the different types of apostrophes in any
special way, there are no equivalence tables or anything like that. To
the engine, the different Unicode characters for the apostrophe are
simply that, different characters. We enforce equivalences such as the
one between ' and ’ by duplicating every pattern containing an
apostrophe and putting it in the hyph-quote-* files as you’ve seen, so
in your case we could do that by putting all patterns with ` and ’ in
hyph-quote-be.tex, and the patterns with ' in the main file. We can
update the Ruby scripts to do that.
The reason for having only one type of apostrophe in the main file
(hyph-be.tex) is so that other programs that have a notion of
equivalence won’t get confused; this is not about TeX (at least not
about UTF-8 TeX, see below).
Post by Maksim Salau
And the third moment with these patterns is T2A encoding. The U+2019 symbol (the third quote from the list above) make conversion impossible, since the symbol is not mapped in converter. I tried to enable it in t2a.dat and regenerate converter, but it fails with message: The encoding t2a uses more than two bytes to encode characters.
Yes, of course, in T2A there is only one character slot for the
apostrophe, so you shouldn’t try and map all the different characters
one-to-one. This is precisely where the strategy explained in the
paragraph above helps: if you extract all the different types of
apostrophes to an auxiliary file and keep only one in the main file, you
can work around that problem. That said, do you really need to use the
patterns in an 8-bit encoding?
In conclusion, I think you should try and test the patterns first; you
don’t need any of the machinery that hyph-utf8 provides, but for example
just
---- BEGIN test-hyph-be.tex
\catcode`\{=1
\catcode`\}=2
\input unicode-letters.def
\lccode`\'=`\'
\lccode`\`=`\`
\lccode`\’=`\’
\input hyph-be
% Your text here
---- END test-hyph-be-tex
to be compiled with “xetex -ini -etex test-hyph-be.tex”. We’ll do the
packaging later.
Best,
Arthur
Arthur Reutenauer
2016-08-29 11:42:38 UTC
Permalink
Post by Maksim Salau
But unfortunately the test script doesn't work for me.
I tried it with TeXLive 2014.20141024-2 without success (unicode-letters.def is not shipped with it)
Try unicode-letters.tex with TeX Live 2014.

Arthur
Arthur Reutenauer
2016-08-29 16:14:23 UTC
Permalink
Post by Arthur Reutenauer
Using that technique I found a number of conflicts such as б1ь and
б8ь, в1ь and в8ь, as well as а1й and а8й, а1ў and а8ў, and the more
intriguing pairs 1’2а and ’3а, 1’2е and ’3е, etc. This makes me suspect
that the patterns haven’t been developed with great care.
Since I’ve been working on a library for pattern manipulation, I’ve
just extended it to report conflicts in pattern sets, and I find 156 of
them in the LibreOffice pattern file for Belarusian, which I copy in
full because they’re quite interesting:

----
1'2а '3а
1`2а `3а
1’2а ’3а
1'2е '3е
1`2е `3е
1’2е ’3е
1'2ё '3ё
1`2ё `3ё
1’2ё ’3ё
1'2і '3і
1`2і `3і
1’2і ’3і
1'2о '3о
1`2о `3о
1’2о ’3о
1'2у '3у
1`2у `3у
1’2у ’3у
1'2ы '3ы
1`2ы `3ы
1’2ы ’3ы
1'2э '3э
1`2э `3э
1’2э ’3э
1'2ю '3ю
1`2ю `3ю
1’2ю ’3ю
1'2я '3я
1`2я `3я
1’2я ’3я
д1ж д8ж
д1з д8з
а1й а8й
а1ў а8ў
е1й е8й
е1ў е8ў
ё1й ё8й
ё1ў ё8ў
і1й і8й
і1ў і8ў
о1й о8й
о1ў о8ў
у1й у8й
у1ў у8ў
ы1й ы8й
ы1ў ы8ў
э1й э8й
э1ў э8ў
ю1й ю8й
ю1ў ю8ў
я1й я8й
я1ў я8ў
б1ь б8ь
б1' б8'
б1` б8`
б1’ б8’
в1ь в8ь
в1' в8'
в1` в8`
в1’ в8’
г1ь г8ь
г1' г8'
г1` г8`
г1’ г8’
ґ1ь ґ8ь
ґ1' ґ8'
ґ1` ґ8`
ґ1’ ґ8’
д1ь д8ь
д1' д8'
д1` д8`
д1’ д8’
ж1ь ж8ь
ж1' ж8'
ж1` ж8`
ж1’ ж8’
з1ь з8ь
з1' з8'
з1` з8`
з1’ з8’
й1ь й8ь
й1' й8'
й1` й8`
й1’ й8’
к1ь к8ь
к1' к8'
к1` к8`
к1’ к8’
л1ь л8ь
л1' л8'
л1` л8`
л1’ л8’
м1ь м8ь
м1' м8'
м1` м8`
м1’ м8’
н1ь н8ь
н1' н8'
н1` н8`
н1’ н8’
п1ь п8ь
п1' п8'
п1` п8`
п1’ п8’
р1ь р8ь
р1' р8'
р1` р8`
р1’ р8’
с1ь с8ь
с1' с8'
с1` с8`
с1’ с8’
т1ь т8ь
т1' т8'
т1` т8`
т1’ т8’
ў1ь ў8ь
ў1' ў8'
ў1` ў8`
ў1’ ў8’
ф1ь ф8ь
ф1' ф8'
ф1` ф8`
ф1’ ф8’
х1ь х8ь
х1' х8'
х1` х8`
х1’ х8’
ц1ь ц8ь
ц1' ц8'
ц1` ц8`
ц1’ ц8’
ч1ь ч8ь
ч1' ч8'
ч1` ч8`
ч1’ ч8’
ш1ь ш8ь
ш1' ш8'
ш1` ш8`
ш1’ ш8’
ь1ь ь8ь
ь1' ь8'
ь1` ь8`
ь1’ ь8’
'1ь '8ь
'1' '8'
'1` '8`
'1’ '8’
`1ь `8ь
`1' `8'
`1` `8`
`1’ `8’
’1ь ’8ь
’1' ’8'
’1` ’8`
’1’ ’8’
----

You can try the same by installing my library from
https://github.com/hyphenation/hydra and the following Ruby script, from
the top-level directory:

----
require './lib/hydra'
hydra = Hydra.new
hydra.ingest_file '/path/to/pattern/file' # containing only patterns, TeX-style comments allowed
hydra.conflicts # Return the list of conflicts as an array
----

You’ll need to run “bundle” first to install the few dependencies.

Best,

Arthur
Maksim Salau
2016-08-30 05:05:08 UTC
Permalink
Post by Arthur Reutenauer
Since I’ve been working on a library for pattern manipulation, I’ve
just extended it to report conflicts in pattern sets, and I find 156 of
them in the LibreOffice pattern file for Belarusian, which I copy in
Thank you, Arthur!

I'll review patterns and return when xetex will not complain.

Best regards,
Maksim.
Arthur Reutenauer
2016-08-30 13:52:45 UTC
Permalink
Post by Maksim Salau
I'll review patterns and return when xetex will not complain.
Actually, making the patterns acceptable to TeX is easy, I can do that
for you. I think it would be more interesting to analyse the logic
behind them, and hopefully fix them, because there seems to be something
seriously wrong.

Apart for a number of very special patterns, I can describe the full
list as follows: a pattern is of the form

ANY 1 CMA
or 1 CMA 2 V
or V 1 V
or A 3 V
or . ANY 8
or 8 ANY .
or the pattern д8ж
or the pattern д8з
or V 8 K
or C 8 MA
or MA 8 MA

where

ANY is any letter of the Belarusian alphabet
C is any consonant
V is any vowel
A is an apostrophe (' ` ’)
K is й or ў
M is the soft sign (ь)
MA is M or an A
CMA is a C or M or A

Using that formalism (it’s a context-free grammar if you’re familiar
with that), I can generate exactly the patterns contained in the
LibreOffice package ... except for a list of almost 2000 patterns that
seem rather strange (between the rule 8 ANY . and the pattern д8ж).
They're all anchored at the beginning or the end of the word (i. e.,
they start or end with a dot), they have the digit 8 in every position,
and they contain only consonants and different apostrophes. The first
ones seem somewhat reasonable, but they then get increasingly strange
and we then have patterns such as (8’s omitted for legibility) .бльгг,
.бррр, .брьггв, .дззззз, and (my personal favourite) .нннннн

I would really have a hard time accepting that these patterns make
much sense. I can imagine a few explanations for why they’re there, but
I think it would be good to try and understand what the original author
meant with that.

Best,

Arthur
Werner LEMBERG
2016-08-30 14:06:31 UTC
Permalink
The first ones seem somewhat reasonable, but they then get
increasingly strange and we then have patterns such as (8’s omitted
for legibility) .бльгг, .бррр, .брьггв, .дззззз, and (my personal
favourite) .нннннн
Eeeeee, you never use these words? Dzzzzz :-)


Werner
Claudio Beccari
2016-08-30 14:34:00 UTC
Permalink
They apper as "onomatopoeias", and "interjections" of the kind that
should never appear in a typeset book, except, may be, in books for
children. Any language uses such "sounds" with different spellings and
alphabets, but as far as I can see, no language pattern file contains
any pattern to deal with them.
Claudio
Post by Werner LEMBERG
The first ones seem somewhat reasonable, but they then get
increasingly strange and we then have patterns such as (8’s omitted
for legibility) .бльгг, .бррр, .брьггв, .дззззз, and (my personal
favourite) .нннннн
Eeeeee, you never use these words? Dzzzzz :-)
Werner
Arthur Reutenauer
2016-08-30 14:51:11 UTC
Permalink
They apper as "onomatopoeias", and "interjections" of the kind that should
never appear in a typeset book, except, may be, in books for children. Any
language uses such "sounds" with different spellings and alphabets, but as
far as I can see, no language pattern file contains any pattern to deal with
them.
We can very well include patterns to deal with onomatopoeia in any
language, the problem is that they should be input as hyphenation
exceptions, not as .д8з8з8з8з8з8 and I also doubt there would be 1842 of
them.

Best,

Arthur
Claudio Beccari
2016-08-30 15:31:04 UTC
Permalink
You are perfectly right
Claudio
Post by Arthur Reutenauer
They apper as "onomatopoeias", and "interjections" of the kind that should
never appear in a typeset book, except, may be, in books for children. Any
language uses such "sounds" with different spellings and alphabets, but as
far as I can see, no language pattern file contains any pattern to deal with
them.
We can very well include patterns to deal with onomatopoeia in any
language, the problem is that they should be input as hyphenation
exceptions, not as .д8з8з8з8з8з8 and I also doubt there would be 1842 of
them.
Best,
Arthur
Maksim Salau
2016-08-30 15:42:42 UTC
Permalink
Post by Arthur Reutenauer
Actually, making the patterns acceptable to TeX is easy, I can do that
for you. I think it would be more interesting to analyse the logic
behind them, and hopefully fix them, because there seems to be something
seriously wrong.
Thanks a lot. I hope I can do it by myself. My understanding of problems with those patterns is that the author incorrectly specified groups of letters (E.g.: made 'ь' a consonant which is incorrect) and this lead to conflicts since there are special rules for ь ' й ў.
BTW, I found another variant of patterns in OpenOffice [1]
It doesn't include impossible combinations (mostly) and even have some exceptions from general rules.

There are still some duplicates and conflicts.
E.g.:
ь8ь ь1ь % 'ьь' is an impossible combination anyway
.пад1ж д8ж
.пад1з д8з

Actually .пад1ж should override д8ж
I.e.:
.пад7ж д4ж
.пад7з д4з

Is this correct?

It seems to me this is a better starting point.

And one more question:
If I need to prohibit hyphenation before й or ў can I write 8й 8ў ?
Or I need write all possible combinations of vowel + й|ў ?

Thanks and regards,
Maksim.

[1] https://gist.github.com/msalau/21bebeaf87a5b22a8020b37dc8afaf21
Arthur Reutenauer
2016-08-30 16:14:57 UTC
Permalink
Post by Maksim Salau
Thanks a lot. I hope I can do it by myself. My understanding of problems with those patterns is that the author incorrectly specified groups of letters (E.g.: made 'ь' a consonant which is incorrect) and this lead to conflicts since there are special rules for ь ' й ў.
It’s a safe bet that the author(s) didn’t really know what they were
doing.
Post by Maksim Salau
BTW, I found another variant of patterns in OpenOffice [1]
Interesting, where did you find them exactly?
Post by Maksim Salau
It doesn't include impossible combinations (mostly) and even have some exceptions from general rules.
Yes, they look much more reasonable, but I still think it could be
worth making contact with Alex Buloichik to ask him what these
combinations were supposed to stand for. They didn’t just come out of
nowhere. If you do, you can ask him if he would agree to change the
licence to the MIT licence (https://opensource.org/licenses/MIT), this
makes it easier to share the patterns across projects since the patterns
are potentially useful to all typesetting systems, hyphenation
libraries, and even Web browsers nowadays.
Post by Maksim Salau
There are still some duplicates and conflicts.
ь8ь ь1ь % 'ьь' is an impossible combination anyway
Forget about it, then. Clearly this has been automatically generated
from a set of (meta-)patterns.
Post by Maksim Salau
.пад1ж д8ж
.пад1з д8з
Note that these are not actually conflicts in TeX’s view, they work
exactly the way hyphenation patterns are intended; obviously they
specify the opposite of what’s correct, but that’s another problem ;-)
Post by Maksim Salau
Actually .пад1ж should override д8ж
.пад7ж д4ж
.пад7з д4з
Is this correct?
That’s correct, but actually I would just write

д2ж
д2з
.пад3

Using lower numbers to begin with makes it easier to refine later.

That being said, is пад really always a prefix?
Post by Maksim Salau
It seems to me this is a better starting point.
Yes, indeed.
Post by Maksim Salau
If I need to prohibit hyphenation before й or ў can I write 8й 8ў ?
Or I need write all possible combinations of vowel + й|ў ?
No, 8й 8ў is perfectly valid and expresses what you want. Or even
4й 4ў, for that matter -- using the even number greater or equal to any
other number in the patterns.

Best,

Arthur
Maksim Salau
2016-08-31 02:54:57 UTC
Permalink
Hi Arthur,
Post by Arthur Reutenauer
Post by Maksim Salau
BTW, I found another variant of patterns in OpenOffice [1]
Interesting, where did you find them exactly?
Here it is http://extensions.services.openoffice.org/en/project/dict-be-official
Version 1.1
The file itself is in cp1251 and needs conversion to UTF-8
iconv -f cp1251 -t UTF-8 < ./hyph_be_BY.dic > ./hyph_be_BY.txt
+ some hand editing to put the content inside \patterns{}
Post by Arthur Reutenauer
Yes, they look much more reasonable, but I still think it could be
worth making contact with Alex Buloichik to ask him what these
combinations were supposed to stand for. They didn’t just come out of
nowhere.
According to comment on line 1414: intention to include such awkward patterns
was to prohibit hyphenation if any part that is composed solely of consonants.
Post by Arthur Reutenauer
If you do, you can ask him if he would agree to change the
licence to the MIT licence (https://opensource.org/licenses/MIT), this
makes it easier to share the patterns across projects since the patterns
are potentially useful to all typesetting systems, hyphenation
libraries, and even Web browsers nowadays.
Ok, I'll ask.
Post by Arthur Reutenauer
That’s correct, but actually I would just write
д2ж
д2з
.пад3
Using lower numbers to begin with makes it easier to refine later.
That being said, is пад really always a prefix?
This would make life too easy :) In some words it is a part of the root and is hyphenated differently.
E.g.: па-да-ру-нак, па-дзел, вы-па-дак, па-да-плё-ка.
Post by Arthur Reutenauer
No, 8й 8ў is perfectly valid and expresses what you want. Or even
4й 4ў, for that matter -- using the even number greater or equal to any
other number in the patterns.
Hyphenation right before й or ў is prohibited at all times, no exceptions. So 8 will be just right, I believe.
Thanks, this will make list of patterns much shorter.

Best regards,
Maksim.
Arthur Reutenauer
2016-08-31 14:38:27 UTC
Permalink
Post by Maksim Salau
Here it is http://extensions.services.openoffice.org/en/project/dict-be-official
Thanks.
Post by Maksim Salau
The file itself is in cp1251 and needs conversion to UTF-8
iconv -f cp1251 -t UTF-8 < ./hyph_be_BY.dic > ./hyph_be_BY.txt
+ some hand editing to put the content inside \patterns{}
Thanks, I know how to do that :-)
Post by Maksim Salau
According to comment on line 1414: intention to include such awkward patterns
was to prohibit hyphenation if any part that is composed solely of consonants.
There’s something odd anyway. I still suspect the actual list of
patterns does not reflect the intention of the author.
Post by Maksim Salau
Ok, I'll ask.
Thanks. I don’t mind being copied on the conversation, even if it is
in Belarusian. You should contact Sviatlana Liasovich as well, since
she’s mentioned as having made corrections; in fact I think it would be
accurate to consider her as the sole author of the OpenOffice file,
since I can’t discern any trace of the original patterns.
Post by Maksim Salau
Post by Arthur Reutenauer
That’s correct, but actually I would just write
д2ж
д2з
.пад3
Using lower numbers to begin with makes it easier to refine later.
That being said, is пад really always a prefix?
This would make life too easy :) In some words it is a part of the root and is hyphenated differently.
E.g.: па-да-ру-нак, па-дзел, вы-па-дак, па-да-плё-ка.
OK, that’s what I suspected :-) In that case it’s probably safer to
stick to

д2ж
д2з
.па2д3ж
.па2д3з

and input падзел as an exception: \hyphenation{па-зел}.

You need an even number after .па because of patterns of the type CVn,
with n an odd number to allow break; the OpenOffice patterns have C8V3,
but I would recommend CV1.
Post by Maksim Salau
Hyphenation right before й or ў is prohibited at all times, no exceptions. So 8 will be just right, I believe.
That sounds right. It’s of course all right to use 8 when break is
really prohibited, but the current files use way too much of them.

Best,

Arthur
Maksim Salau
2016-09-02 07:36:20 UTC
Permalink
Hi Arthur,

Thank you for help!

I've created a script to generate patterns and here is the result:
script: https://github.com/msalau/hyph-be/blob/master/hyph-be.py
patterns: https://github.com/msalau/hyph-be/blob/master/hyph-be.tex

XeTeX is happy with these patterns, and now is the time to verify them :)
Is there any simple way to feed a bunch of words to *TeX and get hyphenated words back?

Also, is there any easy way to prohibit hyphenation of consonant-only endings/beginnings of a word?
I can remember a word with 3 consonants at the end.
Is generation of .ccc8 8ccc. patterns the only way to go? (patterns for 2 consonants are already in place)

Licensing question is still open. I failed to contact Sviatlana and Alex answered nothing about switching to the MIT license.

Best regards,
Maksim.
Claudio Beccari
2016-09-02 07:46:04 UTC
Permalink
Post by Maksim Salau
Is there any simple way to feed a bunch of words to *TeX and get hyphenated words back?
The easy part is to use the testhyphens.sty small package, that is
included in any TeX system full installation.

The difficult partis to find a bunch of words to feed it; sa simple
trivial way is to feed it with a text copied and pasted in, from any
source containing Belorussian text. Do not exagerate with the length of
such text; alternatively use the checkhyphens enviroment (defined in
that package) within a multicolum environment; When I test the hyphens I
prepare I use a 4 or 5 column setting for the multicolumn environment.
The above small package works with pdflatex, xelatex, lualatex.

Claudio
Maksim Salau
2016-09-02 16:17:50 UTC
Permalink
Hi Claudio,

Thanks for directions!

You can always get a lot of words from a spelling dictionary :)
E.g.: unmunch /usr/share/hunspell/be_BY.dic /usr/share/hunspell/be_BY.aff
will give you 765411 words

Regards,
Maksim.

On Fri, 2 Sep 2016 09:46:04 +0200
Post by Claudio Beccari
Post by Maksim Salau
Is there any simple way to feed a bunch of words to *TeX and get hyphenated words back?
The easy part is to use the testhyphens.sty small package, that is
included in any TeX system full installation.
The difficult partis to find a bunch of words to feed it; sa simple
trivial way is to feed it with a text copied and pasted in, from any
source containing Belorussian text. Do not exagerate with the length of
such text; alternatively use the checkhyphens enviroment (defined in
that package) within a multicolum environment; When I test the hyphens I
prepare I use a 4 or 5 column setting for the multicolumn environment.
The above small package works with pdflatex, xelatex, lualatex.
Claudio
Maksim Salau
2016-09-03 00:33:55 UTC
Permalink
Hi,
Post by Claudio Beccari
Post by Maksim Salau
Is there any simple way to feed a bunch of words to *TeX and get hyphenated words back?
The easy part is to use the testhyphens.sty small package, that is
included in any TeX system full installation.
Do you know a way that doesn't involve system-wide installation of patterns and generation of a full-blown *TeX document?
I had in mind something like this:
% -- BEGIN --
\catcode`\{=1
\catcode`\}=2
\input unicode-letters
\lccode`\'=`\'
\input hyph-be
% Some TeX magic
\input word-list.txt
% Some more TeX magic
% -- END --

And result printed to standard output.

I got such work-flow with https://github.com/hunspell/hyphen
But I'm not sure if this library produces the same output as *TeX does.

And here is my first result that works not as expected:
left hyphen min = right hyphen min = 2
а б а б ' ю
.а8
а1
а1
8'1
ю1
8ю.

.а8ба1б8'8ю.

So I expect to get аба=б'ю, but the library says there are no valid hyphenation points in the word.
Adding rule .аба3б doesn't help. The only explanation I can imagine is that the library doesn't treat quote as part of the alphabet and hyphenates the word as two separate words. The word абаб'юц=ца proves that theory.

Does anyone have any experience with the library?

Best regards,
Maksim.
Arthur Reutenauer
2016-09-16 20:28:29 UTC
Permalink
Hi Maksim,
Post by Maksim Salau
Do you know a way that doesn't involve system-wide installation of patterns and generation of a full-blown *TeX document?
% -- BEGIN --
\catcode`\{=1
\catcode`\}=2
\input unicode-letters
\lccode`\'=`\'
\input hyph-be
% Some TeX magic
\input word-list.txt
% Some more TeX magic
% -- END --
And result printed to standard output.
I would use my library Hydra:

-- BEGIN --
require 'hydra'

hydra = Hydra.new
hydra.setlefthyphenmin(2)
hydra.setrighthyphenmin(2)
hydra.ingest_file('/path/to/patterns.txt')
File.read('/path/to/wordlist.txt').each_line do |line|
puts hydra.showhyphens(line.strip)
end
-- END --
Post by Maksim Salau
I got such work-flow with https://github.com/hunspell/hyphen
But I'm not sure if this library produces the same output as *TeX does.
I'm sure it doesn't ;-) It was designed to provide a functionality
equivalent to TeX’s hyphenation routine but has a slightly different
implementation; in the general case you can’t assume that the same set
of patterns will yield the same hyphenation points with TeX and
libhyphen. (One could even say it uses a slightly different algorithm
based on the same ideas.) In addition there are slight variations in
Post by Maksim Salau
left hyphen min = right hyphen min = 2
а б а б ' ю
.а8
а1
а1
8'1
ю1
8ю.
.а8ба1б8'8ю.
So I expect to get аба=б'ю, but the library says there are no valid hyphenation points in the word.
Adding rule .аба3б doesn't help. The only explanation I can imagine is that the library doesn't treat quote as part of the alphabet and hyphenates the word as two separate words. The word абаб'юц=ца proves that theory.
That's a very likely explanation indeed, and I would call this kind of
problem a configuration issue; in addition the fact that the patterns
need “preparation” for use with libhyphen does mean that you can’t use
your patterns with it and expect the same hyphenation as TeX. See
README.hyphen in the libhyphen repository for details.
Post by Maksim Salau
Does anyone have any experience with the library?
Not much more than the theoretical knowledge I summarise above, to be
honest (and one test case that made the problem evident), but I’d still
recommend using my library, and reporting any potential difference you
notice, because then it’s a bug and I’d like to fix it :-) You can of
course report any other issue, if applicable.

Best,

Arthur
Arthur Reutenauer
2016-09-16 20:34:33 UTC
Permalink
Hi again,
Post by Maksim Salau
Also, is there any easy way to prohibit hyphenation of consonant-only endings/beginnings of a word?
I can remember a word with 3 consonants at the end.
Is generation of .ccc8 8ccc. patterns the only way to go? (patterns for 2 consonants are already in place)
Yes, and I would restrict that to lists of three consonants that
actually do occur in Belarusian.
Post by Maksim Salau
Licensing question is still open. I failed to contact Sviatlana and Alex answered nothing about switching to the MIT license.
I’ve seen you’ve made progress in the mean time from your private
emails; however I’d like to mention that from what I see in your working
repository, you have actually reimplemented the whole file from the
specifications of the Belarusian Academy. It is thus almost certain
that you can rightfully call yourself the only copyright holder of the
file; the only caveat is the list of 23 words you’ve copied from the
OpenOffice file (whose author must be Sviatlana Liasovich since they are
not in the LibreOffice file by Alex Buloichik). However it is doubtful
that one can really hold a copyright on a list of 23 words or substrings ...

That said, it is always courteous to acknowledge the contribution of
previous developers, but I wouldn’t put their names in the copyright
line.

Best,

Arthur
Maksim Salau
2016-09-16 21:49:54 UTC
Permalink
Hi Arthur,
Post by Maksim Salau
-- BEGIN --
require 'hydra'
hydra = Hydra.new
hydra.setlefthyphenmin(2)
hydra.setrighthyphenmin(2)
hydra.ingest_file('/path/to/patterns.txt')
File.read('/path/to/wordlist.txt').each_line do |line|
puts hydra.showhyphens(line.strip)
end
-- END --
Thanks! I've decided to use your library and ended up with almost the same script
https://github.com/msalau/hyph-be/blob/master/showhyphens.rb
It accepts words either as arguments or from stdin.
Post by Maksim Salau
Post by Maksim Salau
Also, is there any easy way to prohibit hyphenation of consonant-only endings/beginnings of a word?
I can remember a word with 3 consonants at the end.
Is generation of .ccc8 8ccc. patterns the only way to go? (patterns for 2 consonants are already in place)
Yes, and I would restrict that to lists of three consonants that
actually do occur in Belarusian.
This is the hard part :) All combinations (both possible and impossible) take really huge amount of space. I'm considering parsing the hunspell dictionary to get only possible combinations.
Post by Maksim Salau
Post by Maksim Salau
Licensing question is still open. I failed to contact Sviatlana and Alex answered nothing about switching to the MIT license.
I’ve seen you’ve made progress in the mean time from your private
emails; however I’d like to mention that from what I see in your working
repository, you have actually reimplemented the whole file from the
specifications of the Belarusian Academy. It is thus almost certain
that you can rightfully call yourself the only copyright holder of the
file; the only caveat is the list of 23 words you’ve copied from the
OpenOffice file (whose author must be Sviatlana Liasovich since they are
not in the LibreOffice file by Alex Buloichik). However it is doubtful
that one can really hold a copyright on a list of 23 words or substrings ...
That said, it is always courteous to acknowledge the contribution of
previous developers, but I wouldn’t put their names in the copyright
line.
Thanks. I'll reconsider that part.

Recently I tried patterns with real TeX documents: https://github.com/msalau/hyph-be/tree/master/test-doc
I have 2 variants: for T2A encoding (compiled with pdflatex) and for UTF-8 (compiled with xelatex).

Here is what I've got:

1. document.t2a.tex is an UTF-8 document, uses babel (Belarusian is supported) and T2A encoding and is compiled with pdflatex.
Post by Maksim Salau
[] \T2A/cmr/m/n/10 ��-�-���� ��-��-����!
2. document.tex is an UTF-8 document, uses polyglossia and is compiled with xelatex.
Polyglossia doesn't support Belarusian.
Hyphenation doesn't work but output from \showhyphens{} is readable.
Post by Maksim Salau
Package polyglossia Warning: File gloss-belarusian.ldf does not exist!
(polyglossia) I will nevertheless try to use hyphenation patterns for belarusian. on input line 7.
Underfull \hbox (badness 10000) in paragraph at lines 10--10
[] \EU1/DejaVuSans(0)/m/n/10 Тэставы дакумент
Package babel Warning: No input encoding specified for Belarusian language on input line 146.
))
Underfull \hbox (badness 10000) in paragraph at lines 10--10
[] \EU1/DejaVuSans(0)/m/n/10 Тэставы дакумент
Polyglossia is actually aware of presence of hyphenation patterns for Belarusian, since it doesn't complain much.
Post by Maksim Salau
Package polyglossia Warning: File gloss-foo-bar.ldf does not exist!
(polyglossia) I will nevertheless try to use hyphenation patterns for foo-bar. on input line 7.
Package polyglossia Warning: \setlocalhyphenmin useless for unknown language foo-bar on input line 7.
Package polyglossia Warning: No hyphenation patterns were loaded for `Foo-bar'
Does anyone have any ideas about polyglossia? As I can see, polyglossia can access hyphenation patterns
(it know that they exist), but fails to load them for unknown reason.

Best regards,
Maksim.
Arthur Reutenauer
2016-09-18 16:56:44 UTC
Permalink
Hi Maksim,
Post by Maksim Salau
Thanks! I've decided to use your library and ended up with almost the same script
https://github.com/msalau/hyph-be/blob/master/showhyphens.rb
It accepts words either as arguments or from stdin.
Good, I saw that in the mean time :-)
Post by Maksim Salau
This is the hard part :) All combinations (both possible and impossible) take really huge amount of space. I'm considering parsing the hunspell dictionary to get only possible combinations.
Yes, that’s what I was suggesting.

Regardings experiments with real documents, you need a
gloss-belarusian.ldf if you want to test with Polyglossia, and there are
various minor issues that mean your document should work with only a
small amount of change; see document-arthur.tex in my fork of your
repository (https://github.com/reutenauer/hyph-be) and
gloss-belarusian.ldf in the same directory.

Best,

Arthur
Maksim Salau
2016-09-21 05:55:18 UTC
Permalink
Hi Arthur,

Many thanks for sample ldf-file!

I looked at your https://github.com/reutenauer/hyph-be/blob/master/three-consonants.rb
It lists 3 consonants in a row, but this is not an issue if is in the middle of a word.
I meant only those at the end of a word. E.g.: /[#{cons}]{3}$/

Also I've made some progress in determining if hyphenation in the middle of дж/дз is allowed.
Here is the script https://github.com/msalau/hyph-be/blob/master/list-dz.py
And output https://github.com/msalau/hyph-be/blob/master/list-dz.txt
I started with empty PATTERNS and added patterns until all words are covered.
There are still 95 words (7 patterns) to be determined, but overall picture is already clear:
hyphenation is allowed in 579 words (39 patterns) and is prohibited in 1280 words (69 patterns).
So I can conclude that hyphenation of дж/дз is an exception.
I'll try to find someone to review the list.

There is also a alternative and 100% correct way: prohibit hyphenation in the middle of дж/дз and right before it.
E.g.: 8д8ж 8д8з
This will be valid for all cases :)

Best regards,
Maksim.
Post by Arthur Reutenauer
Post by Maksim Salau
This is the hard part :) All combinations (both possible and impossible) take really huge amount of space. I'm considering parsing the hunspell dictionary to get only possible combinations.
Yes, that’s what I was suggesting.
Arthur Reutenauer
2016-09-26 20:03:51 UTC
Permalink
Hi Maksim,
Post by Maksim Salau
Many thanks for sample ldf-file!
You’re welcome :-) About the consonant clusters, my view was actually
that it can be worth including all the possible clusters as
unhyphenatable patterns at word boundaries – regardless of where the
clusters themselves were found – because it’s still not that many and it
makes sense to be a little prudent; for example, is it possible that
someone make an abbreviation by stopping the word right after the
3-consonant cluster? Just a thought.
Post by Maksim Salau
Also I've made some progress in determining if hyphenation in the middle of дж/дз is allowed.
Here is the script https://github.com/msalau/hyph-be/blob/master/list-dz.py
And output https://github.com/msalau/hyph-be/blob/master/list-dz.txt
I started with empty PATTERNS and added patterns until all words are covered.
hyphenation is allowed in 579 words (39 patterns) and is prohibited in 1280 words (69 patterns).
So I can conclude that hyphenation of дж/дз is an exception.
I'll try to find someone to review the list.
This is sound. I wouldn’t quite call hyphenation of дж and дз an
“exception” as it occurs in one third of the words, but it’s clearly the
pragmatic choice to prohibit it by default and allow it for the words
where it is allowed.
Post by Maksim Salau
There is also a alternative and 100% correct way: prohibit hyphenation in the middle of дж/дз and right before it.
E.g.: 8д8ж 8д8з
This will be valid for all cases :)
Of course :-)

Best,

Arthur

Loading...