Discussion:
[tex-hyphen] extending `hyph-zh-latn-pinyin.tex'
Werner LEMBERG
2018-11-21 19:31:43 UTC
Permalink
Folks,


I think it would be nice to extend `hyph-zh-latn-pinyin.tex' by
covering Chinese syllables with tone marks – it's trivial to extend,
say, pattern

a1b

with

ā1b
á1b
ǎ1b
à1b

and ditto for all other patterns. However, such an extended file
would only be usable by XeTeX and luatex. I now wonder what route
should be taken to stay compatible with pdfTeX and classical TeX.

(1) Another file.

This solution I rather dislike.

(2) Some conditional code within \patterns to append the extended
patterns.

This would be my choice. However, it clutters the patterns with
some TeX commands.

(3) Two \pattern blocks (one for XeTeX/luatex, another one for
pdftex/classical TeX) enclosed by conditionals.

A variant of (2) which might be preferable.


What do you think? What do you suggest? What TeX macros or
primitives should be used?


Werner
Mojca Miklavec
2018-11-21 20:57:40 UTC
Permalink
Dear Werner,
Post by Werner LEMBERG
Folks,
I think it would be nice to extend `hyph-zh-latn-pinyin.tex' by
covering Chinese syllables with tone marks – it's trivial to extend,
say, pattern
a1b
with
ā1b
á1b
ǎ1b
à1b
and ditto for all other patterns. However, such an extended file
would only be usable by XeTeX and luatex. I now wonder what route
should be taken to stay compatible with pdfTeX and classical TeX.
(1) Another file.
This solution I rather dislike.
This is what I would go for.

But I would create a simple script in any programming language (lua,
ruby, python, ...) and generate two pattern files out of it.

You can see the folder source/generic/hyph-utf8/languages with some examples.


There are some cases like all the languages with quotation marks in
patterns which do the following (provide just additional patterns in a
separate file):

\ifx\secondarg\empty
% Unicode-aware engine (such as XeTeX or LuaTeX) only sees a
single (2-byte) argument
\message{UTF-8 Italian hyphenation patterns}
\input hyph-it.tex
\input hyph-quote-it.tex
\else
% 8-bit engine (such as TeX or pdfTeX)
\message{ASCII Italian hyphenation patterns}
\input hyph-it.tex
\fi

but that's a super small subset of patterns. In case of additional
patterns representing the vast majority I really see no advantage of
providing just partial file with additional patterns.
Post by Werner LEMBERG
(2) Some conditional code within \patterns to append the extended
patterns.
This would be my choice. However, it clutters the patterns with
some TeX commands.
No, please don't use that. We parse those files (outside of TeX) to
generate plain text patterns and TeX would only be causing us troubles
there.
Post by Werner LEMBERG
(3) Two \pattern blocks (one for XeTeX/luatex, another one for
pdftex/classical TeX) enclosed by conditionals.
A variant of (2) which might be preferable.
While slightly cleaner, this is hardly better than (2). We also ship
separate pattern files for monotonic and polytonic Greek. Or two sets
of patterns for Serbian (in different scripts). While none of the
situations is the same as for pinyin, I don't see any issue with
multiple files. The only disadvantage of two files is that one might
change one, but not the other. However this is when using a script to
generate both comes handy.
Post by Werner LEMBERG
What TeX macros or primitives should be used?
None :)
Android cannot parse TeX primitives :)

Mojca
Arthur Reutenauer
2018-11-21 22:10:36 UTC
Permalink
Post by Mojca Miklavec
Post by Werner LEMBERG
(1) Another file.
This solution I rather dislike.
This is what I would go for.
But I would create a simple script in any programming language (lua,
ruby, python, ...) and generate two pattern files out of it.
As Mojca said, without the shadow of a doubt. Just use a single file
as the source, and generate both versions with a simple script. In this
particular case the source file can be the full UTF-8 pattern set, so
the file for pdfTeX can be ignored by applications not interested in it.
Or we can use the current file and generate the UTF-8 file with tone
marks out of it.

Best,

Arthur
Werner LEMBERG
2018-11-22 08:43:57 UTC
Permalink
Post by Mojca Miklavec
Post by Werner LEMBERG
(1) Another file.
This solution I rather dislike.
This is what I would go for.
OK.
Post by Mojca Miklavec
But I would create a simple script in any programming language (lua,
ruby, python, ...) and generate two pattern files out of it.
Not necessary – the stuff is so simple, and the number of syllables is
closed which means there won't be any changes except bug fixes. A
simple search and replace did the job in a few minutes; see attached
file.

Please rename it as you like. You can also change the license as you
like (maybe MIT is preferable), and I guess some other minor
adjustments are necessary to distinguish the language tag from the
toneless pinyin version.


Werner
Arthur Reutenauer
2018-11-22 10:01:34 UTC
Permalink
Not necessary – the stuff is so simple, and the number of syllables is
closed which means there won't be any changes except bug fixes. A
simple search and replace did the job in a few minutes; see attached
file.
Thanks, will install. Speaking of bugfixes, I am actually wondering
about some patterns in the file.
Please rename it as you like. You can also change the license as you
like (maybe MIT is preferable)
I’ll change it to MIT since you don’t mind.

Best,

Arthur
Werner LEMBERG
2018-11-23 19:57:21 UTC
Permalink
Speaking of bugfixes, I am actually wondering about some patterns in
the file.
Please elaborate.


Werner
Mojca Miklavec
2018-11-23 22:32:11 UTC
Permalink
Post by Mojca Miklavec
But I would create a simple script in any programming language (lua,
ruby, python, ...) and generate two pattern files out of it.
Not necessary – the stuff is so simple, and the number of syllables is
closed which means there won't be any changes except bug fixes. A
simple search and replace did the job in a few minutes; see attached
file.
I still find it useful to have a level of abstraction, *in particular*
when the rules are really simple. This is what is done in Turkish for
example (it's not a complete set of patterns):

vowels = %w{a â e ı i î o ö u ü û}
consonants = %w{b c ç d f g ğ h j k l m n p r s ş t v y z}

vowels.each do |vowel|
puts "2#{vowel}1"
end
consonants.each do |cons|
puts "1#{cons}1"
end
consonants.each do |c1|
consonants.each do |c2|
puts "2#{c1}#{c2}"
end
end

I'm not saying that it has to be done, just that it is nice to have
something like this. (It's really easy to make a typo when assembling
such a list manually.)

Mojca
Werner LEMBERG
2018-11-24 07:33:03 UTC
Permalink
Post by Mojca Miklavec
Post by Mojca Miklavec
But I would create a simple script in any programming language
(lua, ruby, python, ...) and generate two pattern files out of
it.
Not necessary – the stuff is so simple, and the number of syllables
is closed which means there won't be any changes except bug fixes.
A simple search and replace did the job in a few minutes; see
attached file.
I still find it useful to have a level of abstraction, *in
particular* when the rules are really simple. This is what is done
in Turkish for example (it's not a complete set of patterns): [...]
Well, yes. If someone is going to do that work, I certainly won't
object :-)

BTW, here's the original program that creates the `word list' used to
derive the patterns:

http://git.savannah.gnu.org/gitweb/?p=cjk.git;a=tree;f=utils/pyhyphen


Werner

Loading...