Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stable and compact operator dictionary #104

Closed
fred-wang opened this issue Aug 5, 2021 · 25 comments
Closed

Stable and compact operator dictionary #104

fred-wang opened this issue Aug 5, 2021 · 25 comments
Labels

Comments

@fred-wang
Copy link
Contributor

In the past, I explained that there are strong requirements for browser implementations. In previous MathML versions, the operator dictionary was constantly changing and was not really covered by exhaustive non-regression testing... something not compatible with modern W3C spec practice. It was also huge and that's why I asked for more normalization & compactification before being able to upstream it to Chromium.

One year later, we are again in a state where the operator dictionary has changed and is out of sync with WPT tests and with what we upstreamed in Chromium. And the non-normative version is out-of-sync with the compact form too (see below).

We should ensure that the operator dictionary is stable and compact ; I really hope we can do it. If people disagree with that then I believe MathML Core should have its own compact & stable version for browser implemention & WPT, separate from the one used by MathML Full / xml-entities specs.

I'm proposing December 2021 for a freeze of the MathML Core dictionary. After that, I'll check again carefully the compactness of dictionary and update the WPT and browser implementations. Some things to notice:

  • Unicode 14 is intended to be released on September 14 2021. I really hope we don't add more entries though.

  • We should review the stretch orientation ( issue review operators stretching along the inline axis #102 )

  • 9e11468 modified the assert to prevent unintentional addition of unknown non-BMP characters. Now we have many new supplemental arrows-C in the non-normative version that are not handled by https://w3c.github.io/mathml-core/#operator-dictionary ; do we really need to add all these arrows ? If so, we can add try to add yet-another-special-case but I'd prefer to avoid that.

fred-wang added a commit that referenced this issue Aug 7, 2021
This allows to add this info to the non-normative human-readable table.
The estimated storage size for this information is also added.

#104
@davidcarlisle
Copy link
Collaborator

@fred-wang I think there is broad consensus that the op dict will need to be stable. The only disagreement if it should have been frozen begore even the first public working draft of mathml core, which seemed premature, but freezing in the timescale you suggest seems perfectly reasonable.

On stretch, I agree we should review and probably explicitly distinguish h and v stretch

supplemental arrows C I added as it seemed somehow the right thing to defer to Unicode here but they are marginal really and could just as easily be considered non mathematical sign writing symbols. So if the group has consensus to remove the operator dictionary entries for this block I wouldn't object (should be same for core and full)

On Unicode 14 there is an updated Unicode 14 unicode.xml in the u14 branch at https://github.com/w3c/xml-entities/tree/u14 but it has no diffs for the operator dictionary. I do hope that (somehow) we can incorporate the script/calliographic alphabets that are added in Unicode 14, but I don't think that should affect core it's just character data for an mi at that level.

@davidcarlisle
Copy link
Collaborator

Note since this issue was raised the Unicode 14 update has been merged in to main, with no changes to the operator dictionary, the Supplemental arrows c block is

https://w3c.github.io/xml-entities/1F8.html

If this is causing issues we could drop the operator dictionary entries (and drop this 1F8 page suggesting mathematical use) I don't think that would really lose anything, any character can be used in principle even if it has no math-specific spacing.

@fred-wang
Copy link
Contributor Author

As I said, I'd really prefer we don't add new entries for these arrows.

@davidcarlisle
Copy link
Collaborator

Yes OK, unless someone argues for them in next day or so, I'll lose them.

@davidcarlisle
Copy link
Collaborator

I updated unicode.xml in xml-entities at

w3c/xml-entities@1a70505

dropping the suppemental arrows C block (U+1F800) from the operator dictionary.

The python scripts here would need to be re-run to generate the versions of the operator dictionary data here. I could do that but leaving that with @fred-wang to coordinate with other changes, unless you ask me to run them.

@fred-wang
Copy link
Contributor Author

Thanks @davidcarlisle ; we will resume the mathml work in january. I'll check and coordinate the mathml op dict changes everywhere.

@davidcarlisle
Copy link
Collaborator

@fred-wang thanks: sounds good, and have a good and safe festive period.

@fred-wang
Copy link
Contributor Author

Because of step 4 (fallback when not explicit) of https://w3c.github.io/mathml-core/#dfn-algorithm-for-determining-the-properties-of-an-embellished-operator the following entries are problematic:

U+007C is in infixEntriesWithDefaultValues but also in ['prefixEntriesWithSpacing0AndStretchySymmetric', 'postfixEntriesWithSpacing0AndStretchySymmetric']
U+2234 is in infixEntriesWithDefaultValues but also in ['prefixEntriesWithLspace0Rspace0']
U+2235 is in infixEntriesWithDefaultValues but also in ['prefixEntriesWithLspace0Rspace0']
U+223C is in infixEntriesWithDefaultValues but also in ['prefixEntriesWithLspace0Rspace0']

Indeed, the current algo works as if infixEntriesWithDefaultValues was empty. Probably we'd need an extra step in the algo to handle that. But if some of these are essential, it would be good to simplify a bit.

I see it's important for U+007C and it's covered by mathml/presentation-markup/operators/mo-stretch-properties-001.html ; but AFAIK not for the others.

fred-wang added a commit that referenced this issue Jan 27, 2022
@fred-wang
Copy link
Contributor Author

CL to update in Chromium:
https://chromium-review.googlesource.com/c/chromium/src/+/3419439/1#message-f180e975c56b66947e855dd85100a1c3084a6714

fred-wang added a commit that referenced this issue Jan 28, 2022
… to one of the existing categories with a strong Python assertion.

#104
@fred-wang
Copy link
Contributor Author

fred-wang commented Jan 28, 2022

I uploaded a new patch on https://chromium-review.googlesource.com/c/chromium/src/+/3419439 ; trying to fix the remaining failures:

(1) mathml/presentation-markup/operators/mo-stretch-properties-001.html

This one is testing U+007C in different (implicit) forms. As I mentioned on #104 (comment) the current algo of the spec is broken for operators that use the default values in the infix forms but that have different values in other forms.

That's also true for U+2234, U+2235 and U+223C ; although I don't know if that's intentional for these three ones. If not, it would be better to avoid this situation. Otherwise, we should add WPT tests for them too I guess.

I added an assert to the Python script operator-dictionary.py in order to avoid having too many operators with this situation. Once we agree on a small list of exceptions, we can tweak the script to accept the corresponding code points.

In any case, since we definitely need that for U+007C, the spec should be changed to be able to handle these cases:

https://w3c.github.io/mathml-core/#operator-dictionary
"This compact form removes about 800 entries from the original operator dictionary that actually correspond to default values. They are not necessary since they are handled by the fallback case of 3.2.4.2 Dictionary-based attributes anyway." => this is true except for the exceptions mentioned above.
"For categories that don't have an encoding in Figure 26 (namely K, M) ..." => we need to handle the cases mentioned above by returning a "ForceDefault" status or similar.

https://w3c.github.io/mathml-core/#ref-for-dfn-algorithm-for-determining-the-properties-of-an-embellished-operator-1
"If no entry is found and the form of embellished operator was not explicitly specified as an attribute on its core operator, then user agents must try other dictionary entries for different values of F in the following order: infix, prefix, postfix;" => we need to skip this fallback for a "ForceDefault" status.

(I'll also probably tweak a bit the spec to clarify the execution and results of the algorithm retrieving the properties from the dictionary.)

(2) external/wpt/mathml/presentation-markup/operators/operator-dictionary-combining.html
This is a bug in the WPT test and was already failing (with different metrics) in the past. The operators.woff font is missing glyphs for the combining characters, fixed by
https://chromium-review.googlesource.com/c/chromium/src/+/3419439/2/third_party/blink/web_tests/external/wpt/mathml/tools/operator-dictionary.py

(3) external/wpt/mathml/presentation-markup/operators/operator-dictionary-spacing-001.html
external/wpt/mathml/presentation-markup/operators/operator-dictionary-spacing-002.html
external/wpt/mathml/presentation-markup/operators/operator-dictionary-spacing-004.html
external/wpt/mathml/presentation-markup/operators/operator-dictionary-spacing-005.html

These failures happened because the multi char ".." was removed from Operators_2_ascii_chars. I'm personally fine with removing exceptions. But it would be good to check whether that was intentional.

(4) external/wpt/mathml/presentation-markup/operators/operator-dictionary-spacing-001.html
This failure happened because the char U+2982 was added in category M which is handled specially. Again, I don't know whether that was intentional and less exception is always better for me. But if it's really justified I guess it's OK.

(5) Internal MathOperatorDictionaryTest
This failure happened because categories K and L got switched somehow. The script automatically orders categories by size, this was probably be due to new operators added/removed. I updated my CL to fix that and also pushed cf9221c

@NSoiffer
Copy link
Contributor

I think U+2234 (∴) and U+2235 (∵) are almost always prefix, so the infix versions can be dropped. I need to do a little research verify this though.

U+223C(∼) is similar to the ASCII "~". Probably both are used in interchangeable manners. There really are both prefix and infix usages. I don't think one usages is overwhelmingly common, so it is important to keep both. Again, I need to do some research on usage to check this.

@NSoiffer
Copy link
Contributor

I just looked at the algorithm and I don't understand why the entries you list are problematic. Step 4 is

If no entry is found and the form of embellished operator was not explicitly specified as an attribute on its core operator, then user agents must try other dictionary entries for different values of F in the following order: infix, prefix, postfix;

Can you explain why there is a problem with those entries being both prefix and infix given that the algorithm says to look for them (the wording probably should be tweak to say "...in the following order: infix, prefix, postfix and ignoring the value found above")? "+" and "-" have similar properties. I think I must be missing what the point/problem there is with those entries.

@bkardell
Copy link
Collaborator

Discussed on the call today, resolved to close this issue adding a comment to open a new issue for Step 4, if anyone feels that's needed.

@fred-wang fred-wang reopened this Jan 31, 2022
@fred-wang
Copy link
Contributor Author

Reopening since the work I mentioned is not complete.

Can you explain why there is a problem with those entries being both prefix and infix given that the algorithm says to look for them (the wording probably should be tweak to say "...in the following order: infix, prefix, postfix and ignoring the value found above")? "+" and "-" have similar properties. I think I must be missing what the point/problem there is with those entries.

One of the trick of the compact dictionary is to remove all entries that would use the default properties, since they are essentially redundant with the fallback.

However, people don't always group things with <mrow> properly so e.g. in
<mi>a</mi><mo>+</mo><mo>(</mo><mfrac><mi>x</mi><mi>y</mi></mfrac><mo>)</mo>
The inner <mo>(</mo> would be considered infix.

So to work around that kind of issue, step 4 says we should try the different forms as a fallback.

But now if an operator has an infix entry with default properties and another form with non-default properties, the non-default would now be used (since the infix one is not in the compact dictionary anymore).

So to keep making this work, we need to tweak the spec with something like as explained above and that I implemented in the CL.

The Python script checks operators having that issue and there are only four problematic operators: #104 (comment)

Note that the infix forms of + and - don't use the default properties, so we don't have this issue. Actually moving the infix forms of the problematic operators to an existing infix category could be an alternative way to resolve the problem.

fred-wang added a commit that referenced this issue Feb 1, 2022
... and the algorithm for determining the properties of an embellished
operator. There should be no behavior change. However section B.1 now
describes two algorithms:

- to determine the category of an operator
- to set the properties of an operator from its category

which are used by the algorithm to set the properties of an embellished
operator. This makes it more convenient for implementations and will
allow to work around the issue with multiple forms tried on operators
that can be both in the Default category for the infix form and in a
different category for a non-infix form.

#104
@fred-wang
Copy link
Contributor Author

fred-wang commented Feb 1, 2022

I rewrote the section for the operator dictionary.

This is the TODO list before the issue can be closed:

  • Check whether U+2982 Z NOTATION TYPE COLON was added on purpose in category M (infix, lspace=0, rspace=0.16666666666666666em, no boolean properties). That seems consistent with U+003A COLON.
  • Check whether ".." was removed on purpose from the list of 2-ASCII-char operators.
  • Fix issue with U+007C, U+2234, U+2235, U+223C having an infix form with default values but a non-infix form with different values.
  • Update spec and tests (and chromium implementation).

For the issue with U+007C, U+2234, U+2235, U+223C, there are several points raised:

  • Use non-default values for the infix form?
  • U+2234 and U+2235 are almost always prefix, so the infix versions can be dropped.
  • U+007C and U+223C have infix/prefix usages. But U+223C TILDE OPERATOR is currently not consistent with U+007E TILDE.
  • Introduced a small Forced Default category that would be handled as categories L and M and with the same properties as the Default category. This will make step 4 automatically skipped in the "algorithm for determining the properties of an embellished operator".

@davidcarlisle
Copy link
Collaborator

I think this is probably in error it is \mathrel not \mathpunct in unicode-math for latex, and there is discussion and links to the relevant Z specifications at this stackexchange answer https://tex.stackexchange.com/a/315327

I propose we revert this to be symmetric 5 5 (@NSoiffer ?)

  • .. was removed in the same commit We could put back as infix, (@NSoiffer ?)

  • issue with having infix matching the defaults seemed to be mainly an issue with the pruning of the operator dictionary which should not drop the entry in these cases were there is a non default entry with a different form. @fred-wang do I understand you are proposing a Forced Default category here to address this (it seems that way from the above comment but I don't see it in the current core draft, so I am not sure?

  • agree 2234 and 2235 can be prefix-only so we can drop the entries for infix.

  • 007E and 223C ~ and ∼ do not have the same spacing in LaTeX unicode-math either, the ascii ~ is a mathord so gets no space, \sim 223C is a \mathrel so has 5 left and right.

@fred-wang
Copy link
Contributor Author

fred-wang commented Mar 1, 2022

issue with having infix matching the defaults seemed to be mainly an issue with the pruning of the operator dictionary which should not drop the entry in these cases were there is a non default entry with a different form. @fred-wang do I understand you are proposing a Forced Default category here to address this (it seems that way from the above comment but I don't see it in the current core draft, so I am not sure?

Correct. I've rewritten a bit the spec, but I'm still waiting the final decision before updating the values and adding the Forced Default category. These changes are also performed in the pending CL https://chromium-review.googlesource.com/c/chromium/src/+/3419439

@davidcarlisle
Copy link
Collaborator

@fred-wang thanks for the clarification.

So to be specific I think we should make the following changes to unicode.xml

  1. revert the change to Z colon, making it \mathrel again
  2. delete infix because and therefore

I think I'd propose not adding .. as either infix or postfx as it's quite rare and even when used probably doesn't want extra space in many cases eg if you show ranges as 1..5 Also I propose not changing U+223C keeping at \mathrel and keeping ~ as \mathord.

so the following change in the xml-entities repository

diff --git a/unicode.xml b/unicode.xml
index 224eacf..75142d6 100644
--- a/unicode.xml
+++ b/unicode.xml
@@ -43593,7 +43593,6 @@ Barbara Beeton for the STIX project
             <desc>alias ISOTECH there4</desc>
          </entity>
          <font name="hlcry" pos="144"/>
-         <operator-dictionary priority="70" form="infix" lspace="5" rspace="5"/>
          <operator-dictionary priority="70" form="prefix" lspace="0" rspace="0"/>
          <description unicode="1.1">THEREFORE</description>
       </character>
@@ -43625,7 +43624,6 @@ Barbara Beeton for the STIX project
             <desc>alias ISOTECH becaus</desc>
          </entity>
          <font name="hlcry" pos="145"/>
-         <operator-dictionary priority="70" form="infix" lspace="5" rspace="5"/>
          <operator-dictionary priority="70" form="prefix" lspace="0" rspace="0"/>
          <description unicode="1.1">BECAUSE</description>
       </character>
@@ -59418,7 +59416,7 @@ Barbara Beeton for the STIX project
       <character id="U02982" dec="10626" type="other" mode="unknown">
          <unicodedata category="Sm" combclass="0" bidi="ON" mirror="N" mathclass="F"/>
          <mathlatex set="unicode-math">\typecolon</mathlatex>
-         <operator-dictionary form="infix" lspace="0" rspace="3" priority="60"/>
+         <operator-dictionary form="infix" lspace="5" rspace="5" priority="60"/>
          <description unicode="3.2">Z NOTATION TYPE COLON</description>
       </character>
       <character id="U02983" dec="10627" mode="math" type="opening">

Then @fred-wang can handle the forced default changes assuming those implementation changes land?

@NSoiffer does that look OK?

@fred-wang
Copy link
Contributor Author

@davidcarlisle So do is there any more change to decide for the op dict before we go ahead updating everything ?

@davidcarlisle
Copy link
Collaborator

@fred-wang not as far as I know, but we have a WG call on Thursday, we could put it on the agenda to get sign off on this then? (@NSoiffer )

As you probably saw I re-normalised the priority values but that is purely mechanical and doesn't affect the data used by core.

@NSoiffer
Copy link
Contributor

Looking at the z-notation colon

There aren't many examples of its usage, but here are two from the formal spec draft that seem to use relational operator spacing:
image

image

I'm pretty sure at one point I spent a awhile on the concrete syntax to figure out a few of the Z notation values, but my brain doesn't seem to see now what I saw earlier.

Given the examples, I agree with @davidcarlisle to go with relational op spacing.

@fred-wang
Copy link
Contributor Author

OK, I understand no more changes are planned, so will proceed to the update next week

@davidcarlisle
Copy link
Collaborator

@fred-wang yes I thought I left a comment here after the working group meeting on Thursday, but apparently I didn't post it, sorry.

@fred-wang
Copy link
Contributor Author

OK thanks. I pushed 3b54b10

@NSoiffer
Copy link
Contributor

This was merged into the spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants