![]() ![]() ![]() > To: Resent-From: One of the most useful discussions at our working group F2F last week was the result of a question from Takeshi Kanai about how we calculate character offsets such as those used by Text Position Selector in the draft model. > Any comment/suggestion welcome (I've cross-posted intentionally, please remove recipients if not appropriate.) > TAG members - has the issue of dealing with symbols vs characters/codepoints come up in TAG discussion? > Please feel free to come back again here or contact the I18N WG. > what the Character Model has to say about this: > I'd suggest we schedule a discussion of this issue in an upcoming call. > Unfortunately for us, both considerations apply in the annotation use > b) "user interaction is a primary concern" - in which case grapheme > "code unit strings" (I presume interop with existing DOM APIs would also > a) there are performance considerations that would predicate the use of > clear that recommendation is to use character strings (i.e. > The character model lays out the problems more clearly than I have. > Thanks for this reference, Martin, and thanks for passing this to TAG, > When transfering data, it is important that the other implementation counts offsets the same way. > On my wishlist, I would hope that the new Annotation standard would include a normative list (SHOULD not MUST) of string counting functions for all major programming languages and other standards like SPARQL to tackle interoperability. > (yes, we consider moving to a W3C community group for further improvement) > - the "definition of string" section in the NIF spec: However, NFD is not in wide use and the annotation of diacritics is probably out of scope. > in NFD you can annotate the code point for the diacritic separately. However, if people wish to annotate diacritics independently. > There is a problem with Unicode Normal Form (NF). Personally I think, byte offset for text is unnecessary, simply because code points are better, i.e. > Anyhow, I wouldn't know a single use case for using Code Units for annotation. > It was quite difficult to work with the byte offset given that the original formats where HTML, txt, PDFs and docx. > For the NLP2RDF project we converted these 30 million annotations to RDF: > Python, len() in combination with decode(): len("ä".decode("UTF-8")) =1 Any deviation will lead to side effects such as "ä" having the length 2: > Regarding annotation, using code points or Character Strings is definitely the best practice. > On the (serialized) web, UTF-8 is predominant, which is really not the question here as the choice between graphems, code points and units is orthogonal to encoding. Maybe some DOM parser rely on UTF-16 internally too, but still count Code Points C/C++ has a dataype widechar using 16 bits as it is easier to allocate memory for variables. This means that you can use byte offsets easily to jump to certain positions in the text. > While UTF-8 has a variable length of one to four bytes per code point, UTF-16 and 32 have the advantage of a fixed length. you can encode the same code point in UTF-8, UTF-16 and UTF-32 which will definitely change the number of code units and bytes needed. > UTF-16 is the encoding of the string and is independent of code points, units and graphems, i.e. > From my understanding the example in is not good: > I am a bit puzzled why is renaming Unicode Code Points (a clearly defined thing) to Character String. Here I show it in python (note the u'xxx' is a UTF-16): That you can calculate an offset of a character is not true.įor example characters of the use 4 bytes ![]() Like UTF-8 it can use up to 4 bytes.Ĭases where UTF-16 uses 4 bytes may be 'pathological', but the assumption UTF-16 is **not** a fixed length encoding. > UTF-16 and 32 have the advantage of a fixed length. > While UTF-8 has a variable length of one to four bytes per code point, Int retval = str.To: Sebastian Hellmann, Public TAG List ĬC: W3C Public Annotation List, nlp2rdf returns the index within this sequence ![]() StringBuilder str = new StringBuilder("abcdefg") The following example shows the usage of 圜odePoints() method. IndexOutOfBoundsException − if index is negative or larger then the length of this sequence, or if codePointOffset is positive and the subsequence starting with index has fewer than codePointOffset code points, or if codePointOffset is negative and the subsequence before index has fewer than the absolute value of codePointOffset code points. This method returns the index within this sequence. Public int offsetB圜odePoints(int index, int codePointOffset)ĬodePointOffset − This is the offset in code points. Declarationįollowing is the declaration for 圜odePoints() method The 圜odePoints() method returns the index within this sequence that is offset from the given index by codePointOffset code points. ![]()
0 Comments
Leave a Reply. |