![]() If you want you can write an iterator that iterates over Unicode characters instead of UTF-16 code units: static IEnumerable AsCodePoints(this string s)įor(int i = 0 i and the yield line to: yield return char.ConvertFromUtf32(char.ConvertToUtf32(s, i)) The char structure provides static methods to deal with surrogates: IsHighSurrogate, IsLowSurrogate, IsSurrogatePair, ConvertToUtf32, and ConvertFromUtf32. It creates a string with a lone surrogate, and yet everything carries on as if nothing happened: the string type is not even the type of well-formed UTF-16 code unit sequences, but the type of any UTF-16 code unit sequence. When you append those two surrogates to the string in the loop, you effectively reconstruct the surrogate pair, and printing that pair later as one gets you the right output.Īnd in the ranting front, note how nothing complains that you used a malformed UTF-16 sequence in that loop. When you print them individually they do not produce meaningful output because lone surrogates are not even valid in UTF-16, and they are not considered Unicode characters either. The first one is a high/lead surrogate and the second one is a low/trail surrogate. Those are the two UTF-16 code units that make up the representation of U+10FFFC. That's why iterating your string and printing it out yields some broken output with two "things" in it. String implements IEnumerable, which means that when you iterate over a string you get one UTF-16 code unit per iteration. Sometimes you will find some APIs dealing in ints instead of char because int can be used to handle all code points too (that's what ConvertFromUtf32 takes as argument, and what ConvertToUtf32 produces as result). Sometimes it needs to return two, and that's why it makes it a string. To be able to return all Unicode code points, that method cannot return a single char. ![]() You should note that ConvertFromUtf32 doesn't return a char: char is a UTF-16 code unit, not a Unicode code point. All code points above U+FFFF require two UTF-16 code units for their representation. U+10FFFC uses two UTF-16 code units, so s.Length is 2. So, this string's Length is not 1, because, as I said, the interface deals in UTF-16 code units, not Unicode code points. ![]() Much simpler, much more readable: var s = char.ConvertFromUtf32(0x10FFFC) I'll try not to rant much about how I don't like this design, and just say that not matter how unfortunate, it is just a (sad) fact you have to live with.įirst off, I will suggest using char.ConvertFromUtf32 to get your initial string. It is quite unfortunate that such a low-level view of text was grafted onto the most obvious and intuitive interface available. Its interface exposes a sequence of UTF-16 code units. U+10FFFC is one Unicode code point, but string's interface does not expose a sequence of Unicode code points directly. However, because of this splitting of very large characters, this doesn't work. ![]() Currently I have something like this: foreach(var ch in s) My real use case is I need to be able to detect when very large unicode characters are used in a string. What exactly is going on here? I thought that char contains a single unicode character and I never had to worry about how many bytes a character is unless I'm doing conversion to bytes. If I alter my loop to add these characters back to an empty string like so: string tmp="" Īt the end of this, tmp will print just a single character. the string is apparently composed of two characters). Instead of it printing just the single character, it prints two characters (i.e. The problems come when manipulating this unicode character. I know it's for private-use and such, but it does display a single character as I'd expect when displaying it. I'm using this code to generate U+10FFFC var s = (new byte )
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |