string

java.lang.Object
- zen.lang.string

```
public class string
extends java.lang.Object
```
A string in its simplest form is just a "cookie".
0. Contents
1. Introduction
2. Understanding characters
3. Some Java "characters"
4. More about Java's limitations and this class
5. Further Reading
6. UTF-8 Club Rules

TLDR:
Use UTF-8 strings (byte[]) everywhere possible especially between different code bases and when communicating with filesystems, and as the always preferred file encoding format in text files. Use indexing of the UTF-8 string (byte[]) to speed up locating code points and graphemes within the string (Java doesn't do this, so we've work to do).
Refer http://utf8everywhere.org/
1. UTF-8 strings can be safely scanned for ASCII characters (the 7-bit set) which includes all normal punctuation and in particular directory separators and path separators in all known operating systems.
2. UTF-8 is also self-synchronizing in the face of one or more missing bytes from the stream; only those characters with missing or corrupted bytes will be in error and all other characters are read correctly - this is not the case for example with UTF-16 encoded strings.
These two features alone (along with Unix and Linux compatibility) make UTF-8 the most desirable text encoding to use in almost all cases!
1. Introduction
A deficient string is any string not amenable to counting its characters, i.e. "characters as the user perceives them" or "graphemes".
java.lang.String is a deficient string class, no more than a lowly cookie, incapable of counting the characters it stores for display, at least when it comes to Unicode text.
This class is the beginnings of a proper Java string class, which might be thought of as a UTF-8 encoded "grapheme string" - Java as of version 8.0 only has APIs to work with Unicode code points (codepoints) and not "visual or displayed characters" such as those with multiple accents (Unicode "combining characters"). Surprised? I was, thus this class.
The term "UTF-8 grapheme string" is a misnomer since UTF-8 defines a byte sequence encoding of Unicode/UCS code points which include graphemes as well as non-graphemes.
One or more (thankfully only adjacent) Unicode code points may constitute a grapheme.
Cover your ears, I'm about to shout. This string class simplifies working with THE THING WE OFTEN NEED OUTSIDE OF THE STRING ITSELF, that is graphemes. Graphemes are those things a user perceives visually as a character or "single letter" for the purposes of cursor movement, deletion, insertion, selection, cutting and pasting.
This class' current limitations are due to relying upon java.text.BreakIterator and its getCharacterInstance(Locale) method, which only provides Unicode code point breaks and not "visual" character (grapheme) breaks.
Java as of version 8.0 does not provide Character methods such as isCombiningCharacter(codepoint) nor isUnicodeControlCharacter(codepoint); TODO - try using icu4j instead
For comparison to Java, see Swift's Strings and Characters, which is much younger, better and certainly has advantage of the benefit of hindsight. Also Perl 6 seems to be getting closer to competency as well, see Graphemes, code points, characters and bytes.
2. Understanding Characters
1. Programming language types such as char, byte[] and int store arbitrary values. In a program such values may for example have context specific meaning such as a mapping to ASCII characters, Unicode code points or to Unicode UTF-16 surrogate halves ("paired 16 bit code units" halves, see below) or even to graphemes, grapheme clusters, glyphs or any sequence of such or other things.
2. The Unicode standard specifies Unicode code points, various code point encodings, encoding specific storage 'code units' and code unit sequences (how code points are represented in an encoding), combining characters and their normalization forms, the optional byte order mark (BOM), how sequences of code points map (algorithmically) to graphemes and more.
3. A Unicode code point is Any value in the Unicode codespace. It represents one of a number of things, including characters of the world's languages/ scripts, Unicode control characters, the BOM, specials (I'm sure they are), locals, combining characters and more, see code point type.
4. Code points are not graphemes though some code points map to graphemes depending on their context in a code point sequence.
5. Code points were originally defined to be 31 bits but since ISO 10646-2 in 2001 are promised to be at most 21 bits in size in the future, the consequence being all code points will be representable in a UTF-8 byte stream with at most 4 bytes (4 * 8-bits).
6. A Unicode combining character sometimes called combining mark modifies another character; examples include the diacritical marks, International Phonetic Alphabet (IPA) symbols, IPA diacritics, the combining graphene joiner, double diacritics which are combined with (and are placed across) two characters, and more. Multiple combinations of combining characters are valid to use in sequence. See Unicode's Characters and Combining Marks.
7. A Unicode grapheme represents a user perceived character which the user might move over with a text cursor using a "single character" cursor movement or might delete with a single press of the <BACKSPACE> or <DELETE> key.
8. A grapheme cluster is the ideal way to represent "characters as the user perceives them" and we who had enjoyed Java's hegemony pine with grief at Swift's competency in this regard.
9. At the time of typing this, the author is unable to clearly differentiate between graphemes and grapheme clusters.
10. A grapheme is represented by a sequence of one or more code points, such as LATIN CAPITAL LETTER A (U+0041, "A") followed by COMBINING ACUTE ACCENT (U+0301, " ́" - this may not display as one might expect).
11. Some letter with combining character combinations are also included as single code points, such as LATIN CAPITAL LETTER A WITH ACUTE (U+00C1, "Á").
12. So some code points do not constitute and or are not part of the representation of any grapheme.
13. Unicode specifies various nomalization forms where combining characters are used where possible or avoided where possible, in a code point stream; this can provide for binary equivalence comparison. See also java.text.Normalizer.
14. A glyph is an element of a font which is displayed to the user within a textual or graphical display environment.
15. One or more glyphs may be needed to correctly display a single grapheme, for example with a glyph corresponding to a combining character (represented by one code point) where the combining character follows and is intended to be overlayed visually with ("combined" with) the preceeding one or two characters (each represented by their own code points) in the code point sequence.
16. A font specifies how graphemes, or visual characters as defined by some other standard, map to glyphs. Only some fonts include glyphs for all displayable code points. Only some font engines or display environments support combining characters.
17. A Java char is not a grapheme, but some graphemes can be represented in a char instance or "variable".
18. A Java char is not even a code point, although:
  a) it can represent some code points namely those in the Unicode BMP or "basic multilingual plane" which is the first 16-bit group or Unicode "plane" of code points, and
  b) it can also hold UTF-16 surrogate code point halves ("surrogates") for those code points outside the BMP.
  If you're not impressed with Java's chars, neither am I - they're not very impressive at all. byte - now there's a powerful type!
19. A well formed surrogate is one half of a "paired 16 bit code units" code point, whereas an ill formed surrogate is unpaired, i.e. without its matching half, and is an error condition.
  Refer the Unicode FAQ What are surrogates? and How do I convert an unpaired UTF-16 surrogate to UTF-8?
20. A Unicode encoding scheme defines a bit format for storing, transmitting or otherwise representing code points in a digital medium.
21. Encodings encode code points into a sequence of one or more Unicode code units. A code unit is a "storage unit" (8, 16 or 32 bit value) in an encoding. Code unit format and sequencing are specific to an encoding.
22. Unicode encodings include for example UTF-8, UTF-16LE and UTF-32. Some encodings cannot normally represent all unicode code points, for example ASCII. UTF-8 can represent all code points.
23. Due to endian issues, anything other than UTF-8 is not self describing without a BOM or byte order mark also known originally as "zero-width non-breaking space" (this whitespace usage is still valid but since deprecated in Unicode 3.2 oh happy joy). The BOM messes up UTF-8 data streams where it's not expected or not otherwise handled correctly and the joy only increases.
24. So where at all possible in UTF-8 streams (which we should almost always be using) never use a BOM.
25. In Java e.g. at java.lang.String#codePointAt(index) the index points at a char value which the javadoc names as a "Unicode code unit". Beware as this is neither a code point, character, grapheme nor glyph - nothing but a lowly, miserably little char! "Code unit" as used here in Java is just a fancy name for "I'm so damn cool I can spel UTF-16BE BMP code point in a char and UTF-16LE non-BMP code point surrogate half "code unit" in a char take that! Haha try and figure out which." Resist the temptation to figure out the riddle. Don't be fooled friends - it's useless for anything much at all except reminding us what a bad design decision Java's String class was (in hindsight).
26. Next, java.text.BreakIterator#getCharacterInstance(Locale) returns break indices for (Unicode) code point boundaries and not for graphemes, characters, glyphs, chars or anything useful really. This iterator, better than nothing I guess, is merely a code point surrogate-pair extractor for Java's UTF-16 strings, slapping all comers in the face with its irreverent disregard for invisibles, controls, specials, even combining characters and in fact anything remotely resembling graphemes, let alone those hallowed unreachable glyphs! The use of the term "character" in this method's name is both misleading and wildly overloaded and horrendously overrated, especially in Java context!
27. But wait it gets better - if you want to determine graphemes (let's not even talk about glyphs), nope that can't be done (at least as at Java 8).
28. Java's String class ostensibly stores a UTF-16 encoded code point sequence in a char[]. There is nothing in principle stopping a UTF-8 byte[] implementation (which is evidently indicated in the hindsight of history) and this ought be nothing more than a piddly implementation detail, notwithstanding any legacy yet-to-be-converted API's and their snivelling performance issues - which would likely act as a rather practical and motivating todo list. Righto - let's get cracking then shall we? In the meantime, without handling graphemes, java.lang.String defeats its own purpose surprisingly well.
29. Finally to highlight unequivocally the problems we programmers face, some graphemes are normally displayed using wide glyphs, some normally with narrow glyphs and others still can be either, according to the font designer's æsthetic vehemence (ok, ok, "or lack thereof" - happy?).
30. In conclusion the appearance of a displayed grapheme (and in particular for programs targetting a text based environment) depends on the particular textual or graphical environment in use including the specific font and the characteristics of the grapheme's corresponding glyph (or combination of glyphs) in the font, being used in the environment to display that grapheme.
  For example an xterm terminal is a textual display environment in which Java provides no API for determining glyph widths and we can't even determine grapheme boundaries yet - Java evidently has a way to go on the basic text processing front.
  In Java we need to get to first base - determining the graphemes in a string! Java just manages to figure out single code points but peters out before anything useful to users.
3. Some Java "Characters"
Some of the "character" concepts a Java programmer may have to handle with JNI or otherwise:
1. 7-bit ASCII characters.
2. Java's char type or "native character" which as we now know can only store a BMP code point or UTF-16 code point surrogate half.
3. Unicode UTF-8 "code units" (8 bit values).
4. Unicode UTF-16 BMP code point code units (16 bit values), mildly analogous to ASCII values.
5. Unicode UTF-16 code point surrogate halves (code units).
6. Unicode UTF-32 code point code units.
7. C and C++ types char, byte, int, wchar_t, char16_t, char32_t, wint_t and more still.
8. Unicode code points as represented in a Java int.
9. Unicode code points as represented in a Java char[] of length 1 or 2.
10. Unicode code points as distinguished by char boundaries as returned by BreakIterator.getCharacterInstance(Locale)#next().
11. Unicode non character code points such as control characters, which must be avoided when needing to work with graphemes.
12. Graphemes or Unicode code point sequences which correspond to user perceived characters - good luck with that.
13. Glyphs or those elements of a font which correspond in some way to code points and can be combined to display a representation of graphemes and are displayed to a user in a textual and or graphical display environment.
4. More about Java's limitations and this class
This Java class is an "alternative string" rough draft. There's plenty it does not (yet) do which needs to be done, although the first step of understanding the problem has hopefully been contributed to with the documentation here.
Certainly this class ought be built on something like java.nio.ByteBuffer or a byte[] (which it is not currently).
Information about the "current" width of "ordinarily wide" and "possibly wide" and even "in the current font, unexpectedly wide" graphemes, determination of grapheme boundaries and more is yet to be implemented.
For starters some API is needed for textual display environments to access the width of each glyph in the current environment, notification of environment changes (such as a new font being applied to the environment) so that glyph width or other calculations can be redone if needed, and to what extent the current environment supports combining, potentially wide and other interesting characters.
Glyph widths might for example in a text environment be in some suitable environment units, with 'normal' characters being width 1 and 'wide' characters being width 2. This is presently an exercise for the excessively curious and diligent.
Note: There's no point being upset with Java's UTF-16 String class and char primitive, since at the time these were chosen for Java, it was erroneously believed by those making that choice that 16 bits would be enough to encode all the world's characters and therefore "should be enough for anyone, even a programming language."
BUT, there IS a point in complaining that once Java 1.0 was released, and prior to Java 1.1 being released, that Java should have had a serious String enhancement plan of action, due to the fundamental change in Unicode, which Java has forever been fundamentally incompatible with (practically, as in for regular day to day programming, or anything other than herculean efforts on behalf of the programmer anyway). Swift gets it right whilst Java's 20 years late to the party.
WARNING: Java's current usage of bastadardized UTF-8 ("modified UTF-8") in some cases such as JNI ought be eliminated, removed, stopped, stamped out and completely extricated from a future version of Java. Bring on the deprecations.
Note: It is probably in the interests of the Java language to have a class possibly similar to this one as a "new string" type perhaps named java.lang.string (note the lower case s in string). Such strings should be stored in UTF-8 byte arrays as canonical storage format and may require some language (read, compiler) support for auto casting or more.
Or it may be that the best approach is to simply replace Java's String class (and any and all dependent library APIs) so that it is built on a "sane" UTF-8 encoded byte[] and deprecate all API (in all the Java libraries) that violate this assumption for "performance" or any other grounds.
It is now self evident that Java's char type is actually not relevant and not particularly useful outside of Java String's deformed birth due to the false assumptions made when a Java class named String was impregnated with chars.
In hindsight if char is for programming convenience, what might be more useful is a grapheme type say a UTF-8 byte[] or a code point int[], perhaps with language support to cast to and from a "new string". Until this is tidied up in Java we unfortunately have a bit of a mess on the string front and nothing even approaching a sexy Grapheme String; I can't help wondering how such a verbose name as java.lang.GraphemeString might be shortened. This class started out named "Graphemes" until it became obvious it's just a "sane string".
There are plenty api thoughts to think and some questions are in the code comments below in this class. Definitive answers would be nice. Swift appears to get it basically right.
In case it was unclear note also that string literals as they appear in source files are in the charset/encoding of the source file itself, whatever that happens to be (of course we do insist on UTF-8).
5. Further reading
For further information see the following:
- http://utf8everywhere.org/ - UTF-8. And why.
- http://unicode.org/glossary/ - Unicode Glossary
- http://unicode.org/reports/tr15/ - Unicode normalization forms.
- http://unicode.org/faq/utf_bom.html#utf16-2 - Unicode, What are surrogates?
- http://unicode.org/faq/utf_bom.html#utf8-5 - Unicode, How do I convert an unpaired UTF-16 surrogate to UTF-8?
- http://www.oracle.com/us/technologies/java/supplementary-142654.html - Supplementary Characters in the Java Platform
- http://userguide.icu-project.org/ - Introduction to ICU
- http://en.wikipedia.org/wiki/Unicode
- http://en.wikipedia.org/wiki/Combining_character
- http://en.wikipedia.org/wiki/Diacritics
- http://en.wikipedia.org/wiki/Specials_(Unicode_block)
- http://en.wikipedia.org/wiki/Unicode_control_characters
- http://en.wikipedia.org/wiki/Byte_order_mark
- http://unicode.org/faq/char_combmark.html - Characters and Combining Marks
- http://unicode.org/cldr/utility/character.jsp?a=0301 - Unicode character property viewer
- http://www.regular-expressions.info/unicode.html - Unicode Regular Expressions
- http://stackoverflow.com/questions/3956734/why-does-the-java-char-primitive-take-up-2-bytes-of-memory
- http://www.cl.cam.ac.uk/~mgk25/unicode.html - UTF-8 and Unicode FAQ for Unix/Linux
- http://unix.stackexchange.com/questions/139493/how-can-i-make-unicode-symbols-and-truetype-fonts-work-in-xterm-uxterm
- https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html - Swift
- http://www.cattlegrid.info/blog/2014/12/graphemes-code-points-characters-and-bytes.html - Perl 6
- http://illegalargumentexception.blogspot.co.nz/2009/05/java-rough-guide-to-character-encoding.html - Java, fails to properly determine visible graphemes from combining chars etc.
- http://stackoverflow.com/questions/6828076/how-to-correctly-compute-the-length-of-a-string-in-java - Java
- https://docs.oracle.com/javase/tutorial/i18n/text/boundaryintro.html - still more limited Java
6. UTF-8 Club Rules
Rule 1. UTF-8 is the only charset encoding.
Rule 2. See rule 1.
See Also:

java.nio.charset.StandardCharsets, java.text.BreakIterator, java.text.Normalizer, java.util.Locale

Constructor Summary

Constructors
Constructor and Description
`string()` Create an empty string.
`string(java.lang.Object text)` Initialize a new string using `text` and default locale.
`string(java.lang.Object text, java.util.Locale locale)` Initialize a new string using `text` and `locale`.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`appendCodePointsTo(java.lang.StringBuilder sb, int cpistart, int cpiend)` Append code points in the range `(cpistart, cpiend)` into `sb`; `cpistart` may be less than `cpiend` in which case the code points are appended in reverse order.
`void`	`appendGraphemesTo(int flags, java.lang.StringBuilder sb)` Append all graphemes to `sb`.
`void`	`appendGraphemesTo(int flags, java.lang.StringBuilder sb, int gistart, int giend)` Not implemented yet; throws UnsupportedOperationException.
`void`	`appendTo(java.lang.Object bb)` Appends all of this string to a ByteBuilder (such class does not yet exist).
`void`	`appendTo(java.lang.StringBuilder sb)` Appends all of this string to a StringBuilder.
`byte[]`	`codePointsToBytes(int cpindex)` Returns a byte[] containing the chosen code point encoded as UTF-8.
`byte[]`	`codePointsToBytes(int cpistart, int cpiend)` Returns a byte[] containing the chosen code points encoded as UTF-8.
`char[]`	`codePointsToChars(int cpistart, int cpiend)` Returns a UTF-16 encoded char[] range (reversible) of code points.
`int[]`	`codePointsToInts()` Returns this string as code points in an int[].
`int[]`	`codePointsToInts(int cpistart, int cpiend)` Returns the code points in the range `(cpistart, cpiend)`.
`char[]`	`codePointToChars(int cpindex)` Returns a UTF-16 encoded char[] of the code point at `cpindex`.
`int`	`codePointToInt(int cpindex)` Returns the code point at `cpindex`.
`int`	`countCodePoints()` Returns the number of unicode code points in this string.
`int`	`countGraphemes(int flags)` Returns the number of graphemes in this string; Not implemented yet; throws UnsupportedOperationException.
`int`	`countUTF16Units()` Count the number of Unicode UTF-16 code units required to UTF-16 encode this string.
`int`	`countUTF8Units()` Count the number of Unicode UTF-8 code units required to UTF-8 encode this string.
`int[]`	`getCodePointUTF16Indices()` Returns the grapheme boundary cpcindices into the utf-16 char[] used internally; this array of indices is used internally.
`byte[]`	`graphemesToBytes(int flags, int gindex)` Not implemented yet; throws UnsupportedOperationException.
`byte[]`	`graphemesToBytes(int flags, int gistart, int giend)` Not implemented yet; throws UnsupportedOperationException.
`char[]`	`graphemesToChars(int flags, int gindex)` Returns a char[] containing the chosen grapheme encoded as UTF-16.
`char[]`	`graphemesToChars(int flags, int gistart, int giend)` Not implemented yet; throws UnsupportedOperationException.
`boolean`	`hasBOM()` Returns true if this string contains one or more Unicode "Byte Order Mark" characters; Not implemented yet; throws UnsupportedOperationException.
`boolean`	`hasStartingBOM()` Returns true if this string starts with the Unicode "Byte Order Mark" character; Not implemented yet; throws UnsupportedOperationException.
`void`	`insertCodePointsInto(int sbdest, java.lang.StringBuilder sb, int cpistart, int cpiend)` Inserts a range (reversible) of code points into a StringBuilder.
`void`	`insertGraphemesInto(int flags, int sbdest, java.lang.StringBuilder sb)` Insert graphemes into sb.
`void`	`insertGraphemesInto(int flags, int sbdest, java.lang.StringBuilder sb, int gistart, int giend)` Not implemented yet; throws UnsupportedOperationException.
`void`	`insertInto(int sbdest, java.lang.Object bb)` Inserts all of this string into a ByteBuilder (such class to be implemented).
`void`	`insertInto(int sbdest, java.lang.StringBuilder sb)` Inserts all of this string into `sb`.
`static void`	`main(java.lang.String[] args)` tests
`void`	`setText(java.lang.Object text)` Resets this grapheme string with `text` using default locale.
`void`	`setText(java.lang.Object text, java.util.Locale locale)` Resets this grapheme instance with `text` and specified `Locale`.
`byte[]`	`toBytes()` Returns all of this string encoded as a UTF-8 byte[].
`char[]`	`toChars()` Returns all of this string encoded as a UTF-16 char[].
`java.lang.String`	`toDebugString()` Returns a verbosely descriptive java.lang.String of this string.
`java.lang.String`	`toString()` Returns this string converted into a `java.lang.String`.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - string
```
public string()
```
    Create an empty string.
  - string
```
public string(java.lang.Object text)
```
    Initialize a new string using text and default locale.
    
    See Also:
    
    setText(Object)
  - string
```
public string(java.lang.Object text,
              java.util.Locale locale)
```
    Initialize a new string using text and locale.
    
    See Also:
    
    setText(Object,Locale)
- Method Detail
  - countUTF8Units
```
public int countUTF8Units()
```
    Count the number of Unicode UTF-8 code units required to UTF-8 encode this string.
    This implementation operates in O(n) time.
  - countUTF16Units
```
public int countUTF16Units()
```
    Count the number of Unicode UTF-16 code units required to UTF-16 encode this string.
    This implementation operates in O(1) time.
  - countCodePoints
```
public int countCodePoints()
```
    Returns the number of unicode code points in this string.
    This is the same as the number of Unicode UTF-32 code units required to UTF-32 encode this string.
    This implementation operates in O(1) time.
    
    Returns:
    
    The code point count.
  - countGraphemes
```
public int countGraphemes(int flags)
```
    Returns the number of graphemes in this string; Not implemented yet; throws UnsupportedOperationException.
    This implementation operates in O(?) time.
    
    Parameters:
    
    flags - Grapheme types to include in the count.
    
    Returns:
    
    The grapheme count.
  - hasStartingBOM
```
public boolean hasStartingBOM()
```
    Returns true if this string starts with the Unicode "Byte Order Mark" character; Not implemented yet; throws UnsupportedOperationException.
    This implementation operates in O(1) time.
  - hasBOM
```
public boolean hasBOM()
```
    Returns true if this string contains one or more Unicode "Byte Order Mark" characters; Not implemented yet; throws UnsupportedOperationException.
    
    See Also:
    
    hasStartingBOM()
  - setText
```
public void setText(java.lang.Object text)
```
    Resets this grapheme string with text using default locale.
  - setText
```
public void setText(java.lang.Object text,
                    java.util.Locale locale)
```
    Resets this grapheme instance with text and specified Locale.
    Currently text may be a java.lang.String or char[].
  - getCodePointUTF16Indices
```
public int[] getCodePointUTF16Indices()
```
    Returns the grapheme boundary cpcindices into the utf-16 char[] used internally; this array of indices is used internally.
    Boundaries are calculated with BreakIterator.getCharacterInstance(Locale). There are length() + 1 cpcindices, where each index is the location in the utf-16 char[] of a grapheme boundary.
  - codePointToInt
```
public int codePointToInt(int cpindex)
```
    Returns the code point at cpindex.
  - codePointsToInts
```
public int[] codePointsToInts(int cpistart,
                              int cpiend)
```
    Returns the code points in the range (cpistart, cpiend). Not implemented yet; throws UnsupportedOperationException. In current implementation, easy to implement.
  - codePointsToInts
```
public int[] codePointsToInts()
```
    Returns this string as code points in an int[].
  - codePointToChars
```
public char[] codePointToChars(int cpindex)
```
    Returns a UTF-16 encoded char[] of the code point at cpindex.
    Code points are indexed from zero to countCodePoints() - 1.
    
    Parameters:
    
    cpindex - The code point index of the code point to return.
    
    Returns:
    
    A char[] of the UTF-16 encoded code point at cpindex.
    
    Throws:
    
    java.lang.IndexOutOfBoundsException - if index is negative or greater than this.length() or end is less than -1 or greater than this.length().
  - codePointsToChars
```
public char[] codePointsToChars(int cpistart,
                                int cpiend)
```
    Returns a UTF-16 encoded char[] range (reversible) of code points.
    Code points are indexed from zero to countCodePoints() - 1.
    
    Parameters:
    
    cpistart - The code point index inclusive of the first code point to return. Valid range is (0, this.countCodePoints() - 1).
    
    cpiend - The code point index exclusive of the last code point to return. Valid range is (-1, this.countCodePoints()).
    
    Returns:
    
    A char[] of the UTF-16 encoded code points in the range requested. If cpiend < cpistart then the code point sequence is reversed. If cpistart == cpiend then an empty char[] is returned.
    
    Throws:
    
    java.lang.IndexOutOfBoundsException - if cpistart < 0 or cpistart >= length() or cpiend < -1 or cpiend > length().
  - codePointsToBytes
```
public byte[] codePointsToBytes(int cpindex)
```
    Returns a byte[] containing the chosen code point encoded as UTF-8. Not implemented yet; throws UnsupportedOperationException.
    Code points are indexed from zero to countCodePoints() - 1.
    
    Parameters:
    
    cpindex - The code point index of the code point to return.
  - codePointsToBytes
```
public byte[] codePointsToBytes(int cpistart,
                                int cpiend)
```
    Returns a byte[] containing the chosen code points encoded as UTF-8. Not implemented yet; throws UnsupportedOperationException.
    Code points are indexed from zero to countCodePoints() - 1.
    
    Parameters:
    
    cpistart - The first code point index inclusive of the range of code points to return.
    
    cpiend - The last code point index exclusive of the range of code points to return.
  - graphemesToChars
```
public char[] graphemesToChars(int flags,
                               int gindex)
```
    Returns a char[] containing the chosen grapheme encoded as UTF-16. Not implemented yet; throws UnsupportedOperationException.
    Graphemes are indexed from zero to countGraphemes(flags) - 1.
    
    Parameters:
    
    gindex - The grapheme index of the grapheme to return.
  - graphemesToChars
```
public char[] graphemesToChars(int flags,
                               int gistart,
                               int giend)
```
    Not implemented yet; throws UnsupportedOperationException.
  - graphemesToBytes
```
public byte[] graphemesToBytes(int flags,
                               int gindex)
```
    Not implemented yet; throws UnsupportedOperationException.
  - graphemesToBytes
```
public byte[] graphemesToBytes(int flags,
                               int gistart,
                               int giend)
```
    Not implemented yet; throws UnsupportedOperationException.
  - toBytes
```
public byte[] toBytes()
```
    Returns all of this string encoded as a UTF-8 byte[].
  - toChars
```
public char[] toChars()
```
    Returns all of this string encoded as a UTF-16 char[].
    
    Returns:
    
    The char[] used internally, stored when setText was called.
  - toString
```
public java.lang.String toString()
```
    Returns this string converted into a java.lang.String. For dignity this method ought be named toJavaString, but Object.toString is a little engrained in the Java world... Needs correctness tuning as this class is enhanced.
    
    Overrides:
    
    toString in class java.lang.Object
  - appendTo
```
public void appendTo(java.lang.Object bb)
```
    Appends all of this string to a ByteBuilder (such class does not yet exist). Not implemented yet; throws UnsupportedOperationException.
  - insertInto
```
public void insertInto(int sbdest,
                       java.lang.Object bb)
```
    Inserts all of this string into a ByteBuilder (such class to be implemented). Not implemented yet; throws UnsupportedOperationException.
  - appendTo
```
public void appendTo(java.lang.StringBuilder sb)
```
    Appends all of this string to a StringBuilder.
  - insertInto
```
public void insertInto(int sbdest,
                       java.lang.StringBuilder sb)
```
    Inserts all of this string into sb.
  - appendGraphemesTo
```
public void appendGraphemesTo(int flags,
                              java.lang.StringBuilder sb)
```
    Append all graphemes to sb. Needs correctness tuning, currently ignores flags.
    In general, minimize the number of overloaded/ similar methods.
    
    Parameters:
    
    flags - Working with graphemes needs flags, such as whether to include various space types (non-breaking, zero-width, etc), tabs, various newlines, sentence and paragraph ending characters and more.
  - insertGraphemesInto
```
public void insertGraphemesInto(int flags,
                                int sbdest,
                                java.lang.StringBuilder sb)
```
    Insert graphemes into sb. Needs correctness tuning, currently ignores flags.
    
    Parameters:
    
    flags - Working with graphemes needs flags, such as whether to include various space types (non-breaking, zero-width, etc), tabs, various newlines, sentence and paragraph ending characters and more.
  - appendGraphemesTo
```
public void appendGraphemesTo(int flags,
                              java.lang.StringBuilder sb,
                              int gistart,
                              int giend)
```
    Not implemented yet; throws UnsupportedOperationException.
  - insertGraphemesInto
```
public void insertGraphemesInto(int flags,
                                int sbdest,
                                java.lang.StringBuilder sb,
                                int gistart,
                                int giend)
```
    Not implemented yet; throws UnsupportedOperationException.
  - appendCodePointsTo
```
public void appendCodePointsTo(java.lang.StringBuilder sb,
                               int cpistart,
                               int cpiend)
```
    Append code points in the range (cpistart, cpiend) into sb; cpistart may be less than cpiend in which case the code points are appended in reverse order.
    
    Parameters:
    
    cpistart - The first code point (index inclusive) to append to sb. Valid range is from 0 to (this.countCodePoints() - 1).
    
    cpiend - The last code point (index, exclusive) to append to sb. Valid range is from -1 to this.countCodePoints(). If cpend is less than cpstart, code points are appended in reverse order.
    
    Throws:
    
    java.lang.IndexOutOfBoundsException - if cpstart or cpend are out of valid range.
  - insertCodePointsInto
```
public void insertCodePointsInto(int sbdest,
                                 java.lang.StringBuilder sb,
                                 int cpistart,
                                 int cpiend)
```
    Inserts a range (reversible) of code points into a StringBuilder.
    Code points are indexed from zero to countCodePoints() - 1.
    If cpiend < cpistart then the code point sequence is reversed.
    This method might be one or two too many - perhaps better to simplify (reduce) the API rather than eek out only a little extra performance and only for the measley StringBuilder class - what we need to start chugging with is ByteBuffer!
    
    Parameters:
    
    sbdest - Destination in sb in which to insert the chosen code points (as UTF-16 chars).
    
    cpistart - The code point index inclusive of the first code point to insert into sb. Valid range is (0, this.countCodePoints() - 1).
    
    cpiend - The code point index exclusive of the last code point to insert into sb. Valid range is (-1, this.countCodePoints()).
    
    Throws:
    
    java.lang.IndexOutOfBoundsException - if cpistart < 0 or cpistart >= length() or cpiend < -1 or cpiend > length() or sbdest is not a valid location in sb.
  - toDebugString
```
public java.lang.String toDebugString()
```
    Returns a verbosely descriptive java.lang.String of this string.
  - main
```
public static void main(java.lang.String[] args)
```
    tests

Class string

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

string

string

string

Method Detail

countUTF8Units

countUTF16Units

countCodePoints

countGraphemes

hasStartingBOM

hasBOM

setText

setText

getCodePointUTF16Indices

codePointToInt

codePointsToInts

codePointsToInts

codePointToChars

codePointsToChars

codePointsToBytes

codePointsToBytes

graphemesToChars

graphemesToChars

graphemesToBytes

graphemesToBytes

toBytes

toChars

toString

appendTo

insertInto

appendTo

insertInto

appendGraphemesTo

insertGraphemesInto

appendGraphemesTo

insertGraphemesInto

appendCodePointsTo

insertCodePointsInto

toDebugString

main