public class string
extends java.lang.Object
A string in its simplest form is just a "cookie".
0. Contents
1. Introduction
2. Understanding characters
3. Some Java "characters"
4. More about Java's limitations and this class
5. Further Reading
6. UTF-8 Club Rules
TLDR:
Use UTF-8 strings (byte[]) everywhere possible especially between different
code bases and when communicating with filesystems, and as the always
preferred file encoding format in text files. Use indexing of the UTF-8
string (byte[]) to speed up locating code points and graphemes within the
string (Java doesn't do this, so we've work to do).
Refer http://utf8everywhere.org/
These two features alone (along with Unix and Linux compatibility) make UTF-8 the most desirable text encoding to use in almost all cases!
1. Introduction
A deficient string is any string not amenable to counting its characters, i.e. "characters as the user perceives them" or "graphemes".
java.lang.String
is a deficient string class, no more than a
lowly cookie, incapable of counting the characters it stores for display, at
least when it comes to Unicode text.
This class is the beginnings of a proper Java string class, which might be thought of as a UTF-8 encoded "grapheme string" - Java as of version 8.0 only has APIs to work with Unicode code points (codepoints) and not "visual or displayed characters" such as those with multiple accents (Unicode "combining characters"). Surprised? I was, thus this class.
The term "UTF-8 grapheme string" is a misnomer since UTF-8 defines a byte sequence encoding of Unicode/UCS code points which include graphemes as well as non-graphemes.
One or more (thankfully only adjacent) Unicode code points may constitute a grapheme.
Cover your ears, I'm about to shout. This string class simplifies working with THE THING WE OFTEN NEED OUTSIDE OF THE STRING ITSELF, that is graphemes. Graphemes are those things a user perceives visually as a character or "single letter" for the purposes of cursor movement, deletion, insertion, selection, cutting and pasting.
This class' current limitations are due to relying upon java.text.BreakIterator
and its getCharacterInstance(Locale)
method,
which only provides Unicode code point breaks and not "visual" character
(grapheme) breaks.
Java as of version 8.0 does not provide Character methods such as isCombiningCharacter(codepoint) nor isUnicodeControlCharacter(codepoint); TODO - try using icu4j instead
For comparison to Java, see Swift's Strings and Characters, which is much younger, better and certainly has advantage of the benefit of hindsight. Also Perl 6 seems to be getting closer to competency as well, see Graphemes, code points, characters and bytes.
char
, byte[]
and int
store arbitrary values. In a program such values
may for example have context specific meaning such as a mapping to ASCII
characters, Unicode code points or to Unicode UTF-16 surrogate halves
("paired 16 bit code units" halves, see below)
or even to graphemes, grapheme clusters, glyphs or any sequence of such or
other things.
<BACKSPACE>
or <DELETE>
key.
LATIN CAPITAL LETTER A
(U+0041
, "A") followed by
COMBINING ACUTE ACCENT
(U+0301
, " ́" - this may not display
as one might expect).
LATIN CAPITAL LETTER A WITH ACUTE
(U+00C1
, "Á").
java.text.Normalizer
.
char
is not a grapheme, but some graphemes can
be represented in a char
instance or "variable".
char
is not even a code
point, although:char
s, neither am I - they're
not very impressive at all. byte
- now there's a powerful type!
java.lang.String#codePointAt(index)
the index
points at a char value which the javadoc names as a "Unicode code unit".
Beware as this is neither a code point, character, grapheme nor
glyph - nothing but a lowly, miserably little char! "Code unit" as
used here in Java is just a fancy name for "I'm so damn cool I can spel
UTF-16BE BMP code point in a char and UTF-16LE non-BMP code point
surrogate half "code unit" in a char take that! Haha try and figure out
which." Resist the temptation to figure out the riddle. Don't be fooled
friends - it's useless for anything much at all except reminding us what a
bad design decision Java's String
class was (in hindsight).
java.text.BreakIterator#getCharacterInstance(Locale)
returns break indices for (Unicode) code point boundaries and not for
graphemes, characters, glyphs, chars or anything useful really. This
iterator, better than nothing I guess, is merely a code point
surrogate-pair extractor for Java's UTF-16 strings, slapping all comers in
the face with its irreverent disregard for invisibles, controls, specials,
even combining characters and in fact anything remotely resembling
graphemes, let alone those hallowed unreachable glyphs! The use of the
term "character" in this method's name is both misleading and wildly
overloaded and horrendously overrated, especially in Java context!
char[]
. There is nothing in principle
stopping a UTF-8 byte[] implementation (which is evidently indicated in
the hindsight of history) and this ought be nothing more than a piddly
implementation detail, notwithstanding any legacy yet-to-be-converted
API's and their snivelling performance issues - which would likely act as
a rather practical and motivating todo list. Righto - let's get cracking
then shall we? In the meantime, without handling graphemes, java.lang.String
defeats its own purpose surprisingly well.
xterm
terminal is a textual display
environment in which Java provides no API for determining glyph widths and
we can't even determine grapheme boundaries yet - Java evidently has a way
to go on the basic text processing front.Some of the "character" concepts a Java programmer may have to handle with JNI or otherwise:
char
type or "native character" which as we now know
can only store a BMP code point or UTF-16 code point surrogate half.
char
, byte
, int
, wchar_t
, char16_t
, char32_t
, wint_t
and more
still.
int
.
char[]
of length
1 or 2.
char
boundaries as
returned by BreakIterator.getCharacterInstance(Locale)#next()
.
4. More about Java's limitations and this class
This Java class is an "alternative string" rough draft. There's plenty it does not (yet) do which needs to be done, although the first step of understanding the problem has hopefully been contributed to with the documentation here.
Certainly this class ought be built on something like java.nio.ByteBuffer or a byte[] (which it is not currently).
Information about the "current" width of "ordinarily wide" and "possibly wide" and even "in the current font, unexpectedly wide" graphemes, determination of grapheme boundaries and more is yet to be implemented.
For starters some API is needed for textual display environments to access the width of each glyph in the current environment, notification of environment changes (such as a new font being applied to the environment) so that glyph width or other calculations can be redone if needed, and to what extent the current environment supports combining, potentially wide and other interesting characters.
Glyph widths might for example in a text environment be in some suitable environment units, with 'normal' characters being width 1 and 'wide' characters being width 2. This is presently an exercise for the excessively curious and diligent.
Note: There's no point being upset with Java's UTF-16 String class and
char primitive, since at the time these were chosen for Java, it was
erroneously believed by those making that choice that 16 bits would be enough
to encode all the world's characters and therefore "should be enough for
anyone, even a programming language."
BUT, there IS a point in complaining that once Java 1.0 was released, and
prior to Java 1.1 being released, that Java should have had a serious String
enhancement plan of action, due to the fundamental change in Unicode, which
Java has forever been fundamentally incompatible with (practically, as in for
regular day to day programming, or anything other than herculean efforts on
behalf of the programmer anyway). Swift gets it right whilst Java's 20 years
late to the party.
WARNING: Java's current usage of bastadardized UTF-8 ("modified UTF-8") in some cases such as JNI ought be eliminated, removed, stopped, stamped out and completely extricated from a future version of Java. Bring on the deprecations.
Note: It is probably in the interests of the Java language to have a class
possibly similar to this one as a "new string" type perhaps named java.lang.string
(note the lower case s in string). Such strings should be
stored in UTF-8 byte arrays as canonical storage format and may require some
language (read, compiler) support for auto casting or more.
Or it may be that the best approach is to simply replace Java's String class (and any and all dependent library APIs) so that it is built on a "sane" UTF-8 encoded byte[] and deprecate all API (in all the Java libraries) that violate this assumption for "performance" or any other grounds.
It is now self evident that Java's char type is actually not relevant and not particularly useful outside of Java String's deformed birth due to the false assumptions made when a Java class named String was impregnated with chars.
In hindsight if char is for programming convenience, what might be more
useful is a grapheme type say a UTF-8 byte[]
or a
code point int[]
, perhaps with language support to cast to and from a
"new string". Until this is tidied up in Java we unfortunately have a bit of
a mess on the string front and nothing even approaching a sexy Grapheme
String; I can't help wondering how such a verbose name as
java.lang.GraphemeString might be shortened. This class started out named
"Graphemes" until it became obvious it's just a "sane string".
There are plenty api thoughts to think and some questions are in the code comments below in this class. Definitive answers would be nice. Swift appears to get it basically right.
In case it was unclear note also that string literals as they appear in source files are in the charset/encoding of the source file itself, whatever that happens to be (of course we do insist on UTF-8).
For further information see the following:
Rule 1. UTF-8 is the only charset encoding.
Rule 2. See rule 1.
java.nio.charset.StandardCharsets
,
java.text.BreakIterator
,
java.text.Normalizer
,
java.util.Locale
Constructor and Description |
---|
string()
Create an empty string.
|
string(java.lang.Object text)
Initialize a new string using
text and default locale. |
string(java.lang.Object text,
java.util.Locale locale)
Initialize a new string using
text and locale . |
Modifier and Type | Method and Description |
---|---|
void |
appendCodePointsTo(java.lang.StringBuilder sb,
int cpistart,
int cpiend)
Append code points in the range
(cpistart, cpiend) into sb ; cpistart may be less than cpiend in which case the
code points are appended in reverse order. |
void |
appendGraphemesTo(int flags,
java.lang.StringBuilder sb)
Append all graphemes to
sb . |
void |
appendGraphemesTo(int flags,
java.lang.StringBuilder sb,
int gistart,
int giend)
Not implemented yet; throws UnsupportedOperationException.
|
void |
appendTo(java.lang.Object bb)
Appends all of this string to a ByteBuilder (such class does not yet
exist).
|
void |
appendTo(java.lang.StringBuilder sb)
Appends all of this string to a StringBuilder.
|
byte[] |
codePointsToBytes(int cpindex)
Returns a byte[] containing the chosen code point encoded as UTF-8.
|
byte[] |
codePointsToBytes(int cpistart,
int cpiend)
Returns a byte[] containing the chosen code points encoded as UTF-8.
|
char[] |
codePointsToChars(int cpistart,
int cpiend)
Returns a UTF-16 encoded char[] range (reversible) of code points.
|
int[] |
codePointsToInts()
Returns this string as code points in an int[].
|
int[] |
codePointsToInts(int cpistart,
int cpiend)
Returns the code points in the range
(cpistart, cpiend) . |
char[] |
codePointToChars(int cpindex)
Returns a UTF-16 encoded char[] of the code point at
cpindex . |
int |
codePointToInt(int cpindex)
Returns the code point at
cpindex . |
int |
countCodePoints()
Returns the number of unicode code points in this string.
|
int |
countGraphemes(int flags)
Returns the number of graphemes in this string;
Not implemented yet; throws UnsupportedOperationException.
|
int |
countUTF16Units()
Count the number of Unicode UTF-16 code units required to UTF-16 encode
this string.
|
int |
countUTF8Units()
Count the number of Unicode UTF-8 code units required to UTF-8 encode this
string.
|
int[] |
getCodePointUTF16Indices()
Returns the grapheme boundary cpcindices into the utf-16 char[] used
internally; this array of indices is used internally.
|
byte[] |
graphemesToBytes(int flags,
int gindex)
Not implemented yet; throws UnsupportedOperationException.
|
byte[] |
graphemesToBytes(int flags,
int gistart,
int giend)
Not implemented yet; throws UnsupportedOperationException.
|
char[] |
graphemesToChars(int flags,
int gindex)
Returns a char[] containing the chosen grapheme encoded as UTF-16.
|
char[] |
graphemesToChars(int flags,
int gistart,
int giend)
Not implemented yet; throws UnsupportedOperationException.
|
boolean |
hasBOM()
Returns true if this string contains one or more Unicode "Byte Order Mark"
characters;
Not implemented yet; throws UnsupportedOperationException.
|
boolean |
hasStartingBOM()
Returns true if this string starts with the Unicode "Byte Order Mark"
character;
Not implemented yet; throws UnsupportedOperationException.
|
void |
insertCodePointsInto(int sbdest,
java.lang.StringBuilder sb,
int cpistart,
int cpiend)
Inserts a range (reversible) of code points into a StringBuilder.
|
void |
insertGraphemesInto(int flags,
int sbdest,
java.lang.StringBuilder sb)
Insert graphemes into sb.
|
void |
insertGraphemesInto(int flags,
int sbdest,
java.lang.StringBuilder sb,
int gistart,
int giend)
Not implemented yet; throws UnsupportedOperationException.
|
void |
insertInto(int sbdest,
java.lang.Object bb)
Inserts all of this string into a ByteBuilder (such class to be
implemented).
|
void |
insertInto(int sbdest,
java.lang.StringBuilder sb)
Inserts all of this string into
sb . |
static void |
main(java.lang.String[] args)
tests
|
void |
setText(java.lang.Object text)
Resets this grapheme string with
text using default locale. |
void |
setText(java.lang.Object text,
java.util.Locale locale)
Resets this grapheme instance with
text and specified Locale . |
byte[] |
toBytes()
Returns all of this string encoded as a UTF-8 byte[].
|
char[] |
toChars()
Returns all of this string encoded as a UTF-16 char[].
|
java.lang.String |
toDebugString()
Returns a verbosely descriptive java.lang.String of this string.
|
java.lang.String |
toString()
Returns this string converted into a
java.lang.String . |
public string()
public string(java.lang.Object text)
text
and default locale.setText(Object)
public string(java.lang.Object text, java.util.Locale locale)
text
and locale
.setText(Object,Locale)
public int countUTF8Units()
This implementation operates in O(n) time.
public int countUTF16Units()
This implementation operates in O(1) time.
public int countCodePoints()
This is the same as the number of Unicode UTF-32 code units required to UTF-32 encode this string.
This implementation operates in O(1) time.
public int countGraphemes(int flags)
This implementation operates in O(?) time.
flags
- Grapheme types to include in the count.public boolean hasStartingBOM()
This implementation operates in O(1) time.
public boolean hasBOM()
hasStartingBOM()
public void setText(java.lang.Object text)
text
using default locale.public void setText(java.lang.Object text, java.util.Locale locale)
text
and specified Locale
.
Currently text
may be a java.lang.String
or char[]
.
public int[] getCodePointUTF16Indices()
Boundaries are calculated with BreakIterator.getCharacterInstance(Locale). There are length() + 1 cpcindices, where each index is the location in the utf-16 char[] of a grapheme boundary.
public int codePointToInt(int cpindex)
cpindex
.public int[] codePointsToInts(int cpistart, int cpiend)
(cpistart, cpiend)
.
Not implemented yet; throws UnsupportedOperationException.
In current implementation, easy to implement.public int[] codePointsToInts()
public char[] codePointToChars(int cpindex)
cpindex
.
Code points are indexed from zero to countCodePoints() - 1
.
cpindex
- The code point index of the code point to return.cpindex
.java.lang.IndexOutOfBoundsException
- if
index
is negative or greater than this.length()
or
end
is less than -1 or greater than this.length()
.public char[] codePointsToChars(int cpistart, int cpiend)
Code points are indexed from zero to countCodePoints() - 1
.
cpistart
- The code point index inclusive of the first code point to
return.
Valid range is (0, this.countCodePoints() - 1)
.cpiend
- The code point index exclusive of the last code point to
return.
Valid range is (-1, this.countCodePoints())
.cpiend < cpistart
then the code point sequence is reversed.
If cpistart == cpiend
then an empty char[] is returned.java.lang.IndexOutOfBoundsException
- if
cpistart < 0
or cpistart >= length()
or
cpiend < -1
or cpiend > length()
.public byte[] codePointsToBytes(int cpindex)
Code points are indexed from zero to countCodePoints() - 1
.
cpindex
- The code point index of the code point to return.public byte[] codePointsToBytes(int cpistart, int cpiend)
Code points are indexed from zero to countCodePoints() - 1
.
cpistart
- The first code point index inclusive of the range of code
points to return.cpiend
- The last code point index exclusive of the range of code
points to return.public char[] graphemesToChars(int flags, int gindex)
Graphemes are indexed from zero to countGraphemes(flags) - 1
.
gindex
- The grapheme index of the grapheme to return.public char[] graphemesToChars(int flags, int gistart, int giend)
public byte[] graphemesToBytes(int flags, int gindex)
public byte[] graphemesToBytes(int flags, int gistart, int giend)
public byte[] toBytes()
public char[] toChars()
public java.lang.String toString()
java.lang.String
.
For dignity this method ought be named toJavaString, but Object.toString
is a little engrained in the Java world...
Needs correctness tuning as this class is enhanced.toString
in class java.lang.Object
public void appendTo(java.lang.Object bb)
public void insertInto(int sbdest, java.lang.Object bb)
public void appendTo(java.lang.StringBuilder sb)
public void insertInto(int sbdest, java.lang.StringBuilder sb)
sb
.public void appendGraphemesTo(int flags, java.lang.StringBuilder sb)
sb
.
Needs correctness tuning, currently ignores flags.
In general, minimize the number of overloaded/ similar methods.
flags
- Working with graphemes needs flags, such as whether to
include various space types (non-breaking, zero-width, etc), tabs, various
newlines, sentence and paragraph ending characters and more.public void insertGraphemesInto(int flags, int sbdest, java.lang.StringBuilder sb)
flags
- Working with graphemes needs flags, such as whether to
include various space types (non-breaking, zero-width, etc), tabs, various
newlines, sentence and paragraph ending characters and more.public void appendGraphemesTo(int flags, java.lang.StringBuilder sb, int gistart, int giend)
public void insertGraphemesInto(int flags, int sbdest, java.lang.StringBuilder sb, int gistart, int giend)
public void appendCodePointsTo(java.lang.StringBuilder sb, int cpistart, int cpiend)
(cpistart, cpiend)
into sb
; cpistart
may be less than cpiend
in which case the
code points are appended in reverse order.cpistart
- The first code point (index inclusive) to append to sb.
Valid range is from 0 to (this.countCodePoints() - 1).cpiend
- The last code point (index, exclusive) to append to sb.
Valid range is from -1 to this.countCodePoints().
If cpend is less than cpstart, code points are appended in reverse order.java.lang.IndexOutOfBoundsException
- if cpstart or cpend are out of valid
range.public void insertCodePointsInto(int sbdest, java.lang.StringBuilder sb, int cpistart, int cpiend)
Code points are indexed from zero to countCodePoints() - 1
.
If cpiend < cpistart
then the code point sequence is reversed.
This method might be one or two too many - perhaps better to simplify (reduce) the API rather than eek out only a little extra performance and only for the measley StringBuilder class - what we need to start chugging with is ByteBuffer!
sbdest
- Destination in sb
in which to insert the chosen code
points (as UTF-16 chars).cpistart
- The code point index inclusive of the first code point to
insert into sb.
Valid range is (0, this.countCodePoints() - 1)
.cpiend
- The code point index exclusive of the last code point to
insert into sb.
Valid range is (-1, this.countCodePoints())
.java.lang.IndexOutOfBoundsException
- if
cpistart < 0
or cpistart >= length()
or
cpiend < -1
or cpiend > length()
or
sbdest
is not a valid location in sb
.public java.lang.String toDebugString()
public static void main(java.lang.String[] args)