Detailed explanation of the implementation of Java character encoding and decoding

Author：Eve Cole Update Time：2024-11-24 09:48:01

Character set basics:

Character set
A collection of characters, that is, symbols with special semantics. The letter "A" is a character. "%" is also a character. It has no intrinsic numerical value and has no direct connection to ASC II, Unicode, or even computers. Symbols existed long before computers.
Coded character set
A numeric value is assigned to a collection of characters. Assign codes to characters so that they can express numeric results using a specific character encoding set. Other coded character sets can assign different values to the same character. Character set mappings are usually determined by standards organizations, such as USASCII, ISO 8859-1, Unicode (ISO 10646-1), and JIS X0201.
Character-encoding scheme
Mapping of encoded character set members to octets (8-bit bytes). An encoding scheme defines how a sequence of character encodings is expressed as a sequence of bytes. The value of the character encoding does not need to be the same as the encoding byte, nor does it need to be a one-to-one or one-to-many relationship. In principle, character set encoding and decoding can be approximated as object serialization and deserialization.

Usually character data encoding is used for network transmission or file storage. An encoding scheme is not a character set, it is a mapping; but because of their close relationship, most encodings are associated with a separate character set. For example, UTF-8,
Only used to encode Unicode character sets. Nonetheless, it is possible to use one encoding scheme to handle multiple character sets. For example, EUC can encode characters for several Asian languages.
Figure 6-1 is a graphical expression that uses the UTF-8 encoding scheme to encode a Unicode character sequence into a byte sequence. UTF-8 encodes character code values less than 0x80 into a single-byte value (standard ASC II). All other Unicode characters are encoded as multibyte sequences of 2 to 6 bytes (http://www.ietf.org/rfc/rfc2279.txt).

Charset
The term charset is defined in RFC2278 (http://ietf.org/rfc/rfc2278.txt). It is a collection of encoded character sets and character encoding schemes. The class of the java.nio.charset package is Charset, which encapsulates character set extraction.
1111111111111111
Unicode is a 16-bit character encoding. It attempts to unify the character sets of all the world's languages into a single, comprehensive mapping. It has earned its place, but there are many other character encodings in widespread use today.
Most operating systems are still byte-oriented in terms of I/O and file storage, so no matter what encoding is used, Unicode or other encodings, there is still a need to convert between byte sequences and character set encodings.
The classes composed of the java.nio.charset package satisfy this need. This is not the first time that the Java platform has dealt with character set encoding, but it is the most systematic, comprehensive, and flexible solution. The java.nio.charset.spi package provides a Server Provisioning Interface (SPI) so that encoders and decoders can be plugged in as needed.

Character set: The default value is determined at JVM startup and depends on the underlying operating system environment, locale, and/or JVM configuration. If you need a specific character set, the safest bet is to name it explicitly. Don't assume that the default deployment is the same as your development environment. Character set names are not case-sensitive, that is, uppercase letters and lowercase letters are considered the same when comparing character set names. The Internet Assigned Names Authority (IANA) maintains all officially registered character set names.

Example 6-1 demonstrates how to translate characters into byte sequences using different Charset implementations.

Example 6-1. Using standard character set encoding

Copy the code code as follows:

package com.ronsoft.books.nio.charset;

import java.nio.charset.Charset;
import java.nio.ByteBuffer;

/**
* Charset encoding test. Run the same input string, which contains some
* non-ascii characters, through several Charset encoders and dump out the hex
* values of the resulting byte sequences.
*
* @author Ron Hitchens ([email protected])
*/
public class EncodeTest {
public static void main(String[] argv) throws Exception {
// This is the character sequence to encode
String input = "/u00bfMa/u00f1ana?";
// the list of charsets to encode with
String[] charsetNames = { "US-ASCII", "ISO-8859-1", "UTF-8",
"UTF-16BE", "UTF-16LE", "UTF-16" // , "X-ROT13"
};
for (int i = 0; i < charsetNames.length; i++) {
doEncode(Charset.forName(charsetNames[i]), input);
}
}

/**
* For a given Charset and input string, encode the chars and print out the
* resulting byte encoding in a readable form.
*/
private static void doEncode(Charset cs, String input) {
ByteBuffer bb = cs.encode(input);
System.out.println("Charset: " + cs.name());
System.out.println(" Input: " + input);
System.out.println("Encoded: ");
for (int i = 0; bb.hasRemaining(); i++) {
int b = bb.get();
int ival = ((int) b) & 0xff;
char c = (char) ival;
// Keep tabular alignment pretty
if (i < 10)
System.out.print(" ");
//Print index number
System.out.print(" " + i + ": ");
// Better formatted output is coming someday...
if (ival < 16)
System.out.print("0");
// Print the hex value of the byte
System.out.print(Integer.toHexString(ival));
// If the byte seems to be the value of a
// printable character, print it. No guarantee
// it will be.
if (Character.isWhitespace(c) || Character.isISOControl(c)) {
System.out.println("");
} else {
System.out.println(" (" + c + ")");
}
}
System.out.println("");
}
}

result:

Copy the code code as follows:

Charset: US-ASCII
Input: ?Ma?ana?
Encoded:
0:20
1: 3f (?)
2: 4d (M)
3:61(a)
4: 3f (?)
5:61(a)
6: 6e(n)
7:61(a)
8: 3f (?)

Charset: ISO-8859-1
Input: ?Ma?ana?
Encoded:
0:20
1: bf (?)
2: 4d (M)
3:61(a)
4: f1 (?)
5:61(a)
6: 6e(n)
7:61(a)
8: 3f (?)

Charset: UTF-8
Input: ?Ma?ana?
Encoded:
0:20
1: c2 (?)
2: bf (?)
3: 4d (M)
4:61(a)
5: c3 (?)
6: b1 (±)
7:61(a)
8: 6e(n)
9:61(a)
10: 3f (?)

Charset: UTF-16BE
Input: ?Ma?ana?
Encoded:
0:00
1:20
2:00
3: bf (?)
4:00
5: 4d (M)
6:00
7:61(a)
8:00
9: f1 (?)
10:00
11:61(a)
12:00
13: 6e(n)
14:00
15: 61 (a)
16:00
17: 3f (?)

Charset: UTF-16LE
Input: ?Ma?ana?
Encoded:
0:20
1:00
2: bf (?)
3:00
4: 4d (M)
5:00
6:61(a)
7:00
8: f1 (?)
9:00
10:61(a)
11:00
12: 6e(n)
13:00
14: 61 (a)
15:00
16: 3f (?)
17:00

Charset: UTF-16
Input: ?Ma?ana?
Encoded:
0: fe (?)
1: ff (?)
2:00
3:20
4:00
5: bf (?)
6:00
7: 4d (M)
8:00
9:61(a)
10:00
11: f1 (?)
12:00
13: 61 (a)
14:00
15: 6e(n)
16:00
17: 61 (a)
18:00
19: 3f (?)

Character set class:

Copy the code code as follows:

package java.nio.charset;
public abstract class Charset implements Comparable
{
public static boolean isSupported (String charsetName)
public static Charset forName (String charsetName)
public static SortedMap availableCharsets()
public final String name()
public final Set aliases()
public String displayName()
public String displayName (Locale locale)
public final boolean isRegistered()
public boolean canEncode()
public abstract CharsetEncoder newEncoder();
public final ByteBuffer encode (CharBuffer cb)
public final ByteBuffer encode (String str)
public abstract CharsetDecoder newDecoder();
public final CharBuffer decode (ByteBuffer bb)
public abstract boolean contains (Charset cs);
public final boolean equals (Object ob)
public final int compareTo (Object ob)
public final int hashCode()
public final String toString()
}

Then the Charset object needs to meet several conditions:

 The canonical name of the character set should match the name registered with IANA.
 If IANA registers multiple names with the same character set, the canonical name returned by the object should match the MIME-preferred name in the IANA registration.
 If a character set name is removed from the registry, the current canonical name should be retained as an alias.
 If the character set is not registered with IANA, its canonical name must start with "X -" or "x-".

Most of the time, only JVM sellers pay attention to these rules. However, if you plan to use your own character set as part of your application, it will be helpful to know what not to do. You should return false for isRegistered() and name your character set starting with "X -".

Character set comparison:

Copy the code code as follows:

public abstract class Charset implements Comparable
{
// This is a partial API listing
public abstract boolean contains (Charset cs);
public final boolean equals (Object ob)
public final int compareTo (Object ob)
public final int hashCode()
public final String toString()
}

Recall that a character set is composed of the encoding set of characters and the encoding scheme of that character set. Like ordinary sets, one character set may be a subset of another character set. One character set (C 1) contains another (C 2), meaning that every character expressed in C 2 can be equally expressed in C 1. Each character set is considered to contain itself. If this inclusion relationship holds, then any stream you encode in C 2 (the included subset) must also be encoded in C 1 without any replacement.

Character set encoder: A character set is composed of an encoded character set and an associated encoding scheme. The CharsetEncoder and CharsetDecoder classes implement conversion schemes.

Copy the code code as follows:

float averageBytesPerChar()
Returns the average number of bytes that will be produced for each character of input.
boolean canEncode(char c)
Tells whether or not this encoder can encode the given character.
boolean canEncode(CharSequence cs)
Tells whether or not this encoder can encode the given character sequence.
Charset charset()
Returns the charset that created this encoder.
ByteBuffer encode(CharBuffer in)
Convenience method that encodes the remaining content of a single input character buffer into a newly-allocated byte buffer.
CoderResult encode(CharBuffer in, ByteBuffer out, boolean endOfInput)
Encodes as many characters as possible from the given input buffer, writing the results to the given output buffer.
protected abstract CoderResult encodeLoop(CharBuffer in, ByteBuffer out)
Encodes one or more characters into one or more bytes.
CoderResult flush(ByteBuffer out)
Flushes this encoder.
protected CoderResult implFlush(ByteBuffer out)
Flushes this encoder.
protected void implOnMalformedInput(CodingErrorAction newAction)
Reports a change to this encoder's malformed-input action.
protected void implOnUnmappableCharacter(CodingErrorAction newAction)
Reports a change to this encoder's unmappable-character action.
protected void implReplaceWith(byte[] newReplacement)
Reports a change to this encoder's replacement value.
protected void implReset()
Resets this encoder, clearing any charset-specific internal state.
boolean isLegalReplacement(byte[] repl)
Tells whether or not the given byte array is a legal replacement value for this encoder.
CodingErrorAction malformedInputAction()
Returns this encoder's current action for malformed-input errors.
float maxBytesPerChar()
Returns the maximum number of bytes that will be produced for each character of input.
CharsetEncoder onMalformedInput(CodingErrorAction newAction)
Changes this encoder's action for malformed-input errors.
CharsetEncoder onUnmappableCharacter(CodingErrorAction newAction)
Changes this encoder's action for unmappable-character errors.
byte[] replacement()
Returns this encoder's replacement value.
CharsetEncoder replaceWith(byte[] newReplacement)
Changes this encoder's replacement value.
CharsetEncoder reset()
Resets this encoder, clearing any internal state.
CodingErrorAction unmappableCharacterAction()
Returns this encoder's current action for unmappable-character errors.

The CharsetEncoder object is a state conversion engine: characters in, bytes out. Some encoder calls may require conversion to be done. The encoder stores state transitioned between calls.

One note about the CharsetEncoder API: First, the simpler the encode() form, the more convenient it is. The encoding of the CharBuffer you provide in the reallocated ByteBuffer combines all encodings. This is the last method called when you call encode() directly on the Charset class.

Underflow

Overflow

Malformed input

Unmappable character

While encoding, if the encoder encounters defective or unmappable input, a result object is returned. You can also test individual characters, or sequences of characters, to determine whether they can be encoded. Here's how to check if encoding is possible:

Copy the code code as follows:

package java.nio.charset;
public abstract class CharsetEncoder
{
// This is a partial API listing
public boolean canEncode (char c)
public boolean canEncode (CharSequence cs)
}

CodingErrorAction defines three public fields:

REPORT
Default behavior when creating a CharsetEncoder. This behavior indicates that coding errors should be reported by returning the CoderResult object, mentioned earlier.

IGNORE (ignore)
Indicates that encoding errors should be ignored and any incorrect input should abort if out of position.

REPLACE
Encoding errors are handled by aborting the input of the error and outputting the current replacement byte sequence defined for this CharsetEncoder.

Remember, character set encoding converts characters into a sequence of bytes in preparation for later decoding. If the replacement sequence cannot be decoded into a valid character sequence, the encoded byte sequence becomes invalid.

CoderResult class: CoderResult objects are returned by CharsetEncoder and CharsetDecoder objects:

Copy the code code as follows:

package java.nio.charset;
public class CoderResult {
public static final CoderResult OVERFLOW
public static final CoderResult UNDERFLOW
public boolean isUnderflow()
public boolean isOverflow()
<span style="white-space:pre"> </span>public boolean isError()
public boolean isMalformed()
public boolean isUnmappable()
public int length()
public static CoderResult malformedForLength (int length)
public static CoderResult unmappableForLength (int length)
<span style="white-space:pre"> </span>public void throwException() throws CharacterCodingException
}

Character Set Decoder: A character set decoder is the inverse of an encoder. The byte encoding is converted into a sequence of 16-bit Unicode characters through a special encoding scheme. Similar to CharsetEncoder, CharsetDecoder is a state transition engine. Neither is thread-safe because calling their methods also changes their state, and that state is retained.

Copy the code code as follows:

float averageCharsPerByte()
Returns the average number of characters that will be produced for each byte of input.
Charset charset()
Returns the charset that created this decoder.
CharBuffer decode(ByteBuffer in)
Convenience method that decodes the remaining content of a single input byte buffer into a newly-allocated character buffer.
CoderResult decode(ByteBuffer in, CharBuffer out, boolean endOfInput)
Decodes as many bytes as possible from the given input buffer, writing the results to the given output buffer.
protected abstract CoderResult decodeLoop(ByteBuffer in, CharBuffer out)
Decodes one or more bytes into one or more characters.
Charset detectedCharset()
Retrieves the charset that was detected by this decoder (optional operation).
CoderResult flush(CharBuffer out)
Flushes this decoder.
protected CoderResult implFlush(CharBuffer out)
Flushes this decoder.
protected void implOnMalformedInput(CodingErrorAction newAction)
Reports a change to this decoder's malformed-input action.
protected void implOnUnmappableCharacter(CodingErrorAction newAction)
Reports a change to this decoder's unmappable-character action.
protected void implReplaceWith(String newReplacement)
Reports a change to this decoder's replacement value.
protected void implReset()
Resets this decoder, clearing any charset-specific internal state.
boolean isAutoDetecting()
Tells whether or not this decoder implements an auto-detecting charset.
boolean isCharsetDetected()
Tells whether or not this decoder has yet detected a charset (optional operation).
CodingErrorAction malformedInputAction()
Returns this decoder's current action for malformed-input errors.
float maxCharsPerByte()
Returns the maximum number of characters that will be produced for each byte of input.
CharsetDecoder onMalformedInput(CodingErrorAction newAction)
Changes this decoder's action for malformed-input errors.
CharsetDecoder onUnmappableCharacter(CodingErrorAction newAction)
Changes this decoder's action for unmappable-character errors.
String replacement()
Returns this decoder's replacement value.
CharsetDecoder replaceWith(String newReplacement)
Changes this decoder's replacement value.
CharsetDecoder reset()
Resets this decoder, clearing any internal state.
CodingErrorAction unmappableCharacterAction()
Returns this decoder's current action for unmappable-character errors.

How to actually complete the decoding:

Copy the code code as follows:

package java.nio.charset;
public abstract class CharsetDecoder
{
// This is a partial API listing
public final CharsetDecoder reset()
public final CharBuffer decode (ByteBuffer in)
throws CharacterCodingException
public final CoderResult decode (ByteBuffer in, CharBuffer out,
boolean endOfInput)
public final CoderResult flush (CharBuffer out)
}

The decoding process is similar to encoding and involves the same basic steps:

1. Reset the decoder by calling reset() to put the decoder in a known state ready to receive input.

2. Set endOfInput to false and do not call or call decode() multiple times to supply bytes to the decoding engine. As decoding proceeds, characters will be added to the given CharBuffer.

3. Set endOfInput to true and call decode() once to notify the decoder that all input has been provided.

4. Call flush() to ensure that all decoded characters have been sent to the output.

Example 6-2 illustrates how to encode a byte stream representing a character set encoding.

Example 6-2. Character set decoding

Copy the code code as follows:

package com.ronsoft.books.nio.charset;

import java.nio.*;
import java.nio.charset.*;
import java.nio.channels.*;
import java.io.*;

/**
* Test charset decoding.
*
* @author Ron Hitchens ([email protected])
*/
public class CharsetDecode {
/**
* Test charset decoding in the general case, detecting and handling buffer
* under/overflow and flushing the decoder state at end of input. This code
* reads from stdin and decodes the ASCII-encoded byte stream to chars. The
* decoded chars are written to stdout. This is effectively a 'cat' for
* input ascii files, but another charset encoding could be used by simply
* specifying it on the command line.
*/
public static void main(String[] argv) throws IOException {
// Default charset is standard ASCII
String charsetName = "ISO-8859-1";
// Charset name can be specified on the command line
if (argv. length > 0) {
charsetName = argv[0];
}
// Wrap a Channel around stdin, wrap a channel around stdout,
// find the named Charset and pass them to the deco de method.
// If the named charset is not valid, an exception of type
// UnsupportedCharsetException will be thrown.
decodeChannel(Channels.newChannel(System.in), new OutputStreamWriter(
System.out), Charset.forName(charsetName));
}

/**
* General purpose static method which reads bytes from a Channel, decodes
* them according
*
* @param source
* A ReadableByteChannel object which will be read to EOF as a
* source of encoded bytes.
* @param writer
* A Writer object to which decoded chars will be written.
* @param charset
* A Charset object, whose CharsetDecoder will be used to do the
* character set decoding. Java NIO 206
*/
public static void decodeChannel(ReadableByteChannel source, Writer writer,
Charset charset) throws UnsupportedCharsetException, IOException {
// Get a decoder instance from the Charset
CharsetDecoder decoder = charset.newDecoder();
// Tell decoder to replace bad chars with default mark
decoder.onMalformedInput(CodingErrorAction.REPLACE);
decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
// Allocate radically different input and output buffer sizes
// for testing purposes
ByteBuffer bb = ByteBuffer.allocateDirect(16 * 1024);
CharBuffer cb = CharBuffer.allocate(57);
// Buffer starts empty; indicate input is needed
CoderResult result = CoderResult.UNDERFLOW;
boolean eof = false;
while (!eof) {
// Input buffer underflow; decoder wants more input
if (result == CoderResult.UNDERFLOW) {
// decoder consumes all input, prepare to refill
bb.clear();
// Fill the input buffer; watch for EOF
eof = (source.read(bb) == -1);
// Prepare the buffer for reading by decoder
bb.flip();
}
// Decode input bytes to output chars; pass EOF flag
result = decoder.decode(bb, cb, eof);
// If output buffer is full, drain output
if (result == CoderResult.OVERFLOW) {
drainCharBuf(cb, writer);
}
}
// Flush any remaining state from the decoder, being careful
// to detect output buffer overflow(s)
while (decoder.flush(cb) == CoderResult.OVERFLOW) {
drainCharBuf(cb, writer);
}
// Drain any chars remaining in the output buffer
drainCharBuf(cb, writer);
// Close the channel; push out any buffered data to stdout
source.close();
writer.flush();
}

/**
* Helper method to drain the char buffer and write its content to the given
* Writer object. Upon return, the buffer is empty and ready to be refilled.
*
* @param cb
* A CharBuffer containing chars to be written.
* @param writer
* A Writer object to consume the chars in cb.
*/
static void drainCharBuf(CharBuffer cb, Writer writer) throws IOException {
cb.flip(); // Prepare buffer for draining
// This writes the chars contained in the CharBuffer but
// doesn't actually modify the state of the buffer.
// If the char buffer was being drained by calls to get( ),
// a loop might be needed here.
if (cb.hasRemaining()) {
writer.write(cb.toString());
}
cb.clear(); // Prepare buffer to be filled again
}
}

Character Set Server Provider Interface: The pluggable SPI structure is used in many different contexts throughout the Java environment. There are eight packages in the 1.4 JDK, one is called spi and the rest have other names. Pluggable is a powerful design technique that is one of the cornerstones on which Java's portability and adaptability are built.

Before browsing the API, it's important to explain how the Charset SPI works. The java.nio.charset.spi package contains only one extraction class, CharsetProvider. Concrete implementations of this class provide information related to the Charset objects they provide. In order to define a custom character set, you must first create specific implementations of Charset, CharsetEncoder, and CharsetDecoder from the java.nio.charset package. You then create a custom subclass of CharsetProvider that will provide those classes to the JVM.

Create a custom character set:

The least you have to do is create a subclass of java.nio.charset.Charset, provide concrete implementations of the three extraction methods, and a constructor. The Charset class has no default, parameterless constructor. This means that your custom character set class must have a constructor, even if it does not accept parameters. This is because you must call Charset's constructor at instantiation time (by calling super() at the beginning of your constructor), thus supplying it with your charset specification name and alias. Doing this lets the methods in the Charset class handle the name-related stuff for you, so that's a good thing.

Likewise, you need to provide concrete implementations of CharsetEncoder and CharsetDecoder. Recall that a character set is a collection of encoded characters and encoding/decoding schemes. As we saw before, encoding and decoding are almost symmetrical at the API level. A brief discussion of what is needed to implement an encoder is given here: the same applies to building a decoder.

Similar to Charset, CharsetEncoder does not have a default constructor, so you need to call super() in the concrete class constructor, providing the required parameters.

In order to provide your own CharsetEncoder implementation, you must at least provide the concrete encodeLoop () method. For simple encoding algorithms, the default implementation of other methods should work fine. Note that encodeLoop() takes parameters similar to those of encode(), excluding the Boolean flag. The encode () method represents the actual encoding to encodeLoop(), which only needs to pay attention to the characters consumed from the CharBuffer parameter, and output the encoded bytes to the provided ByteBuffer.

Now that we have seen how to implement custom character sets, including the associated encoders and decoders, let's look at how to connect them to the JVM so that we can run code using them.

Provide your custom character set:

In order to provide your own Charset implementation to the JVM runtime environment, you must create concrete subclasses of the CharsetProvider class in java.nio.charsets.-spi, each with a parameterless constructor. The parameterless constructor is important because your CharsetProvider class will be located by reading the fully qualified name of the configuration file. This class name string will then be imported into Class.newInstance() to instantiate your provider, which only works through the parameterless constructor.

The configuration file read by the JVM locates the character set provider, which is named java.nio.charset.spi.CharsetProvider. It is located in the source directory (META-INF/services) in the JVM classpath. Each JavaArchive (JAR) has a META-INF directory that contains information about the classes and resources in that JAR. A directory named META-INF can also be placed at the top of the regular directories in the JVM classpath.

The CharsetProvider API is almost useless. The actual work of providing a custom character set occurs in creating custom Charset, CharsetEncoder, and CharsetDecoder classes. The CharsetProvider is simply a facilitator between your character set and the runtime environment.

Example 6-3 demonstrates the implementation of a custom Charset and CharsetProvider, including sampling code that illustrates character set usage, encoding and decoding, and the Charset SPI. Example 6-3 implements a custom Charset.

Example 6-3. Customized Rot13 character set

Copy the code code as follows:

package com.ronsoft.books.nio.charset;

import java.nio.CharBuffer;
import java.nio.ByteBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CoderResult;
import java.util.Map;
import java.util.Iterator;
import java.io.Writer;
import java.io.PrintStream;
import java.io.PrintWriter;
import java.io.OutputStreamWriter;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.FileReader;

/**
* A Charset implementation which performs Rot13 encoding. Rot -13 encoding is a
* simple text obfuscation algorithm which shifts alphabetical characters by 13
* so that 'a' becomes 'n', 'o' becomes 'b', etc. This algorithm was popularized
* by the Usenet discussion forums many years ago to mask naughty words, hide
* answers to questions, and so on. The Rot13 algorithm is symmetrical, applying
* it to text that has been scrambled by Rot13 will give you the original
* unscrambled text.
*
* Applying this Charset encoding to an output stream will cause everything you
* write to that stream to be Rot13 scrambled as it's written out. And appying
* it to an input stream causes data read to be Rot13 descrambled as it's read.
*
* @author Ron Hitchens ([email protected])
*/
public class Rot13Charset extends Charset {
// the name of the base charset encoding we delegate to
private static final String BASE_CHARSET_NAME = "UTF-8";
// Handle to the real charset we'll use for transcoding between
// characters and bytes. Doing this allows us to apply the Rot13
// algorithm to multibyte charset encodings. But only the
// ASCII alpha chars will be rotated, regardless of the base encoding.
Charset baseCharset;

/**
* Constructor for the Rot13 charset. Call the superclass constructor to
* pass along the name(s) we'll be known by. Then save a reference to the
* delegate Charset.
*/
protected Rot13Charset(String canonical, String[] aliases) {
super(canonical, aliases);
// Save the base charset we're delegating to
baseCharset = Charset.forName(BASE_CHARSET_NAME);
}

//------------------------------------------------ ----------
/**
* Called by users of this Charset to obtain an encoder. This implementation
* instantiates an instance of a private class (defined below) and passes it
* an encoder from the base Charset.
*/
public CharsetEncoder newEncoder() {
return new Rot13Encoder(this, baseCharset.newEncoder());
}

/**
* Called by users of this Charset to obtain a decoder. This implementation
* instantiates an instance of a private class (defined below) and passes it
* a decoder from the base Charset.
*/
public CharsetDecoder newDecoder() {
return new Rot13Decoder(this, baseCharset.newDecoder());
}

/**
* This method must be implemented by concrete Charsets. We always say no,
* which is safe.
*/
public boolean contains(Charset cs) {
return (false);
}

/**
* Common routine to rotate all the ASCII alpha chars in the given
* CharBuffer by 13. Note that this code explicitly compares for upper and
* lower case ASCII chars rather than using the methods
* Character.isLowerCase and Character.isUpperCase. This is because the
* rotate-by-13 scheme only works properly for the alphabetic characters of
* the ASCII charset and those methods can return true for non-ASCII Unicode
* chars.
*/
private void rot13(CharBuffer cb) {
for (int pos = cb.position(); pos < cb.limit(); pos++) {
char c = cb.get(pos);
char a = '/u0000';
// Is it lowercase alpha?
if ((c >= 'a') && (c <= 'z')) {
a = 'a';
}
// Is it uppercase alpha?
if ((c >= 'A') && (c <= 'Z')) {
a = 'A';
}
// If either, roll it by 13
if (a != '/u0000') {
c = (char) ((((c - a) + 13) % 26) + a);
cb.put(pos, c);
}
}
}

//------------------------------------------------ --------
/**
* The encoder implementation for the Rot13 Chars et. This class, and the
* matching decoder class below, should also override the "impl" methods,
* such as implOnMalformedInput( ) and make passthrough calls to the
* baseEncoder object. That is left as an exercise for the hacker.
*/
private class Rot13Encoder extends CharsetEncoder {
private CharsetEncoder baseEncoder;

/**
* Constructor, call the superclass constructor with the Charset object
* and the encodings sizes from the delegate encoder.
*/
Rot13Encoder(Charset cs, CharsetEncoder baseEncoder) {
super(cs, baseEncoder.averageBytesPerChar(), baseEncoder
.maxBytesPerChar());
this.baseEncoder = baseEncoder;
}

/**
* Implementation of the encoding loop. First, we apply the Rot13
* scrambling algorithm to the CharBuffer, then reset the encoder for
* the base Charset and call it's encode( ) method to do the actual
* encoding. This may not work properly for non-Latin charsets. The
* CharBuffer passed in may be read -only or re-used by the caller for
* other purposes so we duplicate it and apply the Rot13 encoding to the
* copy. We DO want to advance the position of the input buffer to
* reflect the chars consumed.
*/
protected CoderResult encodeLoop(CharBuffer cb, ByteBuffer bb) {
CharBuffer tmpcb = CharBuffer.allocate(cb.remaining());
while (cb.hasRemaining()) {
tmpcb.put(cb.get());
}
tmpcb.rewind();
rot13(tmpcb);
baseEncoder.reset();
CoderResult cr = baseEncoder.encode(tmpcb, bb, true);
// If error or output overflow, we need to adjust
// the position of the input buffer to match what
// was really consumed from the temp buffer. If
// underflow (all input consumed), this is a no-op.
cb.position(cb.position() - tmpcb.remaining());
return(cr);
}
}

//------------------------------------------------ --------
/**
* The decoder implementation for the Rot13 Charset.
*/
private class Rot13Decoder extends CharsetDecoder {
private CharsetDecoder baseDecoder;

/**
* Constructor, call the superclass constructor with the Charset object
* and pass alon the chars/byte values from the delegate decoder.
*/
Rot13Decoder(Charset cs, CharsetDecoder baseDecoder) {
super(cs, baseDecoder.averageCharsPerByte(), baseDecoder
.maxCharsPerByte());
this.baseDecoder = baseDecoder;
}

/**
* Implementation of the decoding loop. First, we reset the decoder for
* the base charset, then call it to decode the bytes into characters,
* saving the result code. The CharBuffer is then de-scrambled with the
* Rot13 algorithm and the result code is returned. This may not work
* properly for non-Latin charsets.
*/
protected CoderResult decodeLoop(ByteBuffer bb, CharBuffer cb) {
baseDecoder.reset();
CoderResult result = baseDecoder.decode(bb, cb, true);
rot13(cb);
return (result);
}
}

//------------------------------------------------ --------
/**
* Unit test for the Rot13 Charset. This main( ) will open and read an input
* file if named on the command line, or stdin if no args are provided, and
* write the contents to stdout via the X -ROT13 charset encoding. The
* "encryption" implemented by the Rot13 algorithm is symmetrical. Feeding
* in a plain-text file, such as Java source code for example, will output a
* scrambled version. Feeding the scrambled version back in will yield the
* original plain-text document.
*/
public static void main(String[] argv) throws Exception {
BufferedReader in;
if (argv. length > 0) {
// Open the named file
in = new bufferedReader (New FILEREADER (ARGV [0]));
} else {
// wrap a bufferedReader around stdin
in = new bufferedReader (New InputStreamReader (System.in));
}
// Create a Printstream that uses the rot13 encoding
Printstream out = New Printstream (System.out, FALSE, "X -Rot13");
String s = null;
// Read all input and write it to the output.
// as the data passes through the printstream,
// It will be rot13-enCoded.
While ((s = in.readline ())! = NULL) {
out.println (s);
}
out.flush();
}
}

In order to use this Charset and its encoder and decoder, it must be effective for the environment of Java when runtime. Use the CharsetProvider class (Example 6-4).

Example 6-4. Customized character set provider

Copy the code code as follows:

package com.ronsoft.books.nio.charset;

Import java.nio.charset.charset;
Import java.nio.charset.spi.charsetProvider;
import java.util.HashSet;
Import java.util.iterator;

/**
* A CharsetProvider Class Which Makes Available The Charsets PROVIDD BY
* Ronsoft. Currently there is only one, namely the x -rot13 charset. This is
* Not a registered ichuset, so it's name before "x-" to avoid name
* Clastes with Office Charsets.
*
* To activate this charsetProvider, it's necessary to add a file to the
* ClassPath of the JVM Runtime at the Following Location:
* Meta-INF/Services/Java.nio.charsets.spi.charsetp Rovider
*
* That File Must Contain a Line with the Fully Qualify name of this class on
* a line by itset: com.ronsoft.books.nio.charsOftCharsetProvider Java
* Nio 216
*
* SEE The Javadoc Page for Java.nio.Charsets.spi.CharsetProvider for Full
* Details.
*
* @Author Ron Hitchens ([email protected])
*/
public class ronsoftcharsetProvider
// The name of the charset we provide
Private Static Final String Charset_name = "X-Rot13";
// a handle to the charset object
Private Charset Rot13 = NULL;

/**
* Constructionor, Instantiate a Charset object and save the reference.
*/
public ronsoftCharsetProvider () {
this.rot13 = new rot13Charset (charset_name, new string [0]);
}

/**
* Called by Charset Static Methods to Find A PARTICular Named Charset. If.
* It's the name of this charset (we do'T'T has been after any aliases) then return the
* Rot13 charset, else return null.
*/
Public Charset CharsetForname (String CharsetName) {{
if (charsetName.equalSignorecase (charset_name)) {{
Return (rot13);
}
Return (null);
}

/**
* Return an itrator over the set of charset objects we provide.
*
* @Return an itrator object containing references to all the charset
* Objects Provided by this class.
*/
public itrator <charset> charsets () {) {
Have set <cHarset> set = New HashSet <Charset> (1);
set.add (rot13);
Return (set.iterator ());
}
}

For this character set provider seen by the environment when running through the JVM, the file named META_INF/Services/Java.nio.charset.SPI.CharsetProvider must exist in the directory of or class paths of the path. The content of that file must be:
com.ronSoft.Books.nio.charset.RonsoftCharsetProvider

Copy the code code as follows:

Add X -ROT13 to the character set list in Example 6-1 to generate this extra output:

Charset: X-Rot13
Input: anMaana?
Encoded:
0: C2 (()
1: BF (()
2: 5A (Z)
3: 6e (n)
4: C3 (()
5: B1 (±)
6: 6e (n)
7: 61 (A)
8: 6E (n)
9: 3F (?)

Summary: Many Java programmers will never need to deal with the problem of character set coding conversion, and most of them will never create custom character sets. But for those who need it, a series of categories in Java.nio.charset and Java.charset.spi provide a strong and elastic mechanism for character processing.

Charset (character set class)
The encoded character set encoding scheme is used to indicate different character sequences from character sets as byte sequences.

Charsetencoder (character set encoding class)
Code the engine and convert the character sequence into byte sequences. After that, byte sequences can be decoded to re -construct the source character sequence.

CharsetDecoder (character set decoder class)
Decoding engine and convert the encoded byte sequence into character sequences.

CharsetProvider SPI (character set supplier SPI)
Positioning through the server supplier's mechanism and enable Charset to achieve available, so as to use in the environment of runtime.