Java字元編碼解碼的實作詳解

作者：Eve Cole 更新時間：2024-11-24 09:48:01

字符集基礎：

Character set（字元集）
字符的集合，也就是，帶有特殊語意的符號。字母“A”是一個字元。 “%”也是一個字元。沒有內在數位價值，與ASC II ，Unicode，連電腦也沒有任何的直接聯繫。在電腦產生前的很長一段時間內，符號就已經存在了。
Coded character set（編碼字元集）
一個數值賦給一個字元的集合。把代碼賦值給字符，這樣它們就可以用特定的字符編碼集來表達數字的結果。其他的編碼字元集可以賦不同的數值到同一個字元。字元集映射通常是由標準組織確定的，例如USASCII ，ISO 8859 -1，Unicode (ISO 10646 -1) ，以及JIS X0201。
Character-encoding scheme（字元編碼方案）
編碼字元集成員到八位元組（8 bit 位元組）的對應。編碼方案定義如何把字元編碼的序列表達為位元組序列。字元編碼的數值不需要與編碼位元組相同，也不需要是一對一或一對多個的關係。原則上，把字元集編碼和解碼近似視為物件的序列化和反序列化。

通常字元資料編碼是用於網路傳輸或檔案儲存。編碼方案不是字符集，它是映射；但是因為它們之間的緊密聯繫，大部分編碼都與一個獨立的字符集相關聯。例如，UTF -8，
僅用來編碼Unicode字元集。儘管如此，用一個編碼方案處理多個字元集還是可能發生的。例如，EUC 可以對幾個亞洲語言的字元進行編碼。
圖6-1 是使用UTF -8 編碼方案將Unicode字元序列編碼為位元組序列的圖形表達式。 UTF -8把小於0x80 的字元代碼值編碼成一個單字節值（標準ASC II ）。所有其他的Unicode字元都被編碼成2 到6 個位元組的多位元組序列(http://www.ietf.org/rfc/rfc2279.txt )。

Charset（字元集）
術語charset 是在RFC2278(http://ietf.org/rfc/rfc2278.txt) 中定義的。它是編碼字符集和字符編碼方案的集合。 java.nio.charset 套件的類別是Charset，它封裝字元集抽取。
1111111111111111
Unicode是16-位元字元編碼。它試著把全世界所有語言的字符集統一到一個獨立的、全面的映射中。它贏得了一席之地，但是目前仍有許多其他字元編碼正在被廣泛的使用。
大部分的作業系統在I/O 與檔案儲存方面仍是以位元組為導向的，所以無論使用何種編碼，Unicode或其他編碼，在位元組序列和字元集編碼之間仍需要進行轉換。
由java.nio.charset 套件組成的類別滿足了這個需求。這不是Java 平台第一次處理字元集編碼，但它是最系統化、最全面、以及最靈活的解決方式。 java.nio.charset.spi套件提供伺服器供給介面（SPI），使編碼器和解碼器可以根據需要選擇插入。

字元集：在JVM 啟動時確定預設值，取決於潛在的作業系統環境、區域設定、和/或JVM配置。如果您需要一個指定的字元集，最安全的方法是明確的命名它。不要假設預設部署與您的開發環境相同。字元集名稱不區分大小寫，也就是，當比較字元集名稱時認為大寫字母和小寫字母相同。網路名稱分配機構（IANA ）維護所有正式註冊的字元集名稱。

範例6-1 示範了透過不同的Charset實作如何把字元翻譯成位元組序列。

範例6 -1. 使用標準字元集編碼

複製代碼代碼如下:

package com.ronsoft.books.nio.charset;

import java.nio.charset.Charset;
import java.nio.ByteBuffer;

/**
* Charset encoding test. Run the same input string, which contains some
* non-ascii characters, through several Charset encoders and dump out the hex
* values of the resulting byte sequences.
*
* @author Ron Hitchens ([email protected])
*/
public class EncodeTest {
public static void main(String[] argv) throws Exception {
// This is the character sequence to encode
String input = " /u00bfMa/u00f1ana?";
// the list of charsets to encode with
String[] charsetNames = { "US-ASCII", "ISO-8859-1", "UTF-8",
"UTF-16BE", "UTF-16LE", "UTF-16" // , "X-ROT13"
};
for (int i = 0; i < charsetNames.length; i++) {
doEncode(Charset.forName(charsetNames[i]), input);
}
}

/**
* For a given Charset and input string, encode the chars and print out the
* resulting byte encoding in a readable form.
*/
private static void doEncode(Charset cs, String input) {
ByteBuffer bb = cs.encode(input);
System.out.println("Charset: " + cs.name());
System.out.println(" Input: " + input);
System.out.println("Encoded: ");
for (int i = 0; bb.hasRemaining(); i++) {
int b = bb.get();
int ival = ((int) b) & 0xff;
char c = (char) ival;
// Keep tabular alignment pretty
if (i < 10)
System.out.print(" ");
// Print index number
System.out.print(" " + i + ": ");
// Better formatted output is coming someday...
if (ival < 16)
System.out.print("0");
// Print the hex value of the byte
System.out.print(Integer.toHexString(ival));
// If the byte seems to be the value of a
// printable character, print it. No guarantee
// it will be.
if (Character.isWhitespace(c) || Character.isISOControl(c)) {
System.out.println("");
} else {
System.out.println(" (" + c + ")");
}
}
System.out.println("");
}
}

結果：

複製代碼代碼如下:

Charset: US-ASCII
Input: ?Ma?ana?
Encoded:
0: 20
1: 3f (?)
2: 4d (M)
3: 61 (a)
4: 3f (?)
5: 61 (a)
6: 6e (n)
7: 61 (a)
8: 3f (?)

Charset: ISO-8859-1
Input: ?Ma?ana?
Encoded:
0: 20
1: bf (?)
2: 4d (M)
3: 61 (a)
4: f1 (?)
5: 61 (a)
6: 6e (n)
7: 61 (a)
8: 3f (?)

Charset: UTF-8
Input: ?Ma?ana?
Encoded:
0: 20
1: c2 (?)
2: bf (?)
3: 4d (M)
4: 61 (a)
5: c3 (?)
6: b1 (±)
7: 61 (a)
8: 6e (n)
9: 61 (a)
10: 3f (?)

Charset: UTF-16BE
Input: ?Ma?ana?
Encoded:
0: 00
1: 20
2: 00
3: bf (?)
4: 00
5: 4d (M)
6: 00
7: 61 (a)
8: 00
9: f1 (?)
10: 00
11: 61 (a)
12: 00
13: 6e (n)
14: 00
15: 61 (a)
16: 00
17: 3f (?)

Charset: UTF-16LE
Input: ?Ma?ana?
Encoded:
0: 20
1: 00
2: bf (?)
3: 00
4: 4d (M)
5: 00
6: 61 (a)
7: 00
8: f1 (?)
9: 00
10: 61 (a)
11: 00
12: 6e (n)
13: 00
14: 61 (a)
15: 00
16: 3f (?)
17: 00

Charset: UTF-16
Input: ?Ma?ana?
Encoded:
0: fe (?)
1: ff (?)
2: 00
3: 20
4: 00
5: bf (?)
6: 00
7: 4d (M)
8: 00
9: 61 (a)
10: 00
11: f1 (?)
12: 00
13: 61 (a)
14: 00
15: 6e (n)
16: 00
17: 61 (a)
18: 00
19: 3f (?)

字元集類別：

複製代碼代碼如下:

package java.nio.charset;
public abstract class Charset implements Comparable
{
public static boolean isSupported (String charsetName)
public static Charset forName (String charsetName)
public static SortedMap availableCharsets()
public final String name()
public final Set aliases()
public String displayName()
public String displayName (Locale locale)
public final boolean isRegistered()
public boolean canEncode()
public abstract CharsetEncoder newEncoder();
public final ByteBuffer encode (CharBuffer cb)
public final ByteBuffer encode (String str)
public abstract CharsetDecoder newDecoder();
public final CharBuffer decode (ByteBuffer bb)
public abstract boolean contains (Charset cs);
public final boolean equals (Object ob)
public final int compareTo (Object ob)
public final int hashCode()
public final String toString()
}

那麼Charset物件需要滿足幾個條件：

 字元集的規範名稱應與在IANA 註冊的名稱相符。
 如果IANA 用同一個字元集註冊了多個名稱，物件傳回的規範名稱應該與IANA 註冊中的MIME -首選名稱相符。
 如果字元集名稱從註冊中移除，那麼目前的規範名稱應保留為別名。
 如果字元集沒有在IANA 註冊，它的規範名稱必須以「X -」或「x-」開頭。

大多數情況下，只有JVM賣家才會關注這些規則。然而，如果您打算以您自己的字元集作為應用程式的一部分，那麼了解這些不該做的事情將對您很有幫助。針對isRegistered() 您應該返回false 並以“X -”開頭命名您的字元集。

字符集比較：

複製代碼代碼如下:

public abstract class Charset implements Comparable
{
// This is a partial API listing
public abstract boolean contains (Charset cs);
public final boolean equals (Object ob)
public final int compareTo (Object ob)
public final int hashCode()
public final String toString()
}

回想一下，字符集是由字符的編碼集與該字符集的編碼方案組成的。與普通的集合類似，一個字元集可能是另一個字元集的子集。一個字元集（C 1）包含另一個（C 2），表示在C 2 中表達的每個字元都可以在C 1 中進行相同的表達。每個字符集都被認為是包含其本身。如果這個包含關係成立，那麼您在C 2（被包含的子集）中編碼的任意流在C 1 中也一定可以編碼，無需任何替換。

字符集編碼器：字符集是由一個編碼字符集和一個相關編碼方案組成的。 CharsetEncoder 和CharsetDecoder 類別實作轉換方案。

複製代碼代碼如下:

float averageBytesPerChar()
Returns the average number of bytes that will be produced for each character of input.
boolean canEncode(char c)
Tells whether or not this encoder can encode the given character.
boolean canEncode(CharSequence cs)
Tells whether or not this encoder can encode the given character sequence.
Charset charset()
Returns the charset that created this encoder.
ByteBuffer encode(CharBuffer in)
Convenience method that encodes the remaining content of a single input character buffer into a newly-allocated byte buffer.
CoderResult encode(CharBuffer in, ByteBuffer out, boolean endOfInput)
Encodes as many characters as possible from the given input buffer, writing the results to the given output buffer.
protected abstract CoderResult encodeLoop(CharBuffer in, ByteBuffer out)
Encodes one or more characters into one or more bytes.
CoderResult flush(ByteBuffer out)
Flushes this encoder.
protected CoderResult implFlush(ByteBuffer out)
Flushes this encoder.
protected void implOnMalformedInput(CodingErrorAction newAction)
Reports a change to this encoder's malformed-input action.
protected void implOnUnmappableCharacter(CodingErrorAction newAction)
Reports a change to this encoder's unmappable-character action.
protected void implReplaceWith(byte[] newReplacement)
Reports a change to this encoder's replacement value.
protected void implReset()
Resets this encoder, clearing any charset-specific internal state.
boolean isLegalReplacement(byte[] repl)
Tells whether or not the given byte array is a legal replacement value for this encoder.
CodingErrorAction malformedInputAction()
Returns this encoder's current action for malformed-input errors.
float maxBytesPerChar()
Returns the maximum number of bytes that will be produced for each character of input.
CharsetEncoder onMalformedInput(CodingErrorAction newAction)
Changes this encoder's action for malformed-input errors.
CharsetEncoder onUnmappableCharacter(CodingErrorAction newAction)
Changes this encoder's action for unmappable-character errors.
byte[] replacement()
Returns this encoder's replacement value.
CharsetEncoder replaceWith(byte[] newReplacement)
Changes this encoder's replacement value.
CharsetEncoder reset()
Resets this encoder, clearing any internal state.
CodingErrorAction unmappableCharacterAction()
Returns this encoder's current action for unmappable-character errors.

CharsetEncoder 物件是一個狀態轉換引擎：字元進去，位元組出來。一些編碼器的呼叫可能需要完成轉換。編碼器儲存在呼叫之間轉換的狀態。

關於CharsetEncoder API 的一個注意事項：首先，越簡單的encode() 形式越方便，在重新分配的ByteBuffer中您提供的CharBuffer 的編碼集所有的編碼於一身。這是當您在Charset類別上直接呼叫encode() 時最後一個呼叫的方法。

Underflow（下溢）

Overflow （上溢）

Malformed input（有缺陷的輸入）

Unmappable character （無映射字元）

編碼時，如果編碼器遭遇了有缺陷的或不能映射的輸入，則會傳回結果物件。您也可以檢測獨立的字符，或字符序列，來確定它們是否能被編碼。以下是檢測能否進行編碼的方法：

複製代碼代碼如下:

package java.nio.charset;
public abstract class CharsetEncoder
{
// This is a partial API listing
public boolean canEncode (char c)
public boolean canEncode (CharSequence cs)
}

CodingErrorAction 定義了三個公共域：

REPORT （報告）
建立CharsetEncoder 時的預設行為。這個行為表示編碼錯誤應該透過回傳CoderResult 物件報告，前面提到過。

IGNORE （忽略）
表示應忽略編碼錯誤並且如果位置不對的話任何錯誤的輸入都應中止。

REPLACE（替換）
透過中止錯誤的輸入並輸出針對該CharsetEncoder 定義的目前的替換位元組序列處理編碼錯誤。

記住，字元集編碼把字元轉換成位元組序列，為以後的解碼做準備。如果替換序列不能被解碼成有效的字元序列，編碼位元組序列變成無效。

CoderResult類別：CoderResult 物件是由CharsetEncoder 和CharsetDecoder 物件傳回的：

複製代碼代碼如下:

package java.nio.charset;
public class CoderResult {
public static final CoderResult OVERFLOW
public static final CoderResult UNDERFLOW
public boolean isUnderflow()
public boolean isOverflow()
<span style="white-space:pre"> </span>public boolean isError()
public boolean isMalformed()
public boolean isUnmappable()
public int length()
public static CoderResult malformedForLength (int length)
public static CoderResult unmappableForLength (int length)
<span style="white-space:pre"> </span>public void throwException() throws CharacterCodingException
}

字元集解碼器：字元集解碼器是編碼器的逆轉。透過特殊的編碼方案把位元組編碼轉換成16-位元Unicode字元的序列。與CharsetEncoder 類似的, CharsetDecoder 是狀態轉換引擎。兩者都不是線程安全的，因為呼叫它們的方法的同時也會改變它們的狀態，並且這些狀態會被保留下來。

複製代碼代碼如下:

float averageCharsPerByte()
Returns the average number of characters that will be produced for each byte of input.
Charset charset()
Returns the charset that created this decoder.
CharBuffer decode(ByteBuffer in)
Convenience method that decodes the remaining content of a single input byte buffer into a newly-allocated character buffer.
CoderResult decode(ByteBuffer in, CharBuffer out, boolean endOfInput)
Decodes as many bytes as possible from the given input buffer, writing the results to the given output buffer.
protected abstract CoderResult decodeLoop(ByteBuffer in, CharBuffer out)
Decodes one or more bytes into one or more characters.
Charset detectedCharset()
Retrieves the charset that was detected by this decoder (optional operation).
CoderResult flush(CharBuffer out)
Flushes this decoder.
protected CoderResult implFlush(CharBuffer out)
Flushes this decoder.
protected void implOnMalformedInput(CodingErrorAction newAction)
Reports a change to this decoder's malformed-input action.
protected void implOnUnmappableCharacter(CodingErrorAction newAction)
Reports a change to this decoder's unmappable-character action.
protected void implReplaceWith(String newReplacement)
Reports a change to this decoder's replacement value.
protected void implReset()
Resets this decoder, clearing any charset-specific internal state.
boolean isAutoDetecting()
Tells whether or not this decoder implements an auto-detecting charset.
boolean isCharsetDetected()
Tells whether or not this decoder has yet detected a charset (optional operation).
CodingErrorAction malformedInputAction()
Returns this decoder's current action for malformed-input errors.
float maxCharsPerByte()
Returns the maximum number of characters that will be produced for each byte of input.
CharsetDecoder onMalformedInput(CodingErrorAction newAction)
Changes this decoder's action for malformed-input errors.
CharsetDecoder onUnmappableCharacter(CodingErrorAction newAction)
Changes this decoder's action for unmappable-character errors.
String replacement()
Returns this decoder's replacement value.
CharsetDecoder replaceWith(String newReplacement)
Changes this decoder's replacement value.
CharsetDecoder reset()
Resets this decoder, clearing any internal state.
CodingErrorAction unmappableCharacterAction()
Returns this decoder's current action for unmappable-character errors.

實際完成解碼的方法上：

複製代碼代碼如下:

package java.nio.charset;
public abstract class CharsetDecoder
{
// This is a partial API listing
public final CharsetDecoder reset()
public final CharBuffer decode (ByteBuffer in)
throws CharacterCodingException
public final CoderResult decode (ByteBuffer in, CharBuffer out,
boolean endOfInput)
public final CoderResult flush (CharBuffer out)
}

解碼處理和編碼類似，包含相同的基本步驟：

1. 重設解碼器，透過呼叫reset() ，把解碼器放在已知的狀態準備用來接收輸入。

2. 把endOfInput 設定成false 不呼叫或多次呼叫decode()，供給位元組到解碼引擎中。隨著解碼的進行，字元將被添加到給定的CharBuffer 中。

3. 把endOfInput 設定成true 呼叫一次decode()，通知解碼器已經提供了所有的輸入。

4. 呼叫flush() ，確保所有的解碼字元都已經傳送給輸出。

範例6-2 說明如何對錶示字元集編碼的位元組流進行編碼。

範例6 -2. 字元集解碼

複製代碼代碼如下:

package com.ronsoft.books.nio.charset;

import java.nio.*;
import java.nio.charset.*;
import java.nio.channels.*;
import java.io.*;

/**
* Test charset decoding.
*
* @author Ron Hitchens ([email protected])
*/
public class CharsetDecode {
/**
* Test charset decoding in the general case, detecting and handling buffer
* under/overflow and flushing the decoder state at end of input. This code
* reads from stdin and decodes the ASCII-encoded byte stream to chars. The
* decoded chars are written to stdout. This is effectively a 'cat' for
* input ascii files, but another charset encoding could be used by simply
* specifying it on the command line.
*/
public static void main(String[] argv) throws IOException {
// Default charset is standard ASCII
String charsetName = "ISO-8859-1";
// Charset name can be specified on the command line
if (argv.length > 0) {
charsetName = argv[0];
}
// Wrap a Channel around stdin, wrap a channel around stdout,
// find the named Charset and pass them to the deco de method.
// If the named charset is not valid, an exception of type
// UnsupportedCharsetException will be thrown.
decodeChannel(Channels.newChannel(System.in), new OutputStreamWriter(
System.out), Charset.forName(charsetName));
}

/**
* General purpose static method which reads bytes from a Channel, decodes
* them according
*
* @param source
* A ReadableByteChannel object which will be read to EOF as a
* source of encoded bytes.
* @param writer
* A Writer object to which decoded chars will be written.
* @param charset
* A Charset object, whose CharsetDecoder will be used to do the
* character set decoding. Java NIO 206
*/
public static void decodeChannel(ReadableByteChannel source, Writer writer,
Charset charset) throws UnsupportedCharsetException, IOException {
// Get a decoder instance from the Charset
CharsetDecoder decoder = charset.newDecoder();
// Tell decoder to replace bad chars with default mark
decoder.onMalformedInput(CodingErrorAction.REPLACE);
decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
// Allocate radically different input and output buffer sizes
// for testing purposes
ByteBuffer bb = ByteBuffer.allocateDirect(16 * 1024);
CharBuffer cb = CharBuffer.allocate(57);
// Buffer starts empty; indicate input is needed
CoderResult result = CoderResult.UNDERFLOW;
boolean eof = false;
while (!eof) {
// Input buffer underflow; decoder wants more input
if (result == CoderResult.UNDERFLOW) {
// decoder consumed all input, prepare to refill
bb.clear();
// Fill the input buffer; watch for EOF
eof = (source.read(bb) == -1);
// Prepare the buffer for reading by decoder
bb.flip();
}
// Decode input bytes to output chars; pass EOF flag
result = decoder.decode(bb, cb, eof);
// If output buffer is full, drain output
if (result == CoderResult.OVERFLOW) {
drainCharBuf(cb, writer);
}
}
// Flush any remaining state from the decoder, being careful
// to detect output buffer overflow(s)
while (decoder.flush(cb) == CoderResult.OVERFLOW) {
drainCharBuf(cb, writer);
}
// Drain any chars remaining in the output buffer
drainCharBuf(cb, writer);
// Close the channel; push out any buffered data to stdout
source.close();
writer.flush();
}

/**
* Helper method to drain the char buffer and write its content to the given
* Writer object. Upon return, the buffer is empty and ready to be refilled.
*
* @param cb
* A CharBuffer containing chars to be written.
* @param writer
* A Writer object to consume the chars in cb.
*/
static void drainCharBuf(CharBuffer cb, Writer writer) throws IOException {
cb.flip(); // Prepare buffer for draining
// This writes the chars contained in the CharBuffer but
// doesn't actually modify the state of the buffer.
// If the char buffer was being drained by calls to get( ),
// a loop might be needed here.
if (cb.hasRemaining()) {
writer.write(cb.toString());
}
cb.clear(); // Prepare buffer to be filled again
}
}

字元集伺服器供應者介面：可插拔的SPI 結構是在許多不同的內容中貫穿Java 環境使用的。在1.4JDK中有八個包，一個叫spi 而剩下的有其它的名稱。可插拔是一個功能強大的設計技術，也是在Java 的可移植性和適應性上建立的基石之一。

在瀏覽API 之前，需要解釋一下Charset SPI 如何運作。 java.nio.charset.spi 套件僅包含一個抽取類，CharsetProvider 。這個類別的具體實作供給與它們提供過的Charset物件相關的資訊。為了定義自訂字元集，您首先必須從java.nio.charset package中建立Charset, CharsetEncoder，以及CharsetDecoder 的具體實作。然後您建立CharsetProvider 的自訂子類，它將把那些類別提供給JVM。

建立自訂字元集：

您至少要做的是建立java.nio.charset.Charset 的子類別、提供三個抽取方法的具體實作以及一個建構函式。 Charset類別沒有預設的，無參數的建構子。這表示您的自訂字元集類別必須有一個建構函數，即使它不接受參數。這是因為您必須在實例化時呼叫Charset的建構函式（透過在您的建構函式的開端呼叫super() ），從而透過您的字元集規範名稱和別名供給它。這樣做可以讓Charset類別中的方法幫您處理和名稱相關的事情，所以是件好事。

同樣地，您需要提供CharsetEncoder和CharsetDecoder 的具體實作。回想一下，字元集是編碼的字元和編碼/解碼方案的集合。如我們之前所看到的，編碼和解碼在API 層級上幾乎是對稱的。這裡給出了關於實現編碼器所需的東西的簡短討論：一樣適用於建立解碼器。

與Charset類似的, CharsetEncoder 沒有預設的建構函數，所以您需要在特定類別建構子中呼叫super() ，提供所需的參數。

為了供給您自己的CharsetEncoder 實現，您至少要提供具體encodeLoop () 方法。對於簡單的編碼運算元則，其他方法的預設實作應該可以正常進行。注意encodeLoop() 採用和encode() 的參數類似的參數，不包含布林標誌。 encode () 方法代表到encodeLoop() 的實際編碼，它只需要關注來自CharBuffer 參數消耗的字符，並且輸出編碼的字節到提供的ByteBuffer上。

現在，我們已經看到瞭如何實作自訂字元集，包括相關的編碼器和解碼器，讓我們看看如何把它們連接到JVM中，這樣可以利用它們來運行程式碼。

供給您的自訂字元集：

為了給JVM運行時環境提供您自己的Charset實現，您必須在java.nio.charsets. - spi 中創建CharsetProvider 類的具體子類，每個都帶有一個無參數構造函數。無參數建構函式很重要，因為您的CharsetProvider 類別將要透過讀取設定檔的全部合格名稱進行定位。之後這個類別名稱字串將會被匯入到Class.newInstance() 來實例化您的提供方，它只透過無參數建構函式運作。

JVM讀取的設定檔定位字元集提供方，被命名為java.nio.charset.spi.CharsetProvider 。它在JVM類路徑中位於來源目錄（META-INF/services）中。每個JavaArchive（Java 檔案檔案）（JAR ）都有一個META-INF 目錄，它可以包含在那個JAR 中的類別和資源的資訊。一個名為META-INF 的目錄也可以在JVM類路徑中放置在常規目錄的頂端。

CharsetProvider 的API 幾乎是沒有作用的。提供自訂字元集的實際工作是發生在建立自訂Charset，CharsetEncoder，以及CharsetDecoder 類別中。 CharsetProvider 僅是連接您的字元集和執行時間環境的促進者。

範例6-3 中演示了自訂Charset和CharsetProvider 的實現，包含說明字元集使用的取樣程式碼，編碼和解碼，以及Charset SPI。範例6-3 實作了一個自訂Charset。

範例6 -3. 自訂Rot13 字元集

複製代碼代碼如下:

package com.ronsoft.books.nio.charset;

import java.nio.CharBuffer;
import java.nio.ByteBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CoderResult;
import java.util.Map;
import java.util.Iterator;
import java.io.Writer;
import java.io.PrintStream;
import java.io.PrintWriter;
import java.io.OutputStreamWriter;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.FileReader;

/**
* A Charset implementation which performs Rot13 encoding. Rot -13 encoding is a
* simple text obfuscation algorithm which shifts alphabetical characters by 13
* so that 'a' becomes 'n', 'o' becomes 'b', etc. This algorithm was popularized
* by the Usenet discussion forums many years ago to mask naughty words, hide
* answers to questions, and so on. The Rot13 algorithm is symmetrical, applying
* it to text that has been scrambled by Rot13 will give you the original
* unscrambled text.
*
* Applying this Charset encoding to an output stream will cause everything you
* write to that stream to be Rot13 scrambled as it's written out. And appying
* it to an input stream causes data read to be Rot13 descrambled as it's read.
*
* @author Ron Hitchens ([email protected])
*/
public class Rot13Charset extends Charset {
// the name of the base charset encoding we delegate to
private static final String BASE_CHARSET_NAME = "UTF-8";
// Handle to the real charset we'll use for transcoding between
// characters and bytes. Doing this allows us to apply the Rot13
// algorithm to multibyte charset encodings. But only the
// ASCII alpha chars will be rotated, regardless of the base encoding.
Charset baseCharset;

/**
* Constructor for the Rot13 charset. Call the superclass constructor to
* pass along the name(s) we'll be known by. Then save a reference to the
* delegate Charset.
*/
protected Rot13Charset(String canonical, String[] aliases) {
super(canonical, aliases);
// Save the base charset we're delegating to
baseCharset = Charset.forName(BASE_CHARSET_NAME);
}

// ------------------------------------------------ ----------
/**
* Called by users of this Charset to obtain an encoder. This implementation
* instantiates an instance of a private class (defined below) and passes it
* an encoder from the base Charset.
*/
public CharsetEncoder newEncoder() {
return new Rot13Encoder(this, baseCharset.newEncoder());
}

/**
* Called by users of this Charset to obtain a decoder. This implementation
* instantiates an instance of a private class (defined below) and passes it
* a decoder from the base Charset.
*/
public CharsetDecoder newDecoder() {
return new Rot13Decoder(this, baseCharset.newDecoder());
}

/**
* This method must be implemented by concrete Charsets. We always say no,
* which is safe.
*/
public boolean contains(Charset cs) {
return (false);
}

/**
* Common routine to rotate all the ASCII alpha chars in the given
* CharBuffer by 13. Note that this code explicitly compares for upper and
* lower case ASCII chars rather than using the methods
* Character.isLowerCase and Character.isUpperCase. This is because the
* rotate-by-13 scheme only works properly for the alphabetic characters of
* the ASCII charset and those methods can return true for non-ASCII Unicode
* chars.
*/
private void rot13(CharBuffer cb) {
for (int pos = cb.position(); pos < cb.limit(); pos++) {
char c = cb.get(pos);
char a = '/u0000';
// Is it lowercase alpha?
if ((c >= 'a') && (c <= 'z')) {
a = 'a';
}
// Is it uppercase alpha?
if ((c >= 'A') && (c <= 'Z')) {
a = 'A';
}
// If either, roll it by 13
if (a != '/u0000') {
c = (char) ((((c - a) + 13) % 26) + a);
cb.put(pos, c);
}
}
}

// ------------------------------------------------ --------
/**
* The encoder implementation for the Rot13 Chars et. This class, and the
* matching decoder class below, should also override the "impl" methods,
* such as implOnMalformedInput( ) and make passthrough calls to the
* baseEncoder object. That is left as an exercise for the hacker.
*/
private class Rot13Encoder extends CharsetEncoder {
private CharsetEncoder baseEncoder;

/**
* Constructor, call the superclass constructor with the Charset object
* and the encodings sizes from the delegate encoder.
*/
Rot13Encoder(Charset cs, CharsetEncoder baseEncoder) {
super(cs, baseEncoder.averageBytesPerChar(), baseEncoder
.maxBytesPerChar());
this.baseEncoder = baseEncoder;
}

/**
* Implementation of the encoding loop. First, we apply the Rot13
* scrambling algorithm to the CharBuffer, then reset the encoder for
* the base Charset and call it's encode( ) method to do the actual
* encoding. This may not work properly for non -Latin charsets. The
* CharBuffer passed in may be read -only or re-used by the caller for
* other purposes so we duplicate it and apply the Rot13 encoding to the
* copy. We DO want to advance the position of the input buffer to
* reflect the chars consumed.
*/
protected CoderResult encodeLoop(CharBuffer cb, ByteBuffer bb) {
CharBuffer tmpcb = CharBuffer.allocate(cb.remaining());
while (cb.hasRemaining()) {
tmpcb.put(cb.get());
}
tmpcb.rewind();
rot13(tmpcb);
baseEncoder.reset();
CoderResult cr = baseEncoder.encode(tmpcb, bb, true);
// If error or output overflow, we need to adjust
// the position of the input buffer to match what
// was really consumed from the temp buffer. If
// underflow (all input consumed), this is a no-op.
cb.position(cb.position() - tmpcb.remaining());
return (cr);
}
}

// ------------------------------------------------ --------
/**
* The decoder implementation for the Rot13 Charset.
*/
private class Rot13Decoder extends CharsetDecoder {
private CharsetDecoder baseDecoder;

/**
* Constructor, call the superclass constructor with the Charset object
* and pass alon the chars/byte values from the delegate decoder.
*/
Rot13Decoder(Charset cs, CharsetDecoder baseDecoder) {
super(cs, baseDecoder.averageCharsPerByte(), baseDecoder
.maxCharsPerByte());
this.baseDecoder = baseDecoder;
}

/**
* Implementation of the decoding loop. First, we reset the decoder for
* the base charset, then call it to decode the bytes into characters,
* saving the result code. The CharBuffer is then de-scrambled with the
* Rot13 algorithm and the result code is returned. This may not work
* properly for non -Latin charsets.
*/
protected CoderResult decodeLoop(ByteBuffer bb, CharBuffer cb) {
baseDecoder.reset();
CoderResult result = baseDecoder.decode(bb, cb, true);
rot13(cb);
return (result);
}
}

// ------------------------------------------------ --------
/**
* Unit test for the Rot13 Charset. This main( ) will open and read an input
* file if named on the command line, or stdin if no args are provided, and
* write the contents to stdout via the X -ROT13 charset encoding. The
* "encryption" implemented by the Rot13 algorithm is symmetrical. Feeding
* in a plain-text file, such as Java source code for example, will output a
* scrambled version. Feeding the scrambled version back in will yield the
* original plain-text document.
*/
public static void main(String[] argv) throws Exception {
BufferedReader in;
if (argv.length > 0) {
// Open the named file
in = new BufferedReader(new FileReader(argv[0]));
} else {
// Wrap a BufferedReader around stdin
in = new BufferedReader(new InputStreamReader(System.in));
}
// Create a PrintStream that uses the Rot13 encoding
PrintStream out = new PrintStream(System.out, false, "X -ROT13");
String s = null;
// Read all input and write it to the output.
// As the data passes through the PrintStream,
// it will be Rot13-encoded.
while ((s = in.readLine()) != null) {
out.println(s);
}
out.flush();
}
}

為了使用這個Charset和它的編碼器與解碼器，它必須對Java 執行環境有效。用CharsetProvider 類別完成（範例6-4）。

範例6 -4. 自訂字元集提供方

複製代碼代碼如下:

package com.ronsoft.books.nio.charset;

import java.nio.charset.Charset;
import java.nio.charset.spi.CharsetProvider;
import java.util.HashSet;
import java.util.Iterator;

/**
* A CharsetProvider class which makes available the charsets provided by
* Ronsoft. Currently there is only one, namely the X -ROT13 charset. This is
* not a registered IANA charset, so it's name begins with "X-" to avoid name
* clashes with offical charsets.
*
* To activate this CharsetProvider, it's necessary to add a file to the
* classpath of the JVM runtime at the following location:
* META-INF/services/java.nio.charsets.spi.CharsetP rovider
*
* That file must contain a line with the fully qualified name of this class on
* a line by itself: com.ronsoft.books.nio.charset.RonsoftCharsetProvider Java
* NIO 216
*
* See the javadoc page for java.nio.charsets.spi.CharsetProvider for full
* details.
*
* @author Ron Hitchens ([email protected])
*/
public class RonsoftCharsetProvider extends CharsetProvider {
// the name of the charset we provide
private static final String CHARSET_NAME = "X-ROT13";
// a handle to the Charset object
private Charset rot13 = null;

/**
* Constructor, instantiate a Charset object and save the reference.
*/
public RonsoftCharsetProvider() {
this.rot13 = new Rot13Charset(CHARSET_NAME, new String[0]);
}

/**
* Called by Charset static methods to find a particular named Charset. If
* it's the name of this charset (we don't have any aliases) then return the
* Rot13 Charset, else return null.
*/
public Charset charsetForName(String charsetName) {
if (charsetName.equalsIgnoreCase(CHARSET_NAME)) {
return (rot13);
}
return (null);
}

/**
* Return an Iterator over the set of Charset objects we provide.
*
* @return An Iterator object containing references to all the Charset
* objects provided by this class.
*/
public Iterator<Charset> charsets() {
HashSet<Charset> set = new HashSet<Charset>(1);
set.add(rot13);
return (set.iterator());
}
}

對於透過JVM執行時間環境看到的這個字元集提供方，名為META_INF/services/java.nio.charset.spi.CharsetProvider的檔案必須存在於JARs 之一內或類別路徑的目錄中。那個文件的內容必須是：
com.ronsoft.books.nio.charset.RonsoftCharsetProvider

複製代碼代碼如下:

在範例6-1 中的字元集清單中加入X -ROT13，產生這個額外的輸出：

Charset: X-ROT13
Input: żMaana?
Encoded:
0: c2 (Ż)
1: bf (ż)
2: 5a (Z)
3: 6e (n)
4: c3 (Ă)
5: b1 (±)
6: 6e (n)
7: 61 (a)
8: 6e (n)
9: 3f (?)

總結：許多Java 程式設計人員永遠不會需要處理字元集編碼轉換問題，而大多數永遠不會建立自訂字元集。但是對於那些需要的人，在java.nio.charset 和java.charset.spi 中的一系列類別為字元處理提供了強大的以及彈性的機制。

Charset（字元集類別）
封裝編碼的字元集編碼方案，用來表示與作為位元組序列的字元集不同的字元序列。

CharsetEncoder（字元集編碼類別）
編碼引擎，把字元序列轉換成位元組序列。之後位元組序列可以被解碼從而重新建構原始字元序列。

CharsetDecoder（字元集解碼器類別）
解碼引擎，把編碼的位元組序列轉換成字元序列。

CharsetProvider SPI（字元集供應商SPI）
透過伺服器供應商機制定位並使Charset實現可用，從而在運行時環境中使用。