Original URL: http://www.blogwind.com/Wuvist/42999.shtml
In the .Net framework, the encoding used by StreamReader must be specified in the constructor and cannot be changed at all midway.
Under normal circumstances, this won't cause any problems. Generally, if a file is read from a hard disk, the encoding within a single file is generally uniform. Even if you find a reading error, you can close the StreamReader and restart reading using the new encoding.
I recently encountered the need to modify the encoding, and my program did not turn off the opportunity to reread. Because the BaseStream of the StreamReader I am using is a Network Stream, I cannot close it... but the things passed by the Network Stream may contain different encodings... GB2312, Big5, UTF8, ISO-8859-1, etc... Although The encoding information is obtained first, and then the specific content is read. However, once the Stream Reader encoding used at the beginning is wrong, the read content can never be recovered...words will be lost...
I can't get it again
...After encoding the information, re-create a new Stream Reader, because the specific content has been buffered by the original Stream Reader...
The only solution is to implement a Stream Reader that can change the CurrentEncoding attribute...
Write it all from scratch It was very impractical at first. I first obtained the source code of mono and made modifications from mono's Stream Reader implementation code.
Stream Reader is actually very simple. It has two buffers inside, one is the input buffer and the other is the decoded buffer. The former is used to cache the original data read from the base stream, and the latter is used to cache the things decoded according to the original data... ...As long as you understand the ReadBuffer method in the implementation of mono, it is not too difficult to dynamically modify CurrentEncoding...
The network protocol I need to deal with is a line protocol...I only called the Readline method of StreamReader in the program, but completely The two methods of Read are not used, which also makes it much easier for me to dynamically modify the encoding...
What I do is that every time I call Readline, I not only move the cursor (pos) of the decoded buffer, but also move a new cursor in the input buffer. (pos_input), the method is very simple. The Readline method needs to call FindNextEOL to move the cursor to find the newline symbol... I add one more line to the FindNextEOL method:
int FindNextEOL()
{
FindNextInputEOL();
....
The new function FindNextInputEOL is a complete replica of FindNextEOL, except that the former processes the input buffer, while the latter processes the decoded buffer...
In this way, I can know that after each Readline, the input buffer has not been decoded by the upper layer. What are the original data read...
Then, add the Set method to the CurrentEncoding attribute:
set
{
encoding=value;
decoder = encoding.GetDecoder();
decoded_count = pos + decoder.GetChars (input_buffer, pos_input, cbEncoded, pos_input, decoded_buffer, pos);
}
When setting a new encoding, the program re-decodes the original data that has not been read according to the cursor of the input buffer (pos_input), and replaces the content in the decoded buffer.
Then, the thing is done... You don't even need to make any modifications to the Readline method... Except for putting the cbEncoded variable into the global...
However, this modification makes the two methods of Read completely unusable... Once called... it will cause the two cursors in the input buffer and the decoded buffer to be out of sync... The complete code is attached below. I hope someone can help me figure out the two methods of Read... Thanks in advance …
/
// System.IO.StreamReader.cs
//
// Author:
// Dietmar Maurer ( [email protected] )
// Miguel de Icaza ( [email protected] )
//
// (C) Ximian, Inc. http://www.ximian.com
// Copyright (C) 2004 Novell ( http://www.novell.com )
//
//
// Copyright (C) 2004 Novell, Inc ( http://www.novell.com )
//
// Permission is hereby granted, free of charge, to any person obtaining
// a copy of this software and associated documentation files (the
// "Software"), to deal in the Software without restriction, including
// without limitation the rights to use, copy, modify, merge, publish,
// distribute, sublicense, and/or sell copies of the Software, and to
// permit persons to whom the Software is furnished to do so, subject to
// the following conditions:
//
// The above copyright notice and this permission notice shall be
// included in all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
// EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
// MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
// NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
// LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
// OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
// WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
//
using System;
using System.Text;
using System.Runtime.InteropServices;
namespace System.IO
{
[Serializable]
public class DynamicStreamReader : TextReader
{
const int DefaultBufferSize = 1024;
const int DefaultFileBufferSize = 4096;
const int MinimumBufferSize = 128;
//
//The input buffer
//
byte [] input_buffer;
//
// The decoded buffer from the above input buffer
//
char [] decoded_buffer;
//
// Decoded bytes in decoded_buffer.
//
int decoded_count;
//
// Current position in the decoded_buffer
//
int pos;
//
// Current position in the input_buffer
//
int pos_input;
//
// The buffer size that we are using
//
int buffer_size;
int do_checks;
Encoding encoding;
Decoder decoder;
Stream base_stream;
bool mayBlock;
StringBuilder line_builder;
private class NullStreamReader : DynamicStreamReader
{
public override int Peek ()
{
return -1;
}
public override int Read ()
{
return -1;
}
public override int Read ([In, Out] char[] buffer, int index, int count)
{
return 0;
}
public override string ReadLine ()
{
return null;
}
public override string ReadToEnd ()
{
return String.Empty;
}
public override Stream BaseStream
{
get { return Stream.Null; }
}
public override Encoding CurrentEncoding
{
get { return Encoding.Unicode; }
}
}
public new static readonly DynamicStreamReader Null = (DynamicStreamReader)(new NullStreamReader());
internal DynamicStreamReader() {}
public DynamicStreamReader(Stream stream)
: this (stream, Encoding.UTF8, true, DefaultBufferSize) { }
public DynamicStreamReader(Stream stream, bool detect_encoding_from_bytemarks)
: this (stream, Encoding.UTF8, detect_encoding_from_bytemarks, DefaultBufferSize) { }
public DynamicStreamReader(Stream stream, Encoding encoding)
: this (stream, encoding, true, DefaultBufferSize) { }
public DynamicStreamReader(Stream stream, Encoding encoding, bool detect_encoding_from_bytemarks)
: this (stream, encoding, detect_encoding_from_bytemarks, DefaultBufferSize) { }
public DynamicStreamReader(Stream stream, Encoding encoding, bool detect_encoding_from_bytemarks, int buffer_size)
{
Initialize (stream, encoding, detect_encoding_from_bytemarks, buffer_size);
}
public DynamicStreamReader(string path)
: this (path, Encoding.UTF8, true, DefaultFileBufferSize) { }
public DynamicStreamReader(string path, bool detect_encoding_from_bytemarks)
: this (path, Encoding.UTF8, detect_encoding_from_bytemarks, DefaultFileBufferSize) { }
public DynamicStreamReader(string path, Encoding encoding)
: this (path, encoding, true, DefaultFileBufferSize) { }
public DynamicStreamReader(string path, Encoding encoding, bool detect_encoding_from_bytemarks)
: this (path, encoding, detect_encoding_from_bytemarks, DefaultFileBufferSize) { }
public DynamicStreamReader(string path, Encoding encoding, bool detect_encoding_from_bytemarks, int buffer_size)
{
if (null == path)
throw new ArgumentNullException("path");
if (String.Empty == path)
throw new ArgumentException("Empty path not allowed");
if (path.IndexOfAny (Path.InvalidPathChars) != -1)
throw new ArgumentException("path contains invalid characters");
if (null == encoding)
throw new ArgumentNullException ("encoding");
if (buffer_size <= 0)
throw new ArgumentOutOfRangeException ("buffer_size", "The minimum size of the buffer must be positive");
string DirName = Path.GetDirectoryName(path);
if (DirName != String.Empty && !Directory.Exists(DirName))
throw new DirectoryNotFoundException ("Directory '" + DirName + "' not found.");
if (!File.Exists(path))
throw new FileNotFoundException("File not found.", path);
Stream stream = (Stream) File.OpenRead (path);
Initialize (stream, encoding, detect_encoding_from_bytemarks, buffer_size);
}
internal void Initialize (Stream stream, Encoding encoding, bool detect_encoding_from_bytemarks, int buffer_size)
{
if (null == stream)
throw new ArgumentNullException ("stream");
if (null == encoding)
throw new ArgumentNullException ("encoding");
if (!stream.CanRead)
throw new ArgumentException ("Cannot read stream");
if (buffer_size <= 0)
throw new ArgumentOutOfRangeException ("buffer_size", "The minimum size of the buffer must be positive");
if (buffer_size < MinimumBufferSize)
buffer_size = MinimumBufferSize;
base_stream = stream;
input_buffer = new byte [buffer_size];
this.buffer_size = buffer_size;
this.encoding = encoding;
decoder = encoding.GetDecoder ();
byte [] preamble = encoding.GetPreamble ();
do_checks = detect_encoding_from_bytemarks ? 1 : 0;
do_checks += (preamble.Length == 0) ? 0 : 2;
decoded_buffer = new char [encoding.GetMaxCharCount (buffer_size)];
decoded_count = 0;
pos = 0;
pos_input =0;
}
public virtual Stream BaseStream
{
get
{
return base_stream;
}
}
public virtual Encoding CurrentEncoding
{
get
{
if (encoding == null)
throw new Exception ();
return encoding;
}
set
{
encoding=value;
decoder = encoding.GetDecoder();
decoded_count = pos + decoder.GetChars (input_buffer, pos_input, cbEncoded - pos_input, decoded_buffer, pos);
//DiscardBufferedData();
}
}
public override void Close ()
{
Dispose(true);
}
protected override void Dispose (bool disposing)
{
if (disposing && base_stream != null)
base_stream.Close ();
input_buffer = null;
decoded_buffer = null;
encoding = null;
decoder = null;
base_stream = null;
base.Dispose (disposing);
}
//
// Provides auto-detection of the encoding, as well as skipping over
// byte marks at the beginning of a stream.
//
int DoChecks (int count)
{
if ((do_checks & 2) == 2)
{
byte [] preamble = encoding.GetPreamble ();
int c = preamble.Length;
if (count >= c)
{
int i;
for (i = 0; i < c; i++)
if (input_buffer [i] != preamble [i])
break;
if (i == c)
return i;
}
}
if ((do_checks & 1) == 1)
{
if (count < 2)
return 0;
if (input_buffer [0] == 0xfe && input_buffer [1] == 0xff)
{
this.encoding = Encoding.BigEndianUnicode;
return 2;
}
if (input_buffer [0] == 0xff && input_buffer [1] == 0xfe)
{
this.encoding = Encoding.Unicode;
return 2;
}
if (count < 3)
return 0;
if (input_buffer [0] == 0xef && input_buffer [1] == 0xbb && input_buffer [2] == 0xbf)
{
this.encoding = Encoding.UTF8;
return 3;
}
}
return 0;
}
public void DiscardBufferedData ()
{
pos = decoded_count = 0;
mayBlock = false;
// Discard internal state of the decoder too.
decoder = encoding.GetDecoder ();
}
int cbEncoded;
int parse_start;
// the buffer is empty, fill it again
private int ReadBuffer ()
{
pos = 0;
pos_input = 0;
cbEncoded = 0;
// keep looping until the decoder gives us some chars
decoded_count = 0;
parse_start = 0;
do
{
cbEncoded = base_stream.Read (input_buffer, 0, buffer_size);
if (cbEncoded == 0)
return 0;
mayBlock = (cbEncoded < buffer_size);
if (do_checks > 0)
{
Encoding old = encoding;
parse_start = DoChecks (cbEncoded);
if (old != encoding)
{
decoder = encoding.GetDecoder ();
}
do_checks = 0;
cbEncoded -= parse_start;
}
decoded_count += decoder.GetChars (input_buffer, parse_start, cbEncoded, decoded_buffer, 0);
parse_start = 0;
} while (decoded_count == 0);
return decoded_count;
}
public override int Peek ()
{
if (base_stream == null)
throw new ObjectDisposedException ("StreamReader", "Cannot read from a closed StreamReader");
if (pos >= decoded_count && (mayBlock || ReadBuffer () == 0))
return -1;
return decoded_buffer [pos];
}
public override int Read ()
{
throw new Exception("Dynamic Reader could not read!");
}
public override int Read ([In, Out] char[] dest_buffer, int index, int count)
{
throw new Exception("Dynamic Reader could not read!");
}
bool foundCR_input;
int FindNextInputEOL()
{
char c = '