Utility class to guess the encoding of a given text file.
Unicode files encoded in UTF-16 (low or big endian) or UTF-8 files
with a Byte Order Marker are correctly discovered. For UTF-8 files with no BOM, if the buffer
is wide enough, the charset should also be discovered.
A byte buffer of 4KB is used to be able to guess the encoding.
Usage:
CharsetToolkit toolkit = new CharsetToolkit(file);
// guess the encoding
Charset guessedCharset = toolkit.getCharset();
// create a reader with the correct charset
BufferedReader reader = toolkit.getReader();
// read the file content
String line;
while ((line = br.readLine())!= null)
{
System.out.println(line);
}
- Author(s):
- Guillaume Laforge
Constructor of the
CharsetToolkit utility class.
- Parameters:
file of which we want to know the encoding.
byte[] bytes = new byte[4096];
int bytesRead = input.read(bytes);
else if (bytesRead < 4096) { byte[] bytesToGuess = new byte[bytesRead];
System.arraycopy(bytes, 0, bytesToGuess, 0, bytesRead);
Defines the default
Charset used in case the buffer represents
an 8-bit
Charset.
- Parameters:
defaultCharset the default Charset to be returned
if an 8-bit Charset is encountered.
if (defaultCharset != null)
If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII.
It might be a file without any special character in the range 128-255, but that may be or become
a file encoded with the default
charset rather than US-ASCII.
- Parameters:
enforce a boolean specifying the use or not of US-ASCII.
Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.
- Returns:
- a boolean representing the flag of use of US-ASCII.
Retrieves the default Charset
Guess the encoding of the provided buffer.
If Byte Order Markers are encountered at the beginning of the buffer, we immediately
return the charset implied by this BOM. Otherwise, the file would not be a human
readable text file.
If there is no BOM, this method tries to discern whether the file is UTF-8 or not.
If it is not UTF-8, we assume the encoding is the default system encoding
(of course, it might be any 8-bit charset, but usually, an 8-bit charset is the default one).
It is possible to discern UTF-8 thanks to the pattern of characters with a multi-byte sequence.
UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000-7FFF FFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
With UTF-8, 0xFE and 0xFF never appear.
- Returns:
- the Charset recognized.
return Charset.forName("UTF-16LE"); return Charset.forName("UTF-16BE"); boolean highOrderBit = false;
boolean validU8Char = true;
return Charset.forName("US-ASCII"); If the byte has the form 10xxxxx, then it's a continuation byte of a multiple byte character;
- Parameters:
b a byte.- Returns:
- true if it's a continuation char.
return -128 <= b && b <= -65;
If the byte has the form 110xxxx, then it's the first byte of a two-bytes sequence character.
- Parameters:
b a byte.- Returns:
- true if it's the first byte of a two-bytes sequence.
return -64 <= b && b <= -33;
If the byte has the form 1110xxx, then it's the first byte of a three-bytes sequence character.
- Parameters:
b a byte.- Returns:
- true if it's the first byte of a three-bytes sequence.
return -32 <= b && b <= -17;
If the byte has the form 11110xx, then it's the first byte of a four-bytes sequence character.
- Parameters:
b a byte.- Returns:
- true if it's the first byte of a four-bytes sequence.
return -16 <= b && b <= -9;
If the byte has the form 11110xx, then it's the first byte of a five-bytes sequence character.
- Parameters:
b a byte.- Returns:
- true if it's the first byte of a five-bytes sequence.
return -8 <= b && b <= -5;
If the byte has the form 1110xxx, then it's the first byte of a six-bytes sequence character.
- Parameters:
b a byte.- Returns:
- true if it's the first byte of a six-bytes sequence.
return -4 <= b && b <= -3;
Retrieve the default charset of the system.
- Returns:
- the default
Charset.
Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).
- Returns:
- true if the buffer has a BOM for UTF8.
Has a Byte Order Marker for UTF-16 Low Endian
(ucs-2le, ucs-4le, and ucs-16le).
- Returns:
- true if the buffer has a BOM for UTF-16 Low Endian.
Has a Byte Order Marker for UTF-16 Big Endian
(utf-16 and ucs-2).
- Returns:
- true if the buffer has a BOM for UTF-16 Big Endian.
Gets a
BufferedReader (indeed a
LineNumberReader) from the
File
specified in the constructor of
CharsetToolkit using the charset discovered or the default
charset if an 8-bit
Charset is encountered.
Retrieves all the available
Charsets on the platform,
among which the default
charset.
- Returns:
- an array of
Charsets.