Java I/O流Streams

Published on 2017 - 04 - 25

I/O in Java is built on streams. Input streams read data; output streams write data. Different stream classes, like java.io.FileInputStream and sun.net.TelnetOutputStream, read and write particular sources of data. However, all output streams have the same basic methods to write data and all input streams use the same basic methods to read data. After a stream is created, you can often ignore the details of exactly what it is you’re reading or writing.

Filter streams can be chained to either an input stream or an output stream. Filters can modify the data as it’s read or written—for instance, by encrypting or compressing it—or they can simply provide additional methods for converting the data that’s read or written into other formats. For instance, the java.io.DataOutputStream class provides a method that converts an int to four bytes and writes those bytes onto its underlying output stream.

Readers and writers can be chained to input and output streams to allow programs to read and write text (i.e., characters) rather than bytes. Used properly, readers and writers can handle a wide variety of character encodings, including multibyte character sets such as SJIS and UTF-8.

Output Streams

Java’s basic output class is java.io.OutputStream:

public abstract class OutputStream

This class provides the fundamental methods needed to write data. These are:

public abstract void write(int b) throws IOException
public void write(byte[] data) throws IOException
public void write(byte[] data, int offset, int length)
    throws IOException
public void flush() throws IOException
public void close() throws IOException

Subclasses of OutputStream use these methods to write data onto particular media. For instance, a FileOutputStream uses these methods to write data into a file. A TelnetOutputStream uses these methods to write data onto a network connection. A ByteArrayOutputStream uses these methods to write data into an expandable byte array. But whichever medium you’re writing to, you mostly use only these same five methods. Sometimes you may not even know exactly what kind of stream you’re writing onto. For instance, you won’t find TelnetOutputStream in the Java class library documentation. It’s deliberately hidden inside the sun packages. It’s returned by various methods in various classes in java.net, like the getOutputStream() method of java.net.Socket. However, these methods are declared to return only OutputStream, not the more specific subclass TelnetOutputStream. That’s the power of polymorphism. If you know how to use the superclass, you know how to use all the subclasses, too.

OutputStream’s fundamental method is write(int b). This method takes an integer from 0 to 255 as an argument and writes the corresponding byte to the output stream. This method is declared abstract because subclasses need to change it to handle their particular medium. For instance, a ByteArrayOutputStream can implement this method with pure Java code that copies the byte into its array. However, a FileOutputStream will need to use native code that understands how to write data in files on the host platform.

Take note that although this method takes an int as an argument, it actually writes an unsigned byte. Java doesn’t have an unsigned byte data type, so an int has to be used here instead. The only real difference between an unsigned byte and a signed byte is the interpretation. They’re both made up of eight bits, and when you write an int onto a network connection using write(int b), only eight bits are placed on the wire. If an int outside the range 0–255 is passed to write(int b), the least significant byte of the number is written and the remaining three bytes are ignored. (This is the effect of casting an int to a byte.)

For example, the character-generator protocol defines a server that sends out ASCII text. The most popular variation of this protocol sends 72-character lines containing printable ASCII characters. (The printable ASCII characters are those between 33 and 126 inclusive that exclude the various whitespace and control characters.) The first line contains characters 33 through 104, sorted. The second line contains characters 34 through 105. The third line contains characters 35 through 106. This continues through line 29, which contains characters 55 through 126. At that point, the characters wrap around so that line 30 contains characters 56 through 126 followed by character 33 again. Lines are terminated with a carriage return (ASCII 13) and a linefeed (ASCII 10). The output looks like this:

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh
"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghi
#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghij
$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijk
%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijkl
&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklm
'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmn

Because ASCII is a 7-bit character set, each character is sent as a single byte. Consequently, this protocol is straightforward to implement using the basic write() methods, as the next code fragment demonstrates:

public static void generateCharacters(OutputStream out)
    throws IOException {

   int firstPrintableCharacter     = 33;
   int numberOfPrintableCharacters = 94;
   int numberOfCharactersPerLine   = 72;

   int start = firstPrintableCharacter;
   while (true) { /* infinite loop */
     for (int i = start; i < start + numberOfCharactersPerLine; i++) {
       out.write((
           (i - firstPrintableCharacter) % numberOfPrintableCharacters)
            + firstPrintableCharacter);
     }
     out.write('\r'); // carriage return
     out.write('\n'); // linefeed
     start = ((start + 1) - firstPrintableCharacter)
         % numberOfPrintableCharacters + firstPrintableCharacter;
   }
}

An OutputStream is passed to the generateCharacters() method in the out argument. Bytes are written onto out one at a time. These bytes are given as integers in a rotating sequence from 33 to 126. Most of the arithmetic here is to make the loop rotate in that range. After each 72-character chunk is written, a carriage return and a linefeed are written onto the output stream. The next start character is calculated and the loop repeats. The entire method is declared to throw IOException. That’s important because the character-generator server will terminate only when the client closes the connection. The Java code will see this as an IOException.

Writing a single byte at a time is often inefficient. For example, every TCP segment contains at least 40 bytes of overhead for routing and error correction. If each byte is sent by itself, you may be stuffing the network with 41 times more data than you think you are! Add overhead of the host-to-network layer protocol, and this can be even worse. Consequently, most TCP/IP implementations buffer data to some extent. That is, they accumulate bytes in memory and send them to their eventual destination only when a certain number have accumulated or a certain amount of time has passed. However, if you have more than one byte ready to go, it’s not a bad idea to send them all at once. Using write(byte[] data) or write(byte[] data, int offset, int length) is normally much faster than writing all the components of the data array one at a time. For instance, here’s an implementation of the generateCharacters() method that sends a line at a time by packing a complete line into a byte array:

public static void generateCharacters(OutputStream out)
    throws IOException {

  int firstPrintableCharacter = 33;
  int numberOfPrintableCharacters = 94;
  int numberOfCharactersPerLine = 72;
  int start = firstPrintableCharacter;
  byte[] line = new byte[numberOfCharactersPerLine + 2];
  // the +2 is for the carriage return and linefeed

  while (true) { /* infinite loop */
    for (int i = start; i < start + numberOfCharactersPerLine; i++) {
      line[i - start] = (byte) ((i - firstPrintableCharacter)
          % numberOfPrintableCharacters + firstPrintableCharacter);
    }
    line[72] = (byte) '\r'; // carriage return
    line[73] = (byte) '\n'; // line feed
    out.write(line);
    start = ((start + 1) - firstPrintableCharacter)
        % numberOfPrintableCharacters + firstPrintableCharacter;
  }
}

The algorithm for calculating which bytes to write when is the same as for the previous implementation. The crucial difference is that the bytes are packed into a byte array before being written onto the network. Also, notice that the int result of the calculation must be cast to a byte before being stored in the array. This wasn’t necessary in the previous implementation because the single-byte write() method is declared to take an int as an argument.

Streams can also be buffered in software, directly in the Java code as well as in the network hardware. Typically, this is accomplished by chaining a BufferedOutputStream or a BufferedWriter to the underlying stream, a technique we’ll explore shortly. Consequently, if you are done writing data, it’s important to flush the output stream. For example, suppose you’ve written a 300-byte request to an HTTP 1.1 server that uses HTTP Keep-Alive. You generally want to wait for a response before sending any more data. However, if the output stream has a 1,024-byte buffer, the stream may be waiting for more data to arrive before it sends the data out of its buffer. No more data will be written onto the stream until the server response arrives, but the response is never going to arrive because the request hasn’t been sent yet! Figure 1 illustrates this catch-22. The flush() method breaks the deadlock by forcing the buffered stream to send its data even if the buffer isn’t yet full.

It’s important to flush your streams whether you think you need to or not. Depending on how you got hold of a reference to the stream, you may or may not know whether it’s buffered. (For instance, System.out is buffered whether you want it to be or not.) If flushing isn’t necessary for a particular stream, it’s a very low-cost operation. However, if it is necessary, it’s very necessary. Failing to flush when you need to can lead to unpredictable, unrepeatable program hangs that are extremely hard to diagnose if you don’t have a good idea of what the problem is in the first place. As a corollary to all this, you should flush all streams immediately before you close them. Otherwise, data left in the buffer when the stream is closed may get lost.

Finally, when you’re done with a stream, close it by invoking its close() method. This releases any resources associated with the stream, such as file handles or ports. If the stream derives from a network connection, then closing the stream terminates the connection. Once an output stream has been closed, further writes to it throw IOExceptions. However, some kinds of streams may still allow you to do things with the object. For instance, a closed ByteArrayOutputStream can still be converted to an actual byte array and a closed DigestOutputStream can still return its digest.

Failure to close a stream in a long-running program can leak file handles, network ports, and other resources. Consequently, in Java 6 and earlier, it’s wise to close the stream in a finally block. To get the right variable scope, you have to declare the stream variable outside the try block but initialize it inside the try block. Furthermore, to avoid NullPointerExceptions you need to check whether the stream variable is null before closing it. Finally, you usually want to ignore or at most log any exceptions that occur while closing the stream. For example:

OutputStream out = null;
try {
  out = new FileOutputStream("/tmp/data.txt");
  // work with the output stream...
} catch (IOException ex) {
  System.err.println(ex.getMessage());
} finally {
  if (out != null) {
    try {
      out.close();
    } catch (IOException ex) {
      // ignore
    }
  }
}

This technique is sometimes called the dispose pattern; and it’s common for any object that needs to be cleaned up before it’s garbage collected. You’ll see it used not just for streams, but also for sockets, channels, JDBC connections and statements, and more.

Java 7 introduces the try with resources construct to make this cleanup neater. Instead of declaring the stream variable outside the try block, you declare it inside an argument list of the try block. For instance, the preceding fragment now becomes the much simpler:

try (OutputStream out = new FileOutputStream("/tmp/data.txt")) {
  // work with the output stream...
} catch (IOException ex) {
  System.err.println(ex.getMessage());
}

The finally clause is no longer needed. Java automatically invokes close() on any AutoCloseable objects declared inside the argument list of the try block.

Input Streams

Java’s basic input class is java.io.InputStream:

public abstract class InputStream

This class provides the fundamental methods needed to read data as raw bytes. These are:

public abstract int read() throws IOException
public int read(byte[] input) throws IOException
public int read(byte[] input, int offset, int length) throws IOException
public long skip(long n) throws IOException
public int available() throws IOException
public void close() throws IOException

Concrete subclasses of InputStream use these methods to read data from particular media. For instance, a FileInputStream reads data from a file. A TelnetInputStream reads data from a network connection. A ByteArrayInputStream reads data from an array of bytes. But whichever source you’re reading, you mostly use only these same six methods. Sometimes you don’t know exactly what kind of stream you’re reading from. For instance, TelnetInputStream is an undocumented class hidden inside the sun.net package. Instances of it are returned by various methods in the java.net package (e.g., the openStream() method of java.net.URL). However, these methods are declared to return only InputStream, not the more specific subclass TelnetInputStream. That’s polymorphism at work once again. The instance of the subclass can be used transparently as an instance of its superclass. No specific knowledge of the subclass is required.

The basic method of InputStream is the noargs read() method. This method reads a single byte of data from the input stream’s source and returns it as an int from 0 to 255. End of stream is signified by returning –1. The read() method waits and blocks execution of any code that follows it until a byte of data is available and ready to be read. Input and output can be slow, so if your program is doing anything else of importance, try to put I/O in its own thread.

The read() method is declared abstract because subclasses need to change it to handle their particular medium. For instance, a ByteArrayInputStream can implement this method with pure Java code that copies the byte from its array. However, a TelnetInputStream needs to use a native library that understands how to read data from the network interface on the host platform.

The following code fragment reads 10 bytes from the InputStream in and stores them in the byte array input. However, if end of stream is detected, the loop is terminated early:

byte[] input = new byte[10];
for (int i = 0; i < input.length; i++) {
  int b = in.read();
  if (b == -1) break;
  input[i] = (byte) b;
}

Although read() only reads a byte, it returns an int. Thus, a cast is necessary before storing the result in the byte array. Of course, this produces a signed byte from –128 to 127 instead of the unsigned byte from 0 to 255 returned by the read() method. However, as long as you’re clear about which one you’re working with, this is not a major problem. You can convert a signed byte to an unsigned byte like this:

int i = b >= 0 ? b : 256 + b;

Reading a byte at a time is as inefficient as writing data one byte at a time. Consequently, there are two overloaded read() methods that fill a specified array with multiple bytes of data read from the stream read(byte[] input) and read(byte[] input, int offset, int length). The first method attempts to fill the specified array input. The second attempts to fill the specified subarray of input, starting at offset and continuing for length bytes.

Notice I said these methods attempt to fill the array, not that they necessarily succeed. An attempt may fail in several ways. For instance, it’s not unheard of that while your program is reading data from a remote web server over DSL, a bug in a switch at a phone company central office will disconnect you and several hundred of your neighbors from the rest of the world. This would cause an IOException. More commonly, however, a read attempt won’t completely fail but won’t completely succeed either. Some of the requested bytes may be read, but not all of them. For example, you may try to read 1,024 bytes from a network connection, when only 512 have actually arrived from the server; the rest are still in transit. They’ll arrive eventually, but they aren’t available at this moment. To account for this, the multibyte read methods return the number of bytes actually read. For example, consider this code fragment:

byte[] input  = new byte[1024];
int bytesRead = in.read(input);

It attempts to read 1,024 bytes from the InputStream in into the array input. However, if only 512 bytes are available, that’s all that will be read, and bytesRead will be set to 512. To guarantee that all the bytes you want are actually read, place the read in a loop that reads repeatedly until the array is filled. For example:

int bytesRead   = 0;
int bytesToRead = 1024;
byte[] input    = new byte[bytesToRead];
while (bytesRead < bytesToRead) {
  bytesRead += in.read(input, bytesRead, bytesToRead - bytesRead);
}

This technique is especially crucial for network streams. Chances are that if a file is available at all, all the bytes of a file are also available. However, because networks move much more slowly than CPUs, it is very easy for a program to empty a network buffer before all the data has arrived. In fact, if one of these two methods tries to read from a temporarily empty but open network buffer, it will generally return 0, indicating that no data is available but the stream is not yet closed. This is often preferable to the behavior of the single-byte read() method, which blocks the running thread in the same circumstances.

All three read() methods return –1 to signal the end of the stream. If the stream ends while there’s still data that hasn’t been read, the multibyte read() methods return the data until the buffer has been emptied. The next call to any of the read() methods will return –1. The –1 is never placed in the array. The array only contains actual data. The previous code fragment had a bug because it didn’t consider the possibility that all 1,024 bytes might never arrive (as opposed to not being immediately available). Fixing that bug requires testing the return value of read() before adding it to bytesRead. For example:

int bytesRead = 0;
int bytesToRead = 1024;
byte[] input = new byte[bytesToRead];
while (bytesRead < bytesToRead) {
  int result = in.read(input, bytesRead, bytesToRead - bytesRead);
  if (result == -1) break; // end of stream
  bytesRead += result;
}

If you do not want to wait until all the bytes you need are immediately available, you can use the available() method to determine how many bytes can be read without blocking. This returns the minimum number of bytes you can read. You may in fact be able to read more, but you will be able to read at least as many bytes as available() suggests. For example:

int bytesAvailable = in.available();
byte[] input = new byte[bytesAvailable];
int bytesRead = in.read(input, 0, bytesAvailable);
// continue with rest of program immediately...

In this case, you can expect that bytesRead is exactly equal to bytesAvailable. You cannot, however, expect that bytesRead is greater than zero. It is possible that no bytes were available. On end of stream, available() returns 0. Generally, read(byte[] input, int offset, int length) returns –1 on end of stream; but if length is 0, then it does not notice the end of stream and returns 0 instead.

On rare occasions, you may want to skip over data without reading it. The skip() method accomplishes this task. It’s less useful on network connections than when reading from files. Network connections are sequential and normally quite slow, so it’s not significantly more time consuming to read data than to skip over it. Files are random access so that skipping can be implemented simply by repositioning a file pointer rather than processing each byte to be skipped.

As with output streams, once your program has finished with an input stream, it should close it by invoking its close() method. This releases any resources associated with the stream, such as file handles or ports. Once an input stream has been closed, further reads from it throw IOExceptions. However, some kinds of streams may still allow you to do things with the object. For instance, you generally won’t retrieve the message digest from a java.security.DigestInputStream until after the data has been read and the stream closed.

Marking and Resetting

The InputStream class also has three less commonly used methods that allow programs to back up and reread data they’ve already read. These are:

public void mark(int readAheadLimit)
public void reset() throws IOException
public boolean markSupported()

In order to reread data, mark the current position in the stream with the mark() method. At a later point, you can reset the stream to the marked position using the reset() method. Subsequent reads then return data starting from the marked position. However, you may not be able to reset as far back as you like. The number of bytes you can read from the mark and still reset is determined by the readAheadLimit argument to mark(). If you try to reset too far back, an IOException is thrown. Furthermore, there can be only one mark in a stream at any given time. Marking a second location erases the first mark.

Marking and resetting are usually implemented by storing every byte read from the marked position on in an internal buffer. However, not all input streams support this. Before trying to use marking and resetting, check whether the markSupported() method returns true. If it does, the stream supports marking and resetting. Otherwise, mark() will do nothing and reset() will throw an IOException.

In my opinion, this demonstrates very poor design. In practice, more streams don’t support marking and resetting than do. Attaching functionality to an abstract superclass that is not available to many, probably most, subclasses is a very poor idea. It would be better to place these three methods in a separate interface that could be implemented by those classes that provided this functionality. The disadvantage of this approach is that you couldn’t then invoke these methods on an arbitrary input stream of unknown type; but in practice, you can’t do that anyway because not all streams support marking and resetting. Providing a method such as markSupported() to check for functionality at runtime is a more traditional, non-object-oriented solution to the problem. An object-oriented approach would embed this in the type system through interfaces and classes so that it could all be checked at compile time.

The only input stream classes in java.io that always support marking are BufferedInputStream and ByteArrayInputStream. However, other input streams such as TelnetInputStream may support marking if they’re chained to a buffered input stream first.

Filter Streams

InputStream and OutputStream are fairly raw classes. They read and write bytes singly or in groups, but that’s all. Deciding what those bytes mean—whether they’re integers or IEEE 754 floating-point numbers or Unicode text—is completely up to the programmer and the code. However, there are certain extremely common data formats that can benefit from a solid implementation in the class library. For example, many integers passed as parts of network protocols are 32-bit big-endian integers. Much of the text sent over the Web is either 7-bit ASCII, 8-bit Latin-1, or multibyte UTF-8. Many files transferred by FTP are stored in the ZIP format. Java provides a number of filter classes you can attach to raw streams to translate the raw bytes to and from these and other formats.

The filters come in two versions: the filter streams, and the readers and writers. The filter streams still work primarily with raw data as bytes: for instance, by compressing the data or interpreting it as binary numbers. The readers and writers handle the special case of text in a variety of encodings such as UTF-8 and ISO 8859-1.

Filters are organized in a chain, as shown in Figure 2. Each link in the chain receives data from the previous filter or stream and passes the data along to the next link in the chain. In this example, a compressed, encrypted text file arrives from the local network interface, where native code presents it to the undocumented TelnetInputStream. A BufferedInputStream buffers the data to speed up the entire process. A CipherInputStream decrypts the data. A GZIPInputStream decompresses the deciphered data. An InputStreamReader converts the decompressed data to Unicode text. Finally, the text is read into the application and processed.

Every filter output stream has the same write(), close(), and flush() methods as java.io.OutputStream. Every filter input stream has the same read(), close(), and available() methods as java.io.InputStream. In some cases, such as BufferedInputStream and BufferedOutputStream, these may be the only methods the filter has. The filtering is purely internal and does not expose any new public interface. However, in most cases, the filter stream adds public methods with additional purposes. Sometimes these are intended to be used in addition to the usual read() and write() methods, like the unread() method of PushbackInputStream. At other times, they almost completely replace the original interface. For example, it’s relatively rare to use the write() method of PrintStream instead of one of its print() and println() methods.

Chaining Filters Together

Filters are connected to streams by their constructors. For example, the following code fragment buffers input from the file data.txt. First, a FileInputStream object fin is created by passing the name of the file as an argument to the FileInputStream constructor. Then, a BufferedInputStream object bin is created by passing fin as an argument to the BufferedInputStream constructor:

FileInputStream     fin = new FileInputStream("data.txt");
BufferedInputStream bin = new BufferedInputStream(fin);

From this point forward, it’s possible to use the read() methods of both fin and bin to read data from the file data.txt. However, intermixing calls to different streams connected to the same source may violate several implicit contracts of the filter streams. Most of the time, you should only use the last filter in the chain to do the actual reading or writing. One way to write your code so that it’s at least harder to introduce this sort of bug is to deliberately overwrite the reference to the underlying input stream. For example:

InputStream in = new FileInputStream("data.txt");
in = new BufferedInputStream(in);

After these two lines execute, there’s no longer any way to access the underlying file input stream, so you can’t accidentally read from it and corrupt the buffer. This example works because it’s not necessary to distinguish between the methods of InputStream and those of BufferedInputStream given that BufferedInputStream is simply used polymorphically as an instance of InputStream. In cases where it is necessary to use the additional methods of the filter stream not declared in the superclass, you may be able to construct one stream directly inside another. For example:

DataOutputStream dout = new DataOutputStream(new BufferedOutputStream(
    new FileOutputStream("data.txt")));

Although these statements can get a little long, it’s easy to split the statement across several lines, like this:

DataOutputStream dout = new DataOutputStream(
                         new BufferedOutputStream(
                          new FileOutputStream("data.txt")
                         )
                        );

Connection is permanent. Filters cannot be disconnected from a stream.

There are times when you may need to use the methods of multiple filters in a chain. For instance, if you’re reading a Unicode text file, you may want to read the byte order mark in the first three bytes to determine whether the file is encoded as big-endian UCS-2, little-endian UCS-2, or UTF-8, and then select the matching Reader filter for the encoding. Or, if you’re connecting to a web server, you may want to read the header the server sends to find the Content-encoding and then use that content encoding to pick the right Reader filter to read the body of the response. Or perhaps you want to send floating-point numbers across a network connection using a DataOutputStream and then retrieve a MessageDigest from the DigestOutputStream that the DataOutputStream is chained to. In all these cases, you need to save and use references to each of the underlying streams. However, under no circumstances should you ever read from or write to anything other than the last filter in the chain.

Buffered Streams

The BufferedOutputStream class stores written data in a buffer (a protected byte array field named buf) until the buffer is full or the stream is flushed. Then it writes the data onto the underlying output stream all at once. A single write of many bytes is almost always much faster than many small writes that add up to the same thing. This is especially true of network connections because each TCP segment or UDP packet carries a finite amount of overhead, generally about 40 bytes’ worth. This means that sending 1 kilobyte of data 1 byte at a time actually requires sending 40 kilobytes over the wire, whereas sending it all at once only requires sending a little more than 1K of data. Most network cards and TCP implementations provide some level of buffering themselves, so the real numbers aren’t quite this dramatic. Nonetheless, buffering network output is generally a huge performance win.

The BufferedInputStream class also has a protected byte array named buf that serves as a buffer. When one of the stream’s read() methods is called, it first tries to get the requested data from the buffer. Only when the buffer runs out of data does the stream read from the underlying source. At this point, it reads as much data as it can from the source into the buffer, whether it needs all the data immediately or not. Data that isn’t used immediately will be available for later invocations of read(). When reading files from a local disk, it’s almost as fast to read several hundred bytes of data from the underlying stream as it is to read one byte of data. Therefore, buffering can substantially improve performance. The gain is less obvious on network connections where the bottleneck is often the speed at which the network can deliver data rather than the speed at which the network interface delivers data to the program or the speed at which the program runs. Nonetheless, buffering input rarely hurts and will become more important over time as network speeds increase.

BufferedInputStream has two constructors, as does BufferedOutputStream:

public BufferedInputStream(InputStream in)
public BufferedInputStream(InputStream in, int bufferSize)
public BufferedOutputStream(OutputStream out)
public BufferedOutputStream(OutputStream out, int bufferSize)

The first argument is the underlying stream from which unbuffered data will be read or to which buffered data will be written. The second argument, if present, specifies the number of bytes in the buffer. Otherwise, the buffer size is set to 2,048 bytes for an input stream and 512 bytes for an output stream. The ideal size for a buffer depends on what sort of stream you’re buffering. For network connections, you want something a little larger than the typical packet size. However, this can be hard to predict and varies depending on local network connections and protocols. Faster, higher-bandwidth networks tend to use larger packets, although TCP segments are often no larger than a kilobyte.

BufferedInputStream does not declare any new methods of its own. It only overrides methods from InputStream. It does support marking and resetting. The two multibyte read() methods attempt to completely fill the specified array or subarray of data by reading from the underlying input stream as many times as necessary. They return only when the array or subarray has been completely filled, the end of stream is reached, or the underlying stream would block on further reads. Most input streams do not behave like this. They read from the underlying stream or data source only once before returning.

BufferedOutputStream also does not declare any new methods of its own. You invoke its methods exactly as you would in any output stream. The difference is that each write places data in the buffer rather than directly on the underlying output stream. Consequently, it is essential to flush the stream when you reach a point at which the data needs to be sent.

PrintStream

The PrintStream class is the first filter output stream most programmers encounter because System.out is a PrintStream. However, other output streams can also be chained to print streams, using these two constructors:

public void print(boolean b)
public void print(char c)
public void print(int i)
public void print(long l)
public void print(float f)
public void print(double d)
public void print(char[] text)
public void print(String s)
public void print(Object o)
public void println()
public void println(boolean b)
public void println(char c)
public void println(int i)
public void println(long l)
public void println(float f)
public void println(double d)
public void println(char[] text)
public void println(String s)
public void println(Object o)

Each print() method converts its argument to a string in a predictable fashion and writes the string onto the underlying output stream using the default encoding. The println() methods do the same thing, but they also append a platform-dependent line separator to the end of the line they write. This is a linefeed (\n) on Unix (including Mac OS X), a carriage return (\r) on Mac OS 9, and a carriage return/linefeed pair (\r\n) on Windows.

PrintStream is evil and network programmers should avoid it like the plague!

The first problem is that the output from println() is platform dependent. Depending on what system runs your code, lines may sometimes be broken with a linefeed, a carriage return, or a carriage return/linefeed pair. This doesn’t cause problems when writing to the console, but it’s a disaster for writing network clients and servers that must follow a precise protocol. Most network protocols such as HTTP and Gnutella specify that lines should be terminated with a carriage return/linefeed pair. Using println() makes it easy to write a program that works on Windows but fails on Unix and the Mac. Although many servers and clients are liberal in what they accept and can handle incorrect line terminators, there are occasional exceptions.

The second problem is that PrintStream assumes the default encoding of the platform on which it’s running. However, this encoding may not be what the server or client expects. For example, a web browser receiving XML files will expect them to be encoded in UTF-8 or UTF-16 unless the server tells it otherwise. However, a web server that uses PrintStream may well send the files encoded in CP1252 from a U.S.-localized Windows system or SJIS from a Japanese-localized system, whether the client expects or understands those encodings or not. PrintStream doesn’t provide any mechanism for changing the default encoding. This problem can be patched over by using the related PrintWriter class instead. But the problems continue.

The third problem is that PrintStream eats all exceptions. This makes PrintStream suitable for textbook programs such as HelloWorld, because simple console output can be taught without burdening students with first learning about exception handling and all that implies. However, network connections are much less reliable than the console. Connections routinely fail because of network congestion, phone company misfeasance, remote systems crashing, and many other reasons. Network programs must be prepared to deal with unexpected interruptions in the flow of data. The way to do this is by handling exceptions. However, PrintStream catches any exceptions thrown by the underlying output stream. Notice that the declaration of the standard five OutputStream methods in PrintStream does not have the usual throws IOException declaration:

public abstract void write(int b)
public void write(byte[] data)
public void write(byte[] data, int offset, int length)
public void flush()
public void close()

Instead, PrintStream relies on an outdated and inadequate error flag. If the underlying stream throws an exception, this internal error flag is set. The programmer is relied upon to check the value of the flag using the checkError() method:

public boolean checkError()

To do any error checking at all on a PrintStream, the code must explicitly check every call. Furthermore, once an error has occurred, there is no way to unset the flag so further errors can be detected. Nor is any additional information available about the error. In short, the error notification provided by PrintStream is wholly inadequate for unreliable network connections.

Data Streams

The DataInputStream and DataOutputStream classes provide methods for reading and writing Java’s primitive data types and strings in a binary format. The binary formats used are primarily intended for exchanging data between two different Java programs through a network connection, a datafile, a pipe, or some other intermediary. What a data output stream writes, a data input stream can read. However, it happens that the formats are the same ones used for most Internet protocols that exchange binary numbers. For instance, the time protocol uses 32-bit big-endian integers, just like Java’s int data type. The controlled-load network element service uses 32-bit IEEE 754 floating-point numbers, just like Java’s float data type. (This is correlation rather than causation. Both Java and most network protocols were designed by Unix programmers, and consequently both tend to use the formats common to most Unix systems.) However, this isn’t true for all network protocols, so check the details of any protocol you use.

The DataOutputStream class offers these 11 methods for writing particular Java data types:

public final void writeBoolean(boolean b) throws IOException
public final void writeByte(int b) throws IOException
public final void writeShort(int s) throws IOException
public final void writeChar(int c) throws IOException
public final void writeInt(int i) throws IOException
public final void writeLong(long l) throws IOException
public final void writeFloat(float f) throws IOException
public final void writeDouble(double d) throws IOException
public final void writeChars(String s) throws IOException
public final void writeBytes(String s) throws IOException
public final void writeUTF(String s) throws IOException

All data is written in big-endian format. Integers are written in two’s complement in the minimum number of bytes possible. Thus, a byte is written as one byte, a short as two bytes, an int as four bytes, and a long as eight bytes. Floats and doubles are written in IEEE 754 form in four and eight bytes, respectively. Booleans are written as a single byte with the value 0 for false and 1 for true. Chars are written as two unsigned bytes.

The last three methods are a little trickier. The writeChars() method simply iterates through the String argument, writing each character in turn as a two-byte, big-endian Unicode character (a UTF-16 code point, to be absolutely precise). The writeBytes() method iterates through the String argument but writes only the least significant byte of each character. Thus, information will be lost for any string with characters from outside the Latin-1 character set. This method may be useful on some network protocols that specify the ASCII encoding, but it should be avoided most of the time.

Neither writeChars() nor writeBytes() encodes the length of the string in the output stream. As a result, you can’t really distinguish between raw characters and characters that make up part of a string. The writeUTF() method does include the length of the string. It encodes the string itself in a variant of the UTF-8 encoding of Unicode. Because this variant is subtly incompatible with most non-Java software, it should be used only for exchanging data with other Java programs that use a DataInputStream to read strings. For exchanging UTF-8 text with all other software, you should use an InputStreamReader with the appropriate encoding. (There wouldn’t be any confusion if Sun had just called this method and its partner writeString() and readString() rather than writeUTF() and readUTF().)

Along with these methods for writing binary numbers and strings, DataOutputStream of course has the usual write(), flush(), and close() methods any OutputStream class has.

DataInputStream is the complementary class to DataOutputStream. Every format that DataOutputStream writes, DataInputStream can read. In addition, DataInputStream has the usual read(), available(), skip(), and close() methods, as well as methods for reading complete arrays of bytes and lines of text.

There are 9 methods to read binary data that match the 11 methods in DataOutputStream (there’s no exact complement for writeBytes() or writeChars(); these are handled by reading the bytes and chars one at a time):

public final boolean readBoolean() throws IOException
public final byte readByte() throws IOException
public final char readChar() throws IOException
public final short readShort() throws IOException
public final int readInt() throws IOException
public final long readLong() throws IOException
public final float readFloat() throws IOException
public final double readDouble() throws IOException
public final String readUTF() throws IOException

In addition, DataInputStream provides two methods to read unsigned bytes and unsigned shorts and return the equivalent int. Java doesn’t have either of these data types, but you may encounter them when reading binary data written by a C program:

public final int readUnsignedByte() throws IOException
public final int readUnsignedShort() throws IOException

DataInputStream has the usual two multibyte read() methods that read data into an array or subarray and return the number of bytes read. It also has two readFully() methods that repeatedly read data from the underlying input stream into an array until the requested number of bytes have been read. If enough data cannot be read, then an IOException is thrown. These methods are especially useful when you know in advance exactly how many bytes you have to read. This might be the case when you’ve read the Content-length field out of an HTTP header and thus know how many bytes of data there are:

public final int read(byte[] input) throws IOException
public final int read(byte[] input, int offset, int length)
    throws IOException
public final void readFully(byte[] input) throws IOException
public final void readFully(byte[] input, int offset, int length)
    throws IOException

Finally, DataInputStream provides the popular readLine() method that reads a line of text as delimited by a line terminator and returns a string:

public final String readLine() throws IOException

However, this method should not be used under any circumstances, both because it is deprecated and because it is buggy. It’s deprecated because it doesn’t properly convert non-ASCII characters to bytes in most circumstances. That task is now handled by the readLine() method of the BufferedReader class. However, that method and this one share the same insidious bug: they do not always recognize a single carriage return as ending a line. Rather, readLine() recognizes only a linefeed or a carriage return/linefeed pair. When a carriage return is detected in the stream, readLine() waits to see whether the next character is a linefeed before continuing. If it is a linefeed, the carriage return and the linefeed are thrown away and the line is returned as a String. If it isn’t a linefeed, the carriage return is thrown away, the line is returned as a String, and the extra character that was read becomes part of the next line. However, if the carriage return is the last character in the stream, then readLine() hangs, waiting for the last character, which isn’t forthcoming.

This problem isn’t obvious when reading files because there will almost certainly be a next character: –1 for end of stream, if nothing else. However, on persistent network connections such as those used for FTP and late-model HTTP, a server or client may simply stop sending data after the last character and wait for a response without actually closing the connection. If you’re lucky, the connection may eventually time out on one end or the other and you’ll get an IOException, although this will probably take at least a couple of minutes, and cause you to lose the last line of data from the stream. If you’re not lucky, the program will hang indefinitely.

Readers and Writers

Many programmers have a bad habit of writing code as if all text were ASCII or at least in the native encoding of the platform. Although some older, simpler network protocols, such as daytime, quote of the day, and chargen, do specify ASCII encoding for text, this is not true of HTTP and many other more modern protocols, that allow a wide variety of localized encodings, such as KOI8-R Cyrillic, Big-5 Chinese, and ISO 8859-9 for Turkish. Java’s native character set is the UTF-16 encoding of Unicode. When the encoding is no longer ASCII, the assumption that bytes and chars are essentially the same things also breaks down. Consequently, Java provides an almost complete mirror of the input and output stream class hierarchy designed for working with characters instead of bytes.

In this mirror image hierarchy, two abstract superclasses define the basic API for reading and writing characters. The java.io.Reader class specifies the API by which characters are read. The java.io.Writer class specifies the API by which characters are written. Wherever input and output streams use bytes, readers and writers use Unicode characters. Consequently, Java provides an almost complete mirror of the input and output stream class hierarchy designed for working with characters instead of bytes.

In this mirror image hierarchy, two abstract superclasses define the basic API for reading and writing characters. The java.io.Reader class specifies the API by which characters are read. The java.io.Writer class specifies the API by which characters are written. Wherever input and output streams use bytes, readers and writers use Unicode characters. Concrete subclasses of Reader and Writer allow particular sources to be read and targets to be written. Filter readers and writers can be attached to other readers and writers to provide additional services or interfaces.

The most important concrete subclasses of Reader and Writer are the InputStreamReader and the OutputStreamWriter classes. An InputStreamReader contains an underlying input stream from which it reads raw bytes. It translates these bytes into Unicode characters according to a specified encoding. An OutputStreamWriter receives Unicode characters from a running program. It then translates those characters into bytes using a specified encoding and writes the bytes onto an underlying output stream.

In addition to these two classes, the java.io package provides several raw reader and writer classes that read characters without directly requiring an underlying input stream, including:

FileReader
FileWriter
StringReader
StringWriter
CharArrayReader
CharArrayWriter

The first two classes in this list work with files and the last four work inside Java, so they aren’t of great use for network programming. However, aside from different constructors, these classes have pretty much the same public interface as all other reader and writer classes.

Writers

The Writer class mirrors the java.io.OutputStream class. It’s abstract and has two protected constructors. Like OutputStream, the Writer class is never used directly; instead, it is used polymorphically, through one of its subclasses. It has five write() methods as well as a flush() and a close() method:

protected Writer()
protected Writer(Object lock)
public abstract void write(char[] text, int offset, int length)
    throws IOException
public void write(int c) throws IOException
public void write(char[] text) throws IOException
public void write(String s) throws IOException
public void write(String s, int offset, int length) throws IOException
public abstract void flush() throws IOException
public abstract void close() throws IOException

The write(char[] text, int offset, int length) method is the base method in terms of which the other four write() methods are implemented. A subclass must override at least this method as well as flush() and close(), although most override some of the other write() methods as well in order to provide more efficient implementations. For example, given a Writer object w, you can write the string “Network” like this:

char[] network = {'N', 'e', 't', 'w', 'o', 'r', 'k'};
w.write(network, 0, network.length);

The same task can be accomplished with these other methods, as well:

w.write(network);
for (int i = 0;  i < network.length;  i++) w.write(network[i]);
w.write("Network");
w.write("Network", 0, 7);

All of these examples are different ways of expressing the same thing. The one you choose to use in any given situation is mostly a matter of convenience and taste. However, how many and which bytes are written by these lines depends on the encoding w uses. If it’s using big-endian UTF-16, it will write these 14 bytes (shown here in hexadecimal) in this order:

00 4E 00 65 00 74 00 77 00 6F 00 72 00 6B

On the other hand, if w uses little-endian UTF-16, this sequence of 14 bytes is written:

4E 00 65 00 74 00 77 00 6F 00 72 00 6B 00

If w uses Latin-1, UTF-8, or MacRoman, this sequence of seven bytes is written:

4E 65 74 77 6F 72 6B

Other encodings may write still different sequences of bytes. The exact output depends on the encoding.

Writers may be buffered, either directly by being chained to a BufferedWriter or indirectly because their underlying output stream is buffered. To force a write to be committed to the output medium, invoke the flush() method:

w.flush();

The close() method behaves similarly to the close() method of OutputStream. close() flushes the writer, then closes the underlying output stream and releases any resources associated with it:

public abstract void close() throws IOException

After a writer has been closed, further writes throw IOExceptions.

OutputStreamWriter

OutputStreamWriter is the most important concrete subclass of Writer. An OutputStreamWriter receives characters from a Java program. It converts these into bytes according to a specified encoding and writes them onto an underlying output stream. Its constructor specifies the output stream to write to and the encoding to use:

public OutputStreamWriter(OutputStream out, String encoding)
    throws UnsupportedEncodingException

Valid encodings are listed in the documentation for Sun’s native2ascii tool included with the JDK. If no encoding is specified, the default encoding for the platform is used. In 2013, the default encoding is UTF-8 on the Mac and more often than not on Linux. However, on Linux it can vary if the local operating system is configured to use some other character set by default. On Windows, it varies depending on country and configuration, but in the United States it’s Windows-1252, a.k.a CP1252 more often than not. Default character sets can cause unexpected problems at unexpected times. You’re generally almost always better off explicitly specifying the character set rather than letting Java pick one for you. For example, this code fragment writes the first few words of Homer’s Odyssey in the CP1253 Windows Greek encoding:

OutputStreamWriter w = new OutputStreamWriter(
    new FileOutputStream("OdysseyB.txt"), "Cp1253");
w.write("ἦμος δ΄ ἠριγένεια φάνη ῥοδοδάκτυλος Ἠώς");

Other than the constructors, OutputStreamWriter has only the usual Writer methods (which are used exactly as they are for any Writer class) and one method to return the encoding of the object:

public String getEncoding()

Readers

The Reader class mirrors the java.io.InputStream class. It’s abstract with two protected constructors. Like InputStream and Writer, the Reader class is never used directly, only through one of its subclasses. It has three read() methods, as well as skip(), close(), ready(), mark(), reset(), and markSupported() methods:

protected Reader()
protected Reader(Object lock)
public abstract int read(char[] text, int offset, int length)
    throws IOException
public int read() throws IOException
public int read(char[] text) throws IOException
public long skip(long n) throws IOException
public boolean ready()
public boolean markSupported()
public void mark(int readAheadLimit) throws IOException
public void reset() throws IOException
public abstract void close() throws IOException

The read(char[] text, int offset, int length) method is the fundamental method through which the other two read() methods are implemented. A subclass must override at least this method as well as close(), although most will override some of the other methods as well in order to provide more efficient implementations.

Most of these methods are easily understood by analogy with their InputStream counterparts. The exception to the rule of similarity is ready(), which has the same general purpose as available() but not quite the same semantics, even modulo the byte-to-char conversion. Whereas available() returns an int specifying a minimum number of bytes that may be read without blocking, ready() only returns a boolean indicating whether the reader may be read without blocking. The problem is that some character encodings, such as UTF-8, use different numbers of bytes for different characters. Thus, it’s hard to tell how many characters are waiting in the network or filesystem buffer without actually reading them out of the buffer.

InputStreamReader is the most important concrete subclass of Reader. An InputStreamReader reads bytes from an underlying input stream such as a FileInputStream or TelnetInputStream. It converts these into characters according to a specified encoding and returns them. The constructor specifies the input stream to read from and the encoding to use:

public InputStreamReader(InputStream in)
public InputStreamReader(InputStream in, String encoding)
    throws UnsupportedEncodingException

If no encoding is specified, the default encoding for the platform is used. If an unknown encoding is specified, an UnsupportedEncodingException is thrown.

For example, this method reads an input stream and converts it all to one Unicode string using the MacCyrillic encoding:

public static String getMacCyrillicString(InputStream in)
    throws IOException {

  InputStreamReader r = new InputStreamReader(in, "MacCyrillic");
  StringBuilder sb = new StringBuilder();
  int c;
  while ((c = r.read()) != -1) sb.append((char) c);
  return sb.toString();
}

Filter Readers and Writers

The InputStreamReader and OutputStreamWriter classes act as decorators on top of input and output streams that change the interface from a byte-oriented interface to a character-oriented interface. Once this is done, additional character-oriented filters can be layered on top of the reader or writer using the java.io.FilterReader and java.io.FilterWriter classes. As with filter streams, there are a variety of subclasses that perform specific filtering, including:

BufferedReader
BufferedWriter
LineNumberReader
PushbackReader
PrintWriter

The BufferedReader and BufferedWriter classes are the character-based equivalents of the byte-oriented BufferedInputStream and BufferedOutputStream classes. Where BufferedInputStream and BufferedOutputStream use an internal array of bytes as a buffer, BufferedReader and BufferedWriter use an internal array of chars.

When a program reads from a BufferedReader, text is taken from the buffer rather than directly from the underlying input stream or other text source. When the buffer empties, it is filled again with as much text as possible, even if not all of it is immediately needed, making future reads much faster. When a program writes to a BufferedWriter, the text is placed in the buffer. The text is moved to the underlying output stream or other target only when the buffer fills up or when the writer is explicitly flushed, which can make writes much faster than would otherwise be the case.

BufferedReader and BufferedWriter have the usual methods associated with readers and writers, like read(), ready(), write(), and close(). They each have two constructors that chain the BufferedReader or BufferedWriter to an underlying reader or writer and set the size of the buffer. If the size is not set, the default size of 8,192 characters is used:

public BufferedReader(Reader in, int bufferSize)
public BufferedReader(Reader in)
public BufferedWriter(Writer out)
public BufferedWriter(Writer out, int bufferSize)

For example, the earlier getMacCyrillicString() example was less than efficient because it read characters one at a time. Because MacCyrillic is a 1-byte character set, it also read bytes one at a time. However, it’s straightforward to make it run faster by chaining a BufferedReader to the InputStreamReader, like this:

public static String getMacCyrillicString(InputStream in)
    throws IOException {

  Reader r = new InputStreamReader(in, "MacCyrillic");
  r = new BufferedReader(r, 1024);
  StringBuilder sb = new StringBuilder();
  int c;
  while ((c = r.read()) != -1) sb.append((char) c);
  return sb.toString();
}

All that was needed to buffer this method was one additional line of code. None of the rest of the algorithm had to change, because the only InputStreamReader methods used were the read() and close() methods declared in the Reader superclass and shared by all Reader subclasses, including BufferedReader.

The BufferedReader class also has a readLine() method that reads a single line of text and returns it as a string:

public String readLine() throws IOException

This method replaces the deprecated readLine() method in DataInputStream, and it has mostly the same behavior as that method. The big difference is that by chaining a BufferedReader to an InputStreamReader, you can correctly read lines in character sets other than the default encoding for the platform.

The BufferedWriter() class adds one new method not included in its superclass, called newLine(), also geared toward writing lines:

public void newLine() throws IOException

This method inserts a platform-dependent line-separator string into the output. The line.separator system property determines exactly what the string is: probably a linefeed on Unix and Mac OS X, and a carriage return/linefeed pair on Windows. Because network protocols generally specify the required line terminator, you should not use this method for network programming. Instead, explicitly write the line terminator the protocol requires. More often than not, the required terminator is a carriage return/linefeed pair.

PrintWriter

The PrintWriter class is a replacement for Java 1.0’s PrintStream class that properly handles multibyte character sets and international text. Sun originally planned to deprecate PrintStream in favor of PrintWriter but backed off when it realized this step would invalidate too much existing code, especially code that depended on System.out. Nonetheless, new code should use PrintWriter instead of PrintStream.

Aside from the constructors, the PrintWriter class has an almost identical collection of methods to PrintStream. These include:

public PrintWriter(Writer out)
public PrintWriter(Writer out, boolean autoFlush)
public PrintWriter(OutputStream out)
public PrintWriter(OutputStream out, boolean autoFlush)
public void flush()
public void close()
public boolean checkError()
public void write(int c)
public void write(char[] text, int offset, int length)
public void write(char[] text)
public void write(String s, int offset, int length)
public void write(String s)
public void print(boolean b)
public void print(char c)
public void print(int i)
....

Most of these methods behave the same for PrintWriter as they do for PrintStream. The exceptions are the four write() methods, which write characters rather than bytes.

Reference