Java URI与URL

Published on 2017 - 04 - 27

URIs

A Uniform Resource Identifier (URI) is a string of characters in a particular syntax that identifies a resource. The resource identified may be a file on a server; but it may also be an email address, a news message, a book, a person’s name, an Internet host, the current stock price of Oracle, or something else.

A resource is a thing that is identified by a URI. A URI is a string that identifies a resource. Yes, it is exactly that circular. Don’t spend too much time worrying about what a resource is or isn’t, because you’ll never see one anyway. All you ever receive from a server is a representation of a resource which comes in the form of bytes. However a single resource may have different representations. For instance, https://www.un.org/en/documents/udhr/ identifies the Universal Declaration of Human Rights; but there are representations of the declaration in plain text, XML, PDF, and other formats. There are also representations of this resource in English, French, Arabic, and many other languages. Some of these representations may themselves be resources. For instance, https://www.un.org/en/documents/udhr/ identifies specifically the English version of the Universal Declaration of Human Rights.

The syntax of a URI is composed of a scheme and a scheme-specific part, separated by a colon, like this:

scheme:scheme-specific-part

The syntax of the scheme-specific part depends on the scheme being used. Current schemes include:

data
Base64-encoded data included directly in a link; see RFC 2397

file
A file on a local disk

ftp
An FTP server

http
A World Wide Web server using the Hypertext Transfer Protocol

mailto
An email address

magnet
A resource available for download via peer-to-peer networks such as BitTorrent

telnet
A connection to a Telnet-based service

urn
A Uniform Resource Name

In addition, Java makes heavy use of nonstandard custom schemes such as rmi, jar, jndi, and doc for various purposes.

There is no specific syntax that applies to the scheme-specific parts of all URIs. However, many have a hierarchical form, like this:

//authority/path?query

The authority part of the URI names the authority responsible for resolving the rest of the URI. For instance, the URI http://www.ietf.org/rfc/rfc3986.txt has the scheme http, the authority www.ietf.org, and the path /rfc/rfc3986.txt (initial slash included). This means the server at www.ietf.org is responsible for mapping the path /rfc/rfc3986.txt to a resource. This URI does not have a query part. The URI http://www.powells.com/cgi-bin/biblio?inkey=62-1565928709-0 has the scheme http, the authority www.powells.com, the path /cgi-bin/biblio, and the query inkey=62-1565928709-0. The URI urn:isbn:156592870 has the scheme urn but doesn’t follow the hierarchical //authority/path?query form for scheme-specific parts.

URLs

A URL is a URI that, as well as identifying a resource, provides a specific network location for the resource that a client can use to retrieve a representation of that resource. By contrast, a generic URI may tell you what a resource is, but not actually tell you where or how to get that resource. In the physical world, it’s the difference between the title “Harry Potter and The Deathly Hallows” and the library location “Room 312, Row 28, Shelf 7”. In Java, it’s the difference between the java.net.URI class that only identifies resources and the java.net.URL class that can both identify and retrieve resources.

The network location in a URL usually includes the protocol used to access a server (e.g., FTP, HTTP), the hostname or IP address of the server, and the path to the resource on that server. A typical URL looks like http://www.ibiblio.org/javafaq/javatutorial.html. This specifies that there is a file called javatutorial.html in a directory called javafaq on the server www.ibiblio.org, and that this file can be accessed via the HTTP protocol.

The syntax of a URL is:

protocol://userInfo@host:port/path?query#fragment

Here the protocol is another word for what was called the scheme of the URI. (Scheme is the word used in the URI RFC. Protocol is the word used in the Java documentation.) In a URL, the protocol part can be file, ftp, http, https, magnet, telnet, or various other strings (though not urn).

The host part of a URL is the name of the server that provides the resource you want. It can be a hostname such as www.oreilly.com or utopia.poly.edu or an IP address, such as 204.148.40.9 or 128.238.3.21.

The userInfo is optional login information for the server. If present, it contains a username and, rarely, a password.

The port number is also optional. It’s not necessary if the service is running on its default port (port 80 for HTTP servers).

Together, the userInfo, host, and port constitute the authority.

The path points to a particular resource on the specified server. It often looks like a filesystem path such as /forum/index.php. However, it may or may not actually map to a filesystem on the server. If it does map to a filesystem, the path is relative to the document root of the server, not necessarily to the root of the filesystem on the server. As a rule, servers that are open to the public do not show their entire filesystem to clients. Rather, they show only the contents of a specified directory. This directory is called the document root, and all paths and filenames are relative to it. Thus, on a Unix server, all files that are available to the public might be in /var/public/html, but to somebody connecting from a remote machine, this directory looks like the root of the filesystem.

The query string provides additional arguments for the server. It’s commonly used only in http URLs, where it contains form data for input to programs running on the server.

Finally, the fragment references a particular part of the remote resource. If the remote resource is HTML, the fragment identifier names an anchor in the HTML document. If the remote resource is XML, the fragment identifier is an XPointer. Some sources refer to the fragment part of the URL as a “section”. Java rather unaccountably refers to the fragment identifier as a “Ref”. Fragment identifier targets are created in an HTML document with an id attribute, like this:

<h3 id="xtocid1902914">Comments</h3>

This tag identifies a particular point in a document. To refer to this point, a URL includes not only the document’s filename but the fragment identifier separated from the rest of the URL by a #:

http://www.cafeaulait.org/javafaq.html#xtocid1902914

The URL Class

The java.net.URL class is an abstraction of a Uniform Resource Locator such as http://www.lolcats.com/ or ftp://ftp.redhat.com/pub/. It extends java.lang.Object, and it is a final class that cannot be subclassed. Rather than relying on inheritance to configure instances for different kinds of URLs, it uses the strategy design pattern. Protocol handlers are the strategies, and the URL class itself forms the context through which the different strategies are selected.

Although storing a URL as a string would be trivial, it is helpful to think of URLs as objects with fields that include the scheme (a.k.a. the protocol), hostname, port, path, query string, and fragment identifier (a.k.a. the ref), each of which may be set independently. Indeed, this is almost exactly how the java.net.URL class is organized, though the details vary a little between different versions of Java.

URLs are immutable. After a URL object has been constructed, its fields do not change. This has the side effect of making them thread safe.

Creating New URLs

Unlike the InetAddress objects in Chapter 4, you can construct instances of java.net.URL. The constructors differ in the information they require:

public URL(String url) throws MalformedURLException
public URL(String protocol, String hostname, String file)
    throws MalformedURLException
public URL(String protocol, String host, int port, String file)
    throws MalformedURLException
public URL(URL base, String relative) throws MalformedURLException

Which constructor you use depends on the information you have and the form it’s in. All these constructors throw a MalformedURLException if you try to create a URL for an unsupported protocol or if the URL is syntactically incorrect.

Exactly which protocols are supported is implementation dependent. The only protocols that have been available in all virtual machines are http and file, and the latter is notoriously flaky. Today, Java also supports the https, jar, and ftp protocols. Some virtual machines support mailto and gopher as well as some custom protocols like doc, netdoc, systemresource, and verbatim used internally by Java.

Other than verifying that it recognizes the URL scheme, Java does not check the correctness of the URLs it constructs. The programmer is responsible for making sure that URLs created are valid. For instance, Java does not check that the hostname in an HTTP URL does not contain spaces or that the query string is x-www-form-URL-encoded. It does not check that a mailto URL actually contains an email address. You can create URLs for hosts that don’t exist and for hosts that do exist but that you won’t be allowed to connect to.

Constructing a URL from a string

The simplest URL constructor just takes an absolute URL in string form as its single argument:

public URL(String url) throws MalformedURLException

Like all constructors, this may only be called after the new operator, and like all URL constructors, it can throw a MalformedURLException. The following code constructs a URL object from a String, catching the exception that might be thrown:

try {
  URL u = new URL("http://www.audubon.org/");
} catch (MalformedURLException ex)  {
  System.err.println(ex);
}

Constructing a URL from its component parts

You can also build a URL by specifying the protocol, the hostname, and the file:

public URL(String protocol, String hostname, String file)
    throws MalformedURLException

This constructor sets the port to -1 so the default port for the protocol will be used. The file argument should begin with a slash and include a path, a filename, and optionally a fragment identifier. Forgetting the initial slash is a common mistake, and one that is not easy to spot. Like all URL constructors, it can throw a MalformedURLException. For example:

try {
  URL u = new URL("http", "www.eff.org", "/blueribbon.html#intro");
} catch (MalformedURLException ex)  {
  throw new RuntimeException("shouldn't happen; all VMs recognize http");
}

This creates a URL object that points to http://www.eff.org/blueribbon.html#intro, using the default port for the HTTP protocol (port 80). The file specification includes a reference to a named anchor. The code catches the exception that would be thrown if the virtual machine did not support the HTTP protocol. However, this shouldn’t happen in practice.

For the rare occasions when the default port isn’t correct, the next constructor lets you specify the port explicitly as an int. The other arguments are the same. For example, this code fragment creates a URL object that points to http://fourier.dur.ac.uk:8000/~dma3mjh/jsci/, specifying port 8000 explicitly:

try {
  URL u = new URL("http", "fourier.dur.ac.uk", 8000, "/~dma3mjh/jsci/");
} catch (MalformedURLException ex)  {
  throw new RuntimeException("shouldn't happen; all VMs recognize http");
}

Constructing relative URLs

This constructor builds an absolute URL from a relative URL and a base URL:

public URL(URL base, String relative) throws MalformedURLException

For instance, you may be parsing an HTML document at http://www.ibiblio.org/javafaq/index.html and encounter a link to a file called mailinglists.html with no further qualifying information. In this case, you use the URL to the document that contains the link to provide the missing information. The constructor computes the new URL as http://www.ibiblio.org/javafaq/mailinglists.html. For example:

try {
  URL u1 = new URL("http://www.ibiblio.org/javafaq/index.html");
  URL u2 = new URL (u1, "mailinglists.html");
} catch (MalformedURLException ex) {
  System.err.println(ex);
}

The filename is removed from the path of u1 and the new filename mailinglists.html is appended to make u2. This constructor is particularly useful when you want to loop through a list of files that are all in the same directory. You can create a URL for the first file and then use this initial URL to create URL objects for the other files by substituting their filenames.

Retrieving Data from a URL

Naked URLs aren’t very exciting. What’s interesting is the data contained in the documents they point to. The URL class has several methods that retrieve data from a URL:

public InputStream openStream() throws IOException
public URLConnection openConnection() throws IOException
public URLConnection openConnection(Proxy proxy) throws IOException
public Object getContent() throws IOException
public Object getContent(Class[] classes) throws IOException

The most basic and most commonly used of these methods is openStream(), which returns an InputStream from which you can read the data. If you need more control over the download process, call openConnection() instead, which gives you a URLConnection which you can configure, and then get an InputStream from it. Finally, you can ask the URL for its content with getContent() which may give you a more complete object such as String or an Image. Then again, it may just give you an InputStream anyway.

public final InputStream openStream() throws IOException

The openStream() method connects to the resource referenced by the URL, performs any necessary handshaking between the client and the server, and returns an InputStream from which data can be read. The data you get from this InputStream is the raw (i.e., uninterpreted) content the URL references: ASCII if you’re reading an ASCII text file, raw HTML if you’re reading an HTML file, binary image data if you’re reading an image file, and so forth. It does not include any of the HTTP headers or any other protocol-related information. You can read from this InputStream as you would read from any other InputStream. For example:

try {
  URL u = new URL("http://www.lolcats.com");
  InputStream in = u.openStream();
  int c;
  while ((c = in.read()) != -1) System.out.write(c);
  in.close();
} catch (IOException ex) {
  System.err.println(ex);
}

The preceding code fragment catches an IOException, which also catches the MalformedURLException that the URL constructor can throw, since MalformedURLException subclasses IOException.

As with most network streams, reliably closing the stream takes a bit of effort. In Java 6 and earlier, we use the dispose pattern: declare the stream variable outside the try block, set it to null, and then close it in the finally block if it’s not null. For example:

InputStream in = null
try {
  URL u = new URL("http://www.lolcats.com");
  in = u.openStream();
  int c;
  while ((c = in.read()) != -1) System.out.write(c);
} catch (IOException ex) {
  System.err.println(ex);
} finally {
  try {
    if (in != null) {
      in.close();
    }
  } catch (IOException ex) {
    // ignore
  }
}

Java 7 makes this somewhat cleaner by using a nested try-with-resources statement:

try {
  URL u = new URL("http://www.lolcats.com");
  try (InputStream in = u.openStream()) {
    int c;
    while ((c = in.read()) != -1) System.out.write(c);
  }
} catch (IOException ex) {
  System.err.println(ex);
}

public URLConnection openConnection() throws IOException

The openConnection() method opens a socket to the specified URL and returns a URLConnection object. A URLConnection represents an open connection to a network resource. If the call fails, openConnection() throws an IOException. For example:

try {
  URL u = new URL("https://news.ycombinator.com/");
  try {
    URLConnection uc = u.openConnection();
    InputStream in = uc.getInputStream();
    // read from the connection...
  } catch (IOException ex) {
    System.err.println(ex);
  }
} catch (MalformedURLException ex) {
  System.err.println(ex);
}

You should use this method when you want to communicate directly with the server. The URLConnection gives you access to everything sent by the server: in addition to the document itself in its raw form (e.g., HTML, plain text, binary image data), you can access all the metadata specified by the protocol. For example, if the scheme is HTTP or HTTPS, the URLConnection lets you access the HTTP headers as well as the raw HTML. The URLConnection class also lets you write data to as well as read from a URL—for instance, in order to send email to a mailto URL or post form data.

An overloaded variant of this method specifies the proxy server to pass the connection through:

public URLConnection openConnection(Proxy proxy) throws IOException

This overrides any proxy server set with the usual socksProxyHost, socksProxyPort, http.proxyHost, http.proxyPort, http.nonProxyHosts, and similar system properties. If the protocol handler does not support proxies, the argument is ignored and the connection is made directly if possible.

public final Object getContent() throws IOException

The getContent() method is the third way to download data referenced by a URL. The getContent() method retrieves the data referenced by the URL and tries to make it into some type of object. If the URL refers to some kind of text such as an ASCII or HTML file, the object returned is usually some sort of InputStream. If the URL refers to an image such as a GIF or a JPEG file, getContent() usually returns a java.awt.ImageProducer. What unifies these two disparate classes is that they are not the thing itself but a means by which a program can construct the thing:

URL u = new URL("http://mesola.obspm.fr/");
Object o = u.getContent();
// cast the Object to the appropriate type
// work with the Object...

getContent() operates by looking at the Content-type field in the header of the data it gets from the server. If the server does not use MIME headers or sends an unfamiliar Content-type, getContent() returns some sort of InputStream with which the data can be read. An IOException is thrown if the object can’t be retrieved. Example 3 demonstrates this.

import java.io.*;
import java.net.*;

public class ContentGetter {

  public static void main (String[] args) {

    if  (args.length > 0) {
      // Open the URL for reading
      try {
        URL u = new URL(args[0]);
        Object o = u.getContent();
        System.out.println("I got a " + o.getClass().getName());
      } catch (MalformedURLException ex) {
        System.err.println(args[0] + " is not a parseable URL");
      } catch (IOException ex) {
        System.err.println(ex);
      }
    }
  }
}

Here’s the result of trying to get the content of http://www.oreilly.com:

% java ContentGetter http://www.oreilly.com/ I got a
sun.net.www.protocol.http.HttpURLConnection$HttpInputStream</programlisting>

The exact class may vary from one version of Java to the next (in earlier versions, it’s been java.io.PushbackInputStream or sun.net.www.http.KeepAliveStream) but it should be some form of InputStream.

Here’s what you get when you try to load a header image from that page:

% java ContentGetter http://www.oreilly.com/graphics_new/animation.gif
I got a sun.awt.image.URLImageSource</programlisting>

Here’s what happens when you try to load a Java applet using getContent():

% java ContentGetter http://www.cafeaulait.org/RelativeURLTest.class</userinput>
I got a sun.net.www.protocol.http.HttpURLConnection$HttpInputStream
 </programlisting>

Here’s what happens when you try to load an audio file using getContent():

% java ContentGetter http://www.cafeaulait.org/course/week9/spacemusic.au
 </userinput>
I got a sun.applet.AppletAudioClip</programlisting>

The last result is the most unusual because it is as close as the Java core API gets to a class that represents a sound file. It’s not just an interface through which you can load the sound data.

This example demonstrates the biggest problems with using getContent(): it’s hard to predict what kind of object you’ll get. You could get some kind of InputStream or an ImageProducer or perhaps an AudioClip; it’s easy to check using the instanceof operator. This information should be enough to let you read a text file or display an image.

public final Object getContent(Class[] classes) throws IOException

A URL’s content handler may provide different views of a resource. This overloaded variant of the getContent() method lets you choose which class you’d like the content to be returned as. The method attempts to return the URL’s content in the first available format. For instance, if you prefer an HTML file to be returned as a String, but your second choice is a Reader and your third choice is an InputStream, write:

URL u = new URL("http://www.nwu.org");
Class<?>[] types = new Class[3];
types[0] = String.class;
types[1] = Reader.class;
types[2] = InputStream.class;
Object o = u.getContent(types);

If the content handler knows how to return a string representation of the resource, then it returns a String. If it doesn’t know how to return a string representation of the resource, then it returns a Reader. And if it doesn’t know how to present the resource as a reader, then it returns an InputStream. You have to test for the type of the returned object using instanceof. For example:

if (o instanceof String) {
  System.out.println(o);
} else if (o instanceof Reader) {
  int c;
  Reader r = (Reader) o;
  while ((c = r.read()) != -1) System.out.print((char) c);
  r.close();
} else if (o instanceof InputStream) {
  int c;
  InputStream in = (InputStream) o;
  while ((c = in.read()) != -1) System.out.write(c);
  in.close();
} else {
  System.out.println("Error: unexpected type " + o.getClass());
}

Splitting a URL into Pieces

URLs are composed of five pieces:

  • The scheme, also known as the protocol
  • The authority
  • The path
  • The fragment identifier, also known as the section or ref
  • The query string

For example, in the URL http://www.ibiblio.org/javafaq/books/jnp/index.html?isbn=1565922069#toc, the scheme is http, the authority is www.ibiblio.org, the path is /javafaq/books/jnp/index.html, the fragment identifier is toc, and the query string is isbn=1565922069. However, not all URLs have all these pieces. For instance, the URL http://www.faqs.org/rfcs/rfc3986.html has a scheme, an authority, and a path, but no fragment identifier or query string.

The authority may further be divided into the user info, the host, and the port. For example, in the URL http://admin@www.blackstar.com:8080/, the authority is admin@www.blackstar.com:8080. This has the user info admin, the host www.blackstar.com, and the port 8080.

Read-only access to these parts of a URL is provided by nine public methods: getFile(), getHost(), getPort(), getProtocol(), getRef(), getQuery(), getPath(), getUserInfo(), and getAuthority().

public String getProtocol()

The getProtocol() method returns a String containing the scheme of the URL (e.g., “http”, “https”, or “file”). For example, this code fragment prints https:

URL u = new URL("https://xkcd.com/727/");
System.out.println(u.getProtocol());

public String getHost()

The getHost() method returns a String containing the hostname of the URL. For example, this code fragment prints xkcd.com:

URL u = new URL("https://xkcd.com/727/");
System.out.println(u.getHost());

public int getPort()

The getPort() method returns the port number specified in the URL as an int. If no port was specified in the URL, getPort() returns -1 to signify that the URL does not specify the port explicitly, and will use the default port for the protocol. For example, if the URL is http://www.userfriendly.org/, getPort() returns -1; if the URL is http://www.userfriendly.org:80/, getPort() returns 80. The following code prints -1 for the port number because it isn’t specified in the URL:

URL u = new URL("http://www.ncsa.illinois.edu/AboutUs/");
System.out.println("The port part of " + u + " is " + u.getPort());

public int getDefaultPort()

The getDefaultPort() method returns the default port used for this URL’s protocol when none is specified in the URL. If no default port is defined for the protocol, then getDefaultPort() returns -1. For example, if the URL is http://www.userfriendly.org/, getDefaultPort() returns 80; if the URL is ftp://ftp.userfriendly.org:8000/, getDefaultPort() returns 21.

public String getFile()

The getFile() method returns a String that contains the path portion of a URL; remember that Java does not break a URL into separate path and file parts. Everything from the first slash (/) after the hostname until the character preceding the # sign that begins a fragment identifier is considered to be part of the file. For example:

URL page = this.getDocumentBase();
System.out.println("This page's path is " + page.getFile());

If the URL does not have a file part, Java sets the file to the empty string.

public String getPath()

The getPath() method is a near synonym for getFile(); that is, it returns a String containing the path and file portion of a URL. However, unlike getFile(), it does not include the query string in the String it returns, just the path.

public String getRef()

The getRef() method returns the fragment identifier part of the URL. If the URL doesn’t have a fragment identifier, the method returns null. In the following code, getRef() returns the string xtocid1902914:

URL u = new URL(
    "http://www.ibiblio.org/javafaq/javafaq.html#xtocid1902914");
System.out.println("The fragment ID of " + u + " is " + u.getRef());

public String getQuery()

The getQuery() method returns the query string of the URL. If the URL doesn’t have a query string, the method returns null. In the following code, getQuery() returns the string category=Piano:

URL u = new URL(
    "http://www.ibiblio.org/nywc/compositions.phtml?category=Piano");
System.out.println("The query string of " + u + " is " + u.getQuery());

public String getUserInfo()

Some URLs include usernames and occasionally even password information. This information comes after the scheme and before the host; an @ symbol delimits it. For instance, in the URL http://elharo@java.oreilly.com/, the user info is elharo. Some URLs also include passwords in the user info. For instance, in the URL ftp://mp3:secret@ftp.example.com/c%3a/stuff/mp3/, the user info is mp3:secret. However, most of the time, including a password in a URL is a security risk. If the URL doesn’t have any user info, getUserInfo() returns null.

Mailto URLs may not behave like you expect. In a URL like mailto:elharo@ibiblio.org, “elharo@ibiblio.org” is the path, not the user info and the host. That’s because the URL specifies the remote recipient of the message rather than the username and host that’s sending the message.

public String getAuthority()

Between the scheme and the path of a URL, you’ll find the authority. This part of the URI indicates the authority that resolves the resource. In the most general case, the authority includes the user info, the host, and the port. For example, in the URL ftp://mp3:mp3@138.247.121.61:21000/c%3a/, the authority is mp3:mp3@138.247.121.61:21000, the user info is mp3:mp3, the host is 138.247.121.61, and the port is 21000. However, not all URLs have all parts. For instance, in the URL http://conferences.oreilly.com/java/speakers/, the authority is simply the hostname conferences.oreilly.com. The getAuthority() method returns the authority as it exists in the URL, with or without the user info and port.

Example 4 uses these methods to split URLs entered on the command line into their component parts.

import java.net.*;

public class URLSplitter {

  public static void main(String args[]) {

    for (int i = 0; i < args.length; i++) {
      try {
        URL u = new URL(args[i]);
        System.out.println("The URL is " + u);
        System.out.println("The scheme is " + u.getProtocol());
        System.out.println("The user info is " + u.getUserInfo());

        String host = u.getHost();
        if (host != null) {
          int atSign = host.indexOf('@');
          if (atSign != -1) host = host.substring(atSign+1);
          System.out.println("The host is " + host);
        } else {
          System.out.println("The host is null.");
        }

        System.out.println("The port is " + u.getPort());
        System.out.println("The path is " + u.getPath());
        System.out.println("The ref is " + u.getRef());
        System.out.println("The query string is " + u.getQuery());
      } catch (MalformedURLException ex) {
        System.err.println(args[i] + " is not a URL I understand.");
      }
      System.out.println();
    }
  }
}

Here’s the result of running this against several of the URL examples :

% java URLSplitter    \
ftp://mp3:mp3@138.247.121.61:21000/c%3a/                 \
http://www.oreilly.com                                   \
http://www.ibiblio.org/nywc/compositions.phtml?category=Piano \
http://admin@www.blackstar.com:8080/                     \

The URL is ftp://mp3:mp3@138.247.121.61:21000/c%3a/
The scheme is ftp
The user info is mp3:mp3
The host is 138.247.121.61
The port is 21000
The path is /c%3a/
The ref is null
The query string is null

The URL is http://www.oreilly.com
The scheme is http
The user info is null
The host is www.oreilly.com
The port is -1
The path is
The ref is null
The query string is null

The URL is http://www.ibiblio.org/nywc/compositions.phtml?category=Piano
The scheme is http
The user info is null
The host is www.ibiblio.org
The port is -1
The path is /nywc/compositions.phtml
The ref is null
The query string is category=Piano

The URL is http://admin@www.blackstar.com:8080/
The scheme is http
The user info is admin
The host is www.blackstar.com
The port is 8080
The path is /
The ref is null
The query string is null</programlisting>

Equality and Comparison

The URL class contains the usual equals() and hashCode() methods. These behave almost as you’d expect. Two URLs are considered equal if and only if both URLs point to the same resource on the same host, port, and path, with the same fragment identifier and query string. However there is one surprise here. The equals() method actually tries to resolve the host with DNS so that, for example, it can tell that http://www.ibiblio.org/ and http://ibiblio.org/ are the same.

This means that equals() on a URL is potentially a blocking I/O operation! For this reason, you should avoid storing URLs in data structure that depend on equals() such as java.util.HashMap. Prefer java.net.URI for this, and convert back and forth from URIs to URLs when necessary.

On the other hand, equals() does not go so far as to actually compare the resources identified by two URLs. For example, http://www.oreilly.com/ is not equal to http://www.oreilly.com/index.html; and http://www.oreilly.com:80 is not equal to http://www.oreilly.com/.

Example 5 creates URL objects for http://www.ibiblio.org/ and http://ibiblio.org/ and tells you if they’re the same using the equals() method.

import java.net.*;

public class URLEquality {

  public static void main (String[] args) {
    try {
      URL www = new URL ("http://www.ibiblio.org/");
      URL ibiblio = new URL("http://ibiblio.org/");
      if (ibiblio.equals(www)) {
        System.out.println(ibiblio + " is the same as " + www);
      } else {
        System.out.println(ibiblio + " is not the same as " + www);
      }
    } catch (MalformedURLException ex) {
      System.err.println(ex);
    }
  }
}

When you run this program, you discover:

<programlisting format="linespecific" id="I_7_tt233">% <userinput moreinfo=
 "none">
  java URLEquality</userinput>
http://www.ibiblio.org/ is the same as http://ibiblio.org/</programlisting>

URL does not implement Comparable.

The URL class also has a sameFile() method that checks whether two URLs point to the same resource:

public boolean sameFile(URL other)

The comparison is essentially the same as with equals(), DNS queries included, except that sameFile() does not consider the fragment identifier. This sameFile() returns true when comparing http://www.oreilly.com/index.html#p1 and http://www.oreilly.com/index.html#q2 while equals() would return false.

Here’s a fragment of code that uses sameFile() to compare two URLs:

URL u1 = new URL("http://www.ncsa.uiuc.edu/HTMLPrimer.html#GS");
URL u2 = new URL("http://www.ncsa.uiuc.edu/HTMLPrimer.html#HD");
if (u1.sameFile(u2)) {
  System.out.println(u1 + " is the same file as \n" + u2);
} else {
  System.out.println(u1 + " is not the same file as \n" + u2);
}

The output is:

http://www.ncsa.uiuc.edu/HTMLPrimer.html#GS is the same file as
http://www.ncsa.uiuc.edu/HTMLPrimer.html#HD

Conversion

URL has three methods that convert an instance to another form: toString(), toExternalForm(), and toURI().

Like all good classes, java.net.URL has a toString() method. The String produced by toString() is always an absolute URL, such as http://www.cafeaulait.org/javatutorial.html. It’s uncommon to call toString() explicitly. Print statements call toString() implicitly. Outside of print statements, it’s more proper to use toExternalForm() instead:

public String toExternalForm()

The toExternalForm() method converts a URL object to a string that can be used in an HTML link or a web browser’s Open URL dialog.

The toExternalForm() method returns a human-readable String representing the URL. It is identical to the toString() method. In fact, all the toString() method does is return toExternalForm().

Finally, the toURI() method converts a URL object to an equivalent URI object:

public URI toURI() throws URISyntaxException

We’ll take up the URI class shortly. In the meantime, the main thing you need to know is that the URI class provides much more accurate, specification-conformant behavior than the URL class. For operations like absolutization and encoding, you should prefer the URI class where you have the option. You should also prefer the URI class if you need to store URLs in a hashtable or other data structure, since its equals() method is not blocking. The URL class should be used primarily when you want to download content from a server.

The URI Class

A URI is a generalization of a URL that includes not only Uniform Resource Locators but also Uniform Resource Names (URNs). Most URIs used in practice are URLs, but most specifications and standards such as XML are defined in terms of URIs. In Java, URIs are represented by the java.net.URI class. This class differs from the java.net.URL class in three important ways:

  • The URI class is purely about identification of resources and parsing of URIs. It provides no methods to retrieve a representation of the resource identified by its URI.
  • The URI class is more conformant to the relevant specifications than the URL class.
  • A URI object can represent a relative URI. The URL class absolutizes all URIs before storing them.

In brief, a URL object is a representation of an application layer protocol for network retrieval, whereas a URI object is purely for string parsing and manipulation. The URI class has no network retrieval capabilities. The URL class has some string parsing methods, such as getFile() and getRef(), but many of these are broken and don’t always behave exactly as the relevant specifications say they should. Normally, you should use the URL class when you want to download the content at a URL and the URI class when you want to use the URL for identification rather than retrieval, for instance, to represent an XML namespace. When you need to do both, you may convert from a URI to a URL with the toURL() method, and from a URL to a URI using the toURI() method.

Constructing a URI

URIs are built from strings. You can either pass the entire URI to the constructor in a single string, or the individual pieces:

public URI(String uri) throws URISyntaxException
public URI(String scheme, String schemeSpecificPart, String fragment)
    throws URISyntaxException
public URI(String scheme, String host, String path, String fragment)
    throws URISyntaxException
public URI(String scheme, String authority, String path, String query,
    String fragment) throws URISyntaxException
public URI(String scheme, String userInfo, String host, int port,
    String path, String query, String fragment) throws URISyntaxException

Unlike the URL class, the URI class does not depend on an underlying protocol handler. As long as the URI is syntactically correct, Java does not need to understand its protocol in order to create a representative URI object. Thus, unlike the URL class, the URI class can be used for new and experimental URI schemes.

The first constructor creates a new URI object from any convenient string. For example:

URI voice = new URI("tel:+1-800-9988-9938");
URI web   = new URI("http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc");
URI book  = new URI("urn:isbn:1-565-92870-9");

If the string argument does not follow URI syntax rules—for example, if the URI begins with a colon—this constructor throws a URISyntaxException. This is a checked exception, so either catch it or declare that the method where the constructor is invoked can throw it. However, one syntax rule is not checked. In contradiction to the URI specification, the characters used in the URI are not limited to ASCII. They can include other Unicode characters, such as ø and é. Syntactically, there are very few restrictions on URIs, especially once the need to encode non-ASCII characters is removed and relative URIs are allowed. Almost any string can be interpreted as a URI.

The second constructor that takes a scheme specific part is mostly used for nonhierarchical URIs. The scheme is the URI’s protocol, such as http, urn, tel, and so forth. It must be composed exclusively of ASCII letters and digits and the three punctuation characters +, -, and .. It must begin with a letter. Passing null for this argument omits the scheme, thus creating a relative URI. For example:

URI absolute = new URI("http", "//www.ibiblio.org" , null);
URI relative = new URI(null, "/javafaq/index.shtml", "today");

The scheme-specific part depends on the syntax of the URI scheme; it’s one thing for an http URL, another for a mailto URL, and something else again for a tel URI. Because the URI class encodes illegal characters with percent escapes, there’s effectively no syntax error you can make in this part.

Finally, the third argument contains the fragment identifier, if any. Again, characters that are forbidden in a fragment identifier are escaped automatically. Passing null for this argument simply omits the fragment identifier.

The third constructor is used for hierarchical URIs such as http and ftp URLs. The host and path together (separated by a /) form the scheme-specific part for this URI. For example:

URI today= new URI("http", "www.ibiblio.org", "/javafaq/index.html", "today");

This produces the URI http://www.ibiblio.org/javafaq/index.html#today.

If the constructor cannot form a legal hierarchical URI from the supplied pieces—for instance, if there is a scheme so the URI has to be absolute but the path doesn’t start with /—then it throws a URISyntaxException.

The fourth constructor is basically the same as the third, with the addition of a query string. For example:

URI today = new URI("http", "www.ibiblio.org", "/javafaq/index.html",
    "referrer=cnet&date=2014-02-23",  "today");

As usual, any unescapable syntax errors cause a URISyntaxException to be thrown and null can be passed to omit any of the arguments.

The fifth constructor is the master hierarchical URI constructor that the previous two invoke. It divides the authority into separate user info, host, and port parts, each of which has its own syntax rules. For example:

URI styles = new URI("ftp", "anonymous:elharo@ibiblio.org",
    "ftp.oreilly.com",  21, "/pub/stylesheet", null, null);

However, the resulting URI still has to follow all the usual rules for URIs; and again null can be passed for any argument to omit it from the result.

If you’re sure your URIs are legal and do not violate any of the rules, you can use the static factory URI.create() method instead. Unlike the constructors, it does not throw a URISyntaxException. For example, this invocation creates a URI for anonymous FTP access using an email address as password:

URI styles = URI.create(
    "ftp://anonymous:elharo%40ibiblio.org@ftp.oreilly.com:21/pub/stylesheet");

If the URI does prove to be malformed, then an IllegalArgumentException is thrown by this method. This is a runtime exception, so you don’t have to explicitly declare it or catch it.

The Parts of the URI

A URI reference has up to three parts: a scheme, a scheme-specific part, and a fragment identifier. The general format is:

scheme:scheme-specific-part:fragment

If the scheme is omitted, the URI reference is relative. If the fragment identifier is omitted, the URI reference is a pure URI. The URI class has getter methods that return these three parts of each URI object. The getRawFoo() methods return the encoded forms of the parts of the URI, while the equivalent getFoo() methods first decode any percent-escaped characters and then return the decoded part:

public String getScheme()
public String getSchemeSpecificPart()
public String getRawSchemeSpecificPart()
public String getFragment()
public String getRawFragment()

These methods all return null if the particular URI object does not have the relevant component: for example, a relative URI without a scheme or an http URI without a fragment identifier.

A URI that has a scheme is an absolute URI. A URI without a scheme is relative. The isAbsolute() method returns true if the URI is absolute, false if it’s relative:

public boolean isAbsolute()

The details of the scheme-specific part vary depending on the type of the scheme. For example, in a tel URL, the scheme-specific part has the syntax of a telephone number. However, in many useful URIs, including the very common file and http URLs, the scheme-specific part has a particular hierarchical format divided into an authority, a path, and a query string. The authority is further divided into user info, host, and port. The isOpaque() method returns false if the URI is hierarchical, true if it’s not hierarchical—that is, if it’s opaque:

public boolean isOpaque()

If the URI is opaque, all you can get is the scheme, scheme-specific part, and fragment identifier. However, if the URI is hierarchical, there are getter methods for all the different parts of a hierarchical URI:

public String getAuthority()
public String getFragment()
public String getHost()
public String getPath()
public String getPort()
public String getQuery()
public String getUserInfo()

These methods all return the decoded parts; in other words, percent escapes, such as %3C, are changed into the characters they represent, such as <. If you want the raw, encoded parts of the URI, there are five parallel getRaw_Foo_() methods:

public String getRawAuthority()
public String getRawFragment()
public String getRawPath()
public String getRawQuery()
public String getRawUserInfo()

Remember the URI class differs from the URI specification in that non-ASCII characters such as é and ü are never percent escaped in the first place, and thus will still be present in the strings returned by the getRawFoo() methods unless the strings originally used to construct the URI object were encoded.

In the event that the specific URI does not contain this information—for instance, the URI http://www.example.com has no user info, path, port, or query string—the relevant methods return null. getPort() is the single exception. Since it’s declared to return an int, it can’t return null. Instead, it returns -1 to indicate an omitted port.

For various technical reasons that don’t have a lot of practical impact, Java can’t always initially detect syntax errors in the authority component. The immediate symptom of this failing is normally an inability to return the individual parts of the authority, port, host, and user info. In this event, you can call parseServerAuthority() to force the authority to be reparsed:

public URI parseServerAuthority() throws URISyntaxException

The original URI does not change (URI objects are immutable), but the URI returned will have separate authority parts for user info, host, and port. If the authority cannot be parsed, a URISyntaxException is thrown.

Example 6 uses these methods to split URIs entered on the command line into their component parts. It’s similar to Example 4 but works with any syntactically correct URI, not just the ones Java has a protocol handler for.

import java.net.*;

public class URISplitter {
    public static void main(String args[]) {

    for (int i = 0; i < args.length; i++) {
      try {
        URI u = new URI(args[i]);
        System.out.println("The URI is " + u);
        if (u.isOpaque()) {
          System.out.println("This is an opaque URI.");
          System.out.println("The scheme is " + u.getScheme());
          System.out.println("The scheme specific part is "
              + u.getSchemeSpecificPart());
          System.out.println("The fragment ID is " + u.getFragment());
        } else {
          System.out.println("This is a hierarchical URI.");
          System.out.println("The scheme is " + u.getScheme());
          try {
            u = u.parseServerAuthority();
            System.out.println("The host is " + u.getHost());
            System.out.println("The user info is " + u.getUserInfo());
            System.out.println("The port is " + u.getPort());
          } catch (URISyntaxException ex) {
            // Must be a registry based authority
            System.out.println("The authority is " + u.getAuthority());
          }
          System.out.println("The path is " + u.getPath());
          System.out.println("The query string is " + u.getQuery());
          System.out.println("The fragment ID is " + u.getFragment());
        }
      } catch (URISyntaxException ex) {
        System.err.println(args[i] + " does not seem to be a URI.");
      }
      System.out.println();
    }
  }
}

Here’s the result of running this against three of the URI examples in this section:

% java URISplitter tel:+1-800-9988-9938 \
  http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc \
  urn:isbn:1-565-92870-9
The URI is tel:+1-800-9988-9938
This is an opaque URI.
The scheme is tel
The scheme specific part is +1-800-9988-9938
The fragment ID is null

The URI is http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc
This is a hierarchical URI.
The scheme is http
The host is www.xml.com
The user info is null
The port is -1
The path is /pub/a/2003/09/17/stax.html
The query string is null
The fragment ID is id=_hbc

The URI is urn:isbn:1-565-92870-9
This is an opaque URI.
The scheme is urn
The scheme specific part is isbn:1-565-92870-9
The fragment ID is null</programlisting>

Resolving Relative URIs

The URI class has three methods for converting back and forth between relative and absolute URIs:

public URI resolve(URI uri)
public URI resolve(String uri)
public URI relativize(URI uri)

The resolve() methods compare the uri argument to this URI and use it to construct a new URI object that wraps an absolute URI. For example, consider these three lines of code:

URI absolute = new URI("http://www.example.com/");
URI relative = new URI("images/logo.png");
URI resolved = absolute.resolve(relative);

After they’ve executed, resolved contains the absolute URI http://www.example.com/images/logo.png.

If the invoking URI does not contain an absolute URI itself, the resolve() method resolves as much of the URI as it can and returns a new relative URI object as a result. For example, take these three statements:

URI top = new URI("javafaq/books/");
URI resolved = top.resolve("jnp3/examples/07/index.html");

After they’ve executed, resolved now contains the relative URI javafaq/books/jnp3/examples/07/index.html with no scheme or authority.

It’s also possible to reverse this procedure; that is, to go from an absolute URI to a relative one. The relativize() method creates a new URI object from the uri argument that is relative to the invoking URI. The argument is not changed. For example:

URI absolute = new URI("http://www.example.com/images/logo.png");
URI top = new URI("http://www.example.com/");
URI relative = top.relativize(absolute);

The URI object relative now contains the relative URI images/logo.png.

Equality and Comparison

URIs are tested for equality pretty much as you’d expect. It’s not quite direct string comparison. Equal URIs must both either be hierarchical or opaque. The scheme and authority parts are compared without considering case. That is, http and HTTP are the same scheme, and www.example.com is the same authority as www.EXAMPLE.com. The rest of the URI is case sensitive, except for hexadecimal digits used to escape illegal characters. Escapes are not decoded before comparing. http://www.example.com/A and http://www.example.com/%41 are unequal URIs.

The hashCode() method is consistent with equals. Equal URIs do have the same hash code and unequal URIs are fairly unlikely to share the same hash code.

URI implements Comparable, and thus URIs can be ordered. The ordering is based on string comparison of the individual parts, in this sequence:

  1. If the schemes are different, the schemes are compared, without considering case.
  2. Otherwise, if the schemes are the same, a hierarchical URI is considered to be less than an opaque URI with the same scheme.
  3. If both URIs are opaque URIs, they’re ordered according to their scheme-specific parts.
  4. If both the scheme and the opaque scheme-specific parts are equal, the URIs are compared by their fragments.
  5. If both URIs are hierarchical, they’re ordered according to their authority components, which are themselves ordered according to user info, host, and port, in that order. Hosts are case insensitive.
  6. If the schemes and the authorities are equal, the path is used to distinguish them.
  7. If the paths are also equal, the query strings are compared.
  8. If the query strings are equal, the fragments are compared.

URIs are not comparable to any type except themselves. Comparing a URI to anything except another URI causes a ClassCastException.

String Representations

Two methods convert URI objects to strings, toString() and toASCIIString():

public String toString()
public String toASCIIString()

The toString() method returns an unencoded string form of the URI (i.e., characters like é and \ are not percent escaped). Therefore, the result of calling this method is not guaranteed to be a syntactically correct URI, though it is in fact a syntactically correct IRI. This form is sometimes useful for display to human beings, but usually not for retrieval.

The toASCIIString() method returns an encoded string form of the URI. Characters like é and \ are always percent escaped whether or not they were originally escaped. This is the string form of the URI you should use most of the time. Even if the form returned by toString() is more legible for humans, they may still copy and paste it into areas that are not expecting an illegal URI. toASCIIString() always returns a syntactically correct URI.

x-www-form-urlencoded

One of the challenges faced by the designers of the Web was dealing with the differences between operating systems. These differences can cause problems with URLs: for example, some operating systems allow spaces in filenames; some don’t. Most operating systems won’t complain about a # sign in a filename; but in a URL, a # sign indicates that the filename has ended, and a fragment identifier follows. Other special characters, nonalphanumeric characters, and so on, all of which may have a special meaning inside a URL or on another operating system, present similar problems. Furthermore, Unicode was not yet ubiquitous when the Web was invented, so not all systems could handle characters such as é and 本. To solve these problems, characters used in URLs must come from a fixed subset of ASCII, specifically:

  • The capital letters A–Z
  • The lowercase letters a–z
  • The digits 0–9
  • The punctuation characters - _ . ! ~ * ' (and ,)

The characters : / & ? @ # ; $ + = and % may also be used, but only for their specified purposes. If these characters occur as part of a path or query string, they and all other characters should be encoded.

The encoding is very simple. Any characters that are not ASCII numerals, letters, or the punctuation marks specified earlier are converted into bytes and each byte is written as a percent sign followed by two hexadecimal digits. Spaces are a special case because they’re so common. Besides being encoded as %20, they can be encoded as a plus sign (+). The plus sign itself is encoded as %2B. The / # = & and ? characters should be encoded when they are used as part of a name, and not as a separator between parts of the URL.

The URL class does not encode or decode automatically. You can construct URL objects that use illegal ASCII and non-ASCII characters and/or percent escapes. Such characters and escapes are not automatically encoded or decoded when output by methods such as getPath() and toExternalForm(). You are responsible for making sure all such characters are properly encoded in the strings used to construct a URL object.

Luckily, Java provides URLEncoder and URLDecoder classes to cipher strings in this format.

URLEncoder

To URL encode a string, pass the string and the character set name to the URLEncoder.encode() method. For example:

String encoded = URLEncoder.encode("This*string*has*asterisks", "UTF-8");

URLEncoder.encode() returns a copy of the input string with a few changes. Any nonalphanumeric characters are converted into % sequences (except the space, underscore, hyphen, period, and asterisk characters). It also encodes all non-ASCII characters. The space is converted into a plus sign. This method is a little overaggressive; it also converts tildes, single quotes, exclamation points, and parentheses to percent escapes, even though they don’t absolutely have to be. However, this change isn’t forbidden by the URL specification, so web browsers deal reasonably with these excessively encoded URLs.

Although this method allows you to specify the character set, the only such character set you should ever pick is UTF-8. UTF-8 is compatible with the IRI specification, the URI class, modern web browsers, and more additional software than any other encoding you could choose.

Example 7 is a program that uses URLEncoder.encode() to print various encoded strings.

import java.io.*;
import java.net.*;

public class EncoderTest {

  public static void main(String[] args) {

    try {
      System.out.println(URLEncoder.encode("This string has spaces",
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This*string*has*asterisks",
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This%string%has%percent%signs",
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This+string+has+pluses",
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This/string/has/slashes",
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This\"string\"has\"quote\"marks",
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This:string:has:colons",
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This~string~has~tildes",
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This(string)has(parentheses)",
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This.string.has.periods",
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This=string=has=equals=signs",
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("This&string&has&ampersands",
                                              "UTF-8"));
      System.out.println(URLEncoder.encode("Thiséstringéhasé
                                              non-ASCII characters", "UTF-8"));
    } catch (UnsupportedEncodingException ex) {
      throw new RuntimeException("Broken VM does not support UTF-8");
    }
  }
}

Here is the output (note that the code needs to be saved in something other than ASCII, and the encoding chosen should be passed as an argument to the compiler to account for the non-ASCII characters in the source code):

% javac -encoding UTF8 EncoderTest
% java EncoderTest
This+string+has+spaces
This*string*has*asterisks
This%25string%25has%25percent%25signs
This%2Bstring%2Bhas%2Bpluses
This%2Fstring%2Fhas%2Fslashes
This%22string%22has%22quote%22marks
This%3Astring%3Ahas%3Acolons
This%7Estring%7Ehas%7Etildes
This%28string%29has%28parentheses%29
This.string.has.periods
This%3Dstring%3Dhas%3Dequals%3Dsigns
This%26string%26has%26ampersands
This%C3%A9string%C3%A9has%C3%A9non-ASCII+characters</programlisting>

Notice in particular that this method encodes the forward slash, the ampersand, the equals sign, and the colon. It does not attempt to determine how these characters are being used in a URL. Consequently, you have to encode URLs piece by piece rather than encoding an entire URL in one method call. This is an important point, because the most common use of URLEncoder is preparing query strings for communicating with server-side programs that use GET. For example, suppose you want to encode this URL for a Google search:

https://www.google.com/search?hl=en&as_q=Java&as_epq=I/O

This code fragment encodes it:

String query = URLEncoder.encode(
    "https://www.google.com/search?hl=en&as_q=Java&as_epq=I/O", "UTF-8");
System.out.println(query);

Unfortunately, the output is:

https%3A%2F%2Fwww.google.com%2Fsearch%3Fhl%3Den%26as_q%3DJava%26as_epq%3DI%2FO

The problem is that URLEncoder.encode() encodes blindly. It can’t distinguish between special characters used as part of the URL or query string, like / and =, and characters that need to be encoded. Consequently, URLs need to be encoded a piece at a time like this:

String url = "https://www.google.com/search?";
url += URLEncoder.encode("hl", "UTF-8");
url += "=";
url += URLEncoder.encode("en", "UTF-8");
url += "&";
url += URLEncoder.encode("as_q", "UTF-8");
url += "=";
url += URLEncoder.encode("Java", "UTF-8");
url += "&";
url += URLEncoder.encode("as_epq", "UTF-8");
url += "=";
url += URLEncoder.encode("I/O", "UTF-8");

System.out.println(url);

The output of this is what you actually want:

https://www.google.com/search?hl=en&as_q=Java&as_epq=I/O

In this case, you could have skipped encoding several of the constant strings such as “Java” because you know from inspection that they don’t contain any characters that need to be encoded. However, in general, these values will be variables, not constants; and you’ll need to encode each piece to be safe.

Example 8 is a QueryString class that uses URLEncoder to encode successive name and value pairs in a Java object, which will be used for sending data to server-side programs. To add name-value pairs, call the add() method, which takes two strings as arguments and encodes them. The getQuery() method returns the accumulated list of encoded name-value pairs.

import java.io.UnsupportedEncodingException;
import java.net.URLEncoder;

public class QueryString {

  private StringBuilder query = new StringBuilder();

    public QueryString() {
  }

  public synchronized void add(String name, String value) {
    query.append('&');
    encode(name, value);
  }

  private synchronized void encode(String name, String value) {
    try {
      query.append(URLEncoder.encode(name, "UTF-8"));
      query.append('=');
      query.append(URLEncoder.encode(value, "UTF-8"));
    } catch (UnsupportedEncodingException ex) {
      throw new RuntimeException("Broken VM does not support UTF-8");
    }
  }

  public synchronized String getQuery() {
    return query.toString();
  }

  @Override
  public String toString() {
    return getQuery();
  }
}

Using this class, we can now encode the previous example:

QueryString qs = new QueryString();
qs.add("hl", "en");
qs.add("as_q", "Java");
qs.add("as_epq", "I/O");
String url = "http://www.google.com/search?" + qs;
System.out.println(url);

URLDecoder

The corresponding URLDecoder class has a static decode() method that decodes strings encoded in x-www-form-url-encoded format. That is, it converts all plus signs to spaces and all percent escapes to their corresponding character:

public static String decode(String s, String encoding)
    throws UnsupportedEncodingException

If you have any doubt about which encoding to use, pick UTF-8. It’s more likely to be correct than anything else.

An IllegalArgumentException should be thrown if the string contains a percent sign that isn’t followed by two hexadecimal digits or decodes into an illegal sequence.

Since URLDecoder does not touch non-escaped characters, you can pass an entire URL to it rather than splitting it into pieces first. For example:

String input = "https://www.google.com/" +
    "search?hl=en&as_q=Java&as_epq=I%2FO";
String output = URLDecoder.decode(input, "UTF-8");
System.out.println(output);

Reference