A Guide to URLs
This document is intended to describe Uniform Resource
Locators, widely used on the World Wide Web and other media for referencing
documents. This document was written to be an understandable, comprehensive,
and accurate resource on URLs. However, some information may become obsolete,
as this document will not be updated to keep pace with developments beyond
1996.
Contents
A URL is a Uniform Resource Locator, a standard way developed to specify
the location of a resource available electronically. URLs are defined by
RFC 1738, to
which you can look for more definitive, technical information.
URLs make it possible to direct both people and software applications to
a variety of information, available from a number of different Internet protocols.
Most commonly, you will run into URLs when using a World Wide Web (WWW) client,
as that medium uses URLs to link WWW pages together. In your WWW browser's
"location" box, the item that generally starts with "http:" is a URL. Files
available over protocols besides HTTP, such as FTP and Gopher can be
referenced by URLs. Even Telnet sessions to remote hosts on the
Internet and someone's Internet e-mail address can be referred to by a URL.
A URL is like your complete mailing address: it specifies all the
information necessary for someone to address an envelope to you. However,
they are much more than that, since URLs can refer to a variety of very
different types of resources. A more fitting analogy would be a system
for specifying your mailing address, your phone number, or the location
of the book you just read from the public library, all in the same
format.
In short, a URL is a very convenient and succinct way to direct people
and applications to a file or other electronic resource. Learning how to
interpret, use, and construct URLs will greatly assist your exploration of
the Internet.
URLs have a very specific syntax, as defined by
RFC 1738.
They all follow the format of:
<scheme>:<scheme-dependent-information>
Examples of various schemes are "http", "gopher", "ftp", and
"news". These schemes and others are explained below. The scheme tells you or the
application using the URL what type of resource we are trying to reach and/or what
mechanism to use to obtain that resource.
The scheme dependent information is detailed below with each separate
scheme. However, most schemes include two different types of information: the
Internet machine making the file available and the "path" to that file. With
these types of schemes, we generally see the scheme separated from the Internet
address of the machine with two slashes (//), and then the Internet address
separated from the full path to the file with one slash (/). FTP, HTTP, and Gopher
URLs generally appear in this fashion:
scheme://machine.domain/full-path-of-file
As an exercise, let's look at this file's URL:
http://www.netspace.org/users/dwb/
url-guide.html
The scheme for this URL is "http" for the HyperText Transfer Protocol. The
Internet address of the machine is "www.netspace.org", and the path to the file
is "users/dwb/www-authoring.html". When working with the WWW, most URLs will
appear very similar to this one's overall structure.
Note that when using FTP, HTTP, and Gopher URLs, the "full-path-of-file" will
sometimes end in a slash. This indicates that the URL is pointing not to a specific
file, but a directory. In this case, the server generally returns the "default
index" of that directory. This might be just a listing of the files available
within that directory, or a default file that the server automatically looks for
in the directory. With HTTP servers, this default index file is generally called
"index.html", but is frequently seen as "homepage.html", "home.html",
"welcome.html", or "default.html".
When you encounter a URL, you will need to know what to do with it. Some
systems and applications allow you to just double-click on the URL, and if
your machine is properly configured, the appropriate client will be launched
and will obtain the resource. Other times, you can just copy the URL, and then
paste it into the application which you use to get to the resource. For instance,
most WWW browsers have a "location" or "go to" box, in which you can paste a URL,
hit return, and you will go to that object.
However, sometimes your clients might not support a certain URL scheme, and
you will have to manually decode it for yourself. In these instances, first start
with the scheme, and look at the descriptions provided below. For instance, if I
come across a URL that starts with "mailto:", I look below and find that this
indicates an Internet email address. Next, figure out what application you should
use to utilize this URL. In this example, I use Eudora to email people, so I
launch that application. Then, use the scheme description to determine what
information the scheme-dependent portion of the URL is providing. In this case,
the Mailto URL lists an email address of "dwb@netspace.org". Lastly,
figure out how to give this information to your client. For Eudora, I know that
I should create a new message, and then fill in "dwb@netspace.org" within
the From: mail field.
Constructing a URL can be tricky. Often times, a client will show you the
URL of the document to which you wish to refer people. For example, most WWW
browsers display the current document's URL within the "location" box. In these
cases, merely copying down the information shown should be sufficient. However,
if this information is not available to you, you will have to construct the URL
for yourself. Basically, the trick is to work backwards from the process previously
mentioned for interpreting URLs.
For example, say that I want to tell someone how to obtain the Mac WWW browser
created by the Netscape Communications Corporation. I've obtained the browser
for myself by using my favorite FTP client. I used that client to contact the
site "ftp.mcom.com". I then changed to the "netscape" directory, and then went
into the "mac" directory. Finally, I got the file called "netscape.sea.hqx".
Looking at the description of the FTP scheme, I construct the following URL:
ftp://ftp.mcom.com/netscape/mac/
netscape.sea.hqx
Now, when presenting this URL to people, there is a general syntax that ought
to be used to avoid confusion. Many people place the URL on its own line, separated
from text below and above by whitespace. Most people consider this sufficient.
However, a more precise syntax is recommended by
RFC 1738, and
distinguishes URLs from other
Uniform Resource
Identifiers (URIs). (URLs are a subset of URIs.) This syntax is to preface the
URL by "<URL:" and terminate it with ">". Thus, when sending the above URL to
obtain Netscape for Mac via email, I use the syntax:
<URL:ftp://ftp.mcom.com/netscape/mac/
netscape.sea.hqx>
Other
recommendations exist, but this is the format used within this document and the
RFC which defines URLs.
Also note, when constructing URLs, that certain characters are reserved or unsafe.
To use these characters, you will need to encode them with "escape sequences." These
sequences are mentioned in the section entitled Appendix
A: Escape Sequences.
Sometimes you will try to follow a URL to its destination and won't meet with
success. If the remote machine refuses the connection, it's quite possible that
the site is very busy, and many popular sites can't be contacted during peak
hours. If the file can't be found, check how you spelled the URL to ensure that
you are correctly specifying it. Try removing the file name, and referencing the
directory in which the file supposedly resides. Perhaps the file was misspelled
when given to you, and you might find the correct spelling in the index. If that
doesn't work, maybe the file was moved, and you could try looking up the hierarchy
by sequentially removing the last directory in the path listed until you come to
the root directory for that site's server. Hopefully, you will find the file on
your own without having to ask the person who directed you there for help.
Below are listed the most common URL schemes and their syntax. Generally, you will
only run across the HTTP, FTP, Gopher, News, and Mailto schemes, but the others
are included for completeness.
HTTP is the Internet protocol specifically designed for use with the World Wide
Web, and thus will be the most common scheme you are likely to use. Its syntax
is:
http://<host>:<port>/<path>?<searchpart>
The host is the Internet address of the WWW server, and the
port is the port number to connect to. In most cases, the port
can be omitted (along with the preceding colon), and it defaults to the standard
"80". The path tells the WWW server which file you want, and if omitted,
indicates that you want the "home page" for the system. The searchpart
may be used to pass information to the server, often to an executable CGI script,
but for most WWW documents is not used. Generally, this part of the URL is omitted,
along with the preceding question-mark.
Another character that may be frequently encountered when browsing the WWW is the
pound sign (#), which can be used to point to a named anchor. An author of an HTML
document can allow browsers to point to a specific section of a document by
creating a named anchor within that document. Then, a URL with a pound sign and the
anchor's name appended will reference that specific section. Named anchors are used
throughout this document, and as an example, the following URL points directly to
the section "What are URLs?":
http://www.netspace.org/users/dwb/
url-guide.html#what
FTP is a well-used means for transmitting files over the Internet. While there are
many advantages to using HTTP instead, many systems don't offer full support of HTTP
and clients are not as well developed as they are for FTP. Thus, many times files
are distributed via FTP. Its syntax is:
ftp://<user>:<password>@<host>:<port>/<cwd1>/
<cwd2>/.../<cwdN>/<name>;type=<typecode>
If contacting a site which provides general FTP access, the user and
password can be omitted, including the colon between them and the at-symbol
afterwards. The host is the Internet address of the FTP site. The
port and its preceding colon can be omitted as well. The portion of
"<cwd1>/<cwd2>/.../<cwdN>" refers to the
series of "change directory" commands a client must use to move
to the directory in which the file desired resides. The name is the
name filename of the desired file. The construction ";type=<typecode>" allows
for a transmission method (e.g. ascii vs. binary) to be specified, but I haven't found
any clients which support this syntax, and in fact, most incorrectly assume that it
is part of the filename. For now, avoid using the typecode.
The Gopher protocol syntax is very similar to FTP and HTTP:
gopher://<host>:<port>/<gopher-path>
The host indicates the Internet address of the Gopher server, while
the port, as in the previous cases, can generally be omitted along with
its preceding colon. The gopher-path specifies the type of Gopher
resource, a selector string, and perhaps other information. A detailed discussion
of Gopher queries is not within the scope of this document, but generally you
can determine a document's gopher-path from information provided by
your browser.
The Mailto URL scheme is different from the previous three schemes, and it does
not identify a file available over the Internet, but rather the email address of
someone that can be reached via the Internet. The syntax is:
mailto:<account@site>
The account@site is the Internet email address of the person you wish
to contact, as defined by
RFC 822. Note that
when encoded in WWW documents, some WWW browsers may not understand the Mailto scheme.
Support for Mailto is increasing, but for now, one can switch to a different
browser or interpret the Mailto URL manually.
The News URL scheme allows for the referencing of Usenet newsgroups or specific
articles. The syntax is either of the following:
news:<newsgroup-name>
news:<message-id>
The newsgroup-name is the Usenet newsgroup name (e.g.
comp.infosystems.www.providers) and generally will tell the browser to retrieve
the titles of all the available articles within that newsgroup. If the
newsgroup-name is "*", the URL refers to "all available newsgroups."
The message-id corresponds to the Message-ID of the specific article
to obtain, and can be found within the article's header information.
Note that the News URL does not specify how a client is to obtain this
information. A client must be properly configured to know where to obtain Usenet
newsgroups and articles, generally from a specific NNTP server.
The Telnet URL designates an interactive session to a remote host on the Internet
via the Telnet protocol. Its syntax is:
telnet://<user>:<password>@<host>:<port>/
The user and password tokens can be omitted, and are
included only for advisory purposes. The host refers to the site to
connect to, and port can be omitted, defaulting to the standard "23".
The TN3270 URL scheme is for telnetting to systems which require 3270 terminal
emulation, such as IBM mainframes. This is not a scheme defined by
RFC 1738,
but is a proposed addition. It is almost identical to the Telnet URL, and has the
syntax:
tn3270://<user>:<password>@<host>:<port>/
The WAIS URL refers to WAIS databases, searches, or documents on a WAIS database.
The WAIS URL scheme has one of the three following forms:
wais://<host>:>port>/<database>
wais://<host>:<port>/<database>?<search>
wais://<host>:<port>/<database>/
<wtype>/<wpath>
The host and port (which can be omitted) describe the
same constructs in previously described schemes. The first syntax indicates a
specific WAIS database, the second a particular search, and the third a specific
document.
The File URL scheme indicates a file which can be obtained by the client machine.
In many sources, this scheme is confused with the FTP scheme. FTP refers to a
specific protocol for file transmission, and while the File URL leaves the retrieval
method up to the client, which in some circumstances, might be via the FTP
protocol. When the file is intended to be obtained via FTP, I recommend designating that
URL scheme. The syntax for the File scheme is:
file://<host>/<path>
The host is the fully qualified domain name of the system, and the
path is the hierarchical directory path of the form
"directory/directory/.../filename". The host can be left as an empty string
or "localhost" to refer to local files on the client on which the URL is being
interpreted.
The NNTP URL scheme is an alternative method to the News scheme for referencing
Usenet articles and newsgroups. It has the syntax of:
nntp://<host>:<port>/<newsgroup-name>/
<article-number>
The items within this syntax are all as described in previous schemes. Generally,
it is better to use the News scheme and trust that the client knows how to
obtain Usenet items. The NNTP scheme specifies that the NNTP protocol is used, and
also specifies a specific NNTP server, designated by the host, to be
used; most NNTP servers do not provide universal access. Thus, use News
whenever possible.
The Prospero URL scheme allows resources available via the Prospero Directory
Service to be designated. It has the syntax:
prospero://<host>:<port>/<hsoname>
<field>=<value>
See
Neuman, B., and S. Augart, "The Prospero Protocol", USC/Information
Sciences Institute, June 1993, <URL:ftp://prospero.isi.edu/pub/prospero/doc/prospero-protocol.PS.Z> for
information on Prospero.
Certain characters are either determines to be unsafe or
reserved, and thus may need to be encoded by escape sequences before
a URL can be specified. These escape sequences are of the format
"%+US-ASCII-character-hexadecimal value". Unsafe characters are designated as such
for a variety of reasons. All unsafe characters should be encoded when
constructing a URL, and these characters with their escape sequences are
listed below:
SPACE %20
< %3C
> %3E
# %23
% %25
{ %7B
} %7D
| %7C
\ %5C
^ %5E
~ %7E
[ %5B
] %5D
` %60
Reserved characters are characters which have special meaning within
specific schemes, and must be encoded when used in such schemes if they are to be
used for a purpose other than that meaning. The escape sequences are listed
below:
; %3B
/ %2F
? %3F
: %3A
@ %40
= %3D
& %26
Thus, the tilde (~) which designates the directory within which this document
resides should be encoded to produce the URL:
http://www.netspace.org/users/dwb/
url-guide.html
As was previously mentioned, the portion of the FTP URL of the syntax
"<cwd1>/<cwd2>/.../<cwdN>", as specified by
RFC 1738,
refers to the series of "change directory" commands a client must use to move
to the directory in which the file desired resides. However, many clients assume
that the FTP server is UNIX-compatible, and issue a single change directory
command or attempt to retrieve the file of "path/filename".
For some FTP servers, such as many VM systems, this is an incorrect assumption
and these clients may be unable to retrieve such files.
Here are a variety of resources available on some of the topics discussed:
-
Uniform Resource Locators (RFC 1738)
- <URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc1738.html>
-
Universal Resource Identifiers (RFC 1630)
- <URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc1630.html>
-
HyperText Transfer Protocol
- <URL:http://www.w3.org/hypertext/WWW/Protocols/Overview.html>
-
File Transfer Protocol (RFC 959)
- <URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc959.html>
-
Gopher Protocol (RFC 1436)
- <URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc1436.html>
-
Internet Text Messages (RFC 822)
- <URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc822.html>
-
File Transfer Protocol (RFC 959)
- <URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc959.html>
-
Usenet Messages (RFC 1036)
- <URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc1036.html>
-
File Transfer Protocol (RFC 959)
- <URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc959.html>
-
Network News Transfer Protocol (RFC 977)
- <URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc977.html>
-
WAIS over Z39.50-1988 (RFC 1625)
- <URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc1625.html>
-
The Prospero Protocol
- <URL:ftp://prospero.isi.edu/pub/prospero/doc/prospero-protocol.PS.Z>
Copyright 1996, David W. Baker.