Internet > World Wide Web > How It Works > URL's >

A Guide to URLs

Courtesy of David W. Baker
1996

A Guide to URLs

This document is intended to describe Uniform Resource Locators, widely used on the World Wide Web and other media for referencing documents. This document was written to be an understandable, comprehensive, and accurate resource on URLs. However, some information may become obsolete, as this document will not be updated to keep pace with developments beyond 1996.

Contents


What are URLs?

A URL is a Uniform Resource Locator, a standard way developed to specify the location of a resource available electronically. URLs are defined by RFC 1738, to which you can look for more definitive, technical information.

URLs make it possible to direct both people and software applications to a variety of information, available from a number of different Internet protocols. Most commonly, you will run into URLs when using a World Wide Web (WWW) client, as that medium uses URLs to link WWW pages together. In your WWW browser's "location" box, the item that generally starts with "http:" is a URL. Files available over protocols besides HTTP, such as FTP and Gopher can be referenced by URLs. Even Telnet sessions to remote hosts on the Internet and someone's Internet e-mail address can be referred to by a URL.

A URL is like your complete mailing address: it specifies all the information necessary for someone to address an envelope to you. However, they are much more than that, since URLs can refer to a variety of very different types of resources. A more fitting analogy would be a system for specifying your mailing address, your phone number, or the location of the book you just read from the public library, all in the same format.

In short, a URL is a very convenient and succinct way to direct people and applications to a file or other electronic resource. Learning how to interpret, use, and construct URLs will greatly assist your exploration of the Internet.

General URL Syntax

URLs have a very specific syntax, as defined by RFC 1738. They all follow the format of:

<scheme>:<scheme-dependent-information>

Examples of various schemes are "http", "gopher", "ftp", and "news". These schemes and others are explained below. The scheme tells you or the application using the URL what type of resource we are trying to reach and/or what mechanism to use to obtain that resource.

The scheme dependent information is detailed below with each separate scheme. However, most schemes include two different types of information: the Internet machine making the file available and the "path" to that file. With these types of schemes, we generally see the scheme separated from the Internet address of the machine with two slashes (//), and then the Internet address separated from the full path to the file with one slash (/). FTP, HTTP, and Gopher URLs generally appear in this fashion:

scheme://machine.domain/full-path-of-file

As an exercise, let's look at this file's URL:

http://www.netspace.org/users/dwb/
url-guide.html

The scheme for this URL is "http" for the HyperText Transfer Protocol. The Internet address of the machine is "www.netspace.org", and the path to the file is "users/dwb/www-authoring.html". When working with the WWW, most URLs will appear very similar to this one's overall structure.

Note that when using FTP, HTTP, and Gopher URLs, the "full-path-of-file" will sometimes end in a slash. This indicates that the URL is pointing not to a specific file, but a directory. In this case, the server generally returns the "default index" of that directory. This might be just a listing of the files available within that directory, or a default file that the server automatically looks for in the directory. With HTTP servers, this default index file is generally called "index.html", but is frequently seen as "homepage.html", "home.html", "welcome.html", or "default.html".

Using URLs

When you encounter a URL, you will need to know what to do with it. Some systems and applications allow you to just double-click on the URL, and if your machine is properly configured, the appropriate client will be launched and will obtain the resource. Other times, you can just copy the URL, and then paste it into the application which you use to get to the resource. For instance, most WWW browsers have a "location" or "go to" box, in which you can paste a URL, hit return, and you will go to that object.

However, sometimes your clients might not support a certain URL scheme, and you will have to manually decode it for yourself. In these instances, first start with the scheme, and look at the descriptions provided below. For instance, if I come across a URL that starts with "mailto:", I look below and find that this indicates an Internet email address. Next, figure out what application you should use to utilize this URL. In this example, I use Eudora to email people, so I launch that application. Then, use the scheme description to determine what information the scheme-dependent portion of the URL is providing. In this case, the Mailto URL lists an email address of "dwb@netspace.org". Lastly, figure out how to give this information to your client. For Eudora, I know that I should create a new message, and then fill in "dwb@netspace.org" within the From: mail field.

Constructing URLs

Constructing a URL can be tricky. Often times, a client will show you the URL of the document to which you wish to refer people. For example, most WWW browsers display the current document's URL within the "location" box. In these cases, merely copying down the information shown should be sufficient. However, if this information is not available to you, you will have to construct the URL for yourself. Basically, the trick is to work backwards from the process previously mentioned for interpreting URLs.

For example, say that I want to tell someone how to obtain the Mac WWW browser created by the Netscape Communications Corporation. I've obtained the browser for myself by using my favorite FTP client. I used that client to contact the site "ftp.mcom.com". I then changed to the "netscape" directory, and then went into the "mac" directory. Finally, I got the file called "netscape.sea.hqx". Looking at the description of the FTP scheme, I construct the following URL:

ftp://ftp.mcom.com/netscape/mac/
netscape.sea.hqx

Now, when presenting this URL to people, there is a general syntax that ought to be used to avoid confusion. Many people place the URL on its own line, separated from text below and above by whitespace. Most people consider this sufficient. However, a more precise syntax is recommended by RFC 1738, and distinguishes URLs from other Uniform Resource Identifiers (URIs). (URLs are a subset of URIs.) This syntax is to preface the URL by "<URL:" and terminate it with ">". Thus, when sending the above URL to obtain Netscape for Mac via email, I use the syntax:

<URL:ftp://ftp.mcom.com/netscape/mac/
netscape.sea.hqx>

Other recommendations exist, but this is the format used within this document and the RFC which defines URLs.

Also note, when constructing URLs, that certain characters are reserved or unsafe. To use these characters, you will need to encode them with "escape sequences." These sequences are mentioned in the section entitled Appendix A: Escape Sequences.

Troubleshooting URLs

Sometimes you will try to follow a URL to its destination and won't meet with success. If the remote machine refuses the connection, it's quite possible that the site is very busy, and many popular sites can't be contacted during peak hours. If the file can't be found, check how you spelled the URL to ensure that you are correctly specifying it. Try removing the file name, and referencing the directory in which the file supposedly resides. Perhaps the file was misspelled when given to you, and you might find the correct spelling in the index. If that doesn't work, maybe the file was moved, and you could try looking up the hierarchy by sequentially removing the last directory in the path listed until you come to the root directory for that site's server. Hopefully, you will find the file on your own without having to ask the person who directed you there for help.

The URL Schemes

Below are listed the most common URL schemes and their syntax. Generally, you will only run across the HTTP, FTP, Gopher, News, and Mailto schemes, but the others are included for completeness.

HyperText Transfer Protocol (HTTP)

HTTP is the Internet protocol specifically designed for use with the World Wide Web, and thus will be the most common scheme you are likely to use. Its syntax is:

http://<host>:<port>/<path>?<searchpart>

The host is the Internet address of the WWW server, and the port is the port number to connect to. In most cases, the port can be omitted (along with the preceding colon), and it defaults to the standard "80". The path tells the WWW server which file you want, and if omitted, indicates that you want the "home page" for the system. The searchpart may be used to pass information to the server, often to an executable CGI script, but for most WWW documents is not used. Generally, this part of the URL is omitted, along with the preceding question-mark.

Another character that may be frequently encountered when browsing the WWW is the pound sign (#), which can be used to point to a named anchor. An author of an HTML document can allow browsers to point to a specific section of a document by creating a named anchor within that document. Then, a URL with a pound sign and the anchor's name appended will reference that specific section. Named anchors are used throughout this document, and as an example, the following URL points directly to the section "What are URLs?":

http://www.netspace.org/users/dwb/
url-guide.html#what

File Transfer Protocol (FTP)

FTP is a well-used means for transmitting files over the Internet. While there are many advantages to using HTTP instead, many systems don't offer full support of HTTP and clients are not as well developed as they are for FTP. Thus, many times files are distributed via FTP. Its syntax is:

ftp://<user>:<password>@<host>:<port>/<cwd1>/
<cwd2>/.../<cwdN>/<name>;type=<typecode>

If contacting a site which provides general FTP access, the user and password can be omitted, including the colon between them and the at-symbol afterwards. The host is the Internet address of the FTP site. The port and its preceding colon can be omitted as well. The portion of "<cwd1>/<cwd2>/.../<cwdN>" refers to the series of "change directory" commands a client must use to move to the directory in which the file desired resides. The name is the name filename of the desired file. The construction ";type=<typecode>" allows for a transmission method (e.g. ascii vs. binary) to be specified, but I haven't found any clients which support this syntax, and in fact, most incorrectly assume that it is part of the filename. For now, avoid using the typecode.

Gopher Protocol (Gopher)

The Gopher protocol syntax is very similar to FTP and HTTP:

gopher://<host>:<port>/<gopher-path>

The host indicates the Internet address of the Gopher server, while the port, as in the previous cases, can generally be omitted along with its preceding colon. The gopher-path specifies the type of Gopher resource, a selector string, and perhaps other information. A detailed discussion of Gopher queries is not within the scope of this document, but generally you can determine a document's gopher-path from information provided by your browser.

Electronic Mail (Mailto)

The Mailto URL scheme is different from the previous three schemes, and it does not identify a file available over the Internet, but rather the email address of someone that can be reached via the Internet. The syntax is:

mailto:<account@site>

The account@site is the Internet email address of the person you wish to contact, as defined by RFC 822. Note that when encoded in WWW documents, some WWW browsers may not understand the Mailto scheme. Support for Mailto is increasing, but for now, one can switch to a different browser or interpret the Mailto URL manually.

Usenet News (News)

The News URL scheme allows for the referencing of Usenet newsgroups or specific articles. The syntax is either of the following:

news:<newsgroup-name>
news:<message-id>

The newsgroup-name is the Usenet newsgroup name (e.g. comp.infosystems.www.providers) and generally will tell the browser to retrieve the titles of all the available articles within that newsgroup. If the newsgroup-name is "*", the URL refers to "all available newsgroups." The message-id corresponds to the Message-ID of the specific article to obtain, and can be found within the article's header information.

Note that the News URL does not specify how a client is to obtain this information. A client must be properly configured to know where to obtain Usenet newsgroups and articles, generally from a specific NNTP server.

Telnet to Remote Host (Telnet)

The Telnet URL designates an interactive session to a remote host on the Internet via the Telnet protocol. Its syntax is:

telnet://<user>:<password>@<host>:<port>/

The user and password tokens can be omitted, and are included only for advisory purposes. The host refers to the site to connect to, and port can be omitted, defaulting to the standard "23".

Telnet to Remote Hosts Requiring 3270 Emulation (TN3270)

The TN3270 URL scheme is for telnetting to systems which require 3270 terminal emulation, such as IBM mainframes. This is not a scheme defined by RFC 1738, but is a proposed addition. It is almost identical to the Telnet URL, and has the syntax:

tn3270://<user>:<password>@<host>:<port>/

Wide Area Information Search (WAIS)

The WAIS URL refers to WAIS databases, searches, or documents on a WAIS database. The WAIS URL scheme has one of the three following forms:

wais://<host>:>port>/<database>
wais://<host>:<port>/<database>?<search>
wais://<host>:<port>/<database>/
<wtype>/<wpath>

The host and port (which can be omitted) describe the same constructs in previously described schemes. The first syntax indicates a specific WAIS database, the second a particular search, and the third a specific document.

Host-Specific File Names (File)

The File URL scheme indicates a file which can be obtained by the client machine. In many sources, this scheme is confused with the FTP scheme. FTP refers to a specific protocol for file transmission, and while the File URL leaves the retrieval method up to the client, which in some circumstances, might be via the FTP protocol. When the file is intended to be obtained via FTP, I recommend designating that URL scheme. The syntax for the File scheme is:

file://<host>/<path>

The host is the fully qualified domain name of the system, and the path is the hierarchical directory path of the form "directory/directory/.../filename". The host can be left as an empty string or "localhost" to refer to local files on the client on which the URL is being interpreted.

USENET News Using NNTP Access (NNTP)

The NNTP URL scheme is an alternative method to the News scheme for referencing Usenet articles and newsgroups. It has the syntax of:

nntp://<host>:<port>/<newsgroup-name>/
<article-number>

The items within this syntax are all as described in previous schemes. Generally, it is better to use the News scheme and trust that the client knows how to obtain Usenet items. The NNTP scheme specifies that the NNTP protocol is used, and also specifies a specific NNTP server, designated by the host, to be used; most NNTP servers do not provide universal access. Thus, use News whenever possible.

Pospero Directory Service (Prospero)

The Prospero URL scheme allows resources available via the Prospero Directory Service to be designated. It has the syntax:

prospero://<host>:<port>/<hsoname>
<field>=<value>

See Neuman, B., and S. Augart, "The Prospero Protocol", USC/Information Sciences Institute, June 1993, <URL:ftp://prospero.isi.edu/pub/prospero/doc/prospero-protocol.PS.Z> for information on Prospero.

Appendix A: Escape Sequences

Certain characters are either determines to be unsafe or reserved, and thus may need to be encoded by escape sequences before a URL can be specified. These escape sequences are of the format "%+US-ASCII-character-hexadecimal value". Unsafe characters are designated as such for a variety of reasons. All unsafe characters should be encoded when constructing a URL, and these characters with their escape sequences are listed below:

     SPACE      %20
     <          %3C
     >          %3E
     #          %23
     %          %25
     {          %7B
     }          %7D
     |          %7C
     \          %5C
     ^          %5E
     ~          %7E
     [          %5B
     ]          %5D
     `          %60

Reserved characters are characters which have special meaning within specific schemes, and must be encoded when used in such schemes if they are to be used for a purpose other than that meaning. The escape sequences are listed below:

     ;          %3B
     /          %2F
     ?          %3F
     :          %3A
     @          %40
     =          %3D
     &          %26

Thus, the tilde (~) which designates the directory within which this document resides should be encoded to produce the URL:

http://www.netspace.org/users/dwb/
url-guide.html

Appendix B: A Note on FTP URL Negotiation

As was previously mentioned, the portion of the FTP URL of the syntax "<cwd1>/<cwd2>/.../<cwdN>", as specified by RFC 1738, refers to the series of "change directory" commands a client must use to move to the directory in which the file desired resides. However, many clients assume that the FTP server is UNIX-compatible, and issue a single change directory command or attempt to retrieve the file of "path/filename".

For some FTP servers, such as many VM systems, this is an incorrect assumption and these clients may be unable to retrieve such files.

Appendix C: Other Resources

Here are a variety of resources available on some of the topics discussed:

Uniform Resource Locators (RFC 1738)
<URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc1738.html>
Universal Resource Identifiers (RFC 1630)
<URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc1630.html>
HyperText Transfer Protocol
<URL:http://www.w3.org/hypertext/WWW/Protocols/Overview.html>
File Transfer Protocol (RFC 959)
<URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc959.html>
Gopher Protocol (RFC 1436)
<URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc1436.html>
Internet Text Messages (RFC 822)
<URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc822.html>
File Transfer Protocol (RFC 959)
<URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc959.html>
Usenet Messages (RFC 1036)
<URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc1036.html>
File Transfer Protocol (RFC 959)
<URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc959.html>
Network News Transfer Protocol (RFC 977)
<URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc977.html>
WAIS over Z39.50-1988 (RFC 1625)
<URL:http://www.cis.ohio-state.edu/htbin/rfc/rfc1625.html>
The Prospero Protocol
<URL:ftp://prospero.isi.edu/pub/prospero/doc/prospero-protocol.PS.Z>

Copyright 1996, David W. Baker.

 

__