How Does the
Internet Work?
Teddy Kang (HLS ’01)
March 1, 2001
In the course of emailing your best friend or surfing the Internet, have you ever thought to yourself: How does the Internet work? Is there one computer or group of computers to which everyone connects in order to access the Internet? If so, who controls those computers? In reality, no single computer or group of computers controls the Internet. Instead, the Internet (which is actually a contraction of the phrase interconnected network) is a collection of individual computers, grouped together in clusters called local area networks (LANs). The local area networks are sometimes called intranets. Each of these LANs are self-sufficient, independent entities, that work together with other LANs to transmit data. LANs are available at many institutions. Private companies have their own LANs – for example, there is probably a LAN at the company where you work. Universities, as well as government agencies have their own LANs. And Internet service providers (ISPs) and other online services, such as America Online (AOL) have their own LANs. Together, these networks and the system of passing information along these networks make up the world of what has commonly come to be known as the Internet.
Therefore, the Internet operates as a sort of loose democracy of individual computers and LANs. However, you might still wonder: If there is no centralized control of the Internet, how can there be any agreement on the proper way to use the Internet? Although control is mostly based on mutual cooperation, there are some existing groups that help establish standards and procedures for proper use of the Internet. One such group is the Internet Society, a private, non-profit group that serves as the standardizing body for the Internet community. The Internet Society is organized and managed by the Internet Architecture Board (IAB). The IAB itself relies on the Internet Engineering Task Force (IETF) for issuing new standards. The IETF itself is governed by the Internet Engineering Steering Group (IESG) and is further organized in the form of Areas and Working Groups where new specifications are discussed and new standards are proposed. Another group is the World Wide Web Consortium (W3C), which develops standards for the rapidly-growing World Wide Web (WWW) (discussed in more detail in a later section). Along with these large groups and consortiums, smaller private companies take a role in helping ensure the Internet’s efficient operation. For example, such companies oversee the registration of Internet domain names, such as www.yahoo.com.
Now
that we have learned that the Internet is not centrally managed, but is
actually a collection of local area networks, we next tackle the question of
how information actually travels across the Internet. How does a particular piece of information – like an email to your
mother or an order for a new sweater through an online catalog – reach its
intended destination?
Let’s
say that you are a student at the University of Michigan (U-M) and you’ve just
finished writing an email to a friend who is studying at Oxford University
(Oxford) in England. The first thing
that your computer does when you click the “send” button is to break down your
email into small packets of information, called datagrams. These datagrams
will all be sent off to your intended destination (your friend’s email account)
along the paths that are most efficient, i.e., have the least amount of
Internet “traffic.” Because traffic on
the Internet changes constantly, datagrams will all be sent along different
paths, and will, in all likelihood, arrive out of order. The destination computer (i.e., the computer
belonging to your friend in England), will then compile the datagrams and
reassemble them into the original, unified form that you submitted.
For
the purpose of explaining some terminology, the system of breaking down and
reassembling that was just described is called a packet-switched network.
This type of network, in which there is no single, unbroken connection
between sender and receiver, is the Internet’s key feature. In contrast to the Internet’s packet-switched
network, the telephone system uses a circuit-switched
network. In this type of network,
after a connection is made, that part of the network is dedicated only to that
single connection. Therefore, when you
speak on a telephone, your voice does not get broken down into packets of data
and reassembled at the receiver’s end; instead, your voice travels as a single,
unbroken stream.
Let’s
go back a minute and trace how the email message got from your computer to your
friend’s computer. As was described in
the preceding section, individual computers are hooked up to LANs in order to
access the Internet. The U-M and Oxford
University each have their own LANs.
Therefore, when you transmitted your email message, datagrams went from
your personal computer to the U-M’s LAN.
From there, a piece of Internet hardware called a router sent the datagrams from the U-M LAN to a regional network. A regional network is a cluster of LANs that
are located in a common geographic area (connections between LANs and regional
networks are formed via leased lines, such as telephone lines or fiber-optic
cables). Routers are the directors of
Internet traffic. Their job is to point
datagrams in the right direction, so that they all arrive at their intended
destination.[1]
If
the intended destination is within the same regional network (but not in the
same LAN), routers will direct datagrams to the proper LAN. However, in this case, your friend is
located in England, which is probably not in the same regional network as the
U-M. What happens now? In this case where the intended destination
lies in a different regional network, routers will send the datagrams to a Network Access Point (NAP). NAPs can be analogized to bus stations,
where datagrams wait to be picked up by the Internet “train.” To finish the analogy, the “train” is
referred to as an Internet backbone. These backbones are high-capacity lines that
carry enormous amounts of Internet traffic.
As a result of this capacity, datagrams can travel very quickly over
backbones. One such backone, called the
very high-speed Backbone Network Service (vBNS),
transfers data at 155 million bits per second.
That is almost 3,000 times the speed that a standard 56k modem can
transfer data! Government agencies,
such as NASA, and large private corporations pay for the construction and
maintenance of many of these backbones.
When data travels over the Internet over one of these backbones, the
long distance may weaken the data signal.
If this signal fades too much, there is a possibility of transmission
failure. Therefore, backbones (and all
other Internet lines, for that matter) have hardware called repeaters to solve this problem. Repeaters serve to amplify the data signal
at intervals so that the signals will not fade too much.
To
summarize, here is the path that your email message will take from your
computer at the U-M to your friend’s computer at Oxford University in
England. First, your computer will
break down the email message into packets of information called datagrams. Second, the datagrams will be sent from your
personal computer to the U-M’s LAN.
Third, because your friend’s email account is not on the U-M’s LAN,
routers will direct the datagrams from the U-M LAN to the regional network that
the U-M is a part of. Fourth, because
Oxford’s LAN is not on the same regional network as the U-M, routers will
direct the datagrams to the regional network’s NAP. Fifth, from the NAP, routers will send the datagrams along a
backbone to the regional network where Oxford’s LAN is located. Along the way, repeaters will amplify the
data signal to prevent transmission failure via signal weakening. Sixth, routers will direct the packets from
the regional network to Oxford’s LAN.
Seventh, Oxford’s LAN will hold on to the datagrams until your friend
checks his email account. And finally,
when your friend checks his email, his computer’s hardware and software will
reassemble the datagrams into a single, unified form, enabling your friend to
read your message.
So now that you have an
understanding as to how information gets sent over the Internet, perhaps you
are wondering: How did my email message get broken down into datagrams and then
reassembled? To answer this question,
we need to discuss Transmission Control
Protocol (TCP) and the Internet
Protocol (IP) – otherwise known as TCP/IP. TCP breaks down and reassembles the
datagrams, whereas IP plays a key role in ensuring that the datagrams are sent
to the proper destination.
In
order for your computer to be capable of breaking down and reassembling
datagrams, as well as being able to route datagrams along the Internet to the
proper destination, software is needed that will interpret the Internet’s
TCP/IP protocols. For PCs, the software
is called Winsock; for McIntosh
computers, it is called MacTCP.
So
via your computer’s TCP/IP software, TCP breaks the original information (i.e.,
the email to your friend) into datagrams.
As this process occurs, a header
is assigned to each datagram. A header
is simply a string of numbers that contains information crucial to the
transmission and reassembly of the datagrams.
This information includes such things as the order in which the
datagrams should be reassembled with respect to each other, checks to determine
whether an error occurred during data transmission (which it accomplishes
through use of a protocol called a checksum),
the Internet address of the intended destination, and the amount of time the
datagram should be kept before discarding it.
Once
information is broken down, the IP function takes over. Each datagram is placed into individual IP
folders. Attached to these folders is
an addressing label, which are read by routers for delivery to the intended
destination. Along with the addressing
label, the folders contain the headers assigned by TCP.
When
the datagrams arrive at their intended destination, TCP checks to see if there
has been any data corruption during transmission. If there has been corruption, TCP discards the datagram and
requests that the original datagram be re-transmitted. When all the non-corrupt datagrams are
received by the computer, TCP reassembles them into their original, unified
form.
In
the preceding section, we discussed how IP functions by placing datagrams into
individual IP folders and attaching to these folders an addressing label, which
tells routers where to send the datagrams.
In this section, we will go over these IP addresses in more detail.
IP
addresses are identifiers for delivery of information over the Internet. The format of the IP address is a numeric
address written as four numbers each separated by periods. Each number can be zero to 255. For example, 1.160.10.240 could be an IP
address. An IP address consists of two
parts. The first part of the address,
called the network number,
identifies a network on the Internet.
The remainder, called the host ID,
identifies an individual host on that network.
An individual host refers to an individual user (for example, if I were
a student at the University of Michigan and I registered my computer through
the university, I would be assigned a host ID).
Historically, three classes of IP addresses have been defined. There are Class A IP addresses, in which only the first field identifies the network, and the number in the first field must be in the range 1 through 126. Class A networks are very large. Host numbers 0.0.0 and 255.255.255 are already reserved, and one octet is reserved for other purposes, so there can be almost 17 million hosts in a class A network. The 126 class A network numbers have been allocated. As an example, the IP address 26.4.0.1 for a Class A network would signify host number 4.0.1 on network number 26.
For Class B IP addresses, the first two fields identify the network, and the number in the first field must be in the range 128 to 191. Class B networks are large. Host numbers 0.0 and 255.255 are reserved, so there can be up to 65,534 hosts in a class B network. Most of the 16,382 class B addresses have been allocated. As an example, the IP address 128.89.0.26 for a Class B network would signify host number 0.26 on network number 128.89.
For Class C IP addresses, the first three fields identify the network, and the number in the first field must be in the range 192 to 223. Class C networks are relatively small. Host numbers 0 and 255 are reserved, so there can be up to 254 hosts in a class C network. Most LANs are class C networks. There can be over 2 million class C networks on the Internet. As an example, the IP address 192.15.28.16 for a Class C network would signify host number 16 on network number 192.15.28.
As you might have already discovered, remembering these numerical addresses would be very difficult. Furthermore, numeric IP addresses sometimes change, which makes the use of the numerical addresses even more impractical. In response to this problem, Sun Microsystems developed the Domain Name System (DNS) in the early 1980s as an easier way to keep track of IP addresses. Under DNS, an Internet address is made up of two major components, that are separated by an @ sign. You are probably familiar with DNS – your email address is an example. The part of the address that is located to the left of the @ sign is the username, which usually identifies the individual who holds the Internet account. The part of the address that is to the right of the @ sign is the hostname and domain name. For example, if the address is jsmith@umich.edu, “jsmith” is the username, “umich” is the hostname, and “.edu” is the domain name.
The hostname signifies the entity or organization where the user has an account. In our preceding example, “umich” signifies the University of Michigan. Other hostnames include “yahoo” (the Yahoo company), “aol” (America Online), and “usdoj” (United States Department of Justice). The domain name represents the broad category that the hostname is part of. There are seven domain names that are used in the United States: .com (commercial entities), .edu (educational institutions), .gov (governmental entities), .org (organizations), .mil (military), .net (networks), and .int (international organizations). In our example of jsmith@umich.edu, the “.edu” domain name is used because the University of Michigan is an educational institution. Yahoo and America Online, because they are commercial entities, would use the .com domain name. And the Department of Justice would use the .gov domain name. Outside the United States, only two letters are used to identify the domain names – for example, .au for Australia, .ca for Canada, .uk for the United Kingdom, and .fr for France.[2]
You might be wondering how the translation from the domain name system to IP numeric addresses takes place. After all, for the Internet, the domain names mean nothing – it can only interpret the IP numeric addresses. This translation process is done by the domain name server. Each LAN has a domain name server. Therefore, if I have an account at the University of Michigan LAN, and I submit an email to the address tk@harvard.edu, the University of Michigan’s domain name server will first take the tk@harvard.edu address and translate it into an IP numeric address, let’s say it is 124.56.789.34. Once this translation process takes place, the LAN knows where to send the information, and the email can be sent.
Suppose now that instead of sending an email, you go on your web browser[3] (such as Internet Explorer or Netscape Navigator), and you type in www.yahoo.com. What exactly is your computer doing? When you type in the www.yahoo.com, which known as a uniform resource locator (URL), your computer is requesting that the Yahoo website send back information to your computer. When you click on a hyperlink on a web page, you have asked for information on a different URL; therefore, your computer is requesting for different information be relayed back to your computer. A hyperlink usually appears on the screen as an underlined word or phrase and is sometimes rendered in a different color from other text.
So when you type in www.yahoo.com or click on a hyperlink, your computer makes a request for information. This request gets sent to the LAN, which then uses the domain name server to translate the URL into an IP numeric address. Once this translation is complete, the request is broken down into datagrams – in the same way as your email transmission was broken down – and gets sent to the intended destination. When the intended recipient (the host) receives your request for information, the host computer will process the request and send back that information to your computer. Once the information gets relayed back to your computer, you are able to view the website that you typed in or hyperlinked to via your web browser. Incredibly, this whole back and forth exchange of information takes place in a matter of seconds!
Intranets
One concept that we are now very familiar with is the local area network (LAN). Educational institutions, government agencies, companies, and online services or Internet service providers will each have their own LAN. The connection of all of these networks form what we know as the Internet (interconnected network). In this section, we will learn about a special type of LAN that some businesses use – the intranet (also known as an internal network).
Succinctly defined, an intranet is an in-house website that serves the employees of a given company or business. Companies use intranets for a variety of purposes – email, group brainstorming, group scheduling, access to corporate databases and documents, and videoconferencing. Due to the often-confidential nature of material that gets distributed over the intranet, companies want their intranet to be secure from outsiders’ viewing. Therefore, although intranet pages may link to the Internet, an intranet is not a site that can be accessed by the general public. Only employees of the company or other authorized users can view the intranet website. Quite often, the intranet website is designed only to allow access when the user is at one of the company’s computer terminals.
The ability to restrict outsiders from gaining access to the company’s intranet is achieved through the use of a firewall. Firewalls is the name given to a set of related programs that protects the resources of an intranet from users of other networks, i.e., outsiders. Although the precise mechanics are difficult and a subject that we will not cover here, you should know that firewall software controls access to a network and enforces a security policy by means of a pair of mechanisms – one to block traffic (i.e., to restrict access from outsiders), and one to permit traffic and access to network data (i.e., to grant access to authorized users).
To summarize, an intranet is a type of LAN that is set up as an in-house website. Many companies use intranets to provide a channel of confidential communication among employees and other authorized users. Authorized users can access the Internet through the company’s intranet, but outsiders cannot gain access to a company’s intranet. This ability to restrict access is accomplished through the use of firewalls, which is a set of software and hardware programs.
Cookies and
Related Privacy Issues
As we will be learning in a latter part of this course, Internet privacy is a growing concern for computer engineers, lawmakers, and all individuals who use the Internet on a regular basis. Although the Internet provides for many conveniences – from email and online shopping to stock quotes at your fingertips and instantaneous weather reports – the potential abuses and invasions of privacy via the Internet are also myriad. It is clear that Internet users can obtain a seemingly limitless amount of information over the Internet. What is less understood is that the Internet also collects a great deal of information about its users. Who uses this information and how it is used forms the basis of many Internet privacy concerns. In this section, we will discuss some of the techniques by which information about users can be collected and how this information can potentially be misused.
Some of the most important Internet privacy issues are linked to the potential abuse of cookies. A cookie is a small piece of information written to the hard drive (internal memory) of an Internet user’s computer when he or she visits a website that offers cookies. Cookies can contain a variety of information, including the name of the website that issued them, where on the site the user visited, passwords, and even user names and credit card numbers that have been supplied via forms. Cookies are only retrievable by the site which issued them and link the information gathered to a unique ID number assigned to the cookie so that information is available from one session to another.
As originally designed, cookies were to be of benefit to the user. Online organizations like the New York Times which require user ID and passwords could store this information in the form of a cookie. This way, repeat visitors to a site could avoid having to fill out form information on each visit. Likewise, some online search engines such as Infoseek use cookies to “remember” users and offer them customized news and services based on their prior use. So as originally designed, cookies were intended to be a time-saving device for computer users. For example instead of having to send a credit card number over the Internet multiple times, an online vendor could read the user’s cookie and match it to a stored profile which would contain that information. Or, on a more general note, cookies that traced user activity on websites could also enable the web designers to determine which of their pages were the most successful and plan their updates accordingly.
With the increasing commercial applications of the Internet, it was probably inevitable that cookies would quickly be utilized for advertising purposes. Since cookies can be matched to the profile of a user’s interests and browsing habits, cookies are a natural tool for the “targeting” of advertisements to individual users. Marketing consultants such as DoubleClick Inc. and MatchLogic quickly began to utilize cookies to increase the efficiency of the placing of advertisements on websites. Their intent was to target advertisements such as changing banner ads to users whose profiles match those of likely consumers of the advertised products. For example, DoubleClick was retained by the 3M Corporation to help target Internet banner advertising for an expensive multi-media projector to those users who would be most likely to purchase it. DoubleClick made use of the information cookies provided about user browsing habits to match the banners with users who had a history of selecting high-technology sites.
Internet privacy advocates object to cookies for a wide variety of reasons. First among them is that the cookie is stored in the user's computer without her consent or knowledge. Before the upgrades of web browsers like Netscape Navigator and Microsoft Internet Explorer, cookies were placed anonymously and without alerting the user. Next, information from the cookie was transmitted to the website, again without the user's knowledge. With browser upgrades users may be alerted to when they are being offered a cookie, but the formatting of the information still tells the user little about what is actually being stored.
The safety of personal information stored on the user’s hard drive has also been of concern with respect to cookies and related privacy issues. Concerns have been raised about the possibility of cookies being written that would allow access to other information that the user has stored. One of the most recent upgrades of the popular Internet browser, Netscape Communicator, was plagued with a bug that would allow a website access to the information that was passed between that site and the cookie file, including credit card numbers and passwords that had been entered into files. While this bug has been fixed and did not allow access to the user’s hard drive, it was still a serious breach of cookie security.
The most pressing issue concerning cookies, more than possible hardware invasions and general unease with the placing of files on user hard drives by third parties is the concern of user privacy and the potential for abuse. Advertisers and webmasters (people who maintain individual websites) are currently using cookies to develop detailed profiles of users and their browsing habits. Each click on a particular type of advertisement or page in a website is added to the profile maintained by the maintainer. For the time being this information is primarily used for website design and the placement of banner advertisements, but the possibility also exists for these profiles to be sold and resold to other commercial interests. This could lead to deeper incursions into personal privacy, because if any one of the cookie-maintainers links a user identity to their cookie ID, then that information could also be resold.
While this might at first seem to be only a nuisance, which would probably lead only to a serious increase in “targeted” junk paper mail or e-mail, there are more serious concerns for potential abuse. One scenario, albeit an extreme one, was put forth by David Christle:
If you visited a number of sites that advertise alcohol...and you end up on a list that your insurance company purchases. The list compiled from a variety of Internet sites shows your name as someone who frequents sites that promote alcohol, or at least as someone who is a prime prospect for alcohol sales. They raise your premiums on a profile that has been built about you based upon the sites you visit on the Internet.[4]
This is an extreme scenario, but it does put forth the potential for abuse of cookies.
Due in part because of these privacy issues, web browsers have now been designed to give users control over whether to accept cookies, to not accept cookies, or to ask each time a cookie is being placed on the hard disk.
[1] Note that if data is being transferred within a common LAN, routers are oftentimes unnecessary, since the network itself can handle its internal traffic. So, for example, if you are a student at the University of Michigan (which, let’s suppose has all its students’ email accounts on one LAN), and you send an email to another student at the University of Michigan, then the use of a router may be not be needed.
[2] An exception to the two-letter domain names for sites outside the United States is .com, which is now used universally. Therefore, even if a commercial entity had a site based in Australia, it would use the .com domain name, not the .au domain name.
[3] Web browsers are software programs that display information on your computer by interpreting the Hypertext Markup Language (HTML) that is used to build websites on the Internet. These browsers can also display applications, programs, animations, and similar material that are integrated with the website.
[4] David Christle, “Cookies in the Oven,” http://www.windowatch.com/cookies.html, 1996.