CIS 506 Course Notes
2002 spring
Thulin
You may wish to review an article about the Linux OS.
See http://www.linuxresources.com/
and http://www2.linuxjournal.com/main.html
Suggestions for papers to review:
CACM Vol. 33, No. 10 (Oct. 1990) special issue
on
simulation
NEST: a network simulation and prototyping testbed
Alexander Dupuy, Jed Schwartz, Yechiam Yemini and
David Bacon
Commun. ACM 33 10 (Oct. 1990), Pages 63-74
Parallel discrete event simulation
Richard M. Fujimoto
Commun. ACM 33 10 (Oct. 1990), Pages 30-53
ACM:
http://www.acm.org/
Student Membership:
http://www.acm.org/membership/student/benefits.html
The Client-Server Model
-
server processes
-
client processes
-
may be on same or different machines
-
assume all machines run same (micro) kernel
-
simplified protocol: only levels 1 (physical), 2 (data
link), 5
level 5 (session) becomes Request / Reply (connectionless
protocol)
Addressing
-
machine
-
machine.process
-
machine.local-id
See p.57 fig. 2.10.
-
Hard code machine.process
-
Random process addresses located by broadcasting
-
Server names looked up by name server
-
Special hardware; random process addresses stored in network
interface chip
Primitives
-
blocking (synchronous)
-
nonblocking (asynchronous)
new thread may be started to disguise interrupt
single thread choices:
-
Blocking send, receive (normally preferred)
-
Nonblocking send with copy, receive with wait or test
-
Nonblocking send, receive with interrupt
-
buffered: mailbox address
-
unbuffered: process address
-
reliable p.64 fig. 2-13
-
unreliable
Remote Procedure Call (RPC)
Basic RPC Operation
Parameters
-
call by value
-
usually no call by reference; replaced by call by
copy/restore
-
parameter marshaling
-
little, big endian concerns (p.74):
network canonical forms; format byte
-
stub compiler; server specification language
-
passing pointers
-
copy/restore improvement for in or out parameters
Dynamic Binding
-
used since hard coding server address is inflexible
-
stubs linked to client, server (p.78)
-
At initialization server exports interface to binder, registering
server via message containing name, version #, (32
bit) unique id, handle (ethernet address, X.500 address, sparse pid or
?), maybe authentication &c.
-
Server may deregister when it decides to serve no longer.
-
binder interface (p.79)
-
Upon first client RPC, client stub imports server interface
via message to binder which responds with message
containing handle and unique id (call fails if no server
available). Then client stub marshals parameters; sends
message addressed to handle.
-
Dynamic binding is highly flexible but costs time for extra
message passing; binder may be bottleneck.
RPC and Failures
-
client cannot locate server
-
lost request message
-
lost reply message
-
idempotent request, nonidempotent e.g. money transfer
-
sequence numbers; retransmission bit
-
server crash (p.82)
-
at least once
-
at most once
-
exactly once (desirable but not possible)
-
client crash
-
orphans
-
extermination (log, grandorphans)
-
reincarnation
-
gentle reincarnation
-
expiration
Implementation
-
Success of a distributed system depends on performance
which depends on communication speed which depends on
implementation.
RPC protocols
-
connection oriented or connectionless
-
standard or specific for RPC
-
message length (larger better for RPC)
Acknowledgments
-
stop and wait
-
blast
-
selective repeat often desirable
on WANs, not LANs
-
flow control
-
overrun errors much more serious than noise
-
cannot happen with stop-and-wait
-
sender can do busy waiting between packets
-
minimizing acks may depend on hardware timing
-
protocol may need to be tuned to hardware
-
server may save reply and set timer; discard reply
after ack arrives or timer expires
Critical path (pp. 89-91): sequence of instructions executed
for every RPC
-
DEC Firefly RPC example
-
user and kernel share address space
Copying
-
needed when user and kernel address spaces disjoint
-
number of copies: 1 - c.8
-
best case: network chip copies via DMA from client stub address
space to network to server kernel in real time; mapped to server address
space
-
worst case: copy to client run stack to client stub
message buffer to kernel buffer to network board buffer
to server network board to server kernel buffer to
server stub to server run stack
-
scatter-gather (hardware)
-
reducing copying easier for sender than receiver
-
can map pages to reduce copies
-
not worth it for small packets
Timer management
-
build data structure and insert in ordered linked list, or:
-
store timeouts in PCBs; use sweep algorithm (p.95)
-
could use priority queue
Problems
-
RPC using client-server model basis for many DOSs
-
transparency not yet fully achieved
-
pipelines
-
read-driven, write-driven do not work
-
RPC not good fit
-
may use temporary files on file server (more overhead)
-
terminal server & user break
-
RPC frequently good fit but not perfect
Group Communication
-
Introduction
-
RPC sends to one receiver; many receivers may be needed
-
one-to-many, point-to-point (p.99)
-
Groups are dynamic.
-
multicasting, broadcasting, unicasting
Design issues
-
closed groups (typical for parallel processing)
-
open groups (typical for replicated servers) (p.101)
-
peer groups (symmetric, no single point of failure)
-
hierarchical groups (decisions made easily) (p.102)
-
group membership
-
group server
-
distributed membership management
-
member crashes
-
leaving and joining synchronous with messages
-
membership too low for functionality; rebuild
-
group addressing
-
1. address for each group (p.104 fig. 2-33)
-
2. sender gives list of all destinations
-
3. predicate addressing (each message contains a predicate)
-
send and receive primitives
-
RPC not good for group communication
-
group_send, group_receive
-
atomicity (atomic broadcast)
-
all members of group or none get message
-
implementation difficult
-
message ordering
-
inconsistency (p.108 fig. 2-34)
-
global time ordering (impossible)
-
consistent time ordering
-
overlapping groups
-
inconsistency (p.109 fig. 2-35)
-
want well defined time ordering among groups
-
frequently difficult to implement
-
scalability
-
gateways cause problems (p.110)
ISIS
-
distributed application tool kit; runs over Unix (or
other OS)
-
synchronous system (impossible)
-
loose synchrony (like consistent time ordering)
-
virtual synchrony (p.111)
-
causally related
-
concurrent (not causally related)
-
ABCAST, GBCAST
-
2 phase commit, timestamps
-
ordered delivery for all messages
-
complex and expensive
-
CBCAST (pp. 113, 114)
-
ordered delivery for messages causally related
-
vectors; pass message if Vj = Lj
+ 1 and Vi £ Li,
i ¹ j
Ch. 3: Synchronization
semaphores, monitors not suited to distributed systems
Clock synchronization
distributed algorithms
1. information scattered on multiple machines
2. process makes decisions on local information
3. single point of failure to be avoided
4. no precise global time
Logical clocks
-
timer (quartz crystal, counter, holding register)
-
clock tick
-
clock skew
-
logical clocks, physical clocks
-
Lamport's algorithm
-
happens-before (a®b)
-
1. a®b if a precedes
b in same process
-
2. a®b if a = sending
a message and b = receiving it
-
happens-before is transitive (if a®b
and b®c then a®c)
-
a, b concurrent if not a®b
and not b®a
-
define C so that if a®b then
C(a) < C(b)
-
time corrections made by adding positive value
-
message contains sender's time, receiver resets its time
to sender's + 1 when its time £ sender's
(p.123)
-
between any two events in same process clock must tick at
least once
-
process # may be attached to time value yielding:
-
1. if a®b in same process
C(a) < C(b)
-
2. if a = sending message and b = receiving it C(a)
< C(b)
-
3. for all events a, b C(a) ¹
C(b)
Physical clocks
Subject:
Master Clock
Date:
Fri, 25 Sep
1998 06:30:01 UT
From:
The Learning
Kingdom <fact@LearningKingdom.com>
To:
Cool Fact
List <fact@tlk-lists.com>
----------------------------------------------------------------------
The Learning Kingdom's Cool Fact of the Day for
September 25, 1998
----------------------------------------------------------------------
What clock keeps time for the whole planet?
----------------------------------------------------------------------
The source clock for the planet is the US Naval Observatory's
Master
Clock. It's a network of dozens of extremely accurate
clocks located
all over the planet!
These clocks communicate by electronic messages every
100 seconds.
Together, they make a very stable time standard.
Their time signal is
used for many purposes, including the satellite-based
Global
Positioning System and many kinds of electronic navigation
systems.
Do you have a self-setting clock? Self-setting clocks
receive a radio
signal that originally comes from the USNO Master Clock,
and reset
themselves as needed!
USNO's Master Clock web site:
http://tycho.usno.navy.mil/time.html
Pollastri's Time of the Internet:
http://www.cstv.to.cnr.it/toi/uk/toi.html
American Watchmakers-Clockmakers Institute:
http://www.awi-net.org
----------------------------------------------------------------------
Cool Fact of the Day list membership: 33,513
----------------------------------------------------------------------
To subscribe, visit http://www.tlk-lists.com/join/
To unsubscribe,
visit http://www.tlk-lists.com/change/
To become a sponsor, visit http://www.tlk-lists.com/sponsor/
----------------------------------------------------------------------
Copyright (c) 1998, The Learning Kingdom, Inc.
http://www.LearningKingdom.com
----------------------------------------------------------------------
http://tycho.usno.navy.mil/time.html
, http://www.cstv.to.cnr.it/toi/uk/toi.html
-
mean solar day = 86,400 mean solar seconds (p.125).
-
atomic second = 9,192,631,770 Cs133 transitions.
-
BIH in Paris averages c. 50 atomic clocks to produce:
-
International Atomic Time (TAI) =
-
(mean # ticks of Cs133 clocks since 1958/1/1 00:00:00) /
9,192,631,770.
-
86,400 TAI seconds = mean solar day - 3 ms (approx.).
-
About 30 leap seconds have been introduced (p.126).
-
Universal Coordinated Time (UTC) is based on TAI with leap
seconds.
-
NIST broadcasts UTC on WWV (±1ms); GEOS (±.5ms).
Clock synchronization algorithms
-
Synchronize all clocks to UTC receiver or just synchronize
among all machines.
-
Timer interrupts H times / sec; Cp(t) is the time
reported by machine p's clock at UTC time t.
-
1 - r £ dC/dt £
1 + r; r = maximum
drift rate (p.128).
-
2 clocks may be as much as 2rDt
apart.
-
Clocks must resynchronize every d/(2r)
seconds to insure they are never apart by more than d.
2rDt=d => Dt=d/(2r).
-
Christian's algorithm:
-
Time server sends CUTC on request (at least one request every
d/(2r) seconds for
each machine).
-
Cannot set time to message time since time cannot go backwards
and it takes time for message to propagate.
-
Speed up or slow down local clock to correct gradually.
-
Estimate proagation time by (T1-T0)/2.
-
Improve estimate with I = time server's interrupt handling
time: (T1-T0-I)/2 (p.129)
-
Make a series of measurements of T1-T0
and average all but large values (or take minimum).
-
Berkeley algorithm: (p.130)
-
(in BSD Unix; no machine with WWV receiver)
-
Time daemon periodically polls machines
-
Averaging algorithms:
-
Above algorithms are centralized and have the usual disadvantages.
-
One type of decentralized algorithm divides time into resynchronization
intervals [T0+iR, T0+(i+1)R). Each machine
broadcasts its time according to its clock at T0+iR. It
then collects other time broadcasts for S seconds and computes time from
them. Average or average after discarding highest and lowest m values.
Possibly estimate propagation time by probes or network topology.
Use of synchronized Clocks
-
At-most-once message delivery
-
G = CurrentTime-MaxLifetime-MaxClockSkew
-
Current time written to disk every Dt
-
Clock-based cache consistency
3.2. Mutual Exclusion
-
centralized algorithm (p.134)
-
coordinator
-
request, grant, release
-
fair
-
no starvation
-
single point of failure
-
coordinator may be bottleneck
-
distributed algorithm (p.136)
-
Ricart & Agrawala's improvement of Lamport's
-
reliable timestamped requests & OKs
-
token ring algorithm (p.139)
-
comparison (p.140)
3.3. Election algorithms
-
Bully algorithm (pp. 141-142)
-
ring algorithm (p.143)
3.4. Atomic transactions
Introduction
The transaction model
Stable storage (p.146)
Transaction primitives
-
E.g. BEGIN; END; ABORT; READ; WRITE
Properties of transactions (ACID)
-
atomic
-
consistent
-
isolated (serializable; schedules p.149)
-
durable
Nested transactions (only top level durable)
Implementation
-
Private workspace
-
prohibitive cost but optimizations make feasible
-
shadow blocks (p.151)
-
Writeahead log (intentions list) used for rollback (p.152)
-
Two-phase commit protocol (p.153)
Concurrency control algorithms
-
Locking
-
granularity
-
two-phase locking (p. 155)
-
growing phase
-
shrinking phase
-
all interleaved schedules serializable
-
widely used
-
strict two-phase locking
-
value read has been written by committed transaction
-
lock acquisitions and releases handled by system
-
eliminates cascaded aborts
-
deadlock possible
-
Optimistic concurrency control
-
Timestamps (p. 156 ff.)
Transactions complex hence low performance
3.5. Deadlocks in distributed systems
four methods for handling (p. 159)
-
ignore problem
-
detection
-
prevention
-
difficult
-
less so with transactions
-
avoidance
Detection
-
killing process less consequential with transactions than
without
-
centralized (p. 160)
-
distributed (p. 162)
-
Chandy-Misra-Haas algorithm
-
theory and practice greatly divergent in this area
Prevention
-
wait-die (p. 164)
-
wound-wait (p. 165)
Ch. 4: Processes and processors in distributed systems
4.1. Threads
Introduction
lightweight processes (pp. 170-171)
-
running
-
blocked
-
ready
-
terminated
Thread usage
-
dispatcher/worker, team & pipeline models (p.172)
-
server models
-
threads
-
single thread
-
finite automaton
-
threads for clients: one for each file replica; handle signals
-
some apps easier to program with threads
-
threads and multiprocessors
Design issues for threads packages
-
static - dynamic
-
mutex, condition variable (p.176)
-
globals; private globals (p.177)
Implementing a threads package
-
implementing threads in user space
-
user-level; kernel-managed (p.179)
-
jacket; spin lock; busy waiting
-
implementing threads in the kernel
Threads and RPC
-
speed up local RPCs with special identifiers
-
implicit receive; pop-up thread (p.185)
4.2. System models
The workstation model
diskless workstations
diskful (disky) workstations (pp. 187-189)
1. paging, temp files
2. paging, temp files, system binaries
3. paging, temp files, system binaries, file caching
4. complete local file system
Using idle workstations
1. How is an idle workstation found?
2. How can a remote process be run transparently?
3. What if the owner comes back?
-
registry (p.191)
-
home workstation
The processor pool model (p.194)
queuing system (p.195)
-
l = # requests / sec
-
m = rate at which server can process
them
-
T = mean time between request and response
-
one processor:
-
n-processor pool gives:
-
T1 = 1 / (nm - nl)
= T / n
A hybrid model
4.3. Processor allocation
Allocation models
-
nonmigratory; migratory
-
maximize CPU utilization, or
-
minimize mean response time (p.199) or response ratio
Design issues for processor allocation algorithms
1. deterministic - heuristic
2. centralized - distributed
3. optimal - suboptimal
heuristic distributed suboptimal most used
4. local - global
5. sender initiated - receiver initiated (p.201)
Implementation issues
-
load determination
-
process count - not useful
-
CPU utilization - hard to determine accurately
-
overhead - hard to deal with
-
complexity e.g. algorithms 1, 2, 3 (p. 203)
-
stability; nonequilibrium problems
Example algorithms
-
graph-theoretic deterministic (p.204)
-
centralized heuristic: up-down (p.206)
-
coordinator maintains usage table
-
hierarchical e.g. MICROS (p.207)
-
sender-initiated distributed heuristic
-
receiver-initiated distributed heuristic
-
bidding
4.4. Scheduling in distributed systems
p.210 co-scheduling
-
attempt to schedule all members of a group to run simultaneously
4.5. Fault tolerance
Component faults
fault = malfunction
-
transient
-
intermittent
-
permanent
¥
S kp(1-p)k-1
= mean time to failure = 1/p
k=1
System failures
processor or process faults
1. fail-silent faults (fail-stop)
2. byzantine faults
Synchronous vs. asynchronous
-
synchronous: known finite bound on message response time
Use of redundancy
information, time, physical
Fault tolerance using active replication
active replication (state machine approach)
TMR (p.216)
k fault tolerance
-
k+1 replicas needed for fail silent faults
-
2k+1 replicas needed for byzantine faults
atomic broadcast problem
-
for finite state machine model to be relevant all requests
must arrive at all servers in the same order
-
global number server
-
Lamport's logical clocks
-
but server does not know if earlier requests have been sent
Fault tolerance using primary backup
p.218
Agreement in faulty systems
two-army problem
byzantine generals problem (p.221&222)
-
more than 2/3 loyal OK for synchronous
-
all loyal needed for asynchronous
4.6. Real-time distributed systems
fly-by-wire
periodic, aperiodic, sporadic stimuli
pp. 224, 225
soft, hard; 3 myths
Design issues
-
clock synchronization
-
event-triggered, time-triggered
-
predictability
-
fault tolerance
-
fail-safe
-
language support
Real-time communication
TDMA
Ch. 5: Distributed file systems
file service - file server
5.1. Distributed file system design
The file service interface
file:
sequence of bytes (usu. in distributed system)
sequence of keyed records
attributes: owner, size, date, access (rwx)
immutable files: only create, read
capability: user ticket for object access
access control list: who may access and how
upload/download model (p.247)
remote access model
The directory service interface
type: file extension or explicit attribute
hierarchical file system: trees, graphs (p.249)
clients have a) same b) different view of file system
(p.250)
global root directory? /server/path
Naming transparency
location transparency, location independence
1. machine+path naming
2. mount remote file system on local hierarchy
3. single name space same on all machines
1, 2 easy; 3 difficult but needed for single-system image
Two-level naming
symbolic names - binary names
binary name may be server+path, or use symbolic link
capabilities as binary names
symbolic name --> several binary names
Semantics of file sharing
UNIX semantics (see p.253)
session semantics
immutable files
transactions (see p.256)
5.2. Distributed file system implementation
File usage (Satyanarayanan 1981; university setting):
-
most files < 10K
-
reading much more common than writing
-
sequential access usual, random access rare
-
most files have short lifetime
-
file sharing is unusual
-
average process uses only a few files
-
there are distinct file classes with different properties
System structure
-
iterative vs. automatic lookup (p.260)
-
stateless vs. stateful servers (p.261)
Caching (pp. 262, 264)
-
exact LRU using liked lists
-
client and/or server caching
Cache consistency (p.267)
Replication (p.269)
-
lazy: background
-
group: simultaneous
Update protocols
-
primary copy replication
-
voting; read quorum, write quorum (p.271)
-
voting with ghosts (dummy server throws out writes)
Sun's NFS
-
remote mounting
-
protocols
-
file handle (structure)
-
static remote mount on boot
-
automounting
-
stateless, no open or close
-
NIS (yellow pages)
NFS layer structure (p.276)
-
v-node, r-node
-
read ahead
Lessons (p.279)
5.3. Trends
Ch. 6: Distributed shared memory
-
multiprocessors: easy to program, hard to build
-
multicomputers: easy to build, hard to program
-
DSM attempts to make multicomputer resemble multiprocessor
and hence gain ease of programming.
6.1. Introduction
DSM (Li 1986, Li & Hudak 1989)
-
simplest variant:
-
single shared paged virtual address space
-
paging over network
-
may share only variables etc. needed by more than 1 process
-
may replicate shared variables
-
may use objects rather than simple variables
6.2. What is shared memory?
Examples from multiprocessors:
-
bus (pp. 294, 5, 7)
-
ring (p.299)
-
switched (p.302) Dash (Directory Architecture for Shared
memory, p. 304)
-
NUMA (NonUniform Memory Access, p. 309)
-
comparison (pp. 312, 314)
6.3. Consistency models
-
strict (p.315)
-
sequential (p.318)
-
causal (p.322)
-
PRAM (p.323)
-
weak (p.325)
-
release (p. 329)
-
entry (p. 331)
summary of consistency models (p. 332)
6.4. Paged DSM
NORMA (NO Remote Memory Access)
-
basic design
-
replication (p. 335)
-
granularity
-
achieving sequential consistency
-
finding the owner (pp. 340-341)
-
finding the copies (p. 342)
-
page replacement
-
synchronization
6.5. Shared variable DSM
6.6. Object-based DSM
-
objects (pp. 356, 357)
-
Linda
-
Orca