Download PDF with some graph. design-of-operating-system-outline
Design of Operating System
OS protection, modes/privileges
User Mode, Kernel Mode
a register of flag to record what mode the CPU is in
In Kernel mode, the executing code has complete and unrestricted access to the underlying hardware. It can execute any CPU instruction and reference any memory address. Kernel mode is generally reserved for the lowest-level, most trusted functions of the operating system. Crashes in kernel mode are catastrophic; they will halt the entire PC.
In User mode, the executing code has no ability to directly access hardware or reference memory. Code running in user mode must delegate to system APIs to access hardware or memory. Due to the protection afforded by this sort of isolation, crashes in user mode are always recoverable. Most of the code running on your computer will execute in user mode.
Transitioning between User and Kernel mode is expensive.
protect programs from each other
◆ Base and limit registers
◆ Page table pointers, page protection, TLB
◆ Virtual memory
Events(Fault, interrupt, trap)
OS handle unnatural change, that is event
Synchronous: when executing instructions
Asynchronous: external event
Unexpected and deliberate (scheduled by OS or application)
Fault: may handle unrecoverable faults by killing the user process. Kernel faults: Dereference NULL, divide by zero, undefined instruction.
System call: trap to syscall à handle syscall à kernel routine à restore routine, set to user mode
Interrupt Handler: part of the operating system, running in kernel mode
Two flavors of interrupts
◆ Precise: CPU transfers control only on instruction boundaries
◆ Imprecise: CPU transfers control in the middle of instruction execution
Timer interrupt: OS reclaim control
Synchronization: interrupt anytime so guarantee short instruction sequence execute atomically
Components, PCB, process control block
All the state for a program in execution.
Stack, heap, static data, code.
Running, ready, waiting
State queue: wait queue, ready queue. PCB are those in the queue.
Process and Thread
PROCESS represented by PCB
◆ An address space
◆ The code for the executing program
◆ The data for the executing program
◆ A set of operating system resources » Open files, network connections, etc.
◆ An execution stack encapsulating the state of procedure calls
◆ The program counter (PC) indicating the next instruction
◆ A set of general-purpose registers with current values
◆ Current execution state (Ready/Running/Waiting)
Deadlock exists among a set of processes if every process is waiting for an event that can be caused only by another process in the set.
- Mutual exclusion – At least one resource must be held in a non-sharable mode
- Hold and wait – There must be one process holding one resource and waiting for another resource
- No preemption – Resources cannot be preempted (critical sections cannot be aborted externally)
- Circular wait – There must exist a set of processes [P1, P2, P3, …,Pn] such that P1 is waiting for P2, P2 for P3, etc.
In computer programming, a mutex is a program object that allows multiple program threads to share the same resource, such as file access, but not simultaneously. When a program is started, a mutex is created with a unique name.
In computing, preemption is the act of temporarily interrupting a task being carried out by a computer system, without requiring its cooperation, and with the intention of resuming the task at a later time. Such changes of the executed task are known as context switches. It is normally carried out by a privileged task or part of the system known as a preemptive scheduler, which has the power to preempt, or interrupt, and later resume, other tasks in the system.
In software engineering, a spinlock is a lock which causes a thread trying to acquire it to simply wait in a loop (“spin”) while repeatedly checking if the lock is available. Since the thread remains active but is not performing a useful task, the use of such a lock is a kind of busy waiting. Once acquired, spinlocks will usually be held until they are explicitly released, although in some implementations they may be automatically released if the thread being waited on (that which holds the lock) blocks, or “goes to sleep”.
A livelock is similar to a deadlock, except that the states of the processes involved in the livelock constantly change with regard to one another, none progressing. This term was defined formally at some time during the 1970s—an early sighting in the published literature is in Babich’s 1979 article on program correctness. Livelock is a special case of resource starvation; the general definition only states that a specific process is not progressing.
Livelock is a risk with some algorithms that detect and recover from deadlock. If more than one process takes action, the deadlock detection algorithm can be repeatedly triggered. This can be avoided by ensuring that only one process (chosen arbitrarily or by priority) takes action.
◆ Job throughput (# jobs/unit time)
◆ Turnaround time (Tfinish – Tstart)
◆ Waiting time (Avg(Twait): avg time spent on wait queues)
◆ Response time (Avg(Tready): avg time spent on ready queue)
Starvation (side effect of schedule)
a process is prevented from making progress because some other process has the resource it requires
Schedule mode, First in first out, Shortest job first, Round Robin
Round Robin: each task get resource for a fixed period of time
◆ Starvation – low priority jobs can wait indefinitely
◆ “Age” processes
» Increase priority as a function of waiting time
» Decrease priority as a function of CPU consumption
Multi-level feedback queue
Read and write semaphore
A monitor is a module that encapsulates
◆ Shared data structures
◆ Procedures that operate on the shared data structures
◆ Synchronization between concurrent threads that invoke the procedures
guarantee one thread in the monitor, other threads are in waiting queue.
- memory segment
- virtual memory:
- fixed partitions
- variable partitions
- paging (easy to allocate, simplify protection;; still have internal fragmentation, memory reference overhead(time consuming), page table is too large)
VAS: virtual address space
PTE: page table entry
VPN: virtual page number
PFN: page frame number
Page fault and Page hit: whether the referenced VM word is present in memory(if there’s a page fault, the memory is on the disk)
Work set locality: At any point in time, programs tend to access a set of active virtual pages called the working set
Programs with better temporal locality will have smaller working sets
If (working set size < main memory size): Good performance for one process after compulsory misses
If ( SUM(working set sizes) > main memory size )Thrashing: Performance meltdown where pages are swapped (copied) in and out continuously
Page Table problem Equation:
Page size: size of each virtual page.
Page size total = page table entry size * number of page entries
Two level page table case:
2-level page table » Level 1 table: each PTE points to a page table (always memory resident) » Level 2 table: each PTE points to a page (paged in and out like any other data)
Top level page table: number of PTE* size of PTE
Bottom level page table: number of pages * size of pages
Address space = number of pages * 2^vpn
Page size = 2^vpn * PTE
Sharing memory: share memory when fork, avoiding intensive memory copy. Some data can be copied on write.
Page table translation:
TLB: speed up the translation, which functions like L1 Cache
TLB hit and TLB miss
The CPU generates the logical address, which contains the page number and the page offset.
The page number is used to index into the page table, to get the corresponding page frame number, and once we have the page frame of the physical memory(also called main memory), we can apply the page offset to get the right word of memory.
Why TLB(Translation Look Aside Buffer)
The thing is that page table is stored in physical memory, and sometimes can be very large, so to speed up the translation of logical address to physical address , we sometimes use TLB, which is made of expensive and faster associative memory, So instead of going into page table first, we go into the TLB and use page number to index into the TLB, and get the corresponding page frame number and if it is found, we completely avoid page table( because we have both the page frame number and the page offset) and form the physical address.
If we don’t find the page frame number inside the TLB, it is called a TLB miss only then we go to the page table to look for the corresponding page frame number.
If we find the page frame number in TLB, its called TLB hit, and we don’t need to go to page table.
Occurs when we have formed the physical address, using TLB or page table, it does not matter and we do not find it in the main memory.
Page Replacement Algorithm, Thrashing
When there’s a page fault and OS must replace the page in physical memory
- Belady’s algorithm
- Replace the page that will not be used for the longest time in future
- Useful for determine the improvement of algorithm
- FIFO: first in first out
- In computer storage, Bélády’s anomaly is the phenomenon in which increasing the number of page frames results in an increase in the number of page faults for certain memory access patterns. This phenomenon is commonly experienced when using the First in First Out (FIFO) page replacement algorithm. László Bélády demonstrated this in 1969.
- LRU: least recently used
- cannot do precisely, so we have do approximate it
- periodically set REF to 0
- LRU clock
- Victim buffer.
- Throw those will be deleted into buffer
The “working set” is short hand for “parts of memory that the current algorithm is using” and is determined by which parts of memory the CPU just happens to access. It is totally automatic to you. If you are processing an array and storing the results in a table, the array and the table are your working set.
Dynamic locality of process’s memory usage
Working set size is the number of pages in the working set
PFF: page fault frequency
Monitor the fault rate of each process and then decide to increase or reduce memory
File seek policy
FCFS: do nothing
SSTF: shortest seek time first
C-SCAN: elevator only in one direction
Optimized File systems
Fast File System, FFS
|FFS Main features:
Cylinder groups to improve locality.
Copy of superblock in each cylinder group (on different platters) to improve fault tolerance.
Free block bitmap per cylinder group,
supports allocation of consecutive blocks.
Larger blocks (multiple of 4kB) for improved throughput.
Clustering of I/O operations to further improve throughput[Pea88].
Fragments (1/2 – 1/8 blocks) to reduce fragmentation.
Pre-allocated inodes scattered throughout cylinder groups,
Bitmap free list for fast allocation.
Ability to rebuild free lists after crash (fsck).
If possible, file direct-blocks and index blocks are allocated from same CG as inode,
up to a maximum (25% of CG capacity) to avoid squeezing out small files.
Attempt to allocate consecutive logical blocks rotationally optimal.
Aided by bitmap free list.
Attempt to allocate (plain) files’ inodes in same CG as directory.
Spread directories across CGs.
Last block of small file is actually array of consecutive fragments.
Fragments only allowed for files not using indirect blocks.
Fragment array is unaligned but must not span blocks.
try allocating extra fragment from same block,
Allocation supported by using fragment granularity in free block bitmap.
Try to allocate logically contiguous blocks contiguously (up to 64kB).
Delay writes until ready to write whole cluster (if possible).
Read ahead full cluster for sequentially accessed files.
Reallocation of modified blocks to improve locality.
Separate cluster map to help finding clusters.
Variant: Linux Extended-2 File System (Ext2FS)
(Logical) block groups rather than cylinder groups.
Inode array in each block group.
Store (short) symbolic links in inode.
Preallocates contiguous blocks.
Ext2FS Disk layout
FFS Problem: Synchronous updates
Need to be able to recover consistent state of FS after crash.
Synchronous writes of metadata for consistency on disk:
On file creation:
Write inode before updating directory.
On file deletion:
Write updated directory before deallocating inode.
Write updated inode before freeing blocks in bitmap.
On file extension:
Write data block before inode (for security).
Write updated inode/indirect block before updating bitmap.
Similarly on file truncation.
Running fsck after crash rebuilds consistent bitmaps.
Synchronous I/O limiting when creating/deleting many files.
Delayed writes/soft updates[GP94]:
Consistency does not require synchronous writes.
Must ensure that dependent writes are kept in sequence.
E.g., ensure that on creation inode is written before bitmap.
Unrelated reads or writes can still occur in arbitrary order.
enhance disk scheduler to honour write dependencies,
use copy-on-write on source blocks to avoid blocking further modifications.
Slightly more loss of user data during crash.
Much more complicated disk scheduling and interrupt processing.
Disks can reorder write requests internally
JFS, LFS, RAID,
1. Explain the difference between internal and external
Internal fragmentation is the wasted space within each allocated block
because of rounding up from the actual requested allocation to the
allocation granularity. External fragmentation is the various free
spaced holes that are generated in either your memory or disk space.
External fragmented blocks are available for allocation, but may be
too small to be of any use.