Saturday, January 30, 2016

Spam Filter Challenge

Adaptation of Adversaries [1]

  • The adversaries are motivated to transform the test data to reduce the learner's effectiveness. 
  • Spam filter designers
    • Attempt to learn good filters by training their algorithms on Spam (and legitimate) email messages received in the recent past. 
  • Spammer
    • Are motivated to reverse-engineer existing Spam filters and use this knowledge to generate messages which are different enough from the (inferred) training data to circumvent the filters. 

Solutions


  • Increase the robustness of the learning algorithm to generic training/test data differences via standard methods such as regularization or minimization of worst-case loss [1]
    • However, these techniques do not account for the adversarial nature of the training/test set discrepancies and may be overly conservative.
  • Predictive analystics to anticipate and counter the adversaries [1]
    • For example, predictions can be made using extrapolation or game-theoretic considerations, and can be employed to transform training instances so that they become similar to (future) test data and therefore provider a more appropriate basis for learning.
  • Time-varying posture to increase uncertainty [1]
    • Pros
      • This approach is flexible, scalable, easy to implement, and hard to reverse-engineer.

Reference

[1] Moving Target Defense for Adaptive Adversaries, by Richard Colbaugh and Kristin Glass, in ISI 2013.

Monday, January 25, 2016

Covert Channel Tutorial


High-speed Covert channel Attacks in the Cloud 





  • The challenges in conducting covert channel
    • scheduling uncertainty
    • address uncertainty
    • cache physical limitations
      • Different VMs may not share cache
      • This can be overcomed by the atomic operations implementation in the system
        • i.e., the memory bus will be locked when a cache is being locked.
  • Memory-bus based covert channel attacks

Friday, January 22, 2016

Unix Command

Procedure

  • For example, "rm file.txt"
    • It would search for a file called rm, load the file into memory, and execute it with the parameter file.txt.

User Mode and Kernel Mode

Mode Bit

  • A bit, called mode bit, is added to the hardware of the computer to indicate current mode.
    • 0: kernel mode
      • When a task is executed on behalf of the operating system
    • 1: user mode
      • When a task is executed on behalf of the user

Switch

  • Whenever a trap or interrupt occurs, the hardware switches from user mode to kernel mode
    • i.e., change mode bit to 0
  • When a user application requests a service from the operating system (via a system call)
    • It must transition from user to kernel mode to fulfill the request
  • Privileged Instructions
    • The hardware allow privileged instructions to be executed only in kernel mode. 
    • The instruction to switch to kernel mode is an example of a privileged instruction. Some other example include I/O control, timer management, and interrupt management. 

Priviledges

Escalate Privileges

  • For example, on Unix, the setuid attribute on a program cause that program to run with the user ID of the owner of the file, rather than the current user's ID.


Virtual Memory

Advantage

  • It allows the execution of a process that is not completely in memory. So that it enables users to run programs that are larger than actual physical memory.
  • In addition, it abstracts main memory into a larger, uniform array for storage, separating logical memory as viewed by the user from physical memory. This arrangement free programmers from concern over memory-storage limitations. 


Thursday, January 21, 2016

Driver

Device Controller

  • A device controller is in charge of the devices. 
  • It maintains 
    • some local buffer
    • and a set of special purpose registers
  • It is responsible for moving the data between the peripheral devices that it is controls and its local buffer storage.

Device Driver

  • The operating system has a device driver for each device controller
  • This device driver understands the device controller and presents a uniform interface to the device to the rest of the operating system


Interrupt Driver IO

  • Procedure
    • The device driver loads the appropriate registers within the device controller.
    • The device controller, in turn, examines the contents of these registers to determine what action to take.
    • The controller starts the transfer of the data from the device to its local buffer.
    • Once the transfer of data is complete, the device controller informs the device driver via an interrupt that it has finished its operation.
    • The device driver then returns control to the operating system, possibly returning the data or a pointer to the data if the operation was a read.
  • Drawback
    • High overhead when used for bulk data movement such as disk I/O.
    • To solve this problem, directed memory access (DMA) is used.

Directed Memory Access (DMA)

  • After setting up buffer, pointers and counters for the I/O device, the device controller transfers an entire block of data directly to or from its own buffer storage to the memory, with no intervention by the CPU.

Interrupt

Introduction


The occurrence of an event is usually signaled by an interrupt from either the hardware or the software.
  • Hardware
    • Hardware may trigger an interrupt at any time by sending signal to the CPU, usually by way of the system bus.
  • Software (called Trap)
    • Software may trigger an interrupt by executing a special operation called a system call.
    • The Trap could be, e.g., division by zero, or invalid memory access


Return Address

  • Before the interrupt
    • Before enter the interrupt, the return address will be stored on the system stack.
  • After the interrupt
    • After the interrupt is serviced, the saved return address is loaded into the program counter, and the interrupted computation resumes s through the interrupt had not occurred.

Bootstrap

Bootstrap


When a computer start running, i.e., when it is powered up or rebooted. it needs to have an initial program to run.

This initial program, or bootstrap program, tends to be simple. Typically, it is stored in read-only memory (ROM) or eletrically erasable programmable read-only memory (EEPROM), known by the general term firmware, within the computer hardware.

Computer Storage Unit

Bit

A bit is the basic unit of computer storage. It can contain one of two values, zero or one.


Byte

A byte is 8 bits, and on most computers it is the smallest convenient chunk of storage. 
  • For example, most computers don't have an instruction to move a bit but do have one to move a byte

Word

A word is generally made up of one or more bytes. 
  • For example, a computer may have instructions to move 64-bit (8-byte) words.

KiloByte

  • 1024 Bytes

MegaByte

  • (1024)^2 Bytes
    • Usually round off as 1 million bytes


GigaByte

  • (1024)^3 Bytes
    • Usually round off as 1 billion bytes

Wednesday, January 20, 2016

Session

Definition

A session is a semi-permanent interactive information interchange between two or more communication devices.


Implementation

The session is implemented by means of a database with information about the state of each session.

The reason of not implementing it with multiple processes or threads is due to large resource overhead.


Cluster of Servers

When a client may connect to any server in a cluster or servers, it is important to maintain consistency.

The client must be either directed to the same server for the duration of the session, or the servers must transmit server-side session information via a shared file system or database.


Client Side Web Sessions

Client-side sessions use cookies and cryptographic techniques to maintain state without storing as much data on the server. 

When presenting a dynamic web page, the server sends the current state data to the client (web browser) in the form of cookie. The client saves the cookie in memory or on disk. 

With each successive request, the client sends the cookie back to the server, and the server users the data to "remember" the state of the application for that client and generate an appropriate response.


Issues

  • Browser limits the number and size of cookies that may be stored by a web site.


HTTP Session Token

  • A session token is a unique identifier that is generated and sent from a server to a client to identify the current interaction session. The client usually stores and sends the token as an HTTP cookie and/or sends it as a parameter in GET or POST queries. 
The reason of using session tokens is that
  • The client only has to handle the identifier
  • All session data is stored on the server (usually in a database).





Completely Fair Scheduler


Introduction

CFP uses a concept called "sleeper fairness",

  • which considers sleeping or waiting tasks equivalent to those on the runqueue. This means that interactive tasks which spend most of their time 
  • So that the interactive tasks which spend most the their time waiting for user input or other events get a comparable share of CPU time when they need it.

Algorithm

  • The data structure used for the scheduling algorithm is a red-black tree in which the nodes are scheduler-specific structures, entitled "sched_entity". 
    • These are derived from the general task_struct process descriptor, with added scheduler elements. These nodes are indexed by processor execution time in nanoseconds. 
    • Idea Processor
      • an "idea processor" would equally share processing power among all processes
    • Maximum execution time
      • Thus the maximum execution time is the time the process has been waiting to run, divided by the total number of processes, or in the other words, the maximum execution time is the time the process would have expected to run on an "ideal processor"
      • A maximum execution is also calculated for each process. This time is based upon the idea that an "idea processor" would equally share processing power among all processes. 
  • When the scheduler is invoked to run a new process, the operation o the scheduler is as follows
    • The left most node of the scheduling tree is chosen (as it will have the lowest spent execution time), and sent for execution
    • If the process simply completes execution, it is removed from the system and scheduling tree
    • If the process reaches its maximum execution time or is otherwise stopped (voluntarily or via interrupt), it is reinserted into the scheduling tree based on its new spent executiom time.
    • The new left-most node will then be selected from the tree, repeating the interation. 


Reference

[1] https://en.wikipedia.org/wiki/Completely_Fair_Scheduler

CPU Components

Components in a CPU


General purpose Registers

  • Register
    • A synonym for memory in computer science
  • General purpose reigster
    • A memory cell
  • Each general purpose register has a unique name
  • It is used to 
    • store (and recall) intermediate result of complex computations

Arithmetic and Logic Unit (ALU)

  • It is a complex electrical circuit that can perform Mathematical (+,-,x,/) and logical operation (<, <=, >, >=, and, or)
  • The output (result) of the computation (obtained by the ALU) is often stored in a general purpose register

Instruction Register (IR)

  • Contains the current instruction being executed by the CPU
  • The CPU will perform the operation indicated by the instruction code contained in the instruction register

Program Counter (PC)

  • The program counter is a register (memory cell)!
  • This register contains the address (location in memory) of the next instruction after the CPU finishes executing the current instruction in the instruction register
  • The value in the program counter will be increased after the CPU finishes executing one instruction

Processor Status Register (PSR)

  • This register contains the various information about the CPU.
  • Among the information contained in the PSR is
    • The result of a comparison operation
  • When the CPU compares 2 numbers a and a and b,  the outcome of the comparison is stored in the PSR
  • The outcome of a compare operation will allow the CPU to determine the following fact between a and b.

Reference

[1] http://www.mathcs.emory.edu/~cheung/Courses/170/Syllabus/01/intro-computer2.html

Process Context Switch

Context Switch

  • Whenever an interrupt arrives, the CPU must to a state-save of currently running process, then switch into kernel mode to handle the interrupt, and the do a state-restore of the interrupted process. 
  • Context switch happens when
    • the time slice for one process has expired and a new process is to be loaded from the ready queue.
    • This will be instigated by a timer interrupt, which will then cause the current process' state to be saved and the new process's state to be restored.
    • Saving and restoring states involves saving and restoring all registers and program counter(s), as well as the process control blocks.
    • Context switch happens VERY VERY frequently, and the overhead of doing the switching is just lost CPU time, so context switches (state saves and restores) need to be as fast as possible. Some hardware has specific provisions for speeding this up, such as a single machine instruction for saving or storing all registers at once. 

Registers

  • The Operating System needs to store a copy of the CPU registers to memory.
  • When it is time for the processor to give up the processor so another process can run, it needs to save it's current state.
  • Equally, we need to be able to restore this state when the process is given more time to run on the CPU. To do this the operating system needs to store a copy of the CPU registers to memory
  • When it is time for the process to run again, the operating system will copy the register values back from memory to the CPU registers and the process will be right back where it left off. 

Reference

[1] http://www.bottomupcs.com/elements_of_a_process.xhtml

Process Heap

Introduction


The heap is an area of memory that is managed by the process for on the fly memory allocation.

This is for variables whose memory requirements are not known at compile time.



  • brk
    • The bottom of the heap is known as the brk, so called for the system call which modifies it.
    • By using the brk call to grow the area downward the process can request the kernel allocate more for it to use.
  • malloc
    • The heap is mostly managed by the malloc library call.
    • This makes managing the heap for programmer by allowing them to simply allocate and free (via the free call) heap memory.
    • Malloc can use schemes like a buddy allocator to manage the heap memory for the user.
    • Malloc can also be smarter about allocation and potentially use anonymous mmaps for extra process memory.
    • This is where instead of mmaping a file into the process memory, it directly maps an area of system RAM.

Process Stack

Stacks usually grown down, i.e., the stack starts at a higher address in memory and progressively gets lower.




Stack Pointer


  • To keep track of current growth of the stack, the hardware defines a register as the stack pointer. 
  • The compiler (or the programmer, when writing in assembler) users this register to keep track of current top of the stack. 
  • Store
    • It is stored in the register ESP.


Call Stack

  • A call stack is used for several related purposes, but the main reason for have one is to keep track of the point to which each active subrountine should return control when it finishes executing.
  • Active subrountine
    • An active subrountine is one that has been called by yet to complete execution after which control should be handed back to the point of the call. 

How it works?

  • Return address
    • Store the returning address somewhere, so that we can continue executing the current function after completion of called function.
  • Other info
    • Saving other information about current function. 
  • Parameters
    • Providing the callee with the parameters
  • Provided called function some space to store its automic variables
  • Return value
    • Providing some mechanism to get return value from the called function. 
    • It is implemented using EAX register. Called function stores return value in EAX.

Execution in Steps

  1. Push the parameters in the reverse order (right to left)
  2. Call function now
    1. It implicitly pushes the return address into STACK
  3. Push the current EBP
    1. i.e., push Frame Pointer (FP) of calling function into stack. We need to do this to continue with the execution of calling function after return from the called function,
  4. Copy the ESP to EBP
    1. Yes, this location will be the new FRAME Pointer
  5. Allocate space on stack for local variable
    1. It is done by decrementing ESP
  6. Put the value to be returned in EAX
  7. Copy current EBP to ESP
    1. It will reduce stack size. Now we have the old FP at the top of the stack??
  8. ...

Stack Frame

  • The stack frame generally includes the following components
    • The return address
    • Argument variables passed on the stack
    • Local variables
    • Saved copies of any registers modified by the subprogram that need to be restored.



Stack Overflow


  • Introduction
  • Suggestion
    • If you are a programmer accept arbitrary input into a stack variable (say, reading from the keyboard or over the network), you need to say how big that data is going to be
    • OS
      • The OS can make the stack not executable. 

Reference

[1] http://articles.manugarg.com/stack.html

Process Communication


Shared Memory Systems


  • In general, the memory to be shared in a shared-memory system is initially within the address space of a particular process, which needs to make system calls in order to make that memory publicly available to one or more other processes. 
  • Other processes which wish to use the shared memory must then make their own system calls to attach the shared memory area onto their address space. 
  • Generally a few messages must be passed back and forth between the cooperating processes first in order to set up and coordinate the shared memory access. 

Example" POSIX Shared Memory


  • Step 1: one of the processes involved to allocate some shared memory, using shmget
    • int segment_id = shmget(IPC_PRIVATE, size, S_IRUSR | S_IWUSR)
      • The first parameter 
        • specifies the key (identifier) of the segement. IPC_PRIVATE creates a new shared memory segment.
      • The second parameter
        • How big the shared memory segment is to be, in bytes.
      • The third parameter
        • A set of bitwise ORed flags. In this case, the segment is being created for reading and writing.
      • The return value
        • an integer identifier
  • Step 2: Any process which wishes to use the shared memory must attach the sahred memory to their address space, using shmat
    • char* shared_memory = (char *) shmat(segment_id, NILL, 0)
      • The first parameter
        • Specify the key of the segment that the process wishes to attack to its address space
      • The second parameter
        • Indicate where the process wishes to have the segment attacked. Null insicates the system should decide.
      • The third parameter
        • A flag for read-only operation. Zero indicates read-write; One indiciates Readonly.
      • The return value
        • The return value of shmat is a void *, which the process can use (type cast) as appropriae. In this example, it is being used as a character pointer. 
  • Step 3: The process may access the memory using the pointer returned by shmat. For example, using sprintf
    • sprintf(shared_memory, "Writing to shared memory\n");
  • Step 4: When a process no longer needs a piece of shared memory, it can be detached using shmtd
    • shmat(shared_memory)
  • Step 5: And finally the process that originally allocated the shared memory can remove it from the system using shmctl.
    • shmctl(segment_id, IPC_RMID)



Message-Passing


  • Message passing systems must support at a minimum system calls for "send message" and "receive message"
  • A communication link must be established between the corresponding process before message can be sent. 
  • There are three key issues to be resolved in message passing systems
    • Direct or indirect communication (naming)
    • Synchronous or asynchronous communication
    • Automatic or explicit buffering

Naming

  • Direct communication
    • With direct communication the sender must know the name of the receiver to which it wishes to send message
  • Indirect communication
    • Multiple processes can share the same mailbox or boxes.
    • Only one process can read any given message in a mailbox. 
    • The OS must provide system calls to create and delete mailboxes, and to send and receive messages to/from mailboxes.

Synchronization

  • Either the sending or receiving of messages (or neither or both) may be either blocking or non-blocking.

Buffering

  • Messages are passed via queues, which may have one or three capacity configurations
    • Zero capacity
      • Message cannot be stored in the queue, so senders must lock until receivers accept the messages.
    • Bounded capacity
      • There is a certain pre-determined finite capacity in the queue. Senders must block if the queue if full, until space becomes available in the queue, but may be either blocking or non-blocking otherwise. 
    • Unbounded capacity
      • The queue has a theoretical infinite capacity, so senders are never forced to block. 



Reference

[1] https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/3_Processes.html

Process Memory, State, and Scheduling

Process Concept

  • A process is an instance of a program in execution


Process Memory




  • Text: 
    • comprises the compiled program code, read in from non-volatile storage when the program is launched. 
  • Data
    • Stores global and static variables, allocated and initialized prior to executing main.
  • Heap
    • Used for dynamic memory allocation, is managed via calls to new, delete, malloc, free etc. 
  • Stack
    • Used for local variables
    • Space on the stack is reserved for local variables when they are declared (at function entrance or elsewhere, depending on the language), and the space is freed up when the variables go out of scope
    • The stack is also used for function return values, and the exact mechanisms for stack management may be language specific. 
  • When processes are swapped out of memory and later restored, additional information must also be stored and restored. Key among them are the program counter and the value of all program registers

Process State


  • New
    • The process is in the stage of being created.
  • Ready
    • The process has all the resources available that it needs to run, but the CPU is not currently working on this process's instruction.
  • Running
    • The CPU is working on this process's instruction.
  • Waiting
    • The process cannot run at the moment, because it is waiting for some resource to become available for some event to occur. For example, the process may be waiting for keyboard input, disk access request, inter-process messages, a time to go off, or a child process to finish.
  • Terminated
    • The process has completed.

Process Control Block

For each process, there is a Process Control Block (PCB), which stores the following (types of) process-specific information.



  • Process State
    • Running, waiting etc. 
  • Process ID
    • and also parent process ID
  • CPU registers and Program Counter
    • These need to be saved to restored when swapping processes in and out of the CPU
  • CPU scheduling information
    • Such as priority information and pointers to scheduling queues
  • Memory-Management information
    • e.g., page tables or segment tables
  • Accounting information
    • User and kernel CPU time consumed, account numbers, limits, etc. 
  • I/O Status information
    • Devices allocated, open file tables etc. 


Process Scheduling

  • The two main objectives of the process scheduling system are to keep the CPu busy at all times and to deliver "acceptable" response times for all program, particularly for interactive ones. 
  • The process scheduler must meet these objectives by implementing suitable policies for swapping process in and out of the CPU.

Scheduling Queues

  • Job queue
    • All processes are stored in the job queue.
  • Ready queue
    • Processes in the Ready state are placed in the ready queue
  • Device queue
    • Processes waiting for a device to be available or to deliver data are placed on device queues. There is generally a separate device queue for each device.



Scheduler

  • Long-term scheduler
    • Typically of a batch system or a very heavily loaded system. It runs infrequently, (such as when one process ends selecting one or more to be loaded in from disk in its place), and can afford to take the time to implement intelligent and advanced scheduling algorithms
  • Short-term scheduler, or CPU scheduler
    • Runs very frequently, on the order of 100 milliseconds, and must very quickly swap one process out of the CPU and swap in another one.
  • Medium-term scheduler
    • Some systems also employ medium-term scheduler.
    • When system loads get high, this scheduler will swap one or more processes out of the ready queue system for a few seconds, in order to allow smaller faster jobs to finish up quickly and clear the system. 




Reference

[1] https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/3_Processes.html

Monday, January 18, 2016

Online Hiring Problem

Problem

If there are n candidates, and we do not want to interview all candidates to find the best one. We also do not wish to hire and fire as we find better and better candidates.

Instead, we are willing to settle for a candidate who is close to the best, in exchange of hiring exactly once.

For each interview we must either immediately offer the position to the applicant or immediately reject the applicant.

What is the trade-off between minimizing the amount of interviewing and maximizing the quality of the candidate hired?

Code


Analysis

We analysis in the following to determine the best value of k that maximize the probability we hire the best candidate. In the analysis, we assume the index starts with 1 (rather than 0 as shown in the code).

Let $B_i$ be the event that the best candidate is the $i$-th candidate.
Let $O_i$ be the event that none of the candidate in position $k+1$ to $i-1$ is chosen.
Let $S_i$ be the event that we hire the best candidate when the best candidate is in the $i$-th position.

We have $Pr(S_i ) = Pr (B_i \cup O_i) = Pr(B_i) \cdot Pr(O_i)$ since $B_i$ and $O_i$ are independent events.

$Pr(S_i)  = \sum^n_{i=k+1} Pr(S_i)$
$ = \sum^n_{i=k+1} \frac{1}{n} \cdot \frac{k}{n-i}$
$ = \frac{k}{n} \sum^n_{i=k+1} \frac{1}{i-1}$
$ \leq \frac{k}{n} \cdot (\ln n - \ln k)$

Setting this derivative equal to 0, we see that we maximize the probability, i.e., when $k = n/e$, we have the probability at least $1/e$.


Reference
[1] Introduction to Algorithms, CLSR


Instruction Set Randomization

Motivation

The attackers can inject instructions to the executables. 

Definition

Instead of executing the executable directly, we add randomizations to the executable, and de-randomize when execute.





Randomization

  • XOR
    • Use a key to XOR with the instruction, and XOR again when execute



Reference

[1] https://www.youtube.com/watch?v=ZgNBjwXTrqA

Buffer Overflow

Definition

A buffer overflow occurs when a program or process tries to store more data in a buffer than it was intended to hold. Since buffers are created to contain a finite amount of data, the extra information -- which has to go somewhere -- can overflow into adjacent buffers, corrupting or overwriting the valid data held in them. 

In buffer overflow attacks, the extra data may contain codes designed to trigger specific actions.


Reference

[1] http://searchsecurity.techtarget.com/definition/buffer-overflow

Address Space Layout Randomization (ASLR)

Definition

Address space layout randomization (ASLR) is a memory-protection process for operating systems to guards against buffer-overflow attacks by randomizing the location where system executables are loaded into memory.

Objective

The success of many cyberattacks, particularly zero-day exploits, relies on the hacker's ability to know or guess the position of processes and functions in memory

ASLR is able to put address space targets in unpredictable locations. If an attacker attempts to exploit in an incorrect address space location, the target application will crash, stopping the attack and alerting the system.


Current Deployments

ASLR was created by the Pax Project as a Linux patch in 2001. 
  • It was integrated into the Windows operating system beginning with Vista in 2007. Prior to ASLR, the memory locations of files and applications were either known or easily determined. 
  • Adding ASLR to Vista increasing the number of possible address space locations to 256, meaning attackers only have a 1 in 256 chance of finding the correct location to execute code.
  • Apple began including ASLR in MAC OS X 10.5 Leopard, and Apple iOS and Google Andriod both using ASLR in 2011.

Reference

[1] http://searchsecurity.techtarget.com/definition/address-space-layout-randomization-ASLR

Sunday, January 17, 2016

HoneyPots

Introduction

[1] Deception is a mechanism that attempts to distort or mislead an attacker into taking a course of action that is more suited to the goals of the defender.

A common deception defense is the use of network honeypots.

A honeypot is a commuter system that is designed to be a trap for unauthorized accesses.

Honeypots are deployed within a network to appear like normal, active systems to an outsider.


How to build honeypots


  • Mimicking
    • One of the deception technique is mimicking. A honeypot attempts to mimic a real system to fool the adversary into probing and/or attacking it. 
    • The amount of interaction the honeypots respond to queries with information that represents a possible system within the infrastructure but unlike a normal system, it maintains a very detailed logs of all interactions.  From these detailed logs, administrators can gain insight into an attacker's goal and methods as well as put in place other measures to hopes of preventing an attack. 








Reference

[1] Probabilistic Performance Analysis of Moving Target and Deception Reconnaissance Defenses, by Michael Crouse, in MTD15

Wednesday, January 13, 2016

Cache in Virtualized Environments

System Model for a Multi-core Processor




Cache Hierarchy

Reference [1]

Because of the long access time of main memory compared to fast processors, smaller but faster memory, called cache, are used to reduce the effective memory time as seen by a processor.

Modern processors feature a hierarchy of caches.

"Higher-level" caches, which are closer the processor core are smaller but faster than lower-level caches, which are closer to main memory.

L1 caches 

Each core typically has two private top level caches

  • 1) one for data
  • 2) one for instructions
A typical L1 cache size is 32KB with 4-cycle access time, as in Intel Core and Xeon families. 

LLC

The LLC is shared among all cores of a multicore chip and is a unified cache, i.e., it holds both data and instructions. [1]

LLC sizes measure in megabytes, and access latencies are of the order 40 cycles


L2 caches

Modern X86 processors typically also support core-private, unified L2 caches of intermediate size and latency. 



How it works

Any memory access first access the L1 cache, and on a miss, the request is sent down the hierarchy until it hits in a cache or accesses main memory. 

The L1 is typically indexed by virtual address. While all other caches are indexed by physical address. 


Cache Access

Reference [1]

To exploit spatial locality, caches are organized in fixed-size lines, which are the units of allocation and transfer down the cache hierarchy. 

A typical line size B is 64bytes.

The log2B lowest-order bits of the address, called the line offset, are used to locate a datum in the cache line. 



Set-associate 


Caches today are usually set-associate, i.e., organized as S sets of W lines each, called a W-way set-associative cache. As shown in the following figure.[1]

When the cache is accessed, the set index field of the address, log2S consecutive bits starting from bit log2B is used to locate a cache set. The remaining high-order bits are used as a tag for each cache line. 

After locating the cache set, the tag field of the address is matched against the tag of the W lines in the set to identify if one of the cache line is a cache hit. 


Cache Replacement


As memory is much larger than the cache, more than W memory lines may map to the same cache set, potentially resulting in cache contention. If an access misses in the cache, and all lines of the matching set are in use, one cache line must be evicted to free cache slot for the new cache line being fetched from the next level of cache or from main memory for the LLC. The cache's replacement policy determines the line to evict. Typically replacement policies are approximations to least-recently-used (LRU). [1]
Traditional Cache



Per-Core Slice Cache

[1] Modern Intel processors, starting with the Sandy Bridge microarchitecture, use a more complex architecture for the LLC, to improve its performance.

The LLC is divided into pre-core slices, which are connected by a ring bus. Slices can be accessed concurrently and are effective separate caches, although the bus ensures that each core can access the full LLC (with higher latency for remote slices).

Sliced Cache

Ring bus architecture and sliced LLC




Reference



[1] Last-Level Cache Side-Channel Attacks are Practical, by Fangfei Liu et al, in Security&Privacy 2015

Virtual Address Space in Virtualized Environments

Page

Reference [1] 
  • A process executes in its private virtual address space, composed of pages, each representing a contiguous range of addresses.
  • The typical page size is 4KB
  • Each page is mapped to an arbitrary frame in physical memory


Address Mapping

Reference [1]

In virtualized environments there are two levels of address-space virtualization. 

1) the virtual addresses of a process -> guest's notion of physical address, i.e., the VM's emulated physical memory

2) guest physical addresses -> physical addresses of the processor


Page Translation

Reference [1]

Translations from virtual pages to physical frames are stored in page tables. 

Processor cache recently used page table entries in the translation look-aside-buffer (TLB).

The TLB is scarce processor resource with a small number of entries. 

Large pages used the TLB more efficiently, since fewer entries are needed to map a particular region of memory. As a result, the performance of application with large memory footprint, such as Oracle database or high-performance computing applications, can benefit from using large pages. 

For the same reason, VMMs, such as VMWare ESX and Xen HVM, also use large pages for mapping guest physical memory. 



Reference

[1] Last-Level Cache Side-Channel Attacks are Practical, by Fangfei Liu et al, in Security&Privacy 2015

Covert Channel Attack

Introduction

[1] Covert channels and side channels are two types of information leakage channels. A covert channel uses mechanisms that are not intended for communications, e.g., writing and checking if a file is locked to convey a "1" or "0". 

In a covert channel, an insider process leaks information to an outsider process not normally allowed to access that information. The insider (sending) process could be a Trojan horse program previously inserted stealthily into the computer. An outsider (receiver) process need only to be an unpriviledged process. 


Reference

[1] Covert and Side Channels due to Processor Architecture, by Zhenghong Wang and Ruby B. Lee, in ACSAC 2006

Tuesday, January 12, 2016

Network Topology in the Cloud -- ToR

Top of Rack





[1] demonstrates that most of the topology in Amazon Cloud is ToR mode.
  • All servers in a track are first connected to a separate Top of Rack (ToR) switch, and then the ToR switch is connected to aggregate swtiches. 
  • Such a topology has currently become a mainstream network topology in a data center.







Reference

[1] A Measurement Study on Co-residence Threat inside the Cloud, by Zhang Xu, Haining Wang and Zhenyu Wu, in UsenixSecurity2016

Virtual Private Cloud (VPC)

VPC Introduction

[1] VPC is a logically isolated networking environment that a separate private IP space and routing configuration. 




Characteristics

  • After creating a VPC, a customer can launch instances into VPC, instead of the large EC2 network pool. 
  • The customer can also divide a VPC into multiple subnets, where each subnet can have a preferred availability zone to place instances.
  • The private IP address of an instance in VPC is only known to its owner. It cannot be detected by other users. Thus, it can significantly reduces the threat of co-residence. 



Reference

[1] A Measurement Study on Co-residence Threat inside the Cloud, by Zhang Xu, Haining Wang and Zhenyu Wu, in UsenixSecurity2016

IP Scan

Tools

  • ZMap

Ports

  • FTP: 20,21
  • SSH: 22
  • Telnet: 23
  • SMTP: 25, 587
  • WHOIS: 43
  • DNS: 53
  • DHCP: 68
  • Finger Protocol: 79
  • HTTP: 80
  • SQL: 118
  • HTTPS: 443
  • MySQL: 3306

Sunday, January 10, 2016

Transactions and Replications

One Copy Serializability

  • Replicated transactional service
    • Each replica manager providers concurrency control and recovery of its own data items in the same way as it would for non-replicated data
  • Effects of transactions performed by various clients on replicated data items are the same as if they has been performed one at a time on a single data item
  • Additional complications
    • Failures should be serialized w.r.t. transactions, i.e., any failure observed by a transaction must appear to have happened before a transaction started

Replication Scheme

  • Primary copy
    • All client request are directed to a single primary server
  • Read one -- write all
    • cannot handle network partitions
    • Each write operation sets a write lock to each replica manager
    • Each read sets a read lock at one replica manager
  • Schemes that can handle network partitions
    • Available copies with validation
    • Quorum consensus
    • Virtual Partition

Available Copies Replication

  • Can handle the case when some replica managers are unavailable because they failed or communication failure
  • Reads can be performed by any available replica manager but writes must be performed by all available replica managers
  • Normal case is like one/write all
    • As long as the set of available replica managers does not change during a transaction
  • RM failure case
    • One copy serializability requires that failures and recovery be serialized w.r.t transactions
    • This is not achieved when different transactions make conflicting failure observations
    • Examples shows local concurrency control is not enough
    • Additional concurrency control procedure (called local validation) has to be performed to ensure correctness
  • Available copies with local validation assume no network partition
    • i.e, functioning replica managers can communicate with one another.
  • Local validation
    • Before a transaction commits it checks for failures (and recoveries) of replica managers of data items it has accessed

Handling Network Partititons

  • Network partitions separate replica managers into two or more subgroups, in such a way that the members of a group can communicate with one another but members of different subgroups cannot communicate
  • Optimistic approaches
    • Available copies with validation
  • Pessimistic approaches
    • Quorum consensus

Available Copies with Validation

  • Available copies algorithm applied within each partition
    • Maintains availability for read operations
  • When partition is repaired, possibly conflicting transactions in separate partitions are validated
    • The effects of a committed transactions that is now aborted on validation will have to be undone
      • Only feasible for applications where such compensating actions can be taken
  • Validation
    • Versions vector (write-write conflicts)
    • Precedence graphs (each partition maintains a log of data items affected by the Read and Write operations of transactions)
    • Log used to construct a precedence graph whose nodes are transactions and whose edges represent conflicts between Read and Write operations
      • No cycle in graph corresponding to each partition
    • If there are cycles in graph, validation fails.

Quorum Consensus

  • A quorum is a subgroup of replica managers whose size gives it the right to carry out operations
  • Majority voting one instance of a quorum consensus scheme 
    • R+ W > total number of votes in group
    • W > half f the total votes
    • Ensure that each read quorum intersects a write quorum, and two write quorum will intersect
  • Each replica has a version number that is used to detect if the replica is up to date. 

Virtual Partition Scheme

  • Combines available copies and quorum consensus
  • Virtual partition = set of replica managers that have a read and write quorum
  • If a virtual partition can be formed, available copies is used
    • Improve performance of Reads
  • If a failure occurs, and virtual partition changes during a transaction, it is aborted
  • Have to ensure virtual partitions do not overlap.


CAP Conjecture

  • Is it possible to achieve consistency, availability, and partition tolerance?.
  • Classic distributed systems: focused on ACID semantics
    • Atomic
    • Consistent
    • Isolated
    • Durable
  • Modern internet system: focused on BASE
    • Basically Available
    • Soft-state (or scalable)
    • Eventually consistent
  • ACID v.s. BASE

Why the Divide

  • What goals might you want for a shared-data system
    • consistency
    • availability
    • partition tolerance
  • Strong consistency
    • all clients see the same view, even in the presence of updates
  • High availability
    • All clients can find the same replica of the data, even in the presence of failures
  • Partition-tolerance
    • The system properties hold even when the system is partitioned. 

CAP Conjecture

  • You can only have two out of these three properties
  • The choice of which feature to discard determines the nature of your system. 

Consistency and Availability

  • Comment
    • Providing transaction semantics requires all nodes to be in contact with each other
  • Example
    • Single-site clustered databases
  • Typical features
    • Two-phase commit
    • Cache invalidation protocol
    • Classic distributed system style

Consistency and Partition Tolerance

  • Comment
    • If one is willing to tolerate system-wide blocking, then can provide consistency even when there are temporary partitions
  • Examples
    • Distributed databases
    • Distributed locking
    • Quorum (majority) protocols
  • Typical features
    • Pessimistic locking
    • Minority partitions unavailable
    • Also common distributed system
      • voting vs. primary replicas

Partition-Tolerance and Availability

  • Comment
    • sacrifice consistency
  • Examples
    • DNS
    • Web cache
    • Coda
    • Bayou
  • Typical features
    • TTLs and lease cache management
    • Optimistic updating with conflict resolution

Techniques

  • Expiration-based caching: AP
  • Quorum/majority algorithm: PC
  • Two phase commit: AC






Saturday, January 9, 2016

Cache-based Covert Channel

Background

Cache-based Covert Channel 

[1] Cache load measurements create very effective covert channels between cooperating processes running in different VMs. 

In practise, this is not a major threat for current deployments since in most cases the cooperating processes can simply talk to each other over a network. 

However, covert channels become significant when communication is forbidden by information flow control (IFC) mechanisms such as sandboxing and IFC kernels. The latter is a promissing emerging approach to improving security (e.g., web-server functionality). 

[1] explains more on the covert-channel in Section 8.1.


Reference

[1]  Hey, You, Get Off of My Cloud: Exploring Information Leakage in Third-Party Compute Clouds, by Restenpart, T. et al., in CCS09

How to Measure Cache Usage


Prime + Trigger + Probe


[1] demonstrates how to utilize the Prime + Probe technique to measure the cache activity, and extend it to the following Prime + Trigger + Probe measurement to support the setting of time-shared virtual machines.

  • The probing instance first allocates a continuous buffer B of b bytes
    • Here b should be large enough that a significant portion of the cache is filled by B.
  • Let s be the cache line size, in bytes.
  • Then the probing instance performs the following steps to generate each load sample
    1. Prime: Read B at s-byte offset in order to ensure it is cached. 
    2. Trigger: Busy-loop until the CPU's cycle counter jump by a large value
      • This means our VM was preempted by the Xen scheduler, hopefully in favor of the sender VM. 
    3. Probe: Measure the time it takes to again read B at s-byte offsets. 
When reading b/s memory locations in B, we use a pseudorandom order, and the pointer-chasing technique to prevent the CPU's hardware prefetcher from hiding the latencies. 

The time the of the final step's read is the load sample, measured in number of CPU cycles. These laod samples will be strongly correlated with use of the cache during the trigger step, since last usage will evict some portion of the buffer and thereby drive up the read time during the probe phase. 



Reference 


[1] Hey, You, Get Off of My Cloud: Exploring Information Leakage in Third-Party Compute Clouds, by Restenpart, T. et al., in CCS09

Threat of Covert Channel Attacks

Summary

  • Attackers can build various side channels to circumvent the logical isolation in cloud physical machines, and obtain sensitive information from co-resident VMs
    • Coarse-grained, e.g., workloads and web traffic rates
      • Since the cache utilization rate has a large impact on the execution time of the cache read operation, attackers can infer the victim's cache usage and workload information, by applying the Prime+Probe technique. 
      • Similarly, they can estimate the victim's web traffic rate, which also has a strong correlation with the execution time of cache operations. [2]
      • [1] demonstrate a clear correlation between a victim's web traffic rate with the load sample. 
    • Fine-grained, e.g., cryptographic keys
      • Attackers can exploit shared hardware resources, such as the instruction cache, to extract cryptographic keys. Specifically, the following challenges are overcomed
        • Dealing with core migrations and determining if an observation is associated with the victim
        • Filtering out hardware and software noise, and regaining access to the target CPU core with sufficient frequency
  • For clever attackers, even seemingly innocuous information like workload statistics can be useful.
    • For example, such data can be used to identify when the system is most vulnerable, i.e., the time to launch further attacks, such as Denial of Service attacks. [9]






Reference
[2] Using Virtual Machine Allocation Policies to Defend against Co-resident Attacks in Cloud Computing, by Yi Han et al, in Transactions on Dependable and Secure Computing

Web Applications Client & Server





Figure 1 [1] illustrates the web application architecture in the server side and client side.

Server Side

  • Logic Layer
    • Implements the application business logic using high-level programming languages, such as Java, PHP, or Python. 
  • Web Server Layer
    • Receive HTTP request, and passes the request to the appropriate server-side program, e.g., Apache web server, Windows IIS, or Nginx.
  • Data Storage Layer
    • Stores the web application state and user data. Popular data storage systems are traditional SQL databases, which include MySQL, PorsgreSQL, or MSSQL
  • Infrastructure Layer
    • Runs the operating systems. An infrastructure could be a physical machine or virtualization platform which manages multiple virtual machines. 

Client Side

The client side receives HTTP response from the server-side, and the job of the client is to convert the HTML contained in the HTTP response into a graphical interface from the user. 


  • Logic Layer (Presentation Layer)
    • It is written in a combination of HTML, CCS, and JavaScript, with JavaScript providing a way for the sever-side code to execute application logic on the client

  • Browser 
    • Retrieves the presentation layer code from the server, interprets it, and presents it as a graphic interface to the user. 

  • Storage Layer
    • For the presentation layer code to store data. Available storage methods include cookies, local storage, IndexedDB, and File APIs.
  • Operating System Layer
    • Runs the browser
















Reference
[1] Toward a Moving Target Defense for a Web Applications, by Marhony Taguinod, in International Conference on Information Reuse and Integration 2015

Web Application Moving Target Defense

Motivation

  • Web application remains the most popular way for businesses to provide services over the Internet. It is the "front door" of many companies, and therefore their security is of paramount importance.
  • Example [1]
    • JPMorgan Chase breach in 2014 affected 76 million US households
    • Bloomberg reported that the hackers exploited an overlooed falow in one of the bank's webistes.
  • The efforts in discovering and fixing vulnerabilities are not enough to protect web applications for many reasons [1]
    • The increasing complexity of modern web applications brings inevitable risks that cannot be fully mitigated in the process of web application development and deployment
    • Attackers can take their time, to understand the web application's functionality and technology stack, before launching an attack

Objective

  • Ref [1]
  • The key issue is to design MTD mechanism that 
    • Prevent or disrupt a vulnerability or exploit
    • While still providing the identical functionality



Issues

  • Choose what component to move in a web application
  • Decide the optimal time and how often to move components


Proposals

  • Change the server-side language used in a web application 
    • automatically translating server-side web application code to another language in order to prevent Code Injection exploits
  • Shifts the Database used in a web application
    • transform the backend SQL database into different implementations that speak different dialects in order to prevent SQL Injection exploits.




Reference

[1] Toward a Moving Target Defense for Web Applications, by Marthony Taguinod et al., in International Conference on Information Reuse and Integeration 2015

Friday, January 8, 2016

Fault Tolerance

Definitions

  • Availability
    • Probability the system operates correctly at any given moment
  • Reliability
    • Ability to run correctly for a long interval of time
  • Safety
    • Failure to operate correctly does not lead to catastrophic failures
  • Maintainability
    • Ability to "easily" repair a failed system

Failure Models

  • Different types of failures



Two-Army Problem



Byzantine Agreement 

  • [Lamport et al. (1982)]
  • Goal
    • Each process learn the true values sent by correct processes
  • Assumption
    • Every message that is sent is delivered correctly
    • The receiver knows who sent the message
    • Message delivery time is bounded
  • Byzantine Agreement Result
    • In a system with m faulty processes agreement, agreement can be achieved only if there are 2m+1 functioning correctly.
    • Note
      • This result only guarantees that each process receives the true values sent by correct processors, but it does not identify the correct process!


Byzantine General Problem

  • Phase 1: Generals announce their troop strengths to each other




  • Phase 2: Each general construct a vector with all troops


  • Phase 3: General send their vectors to each other and compute the majority voting

Reliable Group Communication

  • Reliable Multicast
    • All nonfaulty process which do not join/leave during communication receive the message

  • Atomic Multicast
    • All message are delivered in the same order to all processes

View Delivery

  • A view reflects current membership of group
  • A view is delivered when a membership change occurs and the application is notified of the change
    • Receiving a view is different from delivering a view
      • All members have to agree to the delivery of a review
  • View synchronous group communication
    • The delivery of a new view draws a conceptual line across the system and every message is either delivered on one side or the other of that line


Atomic Multicast

  • All message are delivered in the same order to "all" processes
  • Group view
    • The set of processes known by the sender when it multicast the message
  • Virtual synchronous multicast
    • A message multicast to a group view G is delivered to all nonfaulty process in G
      • If sender fails after sending the message, the message may be delivered to no one

Virtual Synchrony Implementation
  • Only stable messages are delivered
  • Stable message
    • A message received by all processes in the message's group view
  • Assumptions (can be ensured by using TCP)
    • Point-to-point communication is reliable
    • Point-to-point communication ensures FIFO-ordering


Message Ordering

  • Total ordering does not imply causality or FIFO!




[Hackerrank] Grid Challenge

Problem

Given a squared sized grid G of size N in which each cell has a lowercase letter. Denote the character in the ith row and in the jth column as G[i][j].
You can perform one operation as many times as you like: Swap two column adjacent characters in the same row G[i][j] and G[i][j+1] for all valid i,j.
Is it possible to rearrange the grid such that the following condition is true?
G[i][1]G[i][2]G[i][N] for 1iN and 
G[1][j]G[2][j]G[N][j] for 1jN
In other words, is it possible to rearrange the grid such that every row and every column is lexicographically sorted?
Notec1c2, if letter c1 is equal to c2 or is before c2 in the alphabet.
Input Format
The first line begins with T, the number of testcases. In each testcase you will be given N. The following N lines contain N lowercase english alphabet each, describing the grid.
Output Format
Print T lines. On the ith line print YES if it is possible to rearrange the grid in the ith testcase or NO otherwise.
Constraints 
1T100 
1N100 
Gij will be a lower case letter
Sample Input
1
5
ebacd
fghij
olmkn
trpqs
xywuv
Sample Output
YES



Analysis


  • Sort each row
  • Then check each column to check whether numbers are in increasing order or not

Code