Publications - Dan Tsafrir

2025

Why paying for storage beats free networking in cloud bursting
Itamar Gefen, Aviad Zuck, Moshik Hershcovitch, Danny Harnik, Dan Tsafrir
HotStorage '25: ACM Workshop on Hot Topics in in Storage and File Systems
July, 2025, Boston, MA, to appear
BibTeX

@InProceedings{gefen25-bursting,

author	=	{Itamar Gefen and Aviad Zuck and Moshik Hershcovitch and Danny Harnik and Dan Tsafrir},
title	=	{Why paying for storage beats free networking in cloud bursting},
booktitle	=	{ACM Workshop on Hot Topics in in Storage and File Systems (HotStorage)},
year	=	2025,
month	=	{July},
address	=	{Boston, MA},
note	=	{(to appear)}

}

Disentangling the dual role of NIC receive rings
Boris Pismenny, Adam Morrison, Dan Tsafrir
OSDI '25: USENIX Symposium on Operating Systems Design and Implementation
July, 2025, Boston, MA, to appear
Abstract , BibTeX

CPUs parallelize packet processing on multiple cores via per-core receive ("Rx") rings that are sized to absorb bursts (>=1Ki entries by default). The resulting I/O working set—all packet buffers pointed to by all Rx rings—easily exceeds the LLC capacity, resulting in decreased performance due to high memory bandwidth. Recent work has minimized the I/O working set size by sharing Rx buffers among cores via the existing Rx ring interface. However, this approach encounters a bottleneck during imbalanced loads, a situation that is not uncommon.

We contend that this bottleneck is caused by the unnecessarily entangled nature of two orthogonal "producer-consumer" functionalities: (1) memory allocation, whereby the core "produces" empty buffers that the NIC "consumes" for storing packets; and (2) packet delivery, whereby the NIC "produces" incoming packets that the core "consumes" (receives).

We propose rxBisect, a new CPU-NIC interface, which decouples these functionalities. RxBisect substitutes each Rx ring with two rings that correspond to the two functionalities, such that memory allocation is done independently of packet reception. RxBisect can thus efficiently pass empty buffers between cores in all scenarios. We implement RxBisect using software emulation. RxBisect improves throughput by up to 20 significant latency gains occur when rxBisect meets line rate load whereas the baseline fails to do so.

@InProceedings{pismenny25-rxbisect,

author	=	{Boris Pismenny and Adam Morrison and Dan Tsafrir},
title	=	{Disentangling the dual role of {NIC} receive rings},
booktitle	=	{USENIX Symposium on Operating Systems Design and Implementation (OSDI)},
year	=	2025,
month	=	{July},
address	=	{Boston, MA},
note	=	{(to appear)}

}

Batching with end-to-end performance estimation
Avidan Borisov, Nadav Amit, Dan Tsafrir
HotOS '25: ACM Workshop on Hot Topics in Operating Systems
May, 2025, Banff, Canada
Abstract , BibTeX , PDF , Definitive

Batching heuristics are used in multiple layers of the TCP/IP stack, aiming to improve performance by amortizing overheads. When performance is defined as average latency and throughput, optimal batching decisions can be infeasible if application-perceived end-to-end performance is unknown, which is commonly the case in general-purpose setups. We address this problem by occasionally adding a few easily maintained counters to TCP metadata exchanges and using them to estimate end-to-end performance via Little's law. We experimentally show that these estimates are accurate when application requests can be identified by the kernel (corresponding, for example, to send system calls, packets, or a fixed number of bytes). Had these estimates been used to dynamically toggle Nagle batching, they could have extended Redis's range of sustainable throughput at tolerable latencies by nearly 2x and improved latency within this range by as much as nearly 3x. When the kernel cannot identify requests, we propose that applications optionally use a trivial interface to enlighten it, thereby ensuring accuracy.

@InProceedings{borisov25-batching,

author	=	{Avidan Borisov and Nadav Amit and Dan Tsafrir},
title	=	{Batching with end-to-end performance estimation},
booktitle	=	{ACM Workshop on Hot Topics in Operating Systems (HotOS)},
year	=	2025,
month	=	{May},
address	=	{Banff, Canada}

}

Leveraging software fault tolerance for longer flash hardware lifespan
Aviad Zuck, Rob Johnson, Donald E. Porter, Dan Tsafrir
HotOS '25: ACM Workshop on Hot Topics in Operating Systems
May, 2025, Banff, Canada
Abstract , BibTeX , PDF , Definitive

In the context of data center computing, the carbon emission incurred during the manufacturing of solid-state drives (SSDs), called embodied carbon, is increasingly overtaking the emissions caused by powering these drives throughout their operational lifetime. Out of an abundance of caution, current SSDs are designed to fail long before the internal flash components are completely unusable. Although this choice may be appropriate for single-disk systems, distributed storage systems can already tolerate individual SSD failures through redundancy.

We propose Salamander, a new design for SSD servers where SSDs expose multiple logical minidisks, which match the granularity of hardware failures, so that: (1) as minidisks fail, distributed storage systems can continue using the remaining good capacity; and (2) SSDs may recycle portions of failed minidisks that still have some usable life. Salamander increases the lifespan of SSDs, and therefore amortizes embodied carbon, by incrementally exposing hardware deterioration to the system and leveraging existing, end-to-end redundancy mechanisms to recover from this deterioration.

@InProceedings{zuck25-salamander,

author	=	{Aviad Zuck and Rob Johnson and Donald E. Porter and Dan Tsafrir},
title	=	{Leveraging software fault tolerance for longer flash hardware lifespan},
booktitle	=	{ACM Workshop on Hot Topics in Operating Systems (HotOS)},
year	=	2025,
month	=	{May},
address	=	{Banff, Canada}

}

2023

ShRing: networking with shared receive rings
Boris Pismenny, Adam Morrison, Dan Tsafrir
OSDI '23: USENIX Symposium on Operating Systems Design and Implementation
July, 2023, Boston, MA, pages 949–968
Abstract , BibTeX , PDF , Definitive

Multicore systems parallelize to accommodate incoming Ethernet traffic, allocating one receive (Rx) ring with >=1Ki entries per core by default. This ring size is sufficient to absorb packet bursts of single-core workloads. But the combined size of all Rx buffers (pointed to by all Rx rings) can exceed the size of the last-level cache. We observe that, in this case, NIC and CPU memory accesses are increasingly served by main memory, which might incur nonnegligible overheads when scaling to hundreds of incoming gigabits per second.

To alleviate this problem, we propose "shRing," which shares each Rx ring among several cores when networking memory bandwidth consumption is high. ShRing thus adds software synchronization costs, but this overhead is offset by the smaller memory footprint. We show that, consequently, shRing increases the throughput of NFV workloads by up to 1.27x, and that it reduces their latency by up to 38x. The substantial latency reduction occurs when shRing shortens the per-packet processing time to a value smaller than the packet interarrival time, thereby preventing overload conditions.

@InProceedings{pismenny23-shring,

author	=	{Boris Pismenny and Adam Morrison and Dan Tsafrir},
title	=	{{ShRing}: networking with shared receive rings},
booktitle	=	{USENIX Symposium on Operating Systems Design and Implementation (OSDI)},
year	=	2023,
month	=	{July},
pages	=	{949--968},
address	=	{Boston, MA}

}

Degrading data to save the planet
Aviad Zuck, Donald E. Porter, Dan Tsafrir
HotOS '23: ACM Workshop on Hot Topics in Operating Systems
June, 2023, Providence, RI, pages 61–69
Abstract , BibTeX , PDF , Definitive

Storage capacity demand is projected to grow exponentially in the coming decade and so will its contribution to the overall carbon footprint of computing devices. In recent years, cloud providers and device vendors have substantially reduced their carbon impact through improved power consumption and product distribution. However, by 2030, the manufacturing of flash-based storage devices will account for 1.7% of carbon emissions in the world. Therefore, reducing production-related carbon emissions of storage is key to sustainability in computing devices.

We present Sustainability-Oriented Storage (SOS), a new host-device co-design for personal storage devices, which opportunistically improves storage sustainability by: (1) targeting widely-produced flash-based personal storage devices; (2) reducing hardware production through optimizing bit density in existing materials, up to 50 effective lifespan of personal devices and longer lifespan of their underlying flash.

SOS automatically stores low-priority files, occupying most personal storage capacities on high-density flash memories, currently designated for nearline storage. To avoid data loss, low-priority files are allowed to slightly degrade in quality over time. Switching to high-density memories, which maximize production material utilization, reduces the overall carbon footprint of personal storage devices.

@InProceedings{zuck23-ssdcarbon,

author	=	{Aviad Zuck and Donald E. Porter and Dan Tsafrir},
title	=	{Degrading data to save the planet},
booktitle	=	{ACM Workshop on Hot Topics in Operating Systems (HotOS)},
year	=	2023,
month	=	{June},
pages	=	{61--69},
address	=	{Providence, RI}

}

2022

Optimizing storage performance with calibrated interrupts
Amy Tai, Igor Smolyar, Michael Wei, Dan Tsafrir
TOS: ACM Transactions on Storage
February, 2022, pages 3:1–3:32, volume 18, number 1
Abstract , BibTeX , PDF , Definitive

After request completion, an I/O device must decide whether to minimize latency by immediately firing an interrupt or to optimize for throughput by delaying the interrupt, anticipating that more requests will complete soon and help amortize the interrupt cost. Devices employ adaptive interrupt coalescing heuristics that try to balance between these opposing goals. Unfortunately, because devices lack the semantic information about which I/O requests are latency-sensitive, these heuristics can sometimes lead to disastrous results.

Instead, we propose addressing the root cause of the heuristics problem by allowing software to explicitly specify to the device if submitted requests are latency-sensitive. The device then “calibrates” its interrupts to completions of latency-sensitive requests. We focus on NVMe storage devices and show that it is natural to express these semantics in the kernel and the application and only requires a modest two-bit change to the device interface. Calibrated interrupts increase throughput by up to 35%, reduce CPU consumption by as much as 30%, and achieve up to 37% lower latency when interrupts are coalesced.

@Article{tai22-cintj,

author	=	{Amy Tai and Igor Smolyar and Michael Wei and Dan Tsafrir},
title	=	{Optimizing storage performance with calibrated interrupts},
journal	=	{ ACM Transactions on Storage (TOS)},
volume	=	18,
number	=	1,
year	=	2022,
month	=	{February},
pages	=	{3:1--3:32}

}

The benefits of general-purpose on-NIC memory
Boris Pismenny, Liran Liss, Adam Morrison, Dan Tsafrir
ASPLOS '22: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
February, 2022, Lausanne, Switzerland, pages 1130–1147
Abstract , BibTeX , PDF , Definitive

We propose to use the small, newly available on-NIC memory (“nicmem”) to keep pace with the rapidly increasing performance of NICs. We motivate our proposal by accelerating two types of workload classes: NFV and key-value stores. As NFV workloads frequently operate on headers—rather than data—of incoming packets, we introduce a new packet-processing architecture that splits between the two, keeping the data on nicmem when possible and thus reducing PCIe traffic, memory bandwidth, and CPU processing time. Our approach consequently shortens NFV latency by up to 23% and increases its throughput by up to 19%. Similarly, because key-value stores commonly exhibit skewed distributions, we introduce a new network stack mechanism that lets applications keep frequently accessed items on nicmem. Our design shortens memcached latency by up to 43% and increases its throughput by up to 80%.

@InProceedings{pismenny22-nicmem,

author	=	{Boris Pismenny and Liran Liss and Adam Morrison and Dan Tsafrir},
title	=	{The benefits of general-purpose {on-NIC} memory},
booktitle	=	{ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
year	=	2022,
month	=	{February},
pages	=	{1130--1147},
address	=	{Lausanne, Switzerland}

}

2021

Rowhammering storage devices
Tao Zhang, Boris Pismenny, Donald E. Porter, Dan Tsafrir, Aviad Zuck
HotStorage '21: ACM Workshop on Hot Topics In Storage and File Systems
July, 2021, Virtual event
Abstract , BibTeX , PDF , Definitive

Peripheral devices like SSDs are growing more complex, to the point they are effectively small computers themselves. Our position is that this trend creates a new kind of attack vector, where untrusted software could use peripherals strictly as intended to accomplish unintended goals. To exemplify, we set out to rowhammer the DRAM component of a simplified SSD firmware, issuing regular I/O requests that manage to flip bits in a way that triggers sensitive information leakage. We conclude that such attacks might soon be feasible, and we argue that systems need principled approaches for securing peripherals against them.

@InProceedings{zhang21-rowhssd,

author	=	{Tao Zhang and Boris Pismenny and Donald E. Porter and Dan Tsafrir and Aviad Zuck},
title	=	{Rowhammering storage devices},
booktitle	=	{ACM Workshop on Hot Topics In Storage and File Systems (HotStorage)},
year	=	2021,
month	=	{July},
address	=	{Virtual event}

}

Optimizing storage I/O with calibrated interrupts
Amy Tai, Igor Smolyar, Michael Wei, Dan Tsafrir
OSDI '21: USENIX Symposium on Operating Systems Design and Implementation
July, 2021, pages 129–145
Abstract , BibTeX , PDF , Definitive

After request completion, an I/O device must decide either to minimize latency by immediately firing an interrupt or to optimize for throughput by delaying the interrupt, anticipating that more requests will complete soon and help amortize the interrupt cost. Devices employ adaptive interrupt coalescing heuristics that try to balance between these opposing goals. Unfortunately, because devices lack the semantic information about which I/O requests are latency-sensitive, these heuristics can sometimes lead to disastrous results.

Instead, we propose addressing the root cause of the heuristics problem by allowing software to explicitly specify to the device if submitted requests are latency-sensitive. The device then “calibrates” its interrupts to completions of latency-sensitive requests. We focus on NVMe storage devices and show that it is natural to express these semantics in the kernel and the application and only requires a modest two-bit change to the device interface. Calibrated interrupts increase throughput by up to 35%, reduce CPU consumption by as much as 30%, and achieve up to 37% lower latency when interrupts are coalesced.

@InProceedings{tai21-cinter,

author	=	{Amy Tai and Igor Smolyar and Michael Wei and Dan Tsafrir},
title	=	{Optimizing storage {I/O} with calibrated interrupts},
booktitle	=	{USENIX Symposium on Operating Systems Design and Implementation (OSDI)},
year	=	2021,
month	=	{July},
pages	=	{129--145}

}

Dealing with (some of) the fallout from meltdown
Nadav Amit, Michael Wei, Dan Tsafrir
SYSTOR '21: ACM International Conference on Systems and Storage
June, 2021, Virtual event, pages 1–6
Abstract , BibTeX , PDF , Definitive

The meltdown vulnerability allows users to read kernel memory by exploiting a hardware flaw in speculative execution. Processor vendors recommend “page table isolation” (PIT) as a software fix, but PTI can significantly degrade the performance of system-call-heavy programs. Leveraging the fact that 32-bit pointers cannot access 64-bit kernel memory, we propose “Shrink,” a safe alternative to PTI, which is applicable to programs capable of running in 32-bit address spaces. We show that Shrink can restore the performance of some workloads, suggest additional potential alternatives, and argue that vendors must be more open about hardware flaws to allow developers to design protection schemes that are safe and performant.

@InProceedings{amit21-melt,

author	=	{Nadav Amit and Michael Wei and Dan Tsafrir},
title	=	{Dealing with (some of) the fallout from meltdown},
booktitle	=	{ACM International Conference on Systems and Storage (SYSTOR)},
year	=	2021,
month	=	{June},
pages	=	{1--6},
address	=	{Virtual event}

}

Characterizing, exploiting, and detecting DMA code injection vulnerabilities in the presence of an IOMMU
Alex Markuze, Shay Vargaftik, Gil Kupfer, Boris Pismenny, Nadav Amit, Adam Morrison, Dan Tsafrir
EuroSys '21: ACM European Conference on Computer Systems
April, 2021, pages 395–409
Abstract , BibTeX , PDF , Definitive

Direct memory access (DMA) renders a system vulnerable to DMA attacks, in which I/O devices access memory regions not intended for their use. Hardware input-output memory management units (IOMMU) can be used to provide protection. However, an IOMMU cannot prevent all DMA attacks because it only restricts DMA at page-level granularity, leading to sub-page vulnerabilities.

Current DMA attacks rely on simple situations in which write access to a kernel pointer is obtained due to sub-page vulnerabilities and all other attack ingredients are available and reside on the same page. We show that DMA vulnerabilities are a deep-rooted issue and it is often the kernel design that enables complex and multistage DMA attacks. This work presents a structured top-down approach to characterize, exploit, and detect them.

To this end, we first categorize sub-page vulnerabilities into four types, providing insight into the structure of DMA vulnerabilities. We then identify a set of three vulnerability attributes that are sufficient to execute code injection attacks.

We built analysis tools that detect these sub-page vulnerabilities and analyze the Linux kernel. We found that 72% of the device drivers expose callback pointers, which may be overwritten by a device to hijack the kernel control flow.

Aided by our tools' output, we demonstrate novel code injection attacks on the Linux kernel; we refer to these as compound attacks. All previously reported attacks are single-step, with the vulnerability attributes present in a single page. In compound attacks, the vulnerability attributes are initially incomplete. However, we demonstrate that they can be obtained by carefully exploiting standard OS behavior.

@InProceedings{markuze21-ioatk,

author	=	{Alex Markuze and Shay Vargaftik and Gil Kupfer and Boris Pismenny and Nadav Amit and Adam Morrison and Dan Tsafrir},
title	=	{Characterizing, exploiting, and detecting {DMA} code injection vulnerabilities in the presence of an {IOMMU}},
booktitle	=	{ACM European Conference on Computer Systems (EuroSys)},
year	=	2021,
month	=	{April},
pages	=	{395--409}

}

Autonomous NIC offloads
Boris Pismenny, Haggai Eran, Aviad Yehezkel, Liran Liss, Adam Morrison, Dan Tsafrir
ASPLOS '21: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
April, 2021, Detroit, MI, pages 18–35, awarded Best Paper
Abstract , BibTeX , PDF , Definitive

CPUs routinely offload to NICs network-related processing tasks like packet segmentation and checksum. NIC offloads are advantageous because they free valuable CPU cycles. But their applicability is typically limited to layer≤4 protocols (TCP and lower), and they are inapplicable to layer-5 protocols (L5Ps) that are built on top of TCP. This limitation is caused by a misfeature we call “offload dependence,” which dictates that L5P offloading additionally requires offloading the underlying layer≤4 protocols and related functionality: TCP, IP, firewall, etc. The dependence of L5P offloading hinders innovation, because it implies hard-wiring the complicated, ever-changing implementation of the lower-level protocols.

We propose “autonomous NIC offloads,” which eliminate offload dependence. Autonomous offloads provide a lightweight software-device architecture that accelerates L5Ps without having to migrate the entire layer≤4 TCP/IP stack into the NIC. A main challenge that autonomous offloads address is coping with out-of-sequence packets. We implement autonomous offloads for two L5Ps: (i) NVMe-over-TCP zero-copy and CRC computation, and (ii) https authentication, encryption, and decryption. Our autonomous offloads increase throughput by up to 3.3x, and they deliver CPU consumption and latency that are as low as 0.4x and 0.7x, respectively. Their implementation is already upstreamed in the Linux kernel, and they will be supported in the next-generation of Mellanox NICs.

@InProceedings{pismenny21-l5o,

author	=	{Boris Pismenny and Haggai Eran and Aviad Yehezkel and Liran Liss and Adam Morrison and Dan Tsafrir},
title	=	{Autonomous {NIC} offloads},
booktitle	=	{ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
year	=	2021,
month	=	{April},
pages	=	{18--35},
address	=	{Detroit, MI}

}

2020

Predicting execution times with partial simulations in virtual memory research: why and how
Mohammad Agbarya, Idan Yaniv, Jayneel Gandhi, Dan Tsafrir
MICRO '20: IEEE/ACM International Symposium on Microarchitecture
October, 2020, Global Online Event, pages 456–470
Abstract , BibTeX , PDF , Definitive

Computer architects frequently use cycle-accurate simulations, which incur heavy overheads. Recently, virtual memory studies increasingly employ a lighter-weight methodology that utilizes partial simulations—of only the memory subsystem—whose output is fed into a mathematical linear model that predicts execution runtimes. The latter methodology is much faster, but its accuracy is only assumed, never rigorously validated.

We question the assumption and put it to the test by developing Mosalloc, the Mosaic Memory Allocator. Mosalloc backs the virtual memory of applications with arbitrary combinations of 4KB, 2MB, and 1GB pages (each combination forms a “mosaic” of pages). Previous studies used a single page size per execution (either 4KB or 2MB) to generate exactly two execution samples, which defined the aforementioned linear model. In contrast, Mosalloc can generate many samples, allowing us to test instead of assume the model's accuracy. We find that prediction errors of existing models can be as high as 25%–192%. We propose a new model that bounds the maximal error below 3%, making it more reliable and useful for exploring new ideas.

@InProceedings{agbarya20-mosalloc,

author	=	{Mohammad Agbarya and Idan Yaniv and Jayneel Gandhi and Dan Tsafrir},
title	=	{Predicting execution times with partial simulations in virtual memory research: why and how},
booktitle	=	{IEEE/ACM International Symposium on Microarchitecture (MICRO)},
year	=	2020,
month	=	{October},
pages	=	{456-470},
address	=	{Global Online Event}

}

RAIDP: replication with intra-disk parity for cost-effective storage of warm data
Eitan Rosenfeld, Aviad Zuck, Nadav Amit, Michael Factor, Dan Tsafrir
EuroSys '20: ACM European Conference on Computer Systems
April, 2020, Heraklion, Crete, Greece, pages 26.1–26.17
Abstract , BibTeX , PDF , Definitive

Distributed storage systems often triplicate data to reduce the risk of permanent data loss, thereby tolerating at least two simultaneous disk failures at the price of 2/3 of the capacity. To reduce this price, some systems utilize erasure coding. But this optimization is usually only applied to cold data, because erasure coding might hinder performance for warm data.

We propose RAIDP—a new point in the distributed storage design space between replication and erasure coding. RAIDP maintains only two replicas, rather than three or more. It increases durability by utilizing small disk “add-ons” for storing intra-disk erasure codes that are local to the server but fail independently from the disk. By carefully laying out the data, the add-ons allow RAIDP to recover from simultaneous disk failures (add-ons can be stacked to withstand an arbitrary number of failures). RAIDP retains much of the benefits of replication, trading off some performance and availability for substantially reduced storage requirements, networking overheads, and their related costs. We implement RAIDP in HDFS, which triplicates by default. We show that baseline RAIDP achieves performance close to that of HDFS with only two replicas, and performs within 21 update-oriented variant, while halving the storage and networking overheads and providing similar durability.

@InProceedings{rosenfeld20-raidp,

author	=	{Eitan Rosenfeld and Aviad Zuck and Nadav Amit and Michael Factor and Dan Tsafrir},
title	=	{{RAIDP}: replication with intra-disk parity for cost-effective storage of warm data},
booktitle	=	{ACM European Conference on Computer Systems (EuroSys)},
year	=	2020,
month	=	{April},
pages	=	{26.1--26.17},
address	=	{Heraklion, Crete, Greece}

}

IOctopus: outsmarting nonuniform DMA
Igor Smolyar, Alex Markuze, Boris Pismenny, Haggai Eran, Gerd Zellweger, Austin Bolen, Liran Liss, Adam Morrison, Dan Tsafrir
ASPLOS '20: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
March, 2020, Lausanne, Switzerland, pages 101–115, awarded Best Paper
Abstract , BibTeX , PDF , Definitive

In a multi-CPU server, memory modules are local to the CPU to which they are connected, forming a nonuniform memory access (NUMA) architecture. Because non-local accesses are slower than local accesses, the NUMA architecture might degrade application performance. Similar slowdowns occur when an I/O device issues nonuniform DMA (NUDMA) operations, as the device is connected to memory via a single CPU. NUDMA effects therefore degrade application performance similarly to NUMA effects.

We observe that the similarity is not inherent but rather a product of disregarding the intrinsic differences between I/O and CPU memory accesses. Whereas NUMA effects are inevitable, we show that NUDMA effects can and should be eliminated. We present IOctopus, a device architecture that makes NUDMA impossible by unifying multiple physical PCIe functions—one per CPU—in manner that makes them appear as one, both to the system software and externally to the server. IOctopus requires only a modest change to the device driver and firmware. We implement it on existing hardware and demonstrate that it improves throughput and latency by as much as 2.7× and 1.28×, respectively, while ridding developers from the need to combat (what appeared to be) an unavoidable type of overhead.

@InProceedings{smolyar20-oct,

author	=	{Igor Smolyar and Alex Markuze and Boris Pismenny and Haggai Eran and Gerd Zellweger and Austin Bolen and Liran Liss and Adam Morrison and Dan Tsafrir},
title	=	{{IOctopus}: outsmarting nonuniform {DMA}},
booktitle	=	{ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
year	=	2020,
month	=	{March},
pages	=	{101--115},
address	=	{Lausanne, Switzerland}

}

2019

Refreshing ATC
Dan Tsafrir, Dahlia Malkhi
ATC '19: USENIX Annual Technical Conference
July, 2019, Renton, WA, pages xi–xxviii
BibTeX , PDF , Definitive

@InProceedings{tsafrir19-refreshATC,

author	=	{Dan Tsafrir and Dahlia Malkhi},
title	=	{Refreshing {ATC}},
booktitle	=	{USENIX Annual Technical Conference (ATC)},
year	=	2019,
month	=	{July},
pages	=	{xi--xxviii},
address	=	{Renton, WA}

}

Apps can quickly destroy your mobile's flash: why they don't, and how to keep it that way
Tao Zhang, Aviad Zuck, Donald E. Porter, Dan Tsafrir
MobiSys '19: ACM International Conference on Mobile Systems, Applications, and Services
June, 2019, Seoul, South Korea, pages 207–221
Abstract , BibTeX , PDF , Definitive

Although flash cells wear out, a typical SSD has enough cells and sufficiently sophisticated firmware that its lifetime generally exceeds the expected lifetime of its host system. Even under heavy use, SSDs last for years and can be replaced upon failure. On a smartphone, in contrast, the hardware is more limited and we show that, under heavy use, one can easily, and more quickly, wear out smartphone flash storage. Consequently, a simple, unprivileged, malicious application can render a smartphone unbootable (“bricked”) in a few weeks with no warning signs to the user. This bleak result becomes more worrisome when considering the fact that smartphone users generally believe it is safe to try out new applications. To combat this problem, we study the I/O behavior of a wide range of Android applications. We find that high-volume write bursts exist, yet none of the applications we checked sustains an average write rate that is high enough to damage the device (under reasonable usage assumptions backed by the literature). We therefore propose a rate-limiting algorithm for write activity that (1) prevents such attacks, (2) accommodates “normal” bursts, and (3) ensures that the smartphone drive lifetime is longer than a preconfigured lower bound (i.e., its warranty). In terms of user experience, our design only requires that, in the worst case of an app that issues continuous, unsustainable, and unusual writes, the user decides whether to shorten the phone's life or rate limit the problematic app.

@InProceedings{zhang19-fbrick,

author	=	{Tao Zhang and Aviad Zuck and Donald E. Porter and Dan Tsafrir},
title	=	{Apps can quickly destroy your mobile's flash: why they don't, and how to keep it that way},
booktitle	=	{ACM International Conference on Mobile Systems, Applications, and Services (MobiSys)},
year	=	2019,
month	=	{June},
pages	=	{207--221},
address	=	{Seoul, South Korea}

}

Storm: a fast transactional dataplane for remote data structures
Stanko Novakovic, Yizhou Shan, Aasheesh Kolli, Michael Cui, Yiying Zhang, Haggai Eran, Boris Pismenny, Liran Liss, Michael Wei, Dan Tsafrir, Marcos Aguilera
SYSTOR '19: ACM International Conference on Systems and Storage
June, 2019, Haifa, Israel, pages 97–108, awarded Best Paper
Abstract , BibTeX , PDF , Definitive

RDMA technology enables a host to access the memory of a remote host without involving the remote CPU, improving the performance of distributed in-memory storage systems. Previous studies argued that RDMA suffers from scalability issues, because the NIC's limited resources are unable to simultaneously cache the state of all the concurrent network streams. These concerns led to various software-based proposals to reduce the size of this state by trading off performance. We revisit these proposals and show that they no longer apply when using newer RDMA NICs in rack-scale environments. In particular, we find that one-sided remote memory primitives lead to better performance as compared to the previously proposed unreliable datagram and kernel-based stacks. Based on this observation, we design and implement Storm, a transactional dataplane utilizing one-sided read and write-based RPC primitives. We show that Storm outperforms eRPC, FaRM, and LITE by 3.3x, 3.6x, and 17.1x, respectively, on an InfiniBand cluster with Mellanox ConnectX-4 NICs.

@InProceedings{novakovic19-dc,

author	=	{Stanko Novakovic and Yizhou Shan and Aasheesh Kolli and Michael Cui and Yiying Zhang and Haggai Eran and Boris Pismenny and Liran Liss and Michael Wei and Dan Tsafrir and Marcos Aguilera},
title	=	{Storm: a fast transactional dataplane for remote data structures},
booktitle	=	{ACM International Conference on Systems and Storage (SYSTOR)},
year	=	2019,
month	=	{June},
pages	=	{97--108},
address	=	{Haifa, Israel}

}

Why and how to increase SSD performance transparency
Aviad Zuck, Tao Zhang, Philipp Guhring, Donald E. Porter, Dan Tsafrir
HotOS '19: ACM Workshop on Hot Topics in Operating Systems
May, 2019, Bertinoro, Italy, pages 192–200
Abstract , BibTeX , PDF , Definitive

Even on modern SSDs, I/O scheduling is a first-order performance concern. However, it is unclear how best to optimize I/O patterns for SSDs, because a complex layer of proprietary firmware hides many principal aspects of performance, as well as SSD lifetime. Losing this information leads to research papers drawing incorrect conclusions about prototype systems, as well as real-world systems realizing sub-optimal performance and lifetime. It is our position that a useful performance model of a foundational system component is essential, and the community should support efforts to construct models of SSD performance. We show examples from the literature and our own measurements that illustrate serious limitations of current SSD modeling tools and disk statistics. We observe an opportunity to resolve this problem by reverse engineering SSDs, leveraging recent trends toward component standardization within SSDs. This paper presents a feasibility study and initial results to reverse engineer a commercial SSD's firmware, and discusses limitations and open problems.

@InProceedings{zuck19-frvrs,

author	=	{Aviad Zuck and Tao Zhang and Philipp Guhring and Donald E. Porter and Dan Tsafrir},
title	=	{Why and how to increase {SSD} performance transparency},
booktitle	=	{ACM Workshop on Hot Topics in Operating Systems (HotOS)},
year	=	2019,
month	=	{May},
pages	=	{192--200},
address	=	{Bertinoro, Italy}

}

2018

DAMN: overhead-free IOMMU protection for networking
Alex Markuze, Igor Smolyar, Adam Morrison, Dan Tsafrir
ASPLOS '18: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
March, 2018, Williamsburg, VA, pages 301–315
Abstract , BibTeX , PDF , Definitive

DMA operations can access memory buffers only if they are “mapped” in the IOMMU, so operating systems protect themselves against malicious/errant network DMAs by mapping and unmapping each packet immediately before/after it is DMAed. This approach was recently found to be riskier and less performant than keeping packets non-DMAable and instead copying their content to/from permanently-mapped buffers. Still, the extra copy hampers performance of multi-gigabit networking.

We observe that achieving protection at the DMA (un)map boundary is needlessly constraining, as devices must be prevented from changing the data only after the kernel reads it. So there is no real need to switch ownership of buffers between kernel and device at the DMA (un)mapping layer, as opposed to the approach taken by all existing IOMMU protection schemes. We thus eliminate the extra copy by (1) implementing a new allocator called DMA-Aware Malloc for Networking (DAMN), which (de)allocates packet buffers from a memory pool permanently mapped in the IOMMU; (2) modifying the network stack to use this allocator; and (3) copying packet data only when the kernel needs it, which usually morphs the aforementioned extra copy into the kernel's standard copy operation performed at the user-kernel boundary. DAMN thus provides full IOMMU protection with performance comparable to that of an unprotected system.

@InProceedings{markuze18-damn,

author	=	{Alex Markuze and Igor Smolyar and Adam Morrison and Dan Tsafrir},
title	=	{{DAMN}: overhead-free {IOMMU} protection for networking},
booktitle	=	{ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
year	=	2018,
month	=	{March},
pages	=	{301--315},
address	=	{Williamsburg, VA}

}

Stash in a flash
Aviad Zuck, Yue Li, Jehoshua Bruck, Donald E. Porter, Dan Tsafrir
FAST '18: USENIX Conference on File and Storage Technologies
February, 2018, Oakland CA, pages 169–188
Abstract , BibTeX , PDF , Definitive

Encryption is a useful tool to protect data confidentiality. Yet it is still challenging to hide the very presence of encrypted, secret data from a powerful adversary. This paper presents a new technique to hide data in flash by manipulating the voltage level of pseudo-randomlyselected flash cells to encode two bits (rather than one) in the cell. In this model, we have one “public” bit interpreted using an SLC-style encoding, and extract a private bit using an MLC-style encoding. The locations of cells that encode hidden data is based on a secret key known only to the hiding user.

Intuitively, this technique requires that the voltage level in a cell encoding data must be (1) not statistically distinguishable from a cell only storing public data, and (2) the user must be able to reliably read the hidden data from this cell. Our key insight is that there is a wide enough variation in the range of voltage levels in a typical flash device to obscure the presence of fine-grained changes to a small fraction of the cells, and that the variation is wide enough to support reliably re-reading hidden data. We demonstrate that our hidden data and underlying voltage manipulations go undetected by support vector machine based supervised learning which performs similarly to a random guess. The error rates of our scheme are low enough that the data is recoverable months after being stored. Compared to prior work, our technique provides 24x and 50x higher encoding and decoding throughput and doubles the capacity, while being 37x more power efficient.

@InProceedings{zuck18-flashdeny,

author	=	{Aviad Zuck and Yue Li and Jehoshua Bruck and Donald E. Porter and Dan Tsafrir},
title	=	{Stash in a flash},
booktitle	=	{USENIX Conference on File and Storage Technologies (FAST)},
year	=	2018,
month	=	{February},
pages	=	{169--188},
address	=	{Oakland CA}

}

2017

Preserving hidden data with an ever-changing disk
Aviad Zuck, Udi Shriki, Donald E. Porter, Dan Tsafrir
HotOS '17: ACM Workshop on Hot Topics in Operating Systems
May, 2017, Whistler, Canada, pages 50–55
Abstract , BibTeX , PDF , Definitive

This paper presents a storage system that can hide the presence of hidden data alongside a larger volume of public data. Encryption allows a user to hide the contents of data, but not the fact that sensitive data is present. Under duress, the owner of high-value data can be coerced by a powerful adversary to disclose decryption keys. Thus, private users and corporations have an interest in hiding the very presence of some sensitive data, alongside a larger body of less sensitive data (e.g., the operating system and other benign files); this property is called plausible deniability. Existing plausible deniability systems do not fulfill all of the following requirements: (1) resistance to multiple snapshot attacks where an attacker compares the state of the device over time; (2) ensuring that hidden data won't be destroyed when the public volume is modified by a user unaware of the hidden data; and (3) disguising writes to secret data as normal system operations on public data.

We explain why existing solutions do not meet all these requirements and present the Ever-Changing Disk (ECD), a generic scheme for plausible deniability storage systems that meets all of these requirements. An ECD stores hidden data inside a large volume of pseudorandom data. Portions of this volume are periodically migrated in a log-structured manner. Hidden writes can then be interchanged with normal firmware operations. The expected access patterns and time until hidden data is overwritten are completely predictable, and insensitive to whether data is hidden. Users control the rate of internal data migration (R), trading write bandwidth to hidden data for longevity of the hidden data. For a typical 2TB disk and setting of R, a user preserves hidden data by entering her secret key every few days or weeks.

@InProceedings{zuck17-everchng,

author	=	{Aviad Zuck and Udi Shriki and Donald E. Porter and Dan Tsafrir},
title	=	{Preserving hidden data with an ever-changing disk},
booktitle	=	{ACM Workshop on Hot Topics in Operating Systems (HotOS)},
year	=	2017,
month	=	{May},
pages	=	{50--55},
address	=	{Whistler, Canada}

}

Flash drive lifespan *is* a problem
Tao Zhang, Aviad Zuck, Donald E. Porter, Dan Tsafrir
HotOS '17: ACM Workshop on Hot Topics in Operating Systems
May, 2017, Whistler, Canada, pages 42–49
Abstract , BibTeX , PDF , Definitive

When flash was introduced, wear-out was a known problem. Over time, a number of techniques have been developed to estimate the expected number of program/erase cycles under typical usage patterns, and sufficiently over-provision the cells such that the device meets its expected lifespan, even if individual cells fail. This paper started as a simple experiment: measuring whether the lifespan of flash devices in smartphones and other mobile devices, match the estimates. To our surprise, we find that, in a matter of days, simple, unprivileged applications can render the drive of several smartphones (and thus, the phone) inoperable. This result is concerning, as it means that installing malicious or poorly-written software could destroy the device itself. We experimentally demonstrate the problem, discuss reasons why it occurs, and consider potential solutions.

@InProceedings{zhang17-flifespan,

author	=	{Tao Zhang and Aviad Zuck and Donald E. Porter and Dan Tsafrir},
title	=	{Flash drive lifespan is a problem},
booktitle	=	{ACM Workshop on Hot Topics in Operating Systems (HotOS)},
year	=	2017,
month	=	{May},
pages	=	{42--49},
address	=	{Whistler, Canada}

}

Page fault support for network controllers
Ilya Lesokhin, Haggai Eran, Shachar Raindel, Guy Shapiro, Sagi Grimberg, Liran Liss, Muli Ben-Yehuda, Nadav Amit, Dan Tsafrir
ASPLOS '17: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
April, 2017, Xi'an, China, pages 449–466
Abstract , BibTeX , PDF , Definitive

Direct network I/O allows network controllers (NICs) to expose multiple instances of themselves, to be used by untrusted software without a trusted intermediary. Direct I/O thus frees researchers from legacy software, fueling studies that innovate in multitenant setups. Such studies, however, overwhelmingly ignore one serious problem: direct memory accesses (DMAs) of NICs disallow page faults, forcing systems to either pin entire address spaces to physical memory and thereby hinder memory utilization, or resort to APIs that pin/unpin memory buffers before/after they are DMAed, which complicates the programming model and hampers performance.

We solve this problem by designing and implementing page fault support for InfiniBand and Ethernet NICs. A main challenge we tackle—unique to NICs—is handling receive DMAs that trigger page faults, leaving the NIC without memory to store the incoming data. We demonstrate that our solution provides all the benefits associated with “regular” virtual memory, notably (1) a simpler programming model that rids users from the need to pin, and (2) the ability to employ all the canonical memory optimizations, such as memory overcommitment and demand-paging based on actual use. We show that, as a result, benchmark performance improves by up to 1.9x.

@InProceedings{lesokhin17-npf,

author	=	{Ilya Lesokhin and Haggai Eran and Shachar Raindel and Guy Shapiro and Sagi Grimberg and Liran Liss and Muli Ben-Yehuda and Nadav Amit and Dan Tsafrir},
title	=	{Page fault support for network controllers},
booktitle	=	{ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
year	=	2017,
month	=	{April},
pages	=	{449--466},
address	=	{Xi'an, China}

}

Hardware and software support for virtualization [book]
Edouard Bugnion, Jason Nieh, Dan Tsafrir
Morgan & Claypool Publishers
Synthesis Lectures on Computer Architecture, February, 2017, pages 1–206, volume 12, number 1
Abstract , BibTeX , Definitive

This textbook focuses on the core question of the necessary architectural support provided by hardware to efficiently run virtual machines, and of the corresponding design of the hypervisors that run them. Virtualization is still possible when the instruction set architecture lacks such support, but the hypervisor remains more complex and must rely on additional techniques.

Despite the focus on architectural support in current architectures, some historical perspective is necessary to appropriately frame the problem. The first half of the text provides the historical perspective of the theoretical framework developed four decades ago by Popek and Goldberg. It also describes earlier systems that enabled virtualization despite the lack of architectural support in hardware.

As is often the case, theory defines a necessary—but not sufficient—set of features, and modern architectures are the result of the combination of the theoretical framework with insights derived from practical systems. The second half of the text describes state-of-the-art support for virtualization in both x86-64 and ARM processors. This textbook includes an in-depth description of the CPU, memory and I/O virtualization of these two processor architectures, as well as case studies on the Linux/KVM, VMware, and Xen hypervisors. It concludes with a performance comparison of virtualization on current-generation x86- and ARM-based systems across multiple hypervisors.

@Book{bugnion17-virtbook,

author	=	{Edouard Bugnion and Jason Nieh and Dan Tsafrir},
title	=	{Hardware and software support for virtualization},
publisher	=	{Morgan & Claypool Publishers},
series	=	{Synthesis Lectures on Computer Architecture},
volume	=	12,
number	=	1,
year	=	2017,
month	=	{February},
pages	=	{1--206}

}

2016

Hash, don't cache (the page table)
Idan Yaniv, Dan Tsafrir
SIGMETRICS '16: ACM SIGMETRICS International Conference on Measurement and Modeling
June, 2016, Antibes Juan-les-Pins, France, pages 337–350
Abstract , BibTeX , PDF , Definitive

Radix page tables as implemented in the x86-64 architecture incur a penalty of four memory references for address translation upon each TLB miss. These 4 references become 24 in virtualized setups, accounting for 5%–90% of the runtime and thus motivating chip vendors to incorporate page walk caches (PWCs). Counterintuitively, an ISCA'10 paper found that radix page tables with PWCs are superior to hashed page tables, yielding up to 5x fewer DRAM accesses per page walk. We challenge this finding and show that it is the result of comparing against a suboptimal hashed implementation—that of the Itanium architecture. We show that, when carefully optimized, hashed page tables in fact outperform existing PWC-aided x86-64 hardware, shortening benchmark runtimes by 1%–27% and 6%–32% in bare-metal and virtualized setups, without resorting to PWCs. We further show that hashed page tables are inherently more scalable than radix designs and are better suited to accommodate the ever increasing memory size; their downside is that they make it more challenging to support such features as superpages.

@InProceedings{yaniv16-hvsr,

author	=	{Idan Yaniv and Dan Tsafrir},
title	=	{Hash, don't cache (the page table)},
booktitle	=	{ACM SIGMETRICS International Conference on Measurement and Modeling (SIGMETRICS)},
year	=	2016,
month	=	{June},
pages	=	{337--350},
address	=	{Antibes Juan-les-Pins, France}

}

Synopsis of the ASPLOS '16 wild and crazy ideas (WACI) invited-speakers session
Dan Tsafrir
ASPLOS '16: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
April, 2016, Atlanta, GA, pages 291–294
Abstract , BibTeX , PDF , Definitive , Webpage

The Wild and Crazy Ideas (WACI) session is a longstanding tradition at ASPLOS, soliciting talks that consist of forward-looking, visionary, inspiring, creative, far out or just plain amazing ideas presented in an exciting way. (Amusing elements in the presentations are tolerated ;-) but are in fact optional.)

The first WACI session took place in 1998. Back then, the call for talks included a problem statement, which contended that “papers usually do not get admitted to [such conferences as] ISCA or ASPLOS unless the systems that they describe are mature enough to run [some standard benchmark suites, which] has a chilling effect on the idea generation process—encouraging incremental research” [1]. The 1998 WACI session turned out to be a great success. Its webpage states that ”there were 42 submissions [competing over] only eight time slots, [which resulted in] this session [having] a lower acceptance rate than the conference itself” [2].

But the times they are a-changin' [3], and the WACI session no longer enjoys that many submissions (Figure 1), perhaps because nowadays there exist many forums for researchers to describe/discuss their preliminary ideas, including: the “hot topics in” workshops [4–7]; a journal like CAL, dedicated to early results [8]; main conferences soliciting short submissions describing “original or unconventional ideas at a preliminary stage” in addition to regular papers [9]; and the many workshops co-located with main conferences, like ISCA '15, which hosted thirteen such workshops [10].

Regardless of the reason for the declining number of submissions, this time we've decided to organize the WACI session differently to ensure its continued high quality. Instead of soliciting talks via an open call and hoping for the best, we proactively invited speakers whom we believe are capable of delivering excellent WACI presentations. That is, this year's WACI session consists exclusively of invited speakers. Filling up the available slots turned out to be fairly easy, as most of the researchers we invited promptly accepted our invitation. The duration of each talk was set to be eight minutes (exactly as in the first WACI session from 1998) plus two minutes for questions.

The talks are outlined below. We believe they are interesting and exciting, and we hope the attendees of the session will find them stimulating and insightful.

@InProceedings{tsafrir16-waci,

author	=	{Dan Tsafrir},
title	=	{Synopsis of the {ASPLOS} '16 wild and crazy ideas {(WACI)} invited-speakers session},
booktitle	=	{ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
year	=	2016,
month	=	{April},
pages	=	{291--294},
address	=	{Atlanta, GA}

}

vRIO: Paravirtual remote I/O
Yossi Kuperman, Eyal Moscovici, Joel Nider, Razya Ladelsky, Abel Gordon, Dan Tsafrir
ASPLOS '16: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
April, 2016, Atlanta, GA, pages 49–65
Abstract , BibTeX , PDF , Definitive

The traditional “trap and emulate” I/O paravirtualization model conveniently allows for I/O interposition, yet it inherently incurs costly guest-host context switches. The newer “sidecore” model eliminates this overhead by dedicating host (side)cores to poll the relevant guest memory regions and react accordingly without context switching. But the dedication of sidecores on each host might be wasteful when I/O activity is low, or it might not provide enough computational power when I/O activity is high. We propose to alleviate this problem at rack scale by consolidating the dedicated sidecores spread across several hosts onto one server. The hypervisor is then effectively split into two parts: the local hypervisor that hosts the VMs, and the remote hypervisor that processes their paravirtual I/O. We call this model vRIO—paraVirtual Remote I/O. We find that by increasing the latency somewhat, it provides comparable throughput with fewer sidecores and superior throughput with the same number of sidecores as compared to the state of the art. vRIO additionally constitutes a new, cost-effective way to consolidate I/O devices (on the remote hypervisor) while supporting efficient programmable I/O interposition.

@InProceedings{kuperman16-vrio,

author	=	{Yossi Kuperman and Eyal Moscovici and Joel Nider and Razya Ladelsky and Abel Gordon and Dan Tsafrir},
title	=	{{vRIO}: {Paravirtual} remote {I/O}},
booktitle	=	{ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
year	=	2016,
month	=	{April},
pages	=	{49--65},
address	=	{Atlanta, GA}

}

True IOMMU protection from DMA attacks: when copy is faster than zero copy
Alex Markuze, Adam Morrison, Dan Tsafrir
ASPLOS '16: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
April, 2016, Atlanta, GA, pages 249–262
Abstract , BibTeX , PDF , Definitive

Malicious I/O devices might compromise the OS using DMAs. The OS therefore utilizes the IOMMU to map and unmap every target buffer right before and after its DMA is processed, thereby restricting DMAs to their designated locations. This usage model, however, is not truly secure for two reasons: (1) it provides protection at page granularity only, whereas DMA buffers can reside on the same page as other data; and (2) it delays DMA buffer unmaps due to performance considerations, creating a vulnerability window in which devices can access in-use memory.

We propose that OSes utilize the IOMMU differently, in a manner that eliminates these two flaws. Our new usage model restricts device access to a set of shadow DMA buffers that are never unmapped, and it copies DMAed data to/from these buffers, thus providing sub-page protection while eliminating the aforementioned vulnerability window. Our key insight is that the cost of interacting with, and synchronizing access to the slow IOMMU hardware—required for zero-copy protection against devices—make {\em copying preferable to zero-copying}.

We implement our model in Linux and evaluate it with standard networking benchmarks utilizing a 40\,Gb/s NIC. We demonstrate that despite being more secure than the safest preexisting usage model, our approach provides up to 5× higher throughput. Additionally, whereas it is inherently less scalable than an IOMMU-less (unprotected) system, our approach incurs only 0%–25% performance degradation in comparison.

@InProceedings{markuze16-cim,

author	=	{Alex Markuze and Adam Morrison and Dan Tsafrir},
title	=	{True {IOMMU} protection from {DMA} attacks: when copy is faster than zero copy},
booktitle	=	{ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
year	=	2016,
month	=	{April},
pages	=	{249--262},
address	=	{Atlanta, GA}

}

The nom profit-maximizing operating system
Muli Ben-Yehuda, Orna Agmon Ben-Yehuda, Dan Tsafrir
VEE '16: ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments
April, 2016, Atlanta, GA, pages 145–160
Abstract , BibTeX , PDF , Definitive

In the near future, cloud providers will sell their users virtual machines with CPU, memory, network, and storage resources whose prices constantly change according to market-driven supply and demand conditions. Running traditional operating systems in these virtual machines is a poor fit: traditional operating systems are not aware of changing resource prices and their sole aim is to maximize performance with no consideration of costs. Consequently, they yield low profits.

We present nom, a profit-maximizing operating system designed for cloud computing platforms with dynamic resource prices. Applications running on nom aim to maximize profits by optimizing simultaneously for performance and resource costs. The nom kernel provides them with direct access to the underlying hardware and full control over their private software stacks. Since nom applications know there is no single “best” software stack, they adapt their stacks' behavior on the fly according to the current price of available resources and their private utility from them, which differs between applications. We show that in addition to achieving up to 3.9x better throughput and up to 9.1x better latency, nom applications yield up to 11.1x higher profits when compared with the same applications running on Linux and OSv.

@InProceedings{ben-yehuda16-nom,

author	=	{Muli Ben-Yehuda and Orna {Agmon Ben-Yehuda} and Dan Tsafrir},
title	=	{The nom profit-maximizing operating system},
booktitle	=	{ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE)},
year	=	2016,
month	=	{April},
pages	=	{145--160},
address	=	{Atlanta, GA}

}

Bare-metal performance for virtual machines with exitless interrupts
Nadav Amit, Abel Gordon, Nadav Har'El, Muli Ben-Yehuda, Alex Landau, Assaf Schuster, Dan Tsafrir
CACM: Communications of the ACM
January, 2016, pages 108–116, volume 59, number 1, CACM Research Highlight
(See technical perspective by Steven Hand.)
Abstract , BibTeX , PDF , Definitive

Direct device assignment enhances the performance of guest virtual machines by allowing them to communicate with I/O devices without host involvement. But even with device assignment, guests are still unable to approach bare-metal performance, because the host intercepts all interrupts, including those generated by assigned devices to signal to guests the completion of their I/O requests. The host involvement induces multiple unwarranted guest/host context switches, which significantly hamper the performance of I/O intensive workloads. To solve this problem, we present ExitLess Interrupts (ELI), a software-only approach for handling interrupts within guest virtual machines directly and securely. By removing the host from the interrupt handling path, ELI improves the throughput and latency of unmodified, untrusted guests by 1.3x–1.6x, allowing them to reach 97%–100% of bare-metal performance even for the most demanding I/O-intensive workloads.

@Article{amit16-elij,

author	=	{Nadav Amit and Abel Gordon and Nadav Har'El and Muli Ben-Yehuda and Alex Landau and Assaf Schuster and Dan Tsafrir},
title	=	{Bare-metal performance for virtual machines with exitless interrupts},
journal	=	{Communications of the ACM (CACM)},
volume	=	59,
number	=	1,
year	=	2016,
month	=	{January},
pages	=	{108--116}

}

2015

Virtual CPU validation
Nadav Amit, Dan Tsafrir, Assaf Schuster, Ahmad Ayoub, Eran Shlomo
SOSP '15: ACM Symposium on Operating Systems Principles
October, 2015, Monterey, CA, pages 311–327
Abstract , BibTeX , PDF , Definitive

Testing the hypervisor is important for ensuring the correct operation and security of systems, but it is a hard and challenging task. We observe, however, that the challenge is similar in many respects to that of testing real CPUs. We thus propose to apply the testing environment of CPU vendors to hypervisors. We demonstrate the advantages of our proposal by adapting Intel's testing facility to the Linux KVM hypervisor. We uncover and fix 117 bugs, six of which are security vulnerabilities. We further find four flaws in Intel virtualization technology, causing a disparity between the observable behavior of code running on virtual and bare-metal servers.

@InProceedings{amit15-hvfuz,

author	=	{Nadav Amit and Dan Tsafrir and Assaf Schuster and Ahmad Ayoub and Eran Shlomo},
title	=	{Virtual {CPU} validation},
booktitle	=	{ACM Symposium on Operating Systems Principles (SOSP)},
year	=	2015,
month	=	{October},
pages	=	{311--327},
address	=	{Monterey, CA}

}

Securing self-virtualizing Ethernet devices
Igor Smolyar, Muli Ben-Yehuda, Dan Tsafrir
USENIX Security Symposium
August, 2015, Washington, D.C., pages 335–350
Abstract , BibTeX , PDF , Definitive

Single root I/O virtualization (SRIOV) is a hardware/software interface that allows devices to “self virtualize” and thereby remove the host from the critical I/O path. SRIOV thus brings near bare-metal performance to untrusted guest virtual machines (VMs) in public clouds, enterprise data centers, and high-performance computing setups. We identify a design flaw in current Ethernet SRIOV NIC deployments that enables untrusted VMs to completely control the throughput and latency of other, unrelated VMs. The attack exploits Ethernet “pause” frames, which enable network flow control functionality. We experimentally launch the attack across several NIC models and find that it is effective and highly accurate, with substantial consequences if left unmitigated: (1) to be safe, NIC vendors will have to modify their NICs so as to filter pause frames originating from SRIOV instances; (2) in the meantime, administrators will have to either trust their VMs, or configure their switches to ignore pause frames, thus relinquishing flow control, which might severely degrade networking performance. We present the Virtualization-Aware Network Flow Controller (VANFC), a software-based SRIOV NIC prototype that overcomes the attack. VANFC filters pause frames from malicious virtual machines without any loss of performance, while keeping SRIOV and Ethernet flow control hardware/software interfaces intact.

@InProceedings{smolyar15-sriovsec,

author	=	{Igor Smolyar and Muli Ben-Yehuda and Dan Tsafrir},
title	=	{Securing self-virtualizing {Ethernet} devices},
booktitle	=	{USENIX Security Symposium},
year	=	2015,
month	=	{August},
pages	=	{335--350},
address	=	{Washington, D.C.}

}

Utilizing the IOMMU scalably
Omer Peleg, Adam Morrison, Benjamin Serebrin, Dan Tsafrir
ATC '15: USENIX Annual Technical Conference
July, 2015, Santa Clara, CA, pages 549–562
Abstract , BibTeX , PDF , Definitive

IOMMUs provided by modern hardware allow the OS to enforce memory protection controls on the DMA operations of its I/O devices. An IOMMU translation management design must scalably handle frequent concurrent updates of IOMMU translations made by multiple cores, which occur in high throughput I/O workloads such as multi-Gb/s networking. Today, however, OSes experience performance meltdowns when using the IOMMU in such workloads.

This paper explores scalable IOMMU management designs and addresses the two main bottlenecks we find in current OSes: (1) assignment of I/O virtual addresses (IOVAs), and (2) management of the IOMMU's TLB.

We propose three approaches for scalable IOVA assignment: (1) dynamic identity mappings, which eschew IOVA allocation altogether, (2) allocating IOVAs using the kernel's kmalloc, and (3) per-core caching of IOVAs allocated by a globally-locked IOVA allocator. We further describe a scalable IOMMU TLB management scheme that is compatible with all these approaches.

Evaluation of our designs under Linux shows that (1) they achieve 88.5%–100% of the performance obtained without an IOMMU, (2) they achieve similar latency to that obtained without an IOMMU, (3) scalable IOVA allocation and dynamic identity mappings perform comparably, and (4) kmalloc provides a simple solution with high performance, but can suffer from unbounded page table blowup.

@InProceedings{peleg15-eskimo,

author	=	{Omer Peleg and Adam Morrison and Benjamin Serebrin and Dan Tsafrir},
title	=	{Utilizing the {IOMMU} scalably},
booktitle	=	{USENIX Annual Technical Conference (ATC)},
year	=	2015,
month	=	{July},
pages	=	{549--562},
address	=	{Santa Clara, CA}

}

rIOMMU: Efficient IOMMU for I/O devices that employ ring buffers
Moshe Malka, Nadav Amit, Muli Ben-Yehuda, Dan Tsafrir
ASPLOS '15: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
March, 2015, Istanbul, Turkey, pages 355–368
Abstract , BibTeX , PDF , Definitive

The IOMMU allows the OS to encapsulate I/O devices in their own virtual memory spaces, thus restricting their DMAs to specific memory pages. The OS uses the IOMMU to protect itself against buggy drivers and malicious/errant devices. But the added protection comes at a cost, degrading the throughput of I/O-intensive workloads by up to an order of magnitude. This cost has motivated system designers to trade off some safety for performance, e.g., by leaving stale information in the IOTLB for a while so as to amortize costly invalidations.

We observe that high-bandwidth devices—like network and PCIe SSD controllers—interact with the OS via circular ring buffers that that induce a sequential, predictable workload. We design a ring IOMMU (rIOMMU) that leverages this characteristic by replacing the virtual memory page table hierarchy with a circular, flat table. A flat table is adequately supported by exactly one IOTLB entry, making every new translation an implicit invalidation of the former and thus requiring explicit invalidations only at the end of I/O bursts. Using standard networking benchmarks, we show that rIOMMU provides up to 7.56x higher throughput relative to the baseline IOMMU, and that it is within 0.77–1.00x the throughput of a system without IOMMU protection.

@InProceedings{malka15-riommu,

author	=	{Moshe Malka and Nadav Amit and Muli Ben-Yehuda and Dan Tsafrir},
title	=	{{rIOMMU}: {Efficient} {IOMMU} for {I/O} devices that employ ring buffers},
booktitle	=	{ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
year	=	2015,
month	=	{March},
pages	=	{355--368},
address	=	{Istanbul, Turkey}

}

Efficient intra-operating system protection against harmful DMAs
Moshe Malka, Nadav Amit, Dan Tsafrir
FAST '15: USENIX Conference on File and Storage Technologies
February, 2015, Santa Clara, CA, pages 29–44
Abstract , BibTeX , PDF , Definitive

Operating systems can defend themselves against misbehaving I/O devices and drivers by employing intra-OS protection. With “strict” intra-OS protection, the OS uses the IOMMU to map each DMA buffer immediately before the DMA occurs and to unmap it immediately after. Strict protection is costly due to IOMMU-related hardware overheads, motivating “deferred” intra-OS protection, which trades off some safety for performance.

We investigate the Linux intra-OS protection mapping layer and discover that hardware overheads are not exclusively to blame for its high cost. Rather, the cost is amplified by the I/O virtual address (IOVA) allocator, which regularly induces linear complexity. We find that the nature of IOVA allocation requests is inherently simple and constrained due to the manner by which I/O devices are used, allowing us to deliver constant time complexity with a compact, easy-to-implement optimization. Our optimization improves the throughput of standard benchmarks by up to 5.5x. It delivers strict protection with performance comparable to that of the baseline deferred protection.

To generalize our case that OSes drive the IOMMU with suboptimal software, we additionally investigate the FreeBSD mapping layer and obtain similar findings.

@InProceedings{malka15-eiovar,

author	=	{Moshe Malka and Nadav Amit and Dan Tsafrir},
title	=	{Efficient intra-operating system protection against harmful {DMAs}},
booktitle	=	{USENIX Conference on File and Storage Technologies (FAST)},
year	=	2015,
month	=	{February},
pages	=	{29--44},
address	=	{Santa Clara, CA}

}

2014

Experience with using the Parallel Workloads Archive
Dror G. Feitelson, Dan Tsafrir, David Krakov
JPDC: Journal of Parallel and Distributed Computing
October, 2014, pages 2967–2982, volume 74, number 10
Abstract , BibTeX , PDF , Definitive

Workload data is an invaluable resource for the design and evaluation of computer systems. However, data quality is always an issue. Therefore the experience gained when using data is also a precious resource, and both the data and the experience should be recorded and made available to other researchers. We demonstrate this for the Parallel Workloads Archive, which records job-level usage data from many largescale parallel supercomputers, clusters, and grids. Data quality problems encountered include missing data, inconsistent data, erroneous data, system configuration changes during the logging period, and nonrepresentative user behavior. Some of these may be countered by filtering out the problematic data items. In other cases, being cognizant of the problems may affect the decision of which datasets to use.

@Article{feitelson14-pwaexp,

author	=	{Dror G. Feitelson and Dan Tsafrir and David Krakov},
title	=	{Experience with using the {Parallel} {Workloads} {Archive}},
journal	=	{Journal of Parallel and Distributed Computing (JPDC)},
volume	=	74,
number	=	10,
year	=	2014,
month	=	{October},
pages	=	{2967--2982}

}

The rise of RaaS: resource-as-as-service cloud
Orna Agmon Ben-Yehuda, Muli Ben-Yehuda, Assaf Schuster, Dan Tsafrir
CACM: Communications of the ACM
July, 2014, volume 57, number 7
Abstract , BibTeX , PDF , Definitive

Over the next few years, a new model of buying and selling cloud computing resources will evolve. Instead of providers exclusively selling server-equivalent virtual machines for relatively long periods of time (as done in today's IaaS clouds), they will increasingly sell to clients individual resources (such as CPU, memory, and I/O resources), reprice them and adjust their quantity every few seconds. Instead of allocating resources on a First-come First-served basis, they will do so in accordance with market-driven supply-and-demand conditions. We term this nascent economic model of cloud computing the Resource-as-a-Service (RaaS) cloud, and we argue that its rise is the likely culmination of recent trends in the construction of Infrastructure-as-a-Service (IaaS) clouds and of the economic forces operating on both providers and clients.

@Article{ben-yehuda14-raasj,

author	=	{Orna {Agmon Ben-Yehuda} and Muli Ben-Yehuda and Assaf Schuster and Dan Tsafrir},
title	=	{The rise of {RaaS:} resource-as-as-service cloud},
journal	=	{Communications of the ACM (CACM)},
volume	=	57,
number	=	7,
year	=	2014,
month	=	{July}

}

VSwapper: A memory swapper for virtualized environments
Nadav Amit, Dan Tsafrir, Assaf Schuster
ASPLOS '14: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
March, 2014, Salt Lake City, UT, pages 349–366, HiPEAC Paper Award
Abstract , BibTeX , PDF , Definitive

The number of guest virtual machines that can be consolidated on one physical host is typically limited by the memory size, motivating memory overcommitment. Guests are given a choice to either install a “balloon” driver to coordinate the overcommitment activity, or to experience degraded performance due to uncooperative swapping. Ballooning, however, is not a complete solution, as hosts must still fall back on uncooperative swapping in various circumstances. Additionally, ballooning takes time to accommodate change, and so guests might experience degraded performance under changing conditions.

Our goal is to improve the performance of hosts when they fall back on uncooperative swapping and/or operate under changing load conditions. We carefully isolate and characterize the causes for the associated poor performance, which include various types of superfluous swap operations, decayed swap file sequentiality, and ineffective prefetch decisions upon page faults. We address these problems by implementing VSwapper, a guest-agnostic memory swapper for virtual environments that allows efficient, uncooperative overcommitment. With inactive ballooning, VSwapper yields up to an order of magnitude performance improvement. Combined with ballooning, VSwapper can achieve up to double the performance under changing load conditions.

@InProceedings{amit14-vswap,

author	=	{Nadav Amit and Dan Tsafrir and Assaf Schuster},
title	=	{{VSwapper}: {A} memory swapper for virtualized environments},
booktitle	=	{ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
year	=	2014,
month	=	{March},
pages	=	{349--366},
address	=	{Salt Lake City, UT}

}

Memory swapper for virtualized environments
Nadav Amit, Dan Tsafrir, Assaf Schuster
Patent Number: US 9811268 B2 (granted: Nov 2017)
February, 2014
BibTeX , Html

@Misc{amit14-vswappat,

author	=	{Nadav Amit and Dan Tsafrir and Assaf Schuster},
title	=	{Memory swapper for virtualized environments},
howpublished	=	{Patent Number: US 9811268 B2 (granted: Nov 2017)},
year	=	2014,
month	=	{February}

}

2013

Deconstructing Amazon EC2 spot instance pricing
Orna Agmon Ben-Yehuda, Muli Ben-Yehuda, Assaf Schuster, Dan Tsafrir
TEAC: ACM Transactions on Economics and Computation
September, 2013, pages 16:1–16:20, volume 1, number 3
Abstract , BibTeX , PDF , Definitive

Cloud providers possessing large quantities of spare capacity must either incentivize clients to purchase it or suffer losses. Amazon is the first cloud provider to address this challenge, by allowing clients to bid on spare capacity and by granting resources to bidders while their bids exceed a periodically changing spot price. Amazon publicizes the spot price but does not disclose how it is determined. By analyzing the spot price histories of Amazon's EC2 cloud, we reverse engineer how prices are set and construct a model that generates prices consistent with existing price traces. Our findings suggest that usually prices are not market-driven, as sometimes previously assumed. Rather, they are likely to be generated most of the time at random from within a tight price range via a dynamic hidden reserve price mechanism. Our model could help clients make informed bids, cloud providers design profitable systems, and researchers design pricing algorithms.

@Article{ben-yehuda13-spotj,

author	=	{Orna {Agmon Ben-Yehuda} and Muli Ben-Yehuda and Assaf Schuster and Dan Tsafrir},
title	=	{Deconstructing {Amazon} {EC2} spot instance pricing},
journal	=	{ACM Transactions on Economics and Computation (TEAC)},
volume	=	1,
number	=	3,
year	=	2013,
month	=	{September},
pages	=	{16:1--16:20}

}

The nonkernel: a kernel designed for the cloud
Muli Ben-Yehuda, Omer Peleg, Orna Agmon Ben-Yehuda, Igor Smolyar, Dan Tsafrir
APSYS '13: ACM Asia-Pacific Workshop on Systems
July, 2013, Singapore, pages 4:1–4:7
Abstract , BibTeX , PDF , Definitive

Infrastructure-as-a-Service (IaaS) cloud computing is causing a fundamental shift in the way computing resources are bought, sold, and used. We foresee a future whereby every CPU cycle, every memory word, and every byte of network bandwidth in the cloud would have a constantly changing market-driven price. We argue that, in such an environment, the underlying resources should be exposed directly to applications without kernel or hypervisor involve- ment. We propose the nonkernel, an architecture for operating system kernel construction designed for such cloud computing platforms. A nonkernel uses modern architectural support for machine virtualization to securely provide unprivileged user programs with pervasive access to the underlying resources. We motivate the need for the nonkernel, we contrast it against its predecessor the exokernel, and we outline how one could go about building a nonkernel operating system.

@InProceedings{ben-yehuda13-nomish,

author	=	{Muli Ben-Yehuda and Omer Peleg and Orna {Agmon Ben-Yehuda} and Igor Smolyar and Dan Tsafrir},
title	=	{The nonkernel: a kernel designed for the cloud},
booktitle	=	{ACM Asia-Pacific Workshop on Systems (APSYS)},
year	=	2013,
month	=	{July},
pages	=	{4:1--4:7},
address	=	{Singapore}

}

Using disk add-ons to withstand simultaneous disk failures with fewer replicas
Eitan Rosenfeld, Nadav Amit, Dan Tsafrir
WIVOSCA '13: Workshop on the Interaction amongst Virtualization, Operating Systems and Computer Architecture
June, 2013, Tel Aviv, Israel
Abstract , BibTeX , PDF , Definitive

Contemporary storage systems that utilize replication often maintain more than two replicas of each data item, reducing the risk of permanent data loss due to simultaneous disk failures. The price of the additional copies is smaller usable storage space, increased network traffic, and higher power consumption. We propose to alleviate this problem with SimFail, a storage system that maintains only two replicas and utilizes per-disk “add-ons”, which are simple hardware devices equipped with relatively small memory that proxy disk I/O traffic. SimFail can significantly reduce the risk of data loss due to temporally adjacent disk failures by quickly copying atrisk data from disks to their add-ons. SIMFAIL can further eliminate the risk entirely by maintaining local parity information of disks on their add-ons (such that each add-on holds the parity of its own disk's data chunks). We postulate that SimFail may open the door for cloud providers to reduce the number of data replicas they use from three to two.

@InProceedings{rosenfeld13-simfail,

author	=	{Eitan Rosenfeld and Nadav Amit and Dan Tsafrir},
title	=	{Using disk add-ons to withstand simultaneous disk failures with fewer replicas},
booktitle	=	{Workshop on the Interaction amongst Virtualization, Operating Systems and Computer Architecture (WIVOSCA)},
year	=	2013,
month	=	{June},
address	=	{Tel Aviv, Israel}

}

Using disk add-ons to withstand simultaneous disk failures with fewer replicas
Dan Tsafrir, Eitan Rosenfeld
Patent Number: US 9535802 B2 (granted: Jan 2017)
January, 2013
BibTeX , Html

@Misc{tsafrir13-raidppat,

author	=	{Dan Tsafrir and Eitan Rosenfeld},
title	=	{Using disk add-ons to withstand simultaneous disk failures with fewer replicas},
howpublished	=	{Patent Number: US 9535802 B2 (granted: Jan 2017)},
year	=	2013,
month	=	{January}

}

2012

The resource-as-a-service (RaaS) cloud
Orna Agmon Ben-Yehuda, Muli Ben-Yehuda, Assaf Schuster, Dan Tsafrir
HotCloud '12: USENIX Workshop on Hot Topics in Cloud Computing
June, 2012, Boston, MA
Abstract , BibTeX , PDF , Definitive

Over the next few years, a new model of buying and selling cloud computing resources will evolve. Instead of providers exclusively selling server equivalent virtual machines for relatively long periods of time (as done in today's IaaS clouds), providers will increasingly sell individual resources (such as CPU, memory, and I/O resources) for a few seconds at a time. We term this nascent economic model of cloud computing the Resource-as-a-Service (RaaS) cloud, and we argue that its rise is the likely culmination of recent trends in the construction of IaaS clouds and of the economic forces operating on both providers and clients.

@InProceedings{ben-yehuda12-raas,

author	=	{Orna {Agmon Ben-Yehuda} and Muli Ben-Yehuda and Assaf Schuster and Dan Tsafrir},
title	=	{The resource-as-a-service {(RaaS)} cloud},
booktitle	=	{USENIX Workshop on Hot Topics in Cloud Computing (HotCloud)},
year	=	2012,
month	=	{June},
address	=	{Boston, MA}

}

Feasibility of mutable replay for automated regression testing of security updates
Ilia Kravets, Dan Tsafrir
RESoLVE '12: Workshop on Runtime Environments, Systems, Layering and Virtualized Environments
March, 2012, London, UK
Abstract , BibTeX , PDF , Definitive

Applying software patches runs the risk of somehow breaking the system, so organizations often prefer to avoid updates or to defer them until extensive testing is performed. But unpatched software might be compromised, and extensive testing might be tedious and time and resource consuming. We consider alleviating the problem by utilizing a new type of “mutable” deterministic execution replay, which would allow us to (1) run the software before and after the patch and to (2) compare the behavior of the two versions in a manner that tolerates legitimate differences that arise from security patches. The potential advantages of this approach are that: it is automatic; it requires no programming skills; it is language-agnostic; it works without source code; and it can be applied while the system is running and serving real production workloads. In this work, we do not implement mutable replay; rather, we assess its feasibility. We do so by systematically studying about 200 security patches applied to the most popular Debian/Linux packages and by considering how mutable replay should be implemented so as to tolerate execution differences that such patches generate.

@InProceedings{kravets12-mreplay,

author	=	{Ilia Kravets and Dan Tsafrir},
title	=	{Feasibility of mutable replay for automated regression testing of security updates},
booktitle	=	{Workshop on Runtime Environments, Systems, Layering and Virtualized Environments (RESoLVE)},
year	=	2012,
month	=	{March},
address	=	{London, UK}

}

ELI: bare-metal performance for I/O virtualization
Abel Gordon, Nadav Amit, Nadav Har'El, Muli Ben-Yehuda, Alex Landau, Assaf Schuster, Dan Tsafrir
ASPLOS '12: ACM International Conference on Architectural Support for Programming Languages and Operating Systems
March, 2012, London, UK, pages 411–422, 2024 Influential Paper Award, Pat Goldberg Memorial Best Paper Award, HiPEAC Paper Award, CACM Research Highlight
Abstract , BibTeX , PDF , Definitive

Direct device assignment enhances the performance of guest virtual machines by allowing them to communicate with I/O devices without host involvement. But even with device assignment, guests are still unable to approach bare-metal performance, because the host intercepts all interrupts, including those interrupts generated by assigned devices to signal to guests the completion of their I/O requests. The host involvement induces multiple unwarranted guest/host context switches, which significantly hamper the performance of I/O intensive workloads. To solve this problem, we present ELI (ExitLess Interrupts), a software-only approach for handling interrupts within guest virtual machines directly and securely. By removing the host from the interrupt handling path, ELI manages to improve the throughput and latency of unmodified, untrusted guests by 1.3x–1.6x, allowing them to reach 97%–100% of bare-metal performance even for the most demanding I/O-intensive workloads.

@InProceedings{gordon12-eli,

author	=	{Abel Gordon and Nadav Amit and Nadav Har'El and Muli Ben-Yehuda and Alex Landau and Assaf Schuster and Dan Tsafrir},
title	=	{{ELI}: bare-metal performance for {I/O} virtualization},
booktitle	=	{ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
year	=	2012,
month	=	{March},
pages	=	{411--422},
address	=	{London, UK}

}

2011

Deconstructing Amazon EC2 spot instance pricing
Orna Agmon Ben-Yehuda, Muli Ben-Yehuda, Assaf Schuster, Dan Tsafrir
CloudCom '11: IEEE International Conference on Cloud Computing Technology and Science
November, 2011, Athens, Greece
Abstract , BibTeX , PDF , Definitive

Cloud providers possessing large quantities of spare capacity must either incentivize clients to purchase it or suffer losses. Amazon is the first cloud provider to address this challenge, by allowing clients to bid on spare capacity and by granting resources to bidders while their bids exceed a periodically changing spot price. Amazon publicizes the spot price but does not disclose how it is determined. By analyzing the spot price histories of Amazon's EC2 cloud, we reverse engineer how prices are set and construct a model that generates prices consistent with existing price traces. We find that prices are usually not market-driven as sometimes previously assumed. Rather, they are typically generated at random from within a tight price interval via a dynamic hidden reserve price. Our model could help clients make informed bids, cloud providers design profitable systems, and researchers design pricing algorithms.

@InProceedings{ben-yehuda11-spotprice,

author	=	{Orna {Agmon Ben-Yehuda} and Muli Ben-Yehuda and Assaf Schuster and Dan Tsafrir},
title	=	{Deconstructing {Amazon} {EC2} spot instance pricing},
booktitle	=	{IEEE International Conference on Cloud Computing Technology and Science (CloudCom)},
year	=	2011,
month	=	{November},
address	=	{Athens, Greece}

}

vIOMMU: efficient IOMMU emulation
Nadav Amit, Muli Ben-Yehuda, Dan Tsafrir, Assaf Schuster
ATC '11: USENIX Annual Technical Conference
June, 2011, Portland, Oregon, pages 73–86
Abstract , BibTeX , PDF , Definitive

Direct device assignment, where a guest virtual machine directly interacts with an I/O device without host intervention, is appealing, because it allows an unmodified (non-hypervisor-aware) guest to achieve near-native performance. But device assignment for unmodified guests suffers from two serious deficiencies: (1) it requires pinning of all the guest's pages, thereby disallowing memory overcommitment, and (2) it exposes the guest's memory to buggy device drivers.

We solve these problems by designing, implementing, and exposing an emulated IOMMU (vIOMMU) to the unmodified guest. We employ two novel optimizations to make vIOMMU perform well: (1) waiting a few milliseconds before tearing down an IOMMU mapping in the hope it will be immediately reused “optimistic teardown”, and (2) running the vIOMMU on a sidecore, and thereby enabling for the first time the use of a sidecore by unmodified guests. Both optimizations are highly effective in isolation. The former allows bare-metal to achieve 100 of the two allows an unmodified guest to do the same.

@InProceedings{amit11-viommu,

author	=	{Nadav Amit and Muli Ben-Yehuda and Dan Tsafrir and Assaf Schuster},
title	=	{{vIOMMU}: efficient {IOMMU} emulation},
booktitle	=	{USENIX Annual Technical Conference (ATC)},
year	=	2011,
month	=	{June},
pages	=	{73--86},
address	=	{Portland, Oregon}

}

A method for guaranteeing program correctness using finegrained hardware speculative execution
Dan Tsafrir, Robert W. Wisniewski
Patent Number: US 9195550 B2 (granted: Nov 2015)
February, 2011
BibTeX , Html

@Misc{tsafrir11-specupat,

author	=	{Dan Tsafrir and Robert W. Wisniewski},
title	=	{A method for guaranteeing program correctness using finegrained hardware speculative execution},
howpublished	=	{Patent Number: US 9195550 B2 (granted: Nov 2015)},
year	=	2011,
month	=	{February}

}

2010

Using inaccurate estimates accurately
Dan Tsafrir
JSSPP '10: Workshop on Job Scheduling Strategies for Parallel Processing
April, 2010, Atlanta, Georgia, pages 208–221, LNCS Volume 6253, keynote talk
Abstract , BibTeX , PDF , Definitive

Job schedulers improve the system utilization by requiring users to estimate how long their jobs will run and by using this information to better pack (or “backfill”) the jobs. But, surprisingly, many studies find that deliberately making estimates less accurate boosts (or does not affect) the performance, which helps explain why production systems still exclusively rely on notoriously inaccurate estimates.

We prove these studies wrong by showing their methodology is erroneous. The studies model an estimate e as being correlated with r·F (where r is the runtime of the associated job, F is some "badness" factor, and bigger Fs imply increased inaccuracy). We show this model is invalid, because: (1) it conveys too much information to the scheduler; (2) it induces favoritism of short jobs; and (3) it is inherently different than real user inaccuracy, which associates 90% of the jobs with merely 20 estimate values, hindering the scheduler's ability to backfill.

We conclude that researchers must stop using multiples of runtimes as estimates, or else their results would likely be invalid. We develop (and propose to use) a realistic model that preserves the estimates' modality and allows to soundly simulate increased inaccuracy by, e.g., associating more jobs with the maximal runtime allowed (an always-popular estimate, which prevents backfilling).

@InCollection{tsafrir10-keynote,

author	=	{Dan Tsafrir},
title	=	{Using inaccurate estimates accurately},
booktitle	=	{Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP)},
year	=	2010,
month	=	{April},
pages	=	{208--221},
address	=	{Atlanta, Georgia},
note	=	{LNCS Volume 6253}

}

2009

SCARY iterator assignment and initialization revision 1
Robert Klarer, Bjarne Stroustrup, Dan Tsafrir, Michael Wong
ISO/IEC C++ Standards Committee Paper WG21/N2980=09-0170
October, 2009, Santa Cruz, California
Abstract , BibTeX , PDF , Definitive

We propose a requirement that a standard container's iterator types have no dependency on any type argument apart from the container's value_type, difference_type, pointer type, and const_pointer type. In particular, the types of a standard container's iterators should not depend on the container's key_compare, hasher, key_equal, or allocator types.

@InProceedings{klarer09-scary-rev1,

author	=	{Robert Klarer and Bjarne Stroustrup and Dan Tsafrir and Michael Wong},
title	=	{SCARY iterator assignment and initialization revision 1},
booktitle	=	{ISO/IEC C++ Standards Committee Paper WG21/N2980=09-0170},
year	=	2009,
month	=	{October},
address	=	{Santa Cruz, California}

}

Minimizing dependencies within generic classes for faster and smaller programs
Dan Tsafrir, Robert W. Wisniewski, David F. Bacon, Bjarne Stroustrup
OOPSLA '09: ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications
October, 2009, Orlando, Florida, pages 425–444
Abstract , BibTeX , PDF , Definitive

Generic classes can be used to improve performance by allowing compile-time polymorphism. But the applicability of compile-time polymorphism is narrower than that of runtime polymorphism, and it might bloat the object code. We advocate a programming principle whereby a generic class should be implemented in a way that minimizes the dependencies between its members (nested types, methods) and its generic type parameters. Conforming to this principle (1) reduces the bloat and (2) gives rise to a previously unconceived manner of using the language that expands the applicability of compile-time polymorphism to a wider range of problems. Our contribution is thus a programming technique that generates faster and smaller programs. We apply our ideas to GCC's STL containers and iterators, and we demonstrate notable speedups and reduction in object code size (real application runs 1.2x to 2.1x faster and STL code is 1x to 25x smaller). We conclude that standard generic APIs (like STL) should be amended to reflect the proposed principle in the interest of efficiency and compactness. Such modifications will not break old code, simply increase flexibility. Our findings apply to languages like C++, C#, and D, which realize generic programming through multiple instantiations.

@InProceedings{tsafrir09-inner,

author	=	{Dan Tsafrir and Robert W. Wisniewski and David F. Bacon and Bjarne Stroustrup},
title	=	{Minimizing dependencies within generic classes for faster and smaller programs},
booktitle	=	{ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA)},
year	=	2009,
month	=	{October},
pages	=	{425--444},
address	=	{Orlando, Florida}

}

SCARY iterator assignment and initialization
Robert Klarer, Bjarne Stroustrup, Dan Tsafrir, Michael Wong
ISO/IEC C++ Standards Committee Paper WG21/N2913=09-0103
July, 2009, Frankfurt, Germany
Abstract , BibTeX , PDF , Definitive

@InProceedings{klarer09-scary,

author	=	{Robert Klarer and Bjarne Stroustrup and Dan Tsafrir and Michael Wong},
title	=	{SCARY iterator assignment and initialization},
booktitle	=	{ISO/IEC C++ Standards Committee Paper WG21/N2913=09-0103},
year	=	2009,
month	=	{July},
address	=	{Frankfurt, Germany}

}

2008

Portably solving file races with hardness amplification
Dan Tsafrir, Tomer Hertz, Dilma Da Silva, David Wagner
TOS: ACM Transactions on Storage
November, 2008, page 9, volume 4, number 3
Superseded by , Abstract , BibTeX , PDF , Definitive

The file-system API of contemporary systems makes programs vulnerable to TOCTTOU (time-of-check-to-time-of-use) race conditions. Existing solutions either help users to detect these problems (by pinpointing their locations in the code), or prevent the problem altogether (by modifying the kernel or its API). But the latter alternative is not prevalent, and the former is just the first step: programmers must still address TOCTTOU flaws within the limits of the existing API with which several important tasks cannot be accomplished in a portable straightforward manner. Recently, Dean and Hu [2004] addressed this problem and suggested a probabilistic hardness amplification approach that alleviated the matter. Alas, shortly after, Borisov et al. [2005] responded with an attack termed “filesystem maze” that defeated the new approach.

We begin by noting that mazes constitute a generic way to deterministically win many TOCTTOU races (gone are the days when the probability was small). In the face of this threat, we (1) develop a new user-level defense that can withstand mazes, and (2) show that our method is undefeated even by much stronger hypothetical attacks that provide the adversary program with ideal conditions to win the race (enjoying complete and instantaneous knowledge about the defending program's actions and being able to perfectly synchronize accordingly). The fact that our approach is immune to these unrealistic attacks suggests it can be used as a simple and portable solution to a large class of TOCTTOU vulnerabilities, without requiring modifications to the underlying operating system.

@Article{tsafrir08-tocttou:tos,

author	=	{Dan Tsafrir and Tomer Hertz and Dilma Da Silva and David Wagner},
title	=	{Portably solving file races with hardness amplification},
journal	=	{ACM Transactions on Storage (TOS)},
volume	=	4,
number	=	3,
year	=	2008,
month	=	{November},
pages	=	9

}

Portably preventing file race attacks with user-mode path resolution
Dan Tsafrir, Tomer Hertz, David Wagner, Dilma Da Silva
Technical Report RC24572, IBM T. J. Watson Research Center
June, 2008, Yorktown Heights, New York, submitted
Abstract , BibTeX , PDF , Definitive

The filesystem API of contemporary systems exposes programs to TOCTTOU (time of check to time of use) race-condition vulnerabilities, which occur between pairs of check/use system calls that involve a name of a file. Existing solutions either help programmers to detect such races (by pinpointing their location) or prevent them altogether (by altering the operating system). But the latter alternative is not prevalent, and the former is just the first step: programmers must still address TOCTTOU flaws within the limits of the existing API with which several important tasks can not be safely accomplished in a portable straightforward manner. The recent “filesystem maze” attack further worsens the problem by allowing adversaries to deterministically win races and thus refuting the common perception that the risk is small. In the face of this threat, we develop a new algorithm that allows programmers to effectively aggregate a vulnerable pair of distinct system calls into a single operation that is executed “atomically”. This is achieved by emulating one kernel functionality in user mode: the filepath resolution. The surprisingly simple resulting algorithm constitutes a portable solution to a large class of TOCTTOU vulnerabilities, without requiring modifications to the underlying operating system. In contrast to our previous work, the new algorithm is deterministic and incurs significantly less overhead.

@TechReport{tsafrir08-path-res,

author	=	{Dan Tsafrir and Tomer Hertz and David Wagner and Dilma Da Silva},
title	=	{Portably preventing file race attacks with user-mode path resolution},
institution	=	{IBM T. J. Watson Research Center},
number	=	{RC24572},
year	=	2008,
month	=	{June},
address	=	{Yorktown Heights, New York},
note	=	{(submitted for publication)}

}

The murky issue of changing process identity: revising “setuid demystified”
Dan Tsafrir, Dilma Da Silva, David Wagner
USENIX ;login
June, 2008, pages 55–66, volume 33, number 3
Abstract , BibTeX , PDF , Definitive , Software

Dropping unneeded process privileges promotes security, but is notoriously error-prone due to confusing set∗id system calls with unclear semantics and subtle portability issues. To make things worse, existing recipes to accomplish the task are lacking, related manuals can be misleading, and the associated kernel subsystem might contain bugs. We therefore proclaim the system as untrustworthy when it comes to the subject matter and suggest a defensive easy-to-use solution that addresses all concerns.

@Article{tsafrir08-priv,

author	=	{Dan Tsafrir and Dilma Da Silva and David Wagner},
title	=	{The murky issue of changing process identity: revising ``setuid demystified''},
journal	=	{USENIX ;login},
volume	=	33,
number	=	3,
year	=	2008,
month	=	{June},
pages	=	{55--66}

}

Portably solving file TOCTTOU races with hardness amplification
Dan Tsafrir, Tomer Hertz, David Wagner, Dilma Da Silva
FAST '08: USENIX Conference on File and Storage Technologies
February, 2008, San Jose, California, pages 189–206, awarded Best Paper, Pat Goldberg Memorial Best Paper Award
Superseded by , Abstract , BibTeX , PDF , Definitive

The file-system API of contemporary systems makes programs vulnerable to TOCTTOU (time of check to time of use) race conditions. Existing solutions either help users to detect these problems (by pinpointing their locations in the code), or prevent the problem altogether (by modifying the kernel or its API). The latter alternative is not prevalent, and the former is just the first step: programmers must still address TOCTTOU flaws within the limits of the existing API with which several important tasks can not be accomplished in a portable straightforward manner. Recently, Dean and Hu addressed this problem and suggested a probabilistic hardness amplification approach that alleviated the matter. Alas, shortly after, Borisov et al. responded with an attack termed “filesystem maze” that defeated the new approach.

We begin by noting that mazes constitute a generic way to deterministically win many TOCTTOU races (gone are the days when the probability was small). In the face of this threat, we (1) develop a new user-level defense that can withstand mazes, and (2) show that our method is undefeated even by much stronger hypothetical attacks that provide the adversary program with ideal conditions to win the race (enjoying complete and instantaneous knowledge about the defending program's actions and being able to perfectly synchronize accordingly). The fact that our approach is immune to these unrealistic attacks suggests it can be used as a simple and portable solution to a large class of TOCTTOU vulnerabilities, without requiring modifications to the underlying operating system.

@InProceedings{tsafrir08-tocttou,

author	=	{Dan Tsafrir and Tomer Hertz and David Wagner and Dilma Da Silva},
title	=	{Portably solving file {TOCTTOU} races with hardness amplification},
booktitle	=	{USENIX Conference on File and Storage Technologies (FAST)},
year	=	2008,
month	=	{February},
pages	=	{189--206},
address	=	{San Jose, California}

}

Specialized execution environments
Maria Butrico, Dilma Da Silva, Orran Krieger, Michal Ostrowski, Bryan Rosenburg, Dan Tsafrir, Eric Van Hensbergen, Robert W. Wisniewski, Jimi Xenidis
OSR: ACM Operating Systems Review
January, 2008, pages 106–107, volume 42, number 1
BibTeX , PDF , Definitive

@Article{butrico08-specialized,

author	=	{Maria Butrico and Dilma Da Silva and Orran Krieger and Michal Ostrowski and Bryan Rosenburg and Dan Tsafrir and Eric Van Hensbergen and Robert W. Wisniewski and Jimi Xenidis},
title	=	{Specialized execution environments},
journal	=	{ACM Operating Systems Review (OSR)},
volume	=	42,
number	=	1,
year	=	2008,
month	=	{January},
pages	=	{106--107}

}

2007

Reducing performance evaluation sensitivity and variability by input shaking
Dan Tsafrir, Keren Ouaknine, Dror G. Feitelson
MASCOTS '07: IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems
October, 2007, Istanbul, Turkey, pages 231–237
Abstract , BibTeX , PDF , Definitive

Simulations sometimes lead to observed sensitivity to configuration parameters as well as inconsistent performance results. The question is then what is the true effect and what is a coincidental artifact of the evaluation. The shaking methodology answers this by executing multiple simulations under small perturbations to the input workload, and calculating the average performance result; if the effect persists we can be more confident that it is real, whereas if it disappears it was an artifact. We present several examples where the sensitivity that appears in results based on a single evaluation is eliminated or considerably reduced by the shaking methodology. While our examples come from evaluations of scheduling algorithms for supercomputers, we believe the method has wider applicability.

@InProceedings{tsafrir07-shake,

author	=	{Dan Tsafrir and Keren Ouaknine and Dror G. Feitelson},
title	=	{Reducing performance evaluation sensitivity and variability by input shaking},
booktitle	=	{IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)},
year	=	2007,
month	=	{October},
pages	=	{231--237},
address	=	{Istanbul, Turkey}

}

Secretly monopolizing the CPU without superuser privileges
Dan Tsafrir, Yoav Etsion, Dror G. Feitelson
USENIX Security Symposium
August, 2007, Boston, Massachusetts, pages 239–256
Abstract , BibTeX , PDF , Definitive , Coverage: Slashdot , Ars Technica , ACM TechNews

We describe a “cheat” attack, allowing an ordinary process to hijack any desirable percentage of the CPU cycles without requiring superuser/administrator privileges. Moreover, the nature of the attack is such that, at least in some systems, listing the active processes will erroneously show the cheating process as not using any CPU resources: the “missing” cycles would either be attributed to some other process or not be reported at all (if the machine is otherwise idle). Thus, certain malicious operations generally believed to have required overcoming the hardships of obtaining root access and installing a rootkit, can actually be launched by non-privileged users in a straightforward manner, thereby making the job of a malicious adversary that much easier. We show that most major general-purpose operating systems are vulnerable to the cheat attack, due to a combination of how they account for CPU usage and how they use this information to prioritize competing processes. Furthermore, recent scheduler changes attempting to better support interactive workloads increase the vulnerability to the attack, and naive steps taken by certain systems to reduce the danger are easily circumvented. We show that the attack can nevertheless be defeated, and we demonstreate this by implementing a patch for Linux that eliminates the problem with negligible overhead.

@InProceedings{tsafrir07-cheat,

author	=	{Dan Tsafrir and Yoav Etsion and Dror G. Feitelson},
title	=	{Secretly monopolizing the {CPU} without superuser privileges},
booktitle	=	{USENIX Security Symposium},
year	=	2007,
month	=	{August},
pages	=	{239--256},
address	=	{Boston, Massachusetts}

}

The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops)
Dan Tsafrir
ExpCS '07: ACM Workshop on Experimental Computer Science
June, 2007, San-Diego, California, page 4
Abstract , BibTeX , PDF , Definitive

The overhead of a context switch is typically associated with multitasking, where several applications share a processor. But even if only one runnable application is present in the system and supposedly runs alone, it is still repeatedly preempted in favor of a different thread of execution, namely, the operating system that services periodic clock interrupts. We employ two complementing methodologies to measure the overhead incurred by such events and obtain contradictory results.

The first methodology systematically changes the interrupt frequency and measures by how much this prolongs the duration of a program that sorts an array. The overall overhead is found to be 0.5-1.5% at 1000 Hz, linearly proportional to the tick rate, and steadily declining as the speed of processors increases. If the kernel is configured such that each tick is slowed down by an access to an external time source, then the direct overhead dominates. Otherwise, the relative weight of the indirect portion is steadily growing with processors' speed, accounting for up to 85% of the total.

The second methodology repeatedly executes a simplistic loop (calibrated to take 1ms), measures the actual execution time, and analyzes the perturbations. Some loop implementations yield results similar to the above, but others indicate that the overhead is actually an order of magnitude bigger, or worse. The phenomenon was observed on IA32, IA64, and Power processors, the latter being part of the \textsf{ASC Purple} supercomputer. Importantly, the effect is greatly amplified for parallel jobs, where one late thread holds up all its peers, causing a slowdown that is dominated by the per-node latency (numerator) and the job granularity (denominator). We trace the bizarre effect to an unexplained interrupt/loop interaction; the question of whether this hardware misfeature is experienced by real applications remains open.

@InProceedings{tsafrir07-context,

author	=	{Dan Tsafrir},
title	=	{The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops)},
booktitle	=	{ACM Workshop on Experimental Computer Science (ExpCS)},
year	=	2007,
month	=	{June},
pages	=	4,
address	=	{San-Diego, California}

}

Backfilling using system-generated predictions rather than user runtime estimates
Dan Tsafrir, Yoav Etsion, Dror G. Feitelson
TPDS: IEEE Transactions on Parallel and Distributed Systems
June, 2007, pages 789–803, volume 18, number 6
Abstract , BibTeX , PDF , Definitive

The most commonly used scheduling algorithm for parallel supercomputers is FCFS with backfilling, as originally introduced in the EASY scheduler. Backfilling means that short jobs are allowed to run ahead of their time provided they do not delay previously queued jobs (or at least the first queued job). To make such determinations possible, users are required to provide estimates of how long jobs will run, and jobs that violate these estimates are killed. Empirical studies have repeatedly shown that user estimates are inaccurate, and that system-generated predictions based on history may be significantly better. However, predictions have not been incorporated into production schedulers, partially due to a misconception (that we resolve) claiming inaccuracy actually improves performance, but mainly because underprediction is technically unacceptable: users will not tolerate jobs being killed just because system predictions were too short. We solve this problem by divorcing kill-time from the runtime prediction, and correcting predictions adaptively as needed if they are proved wrong. The end result is a surprisingly simple scheduler, which requires minimal deviations from current practices (e.g. using FCFS as the basis), and behaves exactly like EASY as far as users are concerned; nevertheless, it achieves significant improvements in performance, predictability, and accuracy. Notably, this is based on a very simple runtime predictor that just averages the runtimes of the last two jobs by the same user; counterintuitively, our results indicate that using recent data is more important than mining the history for similar jobs. All the techniques suggested in this paper can be used to enhance any backfilling algorithm, and are not limited to EASY.

@Article{tsafrir07-pred,

author	=	{Dan Tsafrir and Yoav Etsion and Dror G. Feitelson},
title	=	{Backfilling using system-generated predictions rather than user runtime estimates},
journal	=	{IEEE Transactions on Parallel and Distributed Systems (TPDS)},
volume	=	18,
number	=	6,
year	=	2007,
month	=	{June},
pages	=	{789--803}

}

Fine grained kernel logging with KLogger: experience and insights
Yoav Etsion, Dan Tsafrir, Scott Kirkpatrick, Dror G. Feitelson
EuroSys '07: ACM Europran Conference on Computer Systems
March, 2007, Lisbon, Portugal, pages 259–274
Abstract , BibTeX , PDF , Definitive , Software

Understanding the detailed behavior of an operating system is crucial for making informed design decisions. But such an understanding is very hard to achieve, due to the increasing complexity of such systems and the fact that they are implemented and maintained by large and diverse groups of developers. Tools like KLogger — presented in this paper — can help by enabling fine-grained logging of system events and the sharing of a logging infrastructure between multiple developers and researchers, facilitating a methodology where design evaluation can be an integral part of kernel development. We demonstrate the need for such methodology by a host of case studies, using KLogger to better understand various kernel subsystems in Linux, and pinpointing overheads and problems therein.

@InProceedings{etsion07-klogger,

author	=	{Yoav Etsion and Dan Tsafrir and Scott Kirkpatrick and Dror G. Feitelson},
title	=	{Fine grained kernel logging with {KLogger:} experience and insights},
booktitle	=	{ACM Europran Conference on Computer Systems (EuroSys)},
year	=	2007,
month	=	{March},
pages	=	{259--274},
address	=	{Lisbon, Portugal}

}

2006

Process prioritization using output production: scheduling for multimedia
Yoav Etsion, Dan Tsafrir, Dror G. Feitelson
TOMM: ACM Transactions on Multimedia Computing, Communications and Applications
November, 2006, pages 318–342, volume 2, number 4
Abstract , BibTeX , PDF , Definitive

Desktop operating systems such as Windows and Linux base scheduling decisions on CPU consumption: processes that consume fewer CPU cycles are prioritized, assuming that interactive processes gain from this since they spend most of their time waiting for user input. However, this doesn't work for modern multimedia applications which require significant CPU resources. We therefore suggest a new metric to identify interactive processes by explicitly measuring interactions with the user, and we use it to design and implement a process scheduler. Measurements using a variety of applications indicate that this scheduler is very effective in distinguishing between competing interactive and noninteractive processes.

@Article{etsion06-huc,

author	=	{Yoav Etsion and Dan Tsafrir and Dror G. Feitelson},
title	=	{Process prioritization using output production: scheduling for multimedia},
journal	=	{ACM Transactions on Multimedia Computing, Communications and Applications (TOMM)},
volume	=	2,
number	=	4,
year	=	2006,
month	=	{November},
pages	=	{318--342}

}

Stop polling! The case against OS ticks
Dan Tsafrir, Yoav Etsion, Dror G. Feitelson
OSDI '06: USENIX Symposium on Operating Systems Design and Implementation
November, 2006, Seattle, Washington, Poster Session
BibTeX , PDF , Definitive , Poster

@InProceedings{tsafrir06-ticks,

author	=	{Dan Tsafrir and Yoav Etsion and Dror G. Feitelson},
title	=	{Stop polling! {The} case against {OS} ticks},
booktitle	=	{USENIX Symposium on Operating Systems Design and Implementation (OSDI)},
year	=	2006,
month	=	{November},
address	=	{Seattle, Washington},
note	=	{Poster Session}

}

The dynamics of backfilling: solving the mystery of why increased inaccuracy may help
Dan Tsafrir, Dror G. Feitelson
IISWC '06: IEEE International Symposium on Workload Characterization
October, 2006, San Jose, California, pages 131–141
Abstract , BibTeX , PDF , Definitive

Parallel job scheduling with backfilling requires users to provide runtime estimates, used by the scheduler to better pack the jobs. Studies of the impact of such estimates on performance have modeled them using a “badness factor” f ≥ 0 in an attempt to capture their inaccuracy (given a runtime r, the estimate is uniformly distributed in [r, (f+1) · r]). Surprisingly, inaccurate estimates (f>0) yielded better performance than accurate ones (f=0). We explain this by a “heel and toe” dynamics that, with f>0, cause backfilling to approximate shortest-job first scheduling. We further find the effect of systematically increasing f is V-shaped: average wait time and slowdown initially drop, only to rise later on. This happens because higher fs create bigger “holes” in the schedule (longer jobs can backfill) and increase the randomness (more long jobs appear as short), thus overshadowing the initial heel-and-toe preference for shorter jobs.

The bottom line is that artificial inaccuracy generated by multiplying (real or perfect) estimates by a factor is (1) just a scheduling technique that trades off fairness for performance, and is (2) ill-suited for studying the effect of real inaccuracy. Real estimates are modal (90% of the jobs use the same 20 estimates) and bounded by a maximum (usually the most popular estimate). Therefore, when performing an evaluation, “increased inaccuracy” should translate to increased modality. Unlike multiplying, this indeed worsens performance as one would intuitively expect.

@InProceedings{tsafrir06-bfdyn,

author	=	{Dan Tsafrir and Dror G. Feitelson},
title	=	{The dynamics of backfilling: solving the mystery of why increased inaccuracy may help},
booktitle	=	{IEEE International Symposium on Workload Characterization (IISWC)},
year	=	2006,
month	=	{October},
pages	=	{131--141},
address	=	{San Jose, California}

}

Modeling, evaluating, and improving the performance of supercomputer scheduling
Dan Tsafrir
PhD: Technical Report 2006-78, School of Computer Science and Engineering, the Hebrew University
September, 2006, Jerusalem, Israel
Abstract , BibTeX , PDF , Definitive

The most popular scheduling policy for parallel systems is FCFS with backfilling (a.k.a. “EASY” scheduling), where short jobs are allowed to run ahead of their time provided they do not delay previously queued jobs (or at least the first queued job). This mandates users to provide estimates of how long jobs will run, and jobs that violate these estimates are killed so as not to violate subsequent commitments. The de-facto standard of evaluating the impact of inaccurate estimates on performance has been to use a “badness factor” f ≥ 0, such that given a runtime r, the associated estimate is uniformly distributed in [r, r · (f+1)], or is simply r·(f+1). The underlying assumption was that bigger fs imply worse information.

Surprisingly, inaccurate estimates (f>0) yield better performance than accurate ones (f=0), a fact that has repeatedly produced statements like “inaccurate estimates actually improve performance” or “what the scheduler doesn't know won't hurt it”, in many independent studies. This has promoted the perception that estimates are “unimportant”. At the same time, other studies noted that real user estimates are inaccurate, and that system-generated predictions based on history can do better. But predictions were never incorporated into production schedulers, partially due the aforementioned perception that inaccuracy actually helps, partially because suggested predictors were too complex, and partially because underprediction is technically unacceptable, as users will not tolerate jobs being killed just because system predictions were too short. All attempts to solve the latter technicality yielded algorithms that are inappropriate for many supercomputing settings (e.g. using preemption, assuming all jobs are restartable, etcetera).

This work has four major contributions. First, we show that the “inaccuracy helps” common wisdom is merely an unwarranted artifact of the erroneous manner in which inaccurate estimates have been modeled, and that increased accuracy actually improves performance. Specifically, previously observed improvements turn out to be due to a “heel and toe” dynamics that, with f>0, cause backfilling to approximate shortest-job first scheduling. We show that multiplying estimates by a factor translates to trading off fairness for performance, and that this reasoning works regardless of whether the values being multiplied are actual runtimes (“perfect estimates”) or the flawed estimates that are supplied by users. We further show that the more accurate the values we multiply, the better the resulting performance. Thus, better estimates actually improve performance, and multiplying is in fact a scheduling policy that exercises the fairness/performance tradeoff. Regardless, multiplying is anything but representative of real inaccuracy, as outlined next.

Our second contribution is developing a more representative model of estimates that, from now on, will allow for a valid evaluation of the effect of inaccurate estimates. It is largely based on noting that human users repeatedly use the same “round” values (ten minutes, one hour etc.) and on the invariant that 90% of the jobs use the same 20 estimates. Importantly, the most popular estimate is typically the maximal allowed. As a result, the jobs associated with this estimate cannot be backfilled, and indeed, the more this value is used, the more EASY resembles plain FCFS. Thus, to artificially increase the inaccuracy one should e.g. associate more jobs with the maximum (a realistic manipulation), not multiply by a greater factor (a bogus boost of performance).

Our third contribution exploits the above understandings to devise a new scheduler that is able to automatically improve the quality of estimates and put this into productive use in the context of EASY, while preserving its attractive simple batch essence and refraining from any unacceptable assumptions. Specifically, the problem of underprediction is solved by divorcing kill-time from the runtime prediction, and correcting predictions adaptively at runtime as needed, if they are proved wrong. The result is a surprisingly simple scheduler, which requires minimal deviations from current practices, and behaves exactly like EASY as far as users are concerned. Nevertheless, it achieves significant improvements in performance, predictability, and accuracy. Notably, this is based on a very simple runtime predictor that just averages the runtimes of the last two jobs by the same user; counterintuitively, our results indicate that using recent data is more important than saving and mining the history for similar jobs, as was done by previous work. For further performance enhancements, we propose to exploit the “heel and toe” understanding: explicitly using a shortest job backfilled first (SJBF) backfilling order. This directly leads to a performance improvements similar to those previously attributed to stunts like multiplying estimates. By still preserving FCFS as the basis, we maintain EASY's appeal and enjoy both worlds: a fair scheduler that nevertheless backfills effectively.

Finally, our fourth contribution has broader applicability, that transcends the supercomputing domain. All of the above results are based on the standard methodology of modeling and simulating real activity logs of production systems, which is routinely practiced in system-related research. The overwhelmingly accepted assumption underlying this methodology is that such real workloads are representative and reliable. We show, however, that real workloads may also contain anomalies that make them non-representative and unreliable. This is a special case of multi-class workloads, where one class is the “real” workload which we wish to use in the evaluation, and the other class contaminates the log with “bogus” data. We provide several examples of this situation, including an anomaly we call “workload flurries”: surges of activity with a repetitive nature, caused by a single user, that dominate the workload for a relatively short period. Using a workload with such anomalies in effect emphasizes rare and unique events (e.g. occurring for a few days out of two years of logged data), and risks optimizing the design decision for the anomalous workload at the expense of the normal workload. Thus, we claim that such anomalies should be removed from the workload before it is used in evaluations, and that ignoring them is actually an unjustifiable approach.

@PhdThesis{tsafrir06-phd,

author	=	{Dan Tsafrir},
title	=	{Modeling, evaluating, and improving the performance of supercomputer scheduling},
school	=	{School of Computer Science and Engineering, the Hebrew University (PhD Thesis)},
year	=	2006,
month	=	{September},
address	=	{Jerusalem, Israel},
note	=	{Technical Report 2006-78}

}

Session-based, estimation-less, and information-less runtime prediction algorithms for parallel and grid job scheduling
David Talby, Dan Tsafrir, Zviki Goldberg, Dror G. Feitelson
Technical Report 2006-77, School of Computer Science and Engineering, the Hebrew University
August, 2006, Jerusalem, Israel
Abstract , BibTeX , PDF , Definitive

The default setting of most production parallel job schedulers is FCFS with backfilling. Under this setting, users must supply job runtime estimates, which are known as being highly inaccurate and inferior to system generated predictions. Recent research revealed how to utilize system predictions for backfilling, and this potential performance gain motivates searching for better prediction techniques. We present three prediction techniques using decreasing levels of information as is suitable for the situation at hand. The first is based on "user sessions": continuous temporal periods of per-user work. This algorithm exploits the entire long-term historical data of the workload, along with user runtime estimates. The second is "estimation-less", that is, uses historical data only, relieving users from the annoying need to supply estimates. The third is completely "informationless" and is suitable for cases in which neither historical information nor estimates are available, as happens in some grid environments. We evaluate the algorithms by simulating real data from production systems. We find all of them to be successful in terms of both accuracy and performance.

@TechReport{talby06-predalg,

author	=	{David Talby and Dan Tsafrir and Zviki Goldberg and Dror G. Feitelson},
title	=	{Session-based, estimation-less, and information-less runtime prediction algorithms for parallel and grid job scheduling},
institution	=	{School of Computer Science and Engineering, the Hebrew University},
number	=	{2006-77},
year	=	2006,
month	=	{August},
address	=	{Jerusalem, Israel}

}

Instability in parallel job scheduling simulation: the role of workload flurries
Dan Tsafrir, Dror G. Feitelson
IPDPS '06: IEEE International Parallel and Distributed Processing Symposium
April, 2006, Rhodes Island, Greece, page 10
Abstract , BibTeX , PDF , Definitive

The performance of computer systems depends, among other things, on the workload. This motivates the use of real workloads (as recorded in activity logs) to drive simulations of new designs. Unfortunately, real workloads may contain various anomalies that contaminate the data. A previously unrecognized type of anomaly is workload flurries: rare surges of activity with a repetitive nature, caused by a single user, that dominate the workload for a relatively short period. We find that long workloads often include at least one such event. We show that in the context of parallel job scheduling these events can have a significant effect on performance evaluation results, e.g. a very small perturbation of the simulation conditions might lead to a large and disproportional change in the outcome. This instability is due to jobs in the flurry being effected in unison, a consequence of the flurry's repetitive nature. We therefore advocate that flurries be filtered out before the workload is used, in order to achieve stable and more reliable evaluation results (analogously to the removal of outliers in statistical analysis). At the same time, we note that more research is needed on the possible effects of flurries.

@InProceedings{tsafrir06-bfly,

author	=	{Dan Tsafrir and Dror G. Feitelson},
title	=	{Instability in parallel job scheduling simulation: the role of workload flurries},
booktitle	=	{IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
year	=	2006,
month	=	{April},
pages	=	10,
address	=	{Rhodes Island, Greece}

}

Workload sanitation for performance evaluation
Dror G. Feitelson, Dan Tsafrir
ISPASS '06: IEEE International Symposium on Performance Analysis of Systems and Software
March, 2006, Austin, Texas, pages 221–230
Abstract , BibTeX , PDF , Definitive

The performance of computer systems depends, among other things, on the workload. Performance evaluations are therefore often done using logs of workloads on current productions systems, under the assumption that such real workloads are representative and reliable; likewise, workload modeling is typically based on real workloads. We show, however, that real workloads may also contain anomalies that make them non-representative and unreliable. This is a special case of multi-class workloads, where one class is the “real” workload which we wish to use in the evaluation, and the other class contaminates the log with “bogus” data. We provide several examples of this situation, including a previously unrecognized type of anomaly we call “workload flurries”: surges of activity with a repetitive nature, caused by a single user, that dominate the workload for a relatively short period. Using a workload with such anomalies in effect emphasizes rare and unique events (e.g. occurring for a few days out of two years of logged data), and risks optimizing the design decision for the anomalous workload at the expense of the normal workload. Thus we claim that such anomalies should be removed from the workload before it is used in evaluations, and that ignoring them is actually an unjustifiable approach.

@InProceedings{feitelson06-clean,

author	=	{Dror G. Feitelson and Dan Tsafrir},
title	=	{Workload sanitation for performance evaluation},
booktitle	=	{IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)},
year	=	2006,
month	=	{March},
pages	=	{221--230},
address	=	{Austin, Texas}

}

System and method for backfilling with system-generated predictions rather than user runtime estimates
Dan Tsafrir, Yoav Etsion, David Talby, Dror G. Feitelson
Patent Number: US 8261283 B2 (granted: Sep 2012)
February, 2006
BibTeX , Html

@Misc{tsafrir06-predpat,

author	=	{Dan Tsafrir and Yoav Etsion and David Talby and Dror G. Feitelson},
title	=	{System and method for backfilling with system-generated predictions rather than user runtime estimates},
howpublished	=	{Patent Number: US 8261283 B2 (granted: Sep 2012)},
year	=	2006,
month	=	{February}

}

2005

System noise, OS clock ticks, and fine-grained parallel applications
Dan Tsafrir, Yoav Etsion, Dror G. Feitelson, Scott Kirkpatrick
ICS '05: ACM International Conference on Supercomputing
June, 2005, Cambridge, Massachusetts, pages 303–312
Abstract , BibTeX , PDF , Definitive

As parallel jobs get bigger in size and finer in granularity, “system noise” is increasingly becoming a problem. In fact, fine-grained jobs on clusters with thousands of SMP nodes run faster if a processor is intentionally left idle (per node), thus enabling a separation of “system noise” from the computation. Paying a cost in average processing speed at a node for the sake of eliminating occasional processes delays is (unfortunately) beneficial, as such delays are enormously magnified when one late process holds up thousands of peers with which it synchronizes.

We provide a probabilistic argument showing that, under certain conditions, the effect of such noise is linearly proportional to the size of the cluster (as is often empirically observed). We then identify a major source of noise to be indirect overhead of periodic OS clock interrupts (“ticks”), that are used by all general-purpose OSs as a means of maintaining control. This is shown for various grain sizes, platforms, tick frequencies, and OSs. To eliminate such noise, we suggest replacing ticks with an alternative mechanism we call “smart timers”. This turns out to also be in line with needs of desktop and mobile computing, increasing the chances of the suggested change to be accepted.

@InProceedings{tsafrir05-noise,

author	=	{Dan Tsafrir and Yoav Etsion and Dror G. Feitelson and Scott Kirkpatrick},
title	=	{System noise, {OS} clock ticks, and fine-grained parallel applications},
booktitle	=	{ACM International Conference on Supercomputing (ICS)},
year	=	2005,
month	=	{June},
pages	=	{303--312},
address	=	{Cambridge, Massachusetts}

}

Modeling user runtime estimates
Dan Tsafrir, Yoav Etsion, Dror G. Feitelson
JSSPP '05: Workshop on Job Scheduling Strategies for Parallel Processing
June, 2005, Cambridge, Massachusetts, pages 1–35, Lecture Notes in Computer Science, Volume 3834
Abstract , BibTeX , PDF , Definitive , Software

User estimates of job runtimes have emerged as an important component of the workload on parallel machines, and can have a significant impact on how a scheduler treats different jobs, and thus on overall performance. It is therefore highly desirable to have a good model of the relationship between parallel jobs and their associated estimates. We construct such a model based on a detailed analysis of several workload traces. The model incorporates those features that are consistent in all of the logs, most notably the inherently modal nature of estimates (e.g.\ only 20 different values are used as estimates for about 90% of the jobs). We find that the behavior of users, as manifested through the estimate distributions, is remarkably similar across the different workload traces. Indeed, providing our model with only the maximal allowed estimate value, along with the percentage of jobs that have used it, yields results that are very similar to the original. The remaining difference (if any) is largely eliminated by providing information on one or two additional popular estimates. Consequently, in comparison to previous models, simulations that utilize our model are better in reproducing scheduling behavior similar to that observed when using real estimates.

@InCollection{tsafrir05-est,

author	=	{Dan Tsafrir and Yoav Etsion and Dror G. Feitelson},
title	=	{Modeling user runtime estimates},
booktitle	=	{Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP)},
year	=	2005,
month	=	{June},
pages	=	{1--35},
address	=	{Cambridge, Massachusetts},
note	=	{Lecture Notes in Computer Science, Volume 3834}

}

A short survey of commercial cluster batch schedulers
Yoav Etsion, Dan Tsafrir
Technical Report 2005-13, School of Computer Science and Engineering, the Hebrew University
May, 2005, Jerusalem, Israel
Abstract , BibTeX , PDF , Definitive

As high performance computing clusters are getting cheaper, they become more accessible. The various clusters are running a host of workload management software suites, which are getting more complex and offer cluster administrators numerous features, scheduling policies, job prioritization schemes, etc. In this paper we survey some of the common commercial workload managers on the market, covering their main features — specifically the scheduling policies and algorithms they support, their priority and queueing mechanisms, focusing on their default settings.

@TechReport{etsion05-survey,

author	=	{Yoav Etsion and Dan Tsafrir},
title	=	{A short survey of commercial cluster batch schedulers},
institution	=	{School of Computer Science and Engineering, the Hebrew University},
number	=	{2005-13},
year	=	2005,
month	=	{May},
address	=	{Jerusalem, Israel}

}

General purpose timing: the failure of periodic timers
Dan Tsafrir, Yoav Etsion, Dror G. Feitelson
Technical Report 2005-6, School of Computer Science and Engineering, the Hebrew University
February, 2005, Jerusalem, Israel
Abstract , BibTeX , PDF , Definitive

All general-purpose commodity operating systems use periodic clock interrupts to regain control and measure the passage of time. This is ill-suited for desktop settings, as the ne-grained timing requirements of modern multimedia applications require a high clock rate, which may suffer from significant overhead. It is ill-suited for HPC environments, as asynchronous interrupts ruin the coordination among cluster nodes. And it is ill-suited for mobile platforms, as it wastes signicant energy, especially when the system is otherwise idle. To be truly general-purpose, systems should therefore switch to a mechanism that is closer to one-shot timers (set only for specic needs) while avoiding the potentially huge overhead they entail. With a careful design it is possible to achieve both high accuracy and low overhead, thus signicantly extending the applicability of general-purpose operating systems.

@TechReport{tsafrir05-timing,

author	=	{Dan Tsafrir and Yoav Etsion and Dror G. Feitelson},
title	=	{General purpose timing: the failure of periodic timers},
institution	=	{School of Computer Science and Engineering, the Hebrew University},
number	=	{2005-6},
year	=	2005,
month	=	{February},
address	=	{Jerusalem, Israel}

}

2004

Desktop scheduling: how can we know what the user wants?
Yoav Etsion, Dan Tsafrir, Dror G. Feitelson
NOSSDAV '04: ACM International Workshop on Network and Operating Systems Support for Digital Audio and Video
June, 2004, Kinsale, Ireland, pages 110–115
Abstract , BibTeX , PDF , Definitive

Current desktop operating systems use CPU utilization (or lack thereof) to prioritize processes for scheduling. This was thought to be beneficial for interactive processes, under the assumption that they spend much of their time waiting for user input. This reasoning fails for modern multimedia applications. For example, playing a movie in parallel with a heavy background job usually leads to poor graphical results, as these jobs are indistinguishable in terms of CPU usage. Suggested solutions involve shifting the burden to the user or programmer, which we claim is unsatisfactory; instead, we seek an automatic solution. Our attempts using new metrics based on CPU usage failed. We therefore propose and implement a novel scheme of identifying interactive and multimedia applications by directly quantifying the I/O between an application and the user (keyboard, mouse, and screen activity). Preliminary results indicate that prioritizing processes according to this metric indeed solves the aforementioned problem, demonstrating that operating systems can indeed provide better support for multimedia and interactive applications. Additionally, once user I/O data is available, it opens intriguing new possibilities to system designers.

@InProceedings{etsion04-huc-pri,

author	=	{Yoav Etsion and Dan Tsafrir and Dror G. Feitelson},
title	=	{Desktop scheduling: how can we know what the user wants?},
booktitle	=	{ACM International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV)},
year	=	2004,
month	=	{June},
pages	=	{110--115},
address	=	{Kinsale, Ireland}

}

2003

Workload flurries
Dan Tsafrir, Dror G. Feitelson
Technical Report 2003-85, School of Computer Science and Engineering, the Hebrew University
November, 2003, Jerusalem, Israel
Abstract , BibTeX , PDF

The performance of computer systems depends, among other things, on the workload. Performance evaluations are therefore often done using logs of workloads on current productions systems, under the assumption that such real workloads are representative and reliable; likewise, workload modeling is typically based on real workloads. However, real workloads may also contain anomalies that make them non-representative and unreliable. A previously unrecognized type of anomaly is workload urries: surges of activity with a repetitive nature, caused by a single user, that dominate the workload for a relatively short period. Under suitable conditions, such urries can have a decisive effect on performance evaluation results. The problem is that workloads containing such a urry are not representative of normal usage. Moreover, creating a statistical model based on such a workload or using it directly is also not representative of urries in general. This motivates the approach of identifying and removing the urries, so as to allow for an evaluation under normal conditions. We demonstrate this for several evaluations of parallel systems, showing that the anomalies in the workload as embodied by urries carry over to anomalies in the evaluation results, which disappear when the urries are removed. Such an evaluation can then be augmented by a separate evaluation of the deviation caused by the urry.

@TechReport{tsafrir03-flurry,

author	=	{Dan Tsafrir and Dror G. Feitelson},
title	=	{Workload flurries},
institution	=	{School of Computer Science and Engineering, the Hebrew University},
number	=	{2003-85},
year	=	2003,
month	=	{November},
address	=	{Jerusalem, Israel}

}

Effects of clock resolution on the scheduling of interactive and soft real-time processes
Yoav Etsion, Dan Tsafrir, Dror G. Feitelson
SIGMETRICS '03: ACM SIGMETRICS International Conference on Measurement and Modeling
June, 2003, San Diego, California, pages 172–183
Abstract , BibTeX , PDF , Definitive

It is commonly agreed that scheduling mechanisms in general purpose operating systems do not provide adequate support for modern interactive applications, notably multimedia applications. The common solution to this problem is to devise specialized scheduling mechanisms that take the specific needs of such applications into account. A much simpler alternative is to better tune existing systems. In particular, we show that conventional scheduling algorithms typically only have little and possibly misleading information regarding the CPU usage of processes, because increasing CPU rates have caused the common 100 Hz clock interrupt rate to be coarser than most application time quanta. We therefore conduct an experimental analysis of what happens if this rate is significantly increased. Results indicate that much higher clock interrupt rates are possible with acceptable overheads, and lead to much better information. In addition we show that increasing the clock rate can provide a measure of support for soft real time requirements, even when using a general-purpose operating system. For example, we achieve a sub-millisecond latency under heavily loaded conditions.

@InProceedings{etsion03-clock,

author	=	{Yoav Etsion and Dan Tsafrir and Dror G. Feitelson},
title	=	{Effects of clock resolution on the scheduling of interactive and soft real-time processes},
booktitle	=	{ACM SIGMETRICS International Conference on Measurement and Modeling (SIGMETRICS)},
year	=	2003,
month	=	{June},
pages	=	{172--183},
address	=	{San Diego, California}

}

2002

Barrier synchronization on a loaded SMP using two-phase waiting algorithms
Dan Tsafrir, Dror G. Feitelson
IPDPS '02: IEEE International Parallel and Distributed Processing Symposium
April, 2002, Fort Lauderdale, Florida, page 80
Abstract , BibTeX , PDF , Definitive

@InProceedings{tsafrir02-sync,

author	=	{Dan Tsafrir and Dror G. Feitelson},
title	=	{Barrier synchronization on a loaded {SMP} using two-phase waiting algorithms},
booktitle	=	{IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
year	=	2002,
month	=	{April},
pages	=	80,
address	=	{Fort Lauderdale, Florida}

}

2001

Barrier synchronization on a loaded SMP using two-phase waiting algorithms
Dan Tsafrir
MSc: Technical Report 2001-82, School of Computer Science and Engineering, the Hebrew University
September, 2001, Jerusalem, Israel
Abstract , BibTeX , PDF , Definitive

Little work has been done on the performance of barrier synchronization using two-phase blocking, as the common wisdom is that it is useless to spin if the total number of threads in the system exceeds the number of processors. We challenge this view and show that it may be beneficial to spin-wait if the spinning period is set to be a bit more than twice the context switch overhead (rather than being equal to it). We show that the success of our approach is due to an inherent property of general-purpose schedulers, which tend to select threads that become unblocked for immediate execution. We find that this property causes applications based on barriers to fall into a previously unnoticed pattern, denoted “alternating synchronization”, which is quite different from the patterns typically assumed in theoretical analyses. By merely choosing an appropriate spinning period, we leverage alternating synchronization to implicitly nudge the system into simultaneously co-scheduling the application's threads, thereby dramatically reducing the overhead of synchronization and significantly improving the performance.

@MastersThesis{tsafrir01-msc,

author	=	{Dan Tsafrir},
title	=	{Barrier synchronization on a loaded {SMP} using two-phase waiting algorithms},
school	=	{School of Computer Science and Engineering, the Hebrew University (MSc Thesis)},
year	=	2001,
month	=	{September},
address	=	{Jerusalem, Israel},
note	=	{Technical Report 2001-82}

}